# SML Lab 2 - Polynomial Regression

In [1]:
# Imports:
import turicreate as tc
from sklearn.model_selection import train_test_split

In [2]:
# 1) Load the datafile RVT.csv into an SFrame.
data = tc.SFrame('RVT.csv')
data

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


X1,R,V,T
0,81.75974500157666,-3.17337489125203,29.46784268084049
1,70.53346486050641,-4.765524444590707,-7.601876111180856
2,41.93519103947173,-0.2252427666224576,16.462177210720117
3,28.54727888717895,4.055789636035739,4.3350445142063565
4,1.866856419290297,1.5604743994854564,-2.174899569612092
5,5.482682848873466,-2.4269979856718598,-5.908776805668268
6,25.505202346164136,0.0258303584573029,6.834325974774764
7,89.53883755482829,-2.956148498433293,43.95816013223048
8,37.20709101449364,-0.7465745712509309,9.768672734348904
9,0.8923460414945406,4.807160015680399,-24.039518973155182


In [3]:
# 2) Add columns to the dataset to hold powers of ‘R’ and ‘V’ terms for a polynomial of order 3.  
# Split the data so that 70% is used for training, 20% for validation, and 10% for testing.
data_df = data.to_dataframe()

data_df['R^2'] = data_df['R'] ** 2
data_df['R^3'] = data_df['R'] ** 3
data_df['V^2'] = data_df['V'] ** 2
data_df['V^3'] = data_df['V'] ** 3
data_df['R*V'] = data_df['R'] * data_df['V']
data_df['R^2*V'] = data_df['R'] ** 2 * data_df['V']
data_df['R*V^2'] = data_df['R'] * data_df['V'] ** 2

train_df, remaining_df = train_test_split(data_df, train_size=0.7, random_state=42)
val_df, test_df = train_test_split(remaining_df, train_size=2/3, random_state=42)


train_data = tc.SFrame(train_df)
val_data = tc.SFrame(val_df)
test_data = tc.SFrame(test_df)

In [4]:
# 3) Use Turi Create (and training data) to fit a polynomial, first order in ‘R’ and ‘V’, 
# to predict the temperature ‘T’. (Hint: you can select which features are 
# used in training by setting the ‘features’ parameter.) Name the model ‘model1’. 
# Calculate and display the training and validation RMSE.
features = ['R', 'V']
model1 = tc.linear_regression.create(train_data, target='T', features=features)
print("Training error (rmse):", model1.evaluate(train_data)['rmse'])
print("Validation error (rmse):", model1.evaluate(val_data)['rmse'])

Training error (rmse): 12.577252722009018


Validation error (rmse): 12.718368737154982


In [5]:
# 4) Repeat step 3) but fit a second order polynomial named ‘model2’.
features = ['R', 'V', 'R^2', 'V^2', 'R*V']
model2 = tc.linear_regression.create(train_data, target='T', features=features)
print("Training error (rmse):", model2.evaluate(train_data)['rmse'])
print("Validation error (rmse):", model2.evaluate(val_data)['rmse'])

Training error (rmse): 0.8231031652631792


Validation error (rmse): 0.9422502165235571


In [6]:
# 5) Repeat step 3) but fit a third order polynomial named ‘model3’.
features = ['R', 'V', 'R^2', 'V^2', 'R*V', 'R^3', 'V^3', 'R^2*V', 'R*V^2']
model3 = tc.linear_regression.create(train_data, target='T', features=features)
print("Training error (rmse):", model3.evaluate(train_data)['rmse'])
print("Validation error (rmse):", model3.evaluate(val_data)['rmse'])

Training error (rmse): 0.8229527282913328
Validation error (rmse): 1.0310471771723417


**6) Add a markdown cell to discuss the three models. Are they overfitting or underfitting? Which one is best?**

**Outputs:**
<pre>
Model 1 - Training RMSE: 12.577252722009018   Validation RMSE: 12.718368737154982

Model 2 - Training RMSE: 0.8231031652631792   Validation RMSE: 0.9422502165235571

Model 3 - Training RMSE: 0.8229527282913328   Validation RMSE: 1.0310471771723417
</pre>

**Analysis:**
<pre>
Model 1 has a high training error, and a high validation error. This means that this model is underfitting.

Model 2 has a low training error, and a low validation error. This means that the model is the best one because it fits the most optimal solution.

Model 3 has a low training error, and a somewhat high validation error when compared to model 2. This means that it is overfitting.
</pre>

**Summary:**
<pre>
Model 1 is underfitting the data.
Model 2 fits the data best because the validation error is the lowest.
Model 3 overfits the data when compared to model 2.
</pre>

In [7]:
# 7) Create a new third order model named ‘model3r’ that uses L1 and/or L2 regularization. 
# Experiment with different values and see if you can beat the best model 
# chosen in step 6). Calculate and display the training and validation RMSE.
features = ['R', 'V', 'R^2', 'V^2', 'R*V']
model3r = tc.linear_regression.create(train_data, target='T', features=features, l1_penalty=0.0)
print("Training error (rmse):", model3r.evaluate(train_data)['rmse'])
print("Validation error (rmse):", model3r.evaluate(val_data)['rmse'])

Training error (rmse): 0.8231031652631792


Validation error (rmse): 0.9422502165235571


**8) Add a markdown cell to discuss how well ‘model3r’ compares with the model you chose in step 6). Which one is best?**

**Outputs:**
<pre>
Model 2 - Training RMSE: 0.8231031652631792   Validation RMSE: 0.9422502165235571

Model3r - Training RMSE: 0.8231031652631792   Validation RMSE: 0.9422502165235571
</pre>
    
**Analysis:**
<pre>
Both Model 2 and model3r have identical training and validation RMSE values. This indicates that they perform similarly in terms of prediction accuracy. 
While Model 2 fits the data well, the identical RMSE values suggest that model3r is also equally effective. 
Given that both models show comparable performance, choosing to stick with Model 2 ensures simplicity and consistency.
</pre>
    
**Summary:**
<pre>
Model 2 is the best model of the four because it fits the data well.
</pre>

In [9]:
# 9) Calculate and display the testing RMSE of the best overall model.
print("Testing error (rmse):", model2.evaluate(test_data)['rmse'])

Testing error (rmse): 0.8598681076864875


**9.b Add a markdown cell to discuss how close you expect the model’s predictions to be upon implementation.**

**Model 2 RMSE values:**
<pre>
- Training   RMSE:   0.8231031652631792
- Validation RMSE:   0.9422502165235571
- Testing    RMSE:   0.8598681076864875
</pre>
    
The best overall model is model 2. It has the lowest RMSE values which makes it the optimal solution. The testing RMSE of 0.86 indicates that the model is reliable and can accurately predict the temperature based on the features 'R' and 'V'. The model's predictions are expected to be close to the actual temperatures upon implementation, making it a suitable choice for this regression algorithm.