| [$\leftarrow$ 3. Polynomial Regression ](n3_polynomial_regression.ipynb) | [5. Comparison and Conclusion $\rightarrow$](n5_comparison_conclusion.ipynb) |
| :-----------------------------------------------------------------: | :---------------------------------------------------------------: |

<hr>

### 4. **Linear and Polynomial Regression**

#### 4.1. **Getting Started**

On this notebook, the best polynomial model is combined with the best linear model. Again, we will compare the results between having the encoded species and having no species at all.

In [1]:
# Append the path to useful directories
import sys
sys.path.append('../my_functions')

# Packages needed
from download_dataset import download_dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Download and extract the dataset
fishcatch = download_dataset(data_file='fishcatch', extension='.tar.xz')

# Quick peek at the data
df = pd.read_csv(fishcatch)
df.head(3).style.background_gradient(cmap='viridis')

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961


We transform weight using n = 5, since it was the best polynomial model from the previous notebook.

In [2]:
# Transforming weight
X_weight = df['Weight']

# Drop weight
df.drop('Weight', axis=1, inplace=True)

# PolynomialFeatures (preprocessing)
from sklearn.preprocessing import PolynomialFeatures

# Transforming the feature to the desired degree
poly = PolynomialFeatures(degree=5)
X_weight = poly.fit_transform(X_weight.values.reshape(-1, 1))

# Quick peek at the transformed feature
X_weight = pd.DataFrame(X_weight, columns=[f'Weight^{i}' for i in range(X_weight.shape[1])])
X_weight.head(3).style.background_gradient(cmap='viridis')

Unnamed: 0,Weight^0,Weight^1,Weight^2,Weight^3,Weight^4,Weight^5
0,1.0,242.0,58564.0,14172488.0,3429742096.0,829997587232.0
1,1.0,290.0,84100.0,24389000.0,7072810000.0,2051114900000.0
2,1.0,340.0,115600.0,39304000.0,13363360000.0,4543542400000.0


#### 4.2. **Droping the 'Species' variable**

In [3]:
# Including only the numeric columns
df_no_species = df.select_dtypes(include=['int64', 'float64'])
df_no_species.head(3).style.background_gradient(cmap='viridis')

Unnamed: 0,Length1,Length2,Length3,Height,Width
0,23.2,25.4,30.0,11.52,4.02
1,24.0,26.3,31.2,12.48,4.3056
2,23.9,26.5,31.1,12.3778,4.6961


In [4]:
# Combining the transformed weight with the other features
df_no_species = pd.concat([df_no_species, X_weight], axis=1)
df_no_species.head(3).style.background_gradient(cmap='viridis')

Unnamed: 0,Length1,Length2,Length3,Height,Width,Weight^0,Weight^1,Weight^2,Weight^3,Weight^4,Weight^5
0,23.2,25.4,30.0,11.52,4.02,1.0,242.0,58564.0,14172488.0,3429742096.0,829997587232.0
1,24.0,26.3,31.2,12.48,4.3056,1.0,290.0,84100.0,24389000.0,7072810000.0,2051114900000.0
2,23.9,26.5,31.1,12.3778,4.6961,1.0,340.0,115600.0,39304000.0,13363360000.0,4543542400000.0


In [5]:
# Splitting the dataset into features and target variable
X = df_no_species.drop(columns=['Width'])
y = df_no_species['Width']

In [6]:
# Split the dataset into training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [7]:
# Fitting the model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)

In [8]:
# R2 score of the training set
train_r2 = lr.score(X_train, y_train)
print(f'Training R²: {train_r2:.3f}')

Training R²: 0.946


In [9]:
# Where the model intercepts the y-axis
intercept = lr.intercept_
print(f'Intercept: {intercept}')

# The coefficients of the model
coef = lr.coef_
print(f'Coefficients: {coef}')

Intercept: 1.7204338360932176
Coefficients: [ 1.27858372e-01  1.15389718e-01 -2.51717257e-01  7.77124304e-02
  1.11050322e-13  2.34489689e-02 -6.56899317e-05  1.03378464e-07
 -7.43893111e-11  1.91374694e-14]


In [10]:
# Equation of the line
print(f"{y.name} = \n{intercept:.6f} ")
for i, j in zip(X.columns, coef):
    print(f"+ {i}*{j:.6f}")

Width = 
1.720434 
+ Length1*0.127858
+ Length2*0.115390
+ Length3*-0.251717
+ Height*0.077712
+ Weight^0*0.000000
+ Weight^1*0.023449
+ Weight^2*-0.000066
+ Weight^3*0.000000
+ Weight^4*-0.000000
+ Weight^5*0.000000


In [11]:
# Testing the model
y_pred = lr.predict(X_test)
y_pred

array([3.0226921 , 1.74053477, 3.73254046, 4.37434882, 3.64991052,
       6.90880814, 1.61029163, 3.79440678, 4.06204145, 6.00398025,
       5.75262534, 7.00115839, 5.60581262, 3.78545462, 5.94588171,
       6.73277788, 7.16594665, 4.38614392, 4.39903243, 5.96984176,
       1.62940692, 5.6775179 , 5.15566483, 5.07330428, 6.81742112,
       7.142034  , 3.9121319 , 4.32668009, 1.61110614, 1.6271897 ,
       5.03006804, 2.78861878])

In [12]:
# Performance of the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'Mean Absolute Error: {mae}')
print(f'R2 Score: {r2}')

Mean Squared Error: 0.20957490968683545
Root Mean Squared Error: 0.4577935229848009
Mean Absolute Error: 0.37594793487107553
R2 Score: 0.9402730149853048


In [13]:
# Save the results to a csv file
from save_results import save_results
save_results(mse, mae, r2, rmse, train_r2, 'Linear and Polynomial: with no species')

Results saved successfully to "results.csv".


### <hr>

| [$\leftarrow$ 3. Polynomial Regression ](n3_polynomial_regression.ipynb) | [5. Comparison and Conclusion $\rightarrow$](n5_comparison_conclusion.ipynb) |
| :-----------------------------------------------------------------: | :---------------------------------------------------------------: |
