### 1. **Linear Regression 1**


The first item requires all numerical variables.

For this first Linear Regression model, we will compare two different approaches:

- droping the 'Species' variable;
- using it as a predictor.

The species is a categorical variable, but it is possible to convert it to a numerical variable. This is called label encoding.

First, we will drop the 'Species' variable and fit the model.


In [11]:
# Append the path to useful directories
import sys
sys.path.append('../my_functions')

# Packages needed
from download_dataset import download_dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Download and extract the dataset
fishcatch = download_dataset(data_file='fishcatch', extension='.tar.xz')

ModuleNotFoundError: No module named 'download_dataset'

In [None]:
# Same dataset as in the exploratory data analysis
df = pd.read_csv(fishcatch)

#### 1.1. **Droping the 'Species' variable**


In [None]:
# Including only the numeric columns
df_no_species = df.select_dtypes(include=['int64', 'float64'])
df_no_species.head(3).style.background_gradient(cmap='viridis')

After dropping the 'Species' variable, we will split the data into training and testing sets, fit the model and evaluate it.


In [None]:
# Splitting the dataset into features and target variable
X = df_no_species.drop(columns=['Width'])
y = df_no_species['Width']

In [None]:
# Split the dataset into training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [None]:
# Fitting the model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# R2 score of the training set
train_r2_no_species = lr.score(X_train, y_train)
print(f'Training R²: {train_r2_no_species:.2f}')

In [None]:
# Where the model intercepts the y-axis
intercept = lr.intercept_
print(f'Intercept: {intercept}')

# The coefficients of the model
coef = lr.coef_
print(f'Coefficients: {coef}')

In [None]:
# Equation of the line
print(f"{y.name} = \n{intercept:.6f} ")
for i, j in zip(X.columns, coef):
    print(f"+ {i}*{j:.6f}")

In [None]:
# Testing the model
y_pred = lr.predict(X_test)
y_pred

In [None]:
# Performance of the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'Mean Absolute Error: {mae}')
print(f'R2 Score: {r2}')

More information about the function below in [plot_regression.py](../my_functions/plot_regression.py).


In [None]:
# Plotting the model
from plot_regression import plot_regression
plot_regression(
    y_test, y_pred,
    regression_type='Linear',
    title="Using All Numerical Features\nSpecies not included in the model")

| [$\leftarrow$ Exploratory Analysis ](n0_exploratory_analysis.ipynb) | [Linear Regression 2 $\rightarrow$](n2_linear_regression_2.ipynb) |
| :-----------------------------------------------------------------: | :---------------------------------------------------------------: |
