Steps:
1. Import libraries 
2. Load dataset
3. Make a copy of original dataset and use one for continued pre-processing
4. Inspect the data | Check for inconsistencies and missing values, duplicates
5. Split the data set in train/test
6. Separate the categorical and numerical features
7. Check for missing and handle missing values in categorical and numerical using simple imputer
8. Fit model and evaluate it

# Step 1: Import libraries

In [17]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
set_config(display='diagram', transform_output='pandas')
# Set the maximum number of rows and columns to display
pd.set_option('display.max_rows', 10)      # Set the maximum number of rows to display
# pd.set_option('display.max_columns', 5)   # Set the maximum number of columns to display

# Step 2: Load Dataset

In [2]:
path = './Fish - Fish.csv'
fish_df = pd.read_csv(path)
fish_df.info()
fish_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  155 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  157 non-null    float64
 3   Length2  157 non-null    float64
 4   Length3  150 non-null    float64
 5   Height   156 non-null    float64
 6   Width    157 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB


Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [3]:
fish_df_copy = fish_df.copy()

# Step 4: Inspect the Data

    There are 0 duplicates.  However, there seems to be some null values present based on initial observance of dataset

In [4]:
fish_df.shape

(159, 7)

In [5]:
# reorder columns so that it is easier to search for target 
column_reordered = ['Species', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Weight']

fish_df = fish_df[column_reordered]

In [6]:
fish_df.head()

Unnamed: 0,Species,Length1,Length2,Length3,Height,Width,Weight
0,Bream,23.2,25.4,30.0,11.52,4.02,242.0
1,Bream,24.0,26.3,31.2,12.48,4.3056,290.0
2,Bream,23.9,26.5,31.1,12.3778,4.6961,340.0
3,Bream,26.3,29.0,33.5,12.73,4.4555,363.0
4,Bream,26.5,29.0,34.0,12.444,5.134,430.0


159 rows, 9 columns

In [7]:
def chk_for_dups_null(df):
    print("checking for duplicates ===>  \n",df.duplicated().sum())
    print("checking for null values ===> \n",df.isna().sum())
chk_for_dups_null(fish_df)


checking for duplicates ===>  
 0
checking for null values ===> 
 Species    4
Length1    2
Length2    2
Length3    9
Height     3
Width      2
Weight     0
dtype: int64


However, there are null values. We could drop them since they're 5% less of the dataframe, but we are practicing pipelines and we will create a pipeline for filling in missing data and standardizing.

# Step 5: Split the Dataset

    Why split data:
            Training Data: This is the portion of your data used to train your machine learning model. The model learns patterns, relationships, and rules from this data.
            Testing Data: This is a separate portion of your data that the model has never seen during training. It's used to assess how well the trained model generalizes to new, unseen data.

In [8]:
target = 'Weight'
X = fish_df.drop(columns=[target])
y= fish_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    The objects returned by the train_test_split function are typically NumPy arrays or pandas DataFrames (depending on the data types of the input data). These objects are also considered data structures or data containers.
    
    X_train, X_test, y_train, and y_test are data containers (objects) that hold your feature and target data. The specific data type of these objects depends on the data types of your input data when calling train_test_split.
    
    X_train: Contains the feature data for the training set. It's typically a DataFrame or NumPy array. It's the portion of your data that your machine learning model will be trained on.

    X_test: Contains the feature data for the testing set. It's also typically a DataFrame or NumPy array. It's used to evaluate the model's performance on unseen data.

    y_train: Contains the target variable data for the training set. It's typically a Series (if using pandas) or a NumPy array.

    y_test: Contains the target variable data for the testing set. It's also typically a Series (if using pandas) or a NumPy array.

    The specific order of these variables (X_train, X_test, y_train, y_test) is a convention often used in machine learning to make it clear which sets correspond to feature data and target data. It's also the order in which the train_test_split function returns the values.

# Step 6 : Separate categorical and numerical data 

In [9]:
# instantiate the columns selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')


In [10]:
numeric_columns = num_selector(X_train)
categorical_columns = cat_selector(X_train)
print(f"Numeric columns are : {numeric_columns} \n Categorical columns are : {categorical_columns}")



Numeric columns are : ['Length1', 'Length2', 'Length3', 'Height', 'Width'] 
 Categorical columns are : ['Species']


# Step 7: Handle Missing Data, Categorical Data, and Scaling!

    AKA setting up our preprocessor pipeline

In [11]:
# Scaler
scaler = StandardScaler() #for numeric data 
# One-hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False) #categorical nominal data

In [18]:
# Create a transformer for imputing missing values in numeric columns
# "imputer" can be any name just as long as it makes sense

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', scaler)
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', ohe)
])



numeric_transformer and categorical_transformer are instances of a data preprocessing pipeline

In [19]:
# Create a ColumnTransformer to apply the numeric transformer to numeric columns
# "num" and "cat" can be any name just as long as it makes sense
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

# Step 8: Fit The Model!

In [20]:
# import some more libraries 
from sklearn.linear_model import LinearRegression 
from sklearn.dummy import DummyRegressor #baseline model for regression

In [21]:
# Fit the preprocessor on your training data
preprocessor.fit(X_train)

Linear Regression Model Part

    IMPORTANT: We ONLY fit the TRAINING data NOT testing

In [22]:
dummy_reg = DummyRegressor(strategy='mean')

# Create model pipeline
dummy_pipe = Pipeline([
    ('preprocessor', preprocessor),  # Tuple 1: (name, estimator)
    ('dummy_reg', dummy_reg)  # Tuple 2: (name, estimator)
])

# This is where the model starts learning
dummy_pipe.fit(X_train, y_train)

In [24]:
# Create an instance of Linear Regression
linear_reg_model = LinearRegression()
linear_pipe =  Pipeline([
    ('preprocessor', preprocessor),  # Tuple 1: (name, estimator)
    ('linear_reg', linear_reg_model)
])

linear_pipe.fit(X_train, y_train)

In [None]:
print("preprocessor", preprocessor.named_transformers_.['cat'])

In [25]:
# transform train and test data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Step 9: Inspect the Result

In [29]:
def inspectResults(train_data, test_data):
    print(np.isnan(train_data).sum().sum(), 'missing values in training data')
    print(np.isnan(test_data).sum().sum(), 'missing values in testing data')
    print('\n')
    print('shape of data is', train_data.shape)
    print('\n')
    return train_data

In [30]:
inspectResults(X_train_processed, X_test_processed)

0 missing values in training data
0 missing values in testing data




shape of data is (119, 13)




Unnamed: 0,num__Length1,num__Length2,num__Length3,num__Height,num__Width,cat__Species_Beam,cat__Species_Bream,cat__Species_Parkki,cat__Species_Perch,cat__Species_Pike,cat__Species_Roach,cat__Species_Smelt,cat__Species_Whitefish
26,0.579039,0.668158,0.879888,0.000000,1.002242,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
137,1.585471,1.650867,1.586725,-0.460459,0.273485,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
146,-1.635110,-1.739477,-1.890146,-1.726736,-2.006780,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
90,-0.628679,-0.609362,-0.753477,-0.825389,-0.276440,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
66,-0.729322,-0.737114,-0.782132,-0.059962,-0.704050,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,-0.226106,-0.216279,-0.228126,0.658577,-0.130566,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
106,-0.034884,-0.019737,-0.189918,-0.242033,-0.148755,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
14,0.317367,0.373346,0.555126,1.569038,0.441209,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
92,-0.578357,-0.560227,-0.705718,-0.503108,-0.502879,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Evaluating your model. 
    Function below

In [31]:
def eval_regression(true,predicted_values):
    ''' Takes true and predicted values (arrays) and prints MAE, MSE, RMSE, and R2'''
    # don't need to use numpy to do these calculations
    # taking 2 parameters because that's what is need the true values and predicted values for calculations
    mae = mean_absolute_error(true, predicted_values)
    mse = mean_squared_error(true, predicted_values)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, predicted_values)
    print(f'MAE {mae}, \n MSE {mse}, \n RMSE {rmse}, \n R^2 {r2}')

In [32]:
train_predicts = dummy_pipe.predict(X_train)
test_predicts= dummy_pipe.predict(X_test)

In [33]:
train_predicts

array([393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27

In [34]:
test_predicts

array([393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891,
       393.27226891, 393.27226891, 393.27226891, 393.27226891])

In [35]:
# finding the MAE. MSE, RMSE, and r2 score on the baseline model for both train and test data
eval_regression(y_train, train_predicts)

MAE 288.0584139538168, 
 MSE 126217.46620577642, 
 RMSE 355.27097574355327, 
 R^2 0.0


In [36]:
eval_regression(y_test, test_predicts)

MAE 314.72834033613447, 
 MSE 130791.0537290975, 
 RMSE 361.65045794122466, 
 R^2 -0.0030955235923453284
