Steps:
1. Import libraries 
2. Load dataset
3. Make a copy of original dataset and use one for continued pre-processing
4. Inspect the data | Check for inconsistencies and missing values, duplicates
5. Split the data set in train/test
6. Separate the categorical and numerical features
7. Check for missing and handle missing values in categorical and numerical using simple imputer
8. Fit model and evaluate it

# Step 1: Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
set_config(display='diagram')

# Step 2: Load Dataset

In [2]:
path = './Fish - Fish.csv'
fish_df = pd.read_csv(path)
fish_df.info()
fish_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  155 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  157 non-null    float64
 3   Length2  157 non-null    float64
 4   Length3  150 non-null    float64
 5   Height   156 non-null    float64
 6   Width    157 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB


Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [3]:
fish_df_copy = fish_df.copy()

# Step 4: Inspect the Data

    There are 0 duplicates.  However, there seems to be some null values present based on initial observance of dataset

In [4]:
fish_df.shape

(159, 7)

In [5]:
# reorder columns so that it is easier to search for target 
column_reordered = ['Species', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Weight']

fish_df = fish_df[column_reordered]

In [6]:
fish_df.head()

Unnamed: 0,Species,Length1,Length2,Length3,Height,Width,Weight
0,Bream,23.2,25.4,30.0,11.52,4.02,242.0
1,Bream,24.0,26.3,31.2,12.48,4.3056,290.0
2,Bream,23.9,26.5,31.1,12.3778,4.6961,340.0
3,Bream,26.3,29.0,33.5,12.73,4.4555,363.0
4,Bream,26.5,29.0,34.0,12.444,5.134,430.0


159 rows, 9 columns

In [7]:
def chk_for_dups_null(df):
    print("checking for duplicates ===>  \n",df.duplicated().sum())
    print("checking for null values ===> \n",df.isna().sum())
chk_for_dups_null(fish_df)


checking for duplicates ===>  
 0
checking for null values ===> 
 Species    4
Length1    2
Length2    2
Length3    9
Height     3
Width      2
Weight     0
dtype: int64


However, there are null values. We could drop them since they're 5% less of the dataframe, but we are practicing pipelines and we will create a pipeline for filling in missing data and standardizing.

# Step 5: Split the Dataset

    Why split data:
            Training Data: This is the portion of your data used to train your machine learning model. The model learns patterns, relationships, and rules from this data.
            Testing Data: This is a separate portion of your data that the model has never seen during training. It's used to assess how well the trained model generalizes to new, unseen data.

In [8]:
target = 'Weight'
X = fish_df.drop(columns=[target])
y= fish_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    The objects returned by the train_test_split function are typically NumPy arrays or pandas DataFrames (depending on the data types of the input data). These objects are also considered data structures or data containers.
    
    X_train, X_test, y_train, and y_test are data containers (objects) that hold your feature and target data. The specific data type of these objects depends on the data types of your input data when calling train_test_split.
    
    X_train: Contains the feature data for the training set. It's typically a DataFrame or NumPy array. It's the portion of your data that your machine learning model will be trained on.

    X_test: Contains the feature data for the testing set. It's also typically a DataFrame or NumPy array. It's used to evaluate the model's performance on unseen data.

    y_train: Contains the target variable data for the training set. It's typically a Series (if using pandas) or a NumPy array.

    y_test: Contains the target variable data for the testing set. It's also typically a Series (if using pandas) or a NumPy array.

    The specific order of these variables (X_train, X_test, y_train, y_test) is a convention often used in machine learning to make it clear which sets correspond to feature data and target data. It's also the order in which the train_test_split function returns the values.

# Step 6 : Separate categorical and numerical data 

In [9]:
# instantiate the columns selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

In [10]:
numeric_columns = num_selector(X_train)
categorical_columns = cat_selector(X_train)
print(f"Numeric columns are : {numeric_columns} \n Categorical columns are : {categorical_columns}")



Numeric columns are : ['Length1', 'Length2', 'Length3', 'Height', 'Width'] 
 Categorical columns are : ['Species']


# Step 7: Handle Missing Data, Categorical Data, and Scaling!

In [11]:
# Create a transformer for imputing missing values in numeric columns
# "imputer" can be any name just as long as it makes sense
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())  
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])



numeric_transformer and categorical_transformer are instances of a data preprocessing pipeline

In [12]:
# Create a ColumnTransformer to apply the numeric transformer to numeric columns
# "num" and "cat" can be any name just as long as it makes sense
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

Important note: the next few cells testing what the data looks like before train/test model

In [13]:
# Fit and transform your data
X_preprocessed = preprocessor.fit_transform(fish_df)

# Print intermediate results (optional)
print("Intermediate Result after Numeric Imputation:")
print(X_preprocessed[:, :len(numeric_columns)])  # Numeric columns


Intermediate Result after Numeric Imputation:
[[-2.92555900e-01 -2.68531910e-01 -1.11589054e-01  6.22604980e-01
  -2.36346595e-01]
 [-2.11297561e-01 -1.81225055e-01  3.73207539e-04  8.51357750e-01
  -6.62429649e-02]
 [-2.21454853e-01 -1.61823532e-01 -8.95698094e-03  8.27005111e-01
   1.66339205e-01]
 [ 2.23201649e-02  8.06955071e-02  2.14967542e-01  9.10928784e-01
   2.30376170e-02]
 [ 4.26347497e-02  8.06955071e-02  2.61618485e-01  8.42779521e-01
   4.27152859e-01]
 [ 7.31066270e-02  1.48600838e-01  3.26929804e-01  1.11880786e+00
   3.04101704e-01]
 [ 7.31066270e-02  1.48600838e-01  3.08269427e-01  1.25632164e+00
   5.13217196e-01]
 [ 1.54364966e-01  1.77703123e-01  3.54920370e-01  8.96631736e-01
   1.62706039e-01]
 [ 1.54364966e-01  1.77703123e-01  3.64250558e-01  1.21471723e+00
   2.54309465e-01]
 [ 2.45780598e-01  2.45608454e-01  4.66882631e-01  1.26754482e+00
   3.23160934e-01]
 [ 2.35623306e-01  2.74710739e-01  4.66882631e-01  1.27617071e+00
   0.00000000e+00]
 [ 2.66095183e-01  



In [14]:
print("Intermediate Result after Categorical Imputation:")
print(X_preprocessed[:, len(numeric_columns):])  # Categorical columns

Intermediate Result after Categorical Imputation:
[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]]


To check what's happening under the hood in scikit-learn's transformers and pipelines, you can use the fit_transform or transform methods and print the intermediate results at various stages. This can help you understand how data is being processed at each step. 

The Data being returned from this preprocessor are Numpy arrays,

Turning the data into a NumPy array in this context has several advantages and use cases:

    Compatibility with Scikit-Learn: Scikit-Learn, a popular machine learning library, primarily works with NumPy arrays or pandas DataFrames. By converting your data into a NumPy array, you ensure that it can be seamlessly integrated with various Scikit-Learn functions and models.

    Efficient Computation: NumPy arrays are highly optimized for numerical computations. When you perform mathematical operations or apply machine learning algorithms, NumPy arrays are more efficient than native Python data structures like lists or tuples.

    Slicing and Indexing: NumPy arrays allow for easy slicing and indexing, as you've seen in the code. This makes it convenient to access specific subsets of your data, such as numeric and categorical columns in this case.

    Standardized Data Format: Converting data into a NumPy array helps standardize the data format. This can be important when working with various data preprocessing steps, transformations, and machine learning models. It ensures a consistent and expected input format.

    Memory Efficiency: NumPy arrays can be more memory-efficient than pandas DataFrames, especially when dealing with large datasets. NumPy stores data in a more compact way, which can reduce memory overhead.

    Integration with Other Libraries: Many other scientific and numerical libraries, beyond Scikit-Learn, are built to work with NumPy arrays. This compatibility makes it easier to use these libraries in conjunction with your machine learning workflow.

# Important to note:
The missing values and onehot encoder are filled in but NOT applied to the original dataframe  (fish_df). The code I provided earlier filled in missing values and stored the result in the X_preprocessed variable, which is a NumPy array.

X_preprocessed now contains your data with missing values filled in and all columns converted to numerical format.

    Missing Values: The missing values in both numeric and categorical columns were filled in using the appropriate strategies (mean for numeric and most frequent for categorical).

    One-Hot Encoding: Categorical columns were one-hot encoded, converting them from categorical to numerical format.

    Numerical Columns: Numeric columns remained unchanged but now contain no missing values.

You can use the StandardScaler to standardize your numeric features so that they have a mean of 0 and a standard deviation of 1. 

# Why do we scale, again?

    Equal Treatment of Features: Scaling ensures that all features are treated equally in terms of their impact on the model. Without scaling, features with larger scales (e.g., income in thousands) can dominate those with smaller scales (e.g., age) in the learning process.

    Convergence: Algorithms that rely on optimization, like gradient descent, converge faster when features are on similar scales. This helps reduce the time it takes for the model to find the optimal solution.

    Numerical Stability: Scaling can improve the numerical stability of certain calculations, particularly in algorithms that involve matrix operations. This prevents issues like overflow or underflow.

Mathematics of Scaling:
Scaling involves transforming your data so that it has a particular scale, often with a mean of 0 and a standard deviation of 1 (a standard normal distribution). The standard scaling transformation is:

    Standardized Value  =   Value−MeanStandard Deviation
                            _________________________

                            Standard DeviationValue−Mean​

Here's a brief explanation of the math:

    Subtracting the mean centers the data around 0. This step ensures that the transformed data has a mean of 0.

    Dividing by the standard deviation scales the data. It makes the variance of the transformed data equal to 1. This step ensures that the transformed data has a standard deviation of 1.

By scaling your data using the standardization process, you bring all features to a similar scale, and they all have a mean of 0 and a standard deviation of 1. This standardization helps ensure that your machine learning algorithms can effectively learn from and generalize to the data.

# Step 8: Fit The Model!

Linear Regression Model

In [15]:
# import LinearRegression model
from sklearn.linear_model import LinearRegression

# Create an instance of Linear Regression
linear_reg_model = LinearRegression()


In [18]:
# fit on train ONLY
preprocessor.fit(X_train)
# X_train




In [None]:
# transform train and test data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

Evaluating your model. 
    Function below

In [17]:
def eval_regression(true,predicted_values):
    ''' Takes true and predicted values (arrays) and prints MAE, MSE, RMSE, and R2'''
    # don't need to use numpy to do these calculations
    # taking 2 parameters because that's what is need the true values and predicted values for calculations
    mae = mean_absolute_error(true, predicted_values)
    mse = mean_squared_error(true, predicted_values)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, predicted_values)
    print(f'MAE {mae}, \n MSE {mse}, \n RMSE {rmse}, \n R^2 {r2}')