Steps:
1. Import libraries 
2. Load dataset
3. Make a copy of original dataset and use one for continued pre-processing
4. Inspect the data | Check for inconsistencies and missing values, duplicates
5. Split the data set in train/test
6. Separate the categorical and numerical features
7. Check for missing and handle missing values in categorical and numerical using simple imputer
8. Fit model and evaluate it

# Step 1: Import libraries

In [16]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
set_config(display='diagram')
# Set the maximum number of rows and columns to display
pd.set_option('display.max_rows', 10)      # Set the maximum number of rows to display
# pd.set_option('display.max_columns', 5)   # Set the maximum number of columns to display

# Step 2: Load Dataset

In [2]:
path = './Fish - Fish.csv'
fish_df = pd.read_csv(path)
fish_df.info()
fish_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  155 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  157 non-null    float64
 3   Length2  157 non-null    float64
 4   Length3  150 non-null    float64
 5   Height   156 non-null    float64
 6   Width    157 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB


Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [3]:
fish_df_copy = fish_df.copy()

# Step 4: Inspect the Data

    There are 0 duplicates.  However, there seems to be some null values present based on initial observance of dataset

In [4]:
fish_df.shape

(159, 7)

In [5]:
# reorder columns so that it is easier to search for target 
column_reordered = ['Species', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Weight']

fish_df = fish_df[column_reordered]

In [6]:
fish_df.head()

Unnamed: 0,Species,Length1,Length2,Length3,Height,Width,Weight
0,Bream,23.2,25.4,30.0,11.52,4.02,242.0
1,Bream,24.0,26.3,31.2,12.48,4.3056,290.0
2,Bream,23.9,26.5,31.1,12.3778,4.6961,340.0
3,Bream,26.3,29.0,33.5,12.73,4.4555,363.0
4,Bream,26.5,29.0,34.0,12.444,5.134,430.0


159 rows, 9 columns

In [7]:
def chk_for_dups_null(df):
    print("checking for duplicates ===>  \n",df.duplicated().sum())
    print("checking for null values ===> \n",df.isna().sum())
chk_for_dups_null(fish_df)


checking for duplicates ===>  
 0
checking for null values ===> 
 Species    4
Length1    2
Length2    2
Length3    9
Height     3
Width      2
Weight     0
dtype: int64


However, there are null values. We could drop them since they're 5% less of the dataframe, but we are practicing pipelines and we will create a pipeline for filling in missing data and standardizing.

# Step 5: Split the Dataset

    Why split data:
            Training Data: This is the portion of your data used to train your machine learning model. The model learns patterns, relationships, and rules from this data.
            Testing Data: This is a separate portion of your data that the model has never seen during training. It's used to assess how well the trained model generalizes to new, unseen data.

In [8]:
target = 'Weight'
X = fish_df.drop(columns=[target])
y= fish_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    The objects returned by the train_test_split function are typically NumPy arrays or pandas DataFrames (depending on the data types of the input data). These objects are also considered data structures or data containers.
    
    X_train, X_test, y_train, and y_test are data containers (objects) that hold your feature and target data. The specific data type of these objects depends on the data types of your input data when calling train_test_split.
    
    X_train: Contains the feature data for the training set. It's typically a DataFrame or NumPy array. It's the portion of your data that your machine learning model will be trained on.

    X_test: Contains the feature data for the testing set. It's also typically a DataFrame or NumPy array. It's used to evaluate the model's performance on unseen data.

    y_train: Contains the target variable data for the training set. It's typically a Series (if using pandas) or a NumPy array.

    y_test: Contains the target variable data for the testing set. It's also typically a Series (if using pandas) or a NumPy array.

    The specific order of these variables (X_train, X_test, y_train, y_test) is a convention often used in machine learning to make it clear which sets correspond to feature data and target data. It's also the order in which the train_test_split function returns the values.

# Step 6 : Separate categorical and numerical data 

In [9]:
# instantiate the columns selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

In [10]:
numeric_columns = num_selector(X_train)
categorical_columns = cat_selector(X_train)
print(f"Numeric columns are : {numeric_columns} \n Categorical columns are : {categorical_columns}")



Numeric columns are : ['Length1', 'Length2', 'Length3', 'Height', 'Width'] 
 Categorical columns are : ['Species']


# Step 7: Handle Missing Data, Categorical Data, and Scaling!

    AKA setting up our preprocessor pipeline

In [11]:
# Scaler
scaler = StandardScaler() #for numeric data 
# One-hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False) #categorical nominal data

In [12]:
# Create a transformer for imputing missing values in numeric columns
# "imputer" can be any name just as long as it makes sense

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', scaler)
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', ohe)
])



numeric_transformer and categorical_transformer are instances of a data preprocessing pipeline

In [13]:
# Create a ColumnTransformer to apply the numeric transformer to numeric columns
# "num" and "cat" can be any name just as long as it makes sense
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

Important note: the next few cells testing what the data looks like before train/test model

In [None]:
# Fit and transform your data
X_preprocessed = preprocessor.fit_transform(X_train)

# Print intermediate results (optional)
print("Intermediate Result after Numeric Imputation:")
print(X_preprocessed[:, :len(numeric_columns)])  # Numeric columns


In [None]:
print("Intermediate Result after Categorical Imputation:")
print(X_preprocessed[:, len(numeric_columns):])  # Categorical columns

To check what's happening under the hood in scikit-learn's transformers and pipelines, you can use the fit_transform or transform methods and print the intermediate results at various stages. This can help you understand how data is being processed at each step. 

The Data being returned from this preprocessor are Numpy arrays,

Turning the data into a NumPy array in this context has several advantages and use cases:

    Compatibility with Scikit-Learn: Scikit-Learn, a popular machine learning library, primarily works with NumPy arrays or pandas DataFrames. By converting your data into a NumPy array, you ensure that it can be seamlessly integrated with various Scikit-Learn functions and models.

    Efficient Computation: NumPy arrays are highly optimized for numerical computations. When you perform mathematical operations or apply machine learning algorithms, NumPy arrays are more efficient than native Python data structures like lists or tuples.

    Slicing and Indexing: NumPy arrays allow for easy slicing and indexing, as you've seen in the code. This makes it convenient to access specific subsets of your data, such as numeric and categorical columns in this case.

    Standardized Data Format: Converting data into a NumPy array helps standardize the data format. This can be important when working with various data preprocessing steps, transformations, and machine learning models. It ensures a consistent and expected input format.

    Memory Efficiency: NumPy arrays can be more memory-efficient than pandas DataFrames, especially when dealing with large datasets. NumPy stores data in a more compact way, which can reduce memory overhead.

    Integration with Other Libraries: Many other scientific and numerical libraries, beyond Scikit-Learn, are built to work with NumPy arrays. This compatibility makes it easier to use these libraries in conjunction with your machine learning workflow.

# Important to note:
The missing values and onehot encoder are filled in but NOT applied to the original dataframe  (fish_df). The code I provided earlier filled in missing values and stored the result in the X_preprocessed variable, which is a NumPy array.

X_preprocessed now contains the data with missing values filled in and all columns converted to numerical format.

    Missing Values: The missing values in both numeric and categorical columns were filled in using the appropriate strategies (mean for numeric and most frequent for categorical).

    One-Hot Encoding: Categorical columns were one-hot encoded, converting them from categorical to numerical format.

    Numerical Columns: Numeric columns remained unchanged but now contain no missing values.

You can use the StandardScaler to standardize your numeric features so that they have a mean of 0 and a standard deviation of 1. 

# Why do we scale, again?

    Equal Treatment of Features: Scaling ensures that all features are treated equally in terms of their impact on the model. Without scaling, features with larger scales (e.g., income in thousands) can dominate those with smaller scales (e.g., age) in the learning process.

    Convergence: Algorithms that rely on optimization, like gradient descent, converge faster when features are on similar scales. This helps reduce the time it takes for the model to find the optimal solution.

    Numerical Stability: Scaling can improve the numerical stability of certain calculations, particularly in algorithms that involve matrix operations. This prevents issues like overflow or underflow.

Mathematics of Scaling:
Scaling involves transforming your data so that it has a particular scale, often with a mean of 0 and a standard deviation of 1 (a standard normal distribution). The standard scaling transformation is:

    Standardized Value  =   Value−MeanStandard Deviation
                           ________________________________

                            Standard DeviationValue−Mean​

Here's a brief explanation of the math:

    Subtracting the mean centers the data around 0. This step ensures that the transformed data has a mean of 0.

    Dividing by the standard deviation scales the data. It makes the variance of the transformed data equal to 1. This step ensures that the transformed data has a standard deviation of 1.

By scaling your data using the standardization process, you bring all features to a similar scale, and they all have a mean of 0 and a standard deviation of 1. This standardization helps ensure that your machine learning algorithms can effectively learn from and generalize to the data.

In [20]:
# create a subset of data for only categorical columns
train_cat_data = X_train[cat_selector(X_train)]
test_cat_data = X_test[cat_selector(X_test)]
train_cat_data.head()

Unnamed: 0,Species
26,Bream
137,Pike
146,Smelt
90,Perch
66,Parkki


In [22]:
#fit the OneHotEncoder on the training data
ohe.fit(train_cat_data)
#transform both the training and the testing data
train_ohe = ohe.transform(train_cat_data)
test_ohe = ohe.transform(test_cat_data)
train_ohe

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [23]:
ohe_column_names = ohe.get_feature_names_out(train_cat_data.columns)
train_ohe = pd.DataFrame(train_ohe, columns=ohe_column_names)
test_ohe = pd.DataFrame(test_ohe, columns=ohe_column_names)
train_ohe.head()

Unnamed: 0,Species_Beam,Species_Bream,Species_Parkki,Species_Perch,Species_Pike,Species_Roach,Species_Smelt,Species_Whitefish,Species_nan
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


# Step 8: Fit The Model!

Linear Regression Model Part

    IMPORTANT: We ONLY fit the TRAINING data NOT testing

In [24]:
# import LinearRegression model
from sklearn.linear_model import LinearRegression

# Create an instance of Linear Regression
linear_reg_model = LinearRegression()


At this step: the model is being fit aka "trained" on the present data. 
    
    Initialization: preprocessor is an instance of the ColumnTransformer (or a similar preprocessing transformer). Before it can be used to transform data, it needs to be configured with the transformations you want to apply to your data.

    Configuration: When you create a ColumnTransformer, you specify what preprocessing steps should be applied to different subsets of your features. For example, you might specify that numeric features should be scaled, while categorical features should be one-hot encoded. This configuration is based on the transformers you defined.

    Fitting: The fit method of the ColumnTransformer goes through each specified transformation and "fits" or "learns" parameters from the training data. This step varies depending on the specific transformers used. For example:

        If you're using StandardScaler to scale numeric features, it calculates the mean and standard deviation of each numeric feature from X_train.

        If you're using OneHotEncoder to one-hot encode categorical features, it determines the unique categories in each categorical feature from X_train.

    Parameter Learning: During the fitting process, the transformers are essentially learning something about the training data that they will later use to transform the data. For example, the scaler learns the mean and standard deviation of numeric features, while the one-hot encoder learns the unique categories.

    Transformation: After fitting, the transformers are ready to transform new data. You can use the same preprocessor to transform both your training and test data. The learned parameters are applied consistently to ensure that the same transformations are performed on both datasets.

In summary, preprocessor.fit(X_train) configures and learns the preprocessing steps you've specified in the ColumnTransformer based on the training data. This ensures that the same transformations are applied consistently to both the training and test data, preventing data leakage and allowing your model to make predictions on new, unseen data.

In [25]:
# fit on train ONLY
preprocessor.fit(X_train)
# X_train


In [26]:
# transform train and test data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Step 9: Inspect the Result

In [29]:
# categorical_columns
# # Fit the encoder on your categorical columns in your training data 'X_train'
# ohe.fit(X_train[categorical_columns])

# # Get the feature (column) names after one-hot encoding
# ohe_cat_col_names = ohe.get_feature_names_out(input_features=categorical_columns)
ohe_cat_col_names = ohe_column_names.tolist() # converting the Numpy Array to a list
ohe_cat_col_names


['Species_Beam',
 'Species_Bream',
 'Species_Parkki',
 'Species_Perch',
 'Species_Pike',
 'Species_Roach',
 'Species_Smelt',
 'Species_Whitefish',
 'Species_nan']

In [30]:
# Check data types in numeric_columns
print(type(numeric_columns))
print(type(ohe_cat_col_names))
print(type(X_train_processed))


<class 'list'>
<class 'list'>
<class 'numpy.ndarray'>


In [35]:
def inspectResults(train_data, test_data):
    print(np.isnan(train_data).sum().sum(), 'missing values in training data')
    print(np.isnan(test_data).sum().sum(), 'missing values in testing data')
    print('\n')
    print('All data in X_train_processed are', train_data.dtype)
    print('All data in X_test_processed are', test_data.dtype)
    print('\n')
    print('shape of data is', train_data.shape)
    print('\n')
    # Combine numeric and one-hot encoded column names
    # Create a DataFrame with the correct number of columns
    # train_data = pd.DataFrame(train_data, columns=numeric_columns + ohe_cat_col_names)
    return train_data

In [36]:
inspectResults(X_train_processed, X_test_processed)

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (119, 13)




array([[ 0.5790393 ,  0.66815832,  0.87988819, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.58547083,  1.65086656,  1.58672453, ...,  0.        ,
         0.        ,  0.        ],
       [-1.63511008, -1.73947686, -1.89014612, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.3173671 ,  0.37334585,  0.55512555, ...,  0.        ,
         0.        ,  0.        ],
       [-0.57835697, -0.56022698, -0.70571766, ...,  0.        ,
         0.        ,  0.        ],
       [-0.10533415, -0.08852702, -0.25678106, ...,  0.        ,
         0.        ,  0.        ]])

In [45]:
from sklearn.dummy import DummyRegressor #baseline model for regression

In [46]:
# instantiate a baseline model
dummy_reg = DummyRegressor(strategy='mean')

# create model pipeline
dummy_pipe = Pipeline(preprocessor, dummy_reg) #since we preprocessed or prepared the pipeline with cleaned data

# this is where the model starts learning 
dummy_pipe.fit(X_train, y_train)

TypeError: Pipeline.__init__() takes 2 positional arguments but 3 were given

Evaluating your model. 
    Function below

In [44]:
def eval_regression(true,predicted_values):
    ''' Takes true and predicted values (arrays) and prints MAE, MSE, RMSE, and R2'''
    # don't need to use numpy to do these calculations
    # taking 2 parameters because that's what is need the true values and predicted values for calculations
    mae = mean_absolute_error(true, predicted_values)
    mse = mean_squared_error(true, predicted_values)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, predicted_values)
    print(f'MAE {mae}, \n MSE {mse}, \n RMSE {rmse}, \n R^2 {r2}')