<a href="https://colab.research.google.com/github/henrybearden/SYS3501/blob/main/hw4_ml_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 4 - Understanding Machine Learning Pipeline

- Read the code below and understand the whole process of building a machine learning pipeline
- Answer the 5 questions in the markdown cells
- Store your answers and submit your ipynb file via Canvas
- You CAN use any resources including internet and GenAI tools (remember you can use ChatGPT to help you understand the code)

## An Machine Learning Pipeline for Titanic Dataset Survival Prediction

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline

# Load dataset
titanic = sns.load_dataset('titanic')

# Data cleaning
titanic['age'].fillna(titanic['age'].median(), inplace=True)
titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)
titanic['fare'].fillna(titanic['fare'].median(), inplace=True)

# Encode categorical variables
titanic['sex'] = LabelEncoder().fit_transform(titanic['sex'])
titanic['embarked'] = LabelEncoder().fit_transform(titanic['embarked'].astype(str))

# Define raw features
X_raw = titanic[['pclass', 'sex', 'age', 'fare', 'sibsp', 'parch', 'embarked']]
y = titanic['survived']

# Feature engineering
# Acutually some below features can be insightful, but some might be noise.
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1 # family size = sibsp + parch + 1
titanic['is_alone'] = (titanic['family_size'] == 1).astype(int) # is_alone = 1 if family size == 1, otherwise 0
titanic['fare_bin'] = pd.qcut(titanic['fare'], 4, labels=[1, 2, 3, 4]).astype(int) # fare_bin = 1, 2, 3, 4; mapping fare into 4 bins
titanic['age_fare_ratio'] = titanic['age'] / (titanic['fare'] + 1) # age_fare_ratio = age / (fare + 1)
titanic['sibsp_parch_ratio'] = (titanic['sibsp'] + 1) / (titanic['parch'] + 1) # sibsp_parch_ratio = (sibsp + 1) / (parch + 1)
titanic['age_class_interaction'] = titanic['age'] * titanic['pclass'] # age_class_interaction = age * pclass
titanic['fare_per_family_member'] = titanic['fare'] / (titanic['family_size'] + 1) # fare_per_family_member = fare / (family_size + 1)

# Combine features
X_engineered = titanic[['pclass', 'sex', 'age', 'fare', 'embarked', 'family_size', 'is_alone',
                       'fare_bin', 'age_fare_ratio', 'sibsp_parch_ratio', 'age_class_interaction',
                       'fare_per_family_member']]

model = RandomForestClassifier(random_state=42)

# Define pipeline with feature selection and scaling with Random Forest model
pipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=6)), # select 6 features with the highest F-values versus full features
    ('scaling', StandardScaler()),
    ('model', model)
])

# Perform 5-fold cross-validation with the pipeline
cv_scores = cross_val_score(pipeline, X_engineered, y, cv=5, scoring='accuracy')

# Output cross-validation results
print("5-Fold Cross-Validation AccuracyScores:", cv_scores)
print("Mean Accuracy Score:", cv_scores.mean())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediat

5-Fold Cross-Validation AccuracyScores: [0.77094972 0.8258427  0.87640449 0.80337079 0.83146067]
Mean Accuracy Score: 0.8216056744711568


#### Question 1: please describe the process of building this machine learning pipeline - step by step.

Your answer here: First, you load and clean the dataset to handle missing values and other inconsistencies. Next, categorical variables are converted into numeric forms using LabelEncoder, making them suitable for machine learning algorithms. Next, you feature engineer new, potentially insightful features based on the raw data. You then define the machine learning pipeline with three main components: (1) feature selection, which retains only the top six features based on statistical analysis, (2) standard scaling with to normalize the data so that it is more suitable for the model, and (3) modeling with a RandomForestClassifier. Lastly, you use cross-validation with this pipeline to evaluate its performance, ensuring that each stage of the pipeline consistently applies to each fold of the data.

#### Question 2: based on what you have learned in the lecture, please explain why we need to extract more features versus using raw features only.

Your answer here: Extracting more features allows you to find more complex relationships in the data. For example, interactions between variables or variable-based-ratios can provide information that single features might not. Raw features often provide less detail, while engineered features can add detail by displaying more complicated relationships. Engineeering features like age_fare_ratio offers more nuanced interpretations of the data, potentially making patterns more visible and easier for the model to learn, meaning the model will also be more effective.

#### Question 3: based on what you have learned in the lecture, please explain why we need feature selection.

Your answer here: Feature selection helps you focus on the most useful features, reducing data "noise." Complex data often contains extra, unhelpful information, which can negatively impact the performance of the model and increase computational load. By selecting only the top features, you simplify the model, reducing the risk of overfitting, and allowing the model to focus on the features that are most strongly related to the target variable. In this code, SelectKBest uses statistical tests to identify features that have the most predictive use, narrowing down the data to what is most relevant.

#### Question 4: see printed results (accuracy scores across 5 folds versus averaged accuracy score), explain the benefits of using 5-fold CV versus one-time 80-20 split.

Your answer here: The accuracy scores across the 5 folds (approximately 0.77, 0.83, 0.88, 0.80, and 0.83) show some variability in the model’s performance, but that the performance is relatively reliable depending on data split. When these scores are averaged, the cross-validation provides a more balanced estimate of model accuracy. The benefit of 5-fold cross-validation over a single 80-20 split lies in its depth and balance. Each data point is used in both training and testing, and the model is evaluated on many splits of data. This approach gives a more complete view of the model accuracy due to the broader collection and use of data.

## Compare the performance of different feature sets:
- 1. raw features only;
- 2. with engineered features;
- 3. engineered features with feature selection

In [None]:
# Initialize a pipeline without feature selection
pipeline_no_selection = Pipeline([
    ('scaling', StandardScaler()),
    ('model', model)
])

# 1. Raw features only
cv_scores_raw = cross_val_score(pipeline_no_selection, X_raw, y, cv=5)

# 2. Engineered features only (without feature selection)
cv_scores_extraction = cross_val_score(pipeline_no_selection, X_engineered, y, cv=5)

# 3. Engineered features with feature selection (global selection from previous block)
cv_scores_full = cross_val_score(pipeline, X_engineered, y, cv=5)

# Results summary
results_df = pd.DataFrame({
    'Experiment': [
        'Raw Features Only',
        'With Feature Extraction',
        'With Feature Extraction & Feature Selection'
    ],
    'Mean Score': [
        cv_scores_raw.mean(),
        cv_scores_extraction.mean(),
        cv_scores_full.mean()
    ]
})

# Display results
print(results_df)


                                    Experiment  Mean Score
0                            Raw Features Only    0.803616
1                      With Feature Extraction    0.811500
2  With Feature Extraction & Feature Selection    0.821606


#### Question 5: use formal language to explain printed results "results_df".

Your answer here: The results show the impact of feature extraction and feature selection on model performance, measured by mean cross-validation score. Using only the raw features, the model had a mean score of 0.8036. With feature extraction, the model had an improved score of 0.8115. This tells us that engineered features offer useful information for the model. With both feature extraction and feature selection, the model had an even better score of 0.8216, suggesting that also focusing on the most informative features through feature selection leads to an even more effective model. These results highlight the importance of using feature extraction and selection to improve model performance.

## [Optional] Feature Importance Analysis
- Note from lecturer:
- Feature importance, in ML, measures how each feature contributes to the prediction accuracy of a machine learning model. It quantitatively reveals the influence of various factors on the model's predictive outcomes.
- I found some final project groups proposed investigating the influence of certain factors on predictive outcomes but did not include a detailed methodological plan.
- More specifically, descriptive analysis and visualizations are good for finding insights, but they are not enough to draw reliable quantitative conclusions.
- For these groups, feature importance analysis might be a helpful approach to highlight the impact of different factors. I’ve attached the following code as a reference!

In [3]:
# Fit pipeline on the entire dataset to select features
pipeline.fit(X_engineered, y)

# Get selected features
selected_features_indices = pipeline.named_steps['feature_selection'].get_support(indices=True)
selected_features = X_engineered.columns[selected_features_indices]

# Scale the data using the fitted scaler
X_selected_scaled = pipeline.named_steps['scaling'].transform(X_engineered.iloc[:, selected_features_indices])

# Train model on the entire dataset with selected features for global feature importance
model = RandomForestClassifier(random_state=42)
model.fit(X_selected_scaled, y)
feature_importances = model.feature_importances_

# Display selected features and their importance
feature_importance_df = pd.DataFrame({
    'Feature': selected_features,
    'Importance': feature_importances,
    'Original or Engineered': ['Engineered' if f in ['family_size', 'is_alone', 'fare_bin',
                                                     'age_fare_ratio', 'sibsp_parch_ratio',
                                                     'age_class_interaction', 'fare_per_family_member']
                               else 'Original' for f in selected_features]
}).sort_values(by='Importance', ascending=False)

# Display results
print(feature_importance_df)



                  Feature  Importance Original or Engineered
4   age_class_interaction    0.280336             Engineered
1                     sex    0.259849               Original
5  fare_per_family_member    0.199735             Engineered
2                    fare    0.176670               Original
0                  pclass    0.054854               Original
3                fare_bin    0.028557             Engineered


#### No question for feature importance :)