#  Handling Imbalanced Data with Oversampling

### Objective

Address potential class imbalance in the Titanic survival prediction (more non-survivors than survivors) by implementing the Synthetic Minority Oversampling Technique (SMOTE). Compare the performance of a tuned Random Forest model with and without oversampling.

`Steps:`
Analyze Class Imbalance:
Check the distribution of survived (0 = non-survivor, 1 = survivor) in the dataset.

`Apply SMOTE:`
Use SMOTE from imblearn.over_sampling to oversample the minority class (survivors) in the training data.

`Train and Tune a Random Forest Model:`
Reuse the GridSearchCV approach from Day 5 to tune the model on the oversampled data.

`Evaluate and Compare:`
Compare the model’s accuracy, classification report, and feature importance with and without SMOTE.

In [16]:
# Import libraries 
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, classification_report

In [2]:
def wrangle(filepath):
    df = pd.read_csv(filepath)

    return df
    

In [3]:
df = wrangle(r"C:\Users\User\Desktop\100DayOfCode\Titanic_clean.csv")

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [5]:
# Data Preprocessing
def preprocessing(df):
    df_processed = df.copy()
    # Drop irrelevant column
    df_processed.drop(columns= ["Unnamed: 0"], inplace = True)
    # Subset Data
    df_processed["Family_size"] = df_processed["SibSp"] + df_processed["Parch"] + 1
    df_processed["Title"] = df_processed["Name"].str.extract(" ([A-Za-z]+)\.", expand = False)
    df_processed.drop(columns=["Name", "SibSp", "Parch"], inplace = True)

    # Onehot Encoding using pandas
    cat_cols = ["Embarked", "Sex", "Title"]
    df_processed = pd.get_dummies(df_processed, columns = cat_cols, drop_first = True)

    # To convert bool to int
    for col in df_processed.columns:
        if df_processed[col].dtype == "bool":
            df_processed[col] = df_processed[col].astype(int)
    # Drop multicolineality columns
    df_processed.drop(columns=["PassengerId", "Ticket"], inplace = True)

    # Standardize numericals columns
    num_cols = ["Pclass",	"Age",	"Fare",	"Family_size"]
    scaler = StandardScaler()
    df_processed[num_cols] = scaler.fit_transform(df_processed[num_cols])

    return df_processed


In [6]:
df_clean = preprocessing(df)

In [7]:
df_clean.head()

Unnamed: 0,Survived,Pclass,Age,Fare,Family_size,Embarked_Q,Embarked_S,Sex_male,Title_Col,Title_Countess,...,Title_Major,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir
0,0,0.827377,-0.565736,-0.502445,0.05916,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,-1.566107,0.663861,0.786845,0.05916,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1,0.827377,-0.258337,-0.488854,-0.560975,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,1,-1.566107,0.433312,0.42073,0.05916,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0.827377,0.433312,-0.486337,-0.560975,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [14]:
# To sepparate features and target
features = (col for col in df_clean.columns if col not in ["Survived"])
X = df_clean[features]
y = df_clean["Survived"]

# train-test-split of data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.2, random_state = 42)

# To analyse class imbalance
print("Class distribution before aplication of Smote\n", y.value_counts())
# To apply  smote
smote = SMOTE(random_state = 42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("Class distribution after application of smote: \n", y_train_smote.value_counts())

# To train and tune using random forest
rf_model = RandomForestClassifier(random_state = 42)
params ={
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 100]
}
grid_search_smote = GridSearchCV(rf_model,params, cv=5, scoring="accuracy")
grid_search_smote.fit(X_train_smote, y_train_smote)


Class distribution before aplication of Smote
 Survived
0    549
1    342
Name: count, dtype: int64
Class distribution after application of smote: 
 Survived
0    113
1    113
Name: count, dtype: int64


In [20]:
# To evaluate model Perfomance
best_model_smote = grid_search_smote.best_estimator_
y_pred_smote = best_model_smote.predict(X_test) 
accuracy = accuracy_score(y_test, y_pred_smote)
print(f"Best Model with smote: {grid_search_smote.best_params_}")
print(f"Random Forest Accuracy with smote: {accuracy}")
print("Random Forest cassification report with smote:\n", classification_report(y_test, y_pred_smote))

Best Model with smote: {'max_depth': None, 'n_estimators': 200}
Random Forest wit smote: 0.7727910238429172
Random Forest cassification report with smote:
               precision    recall  f1-score   support

           0       0.83      0.79      0.81       436
           1       0.69      0.74      0.72       277

    accuracy                           0.77       713
   macro avg       0.76      0.77      0.76       713
weighted avg       0.78      0.77      0.77       713

