### **$$Response \ to \ Marketing \ Campaign \ by \ SparkCognition$$**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
import numpy as np
from pathlib import Path
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)

In [2]:
train_df = pd.read_csv('/Response_to_marketing_campaign/datasets/marketing_training.csv')
test_df = pd.read_csv('/Response_to_marketing_campaign/datasets/marketing_test.csv').drop('Unnamed: 0',axis = 1)

FileNotFoundError: [Errno 2] No such file or directory: '/Response_to_marketing_campaign/datasets/marketing_training.csv'

In [None]:
display(train_df.head(5))
train_df.name = 'Training_Data'
display(test_df.head(5))
test_df.name = 'Testing_Data'

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
(train_df.isnull().sum()/7414)*100

### **~ Observation:**

- The columns `'custAge'`,`'schooling'` and `'day_of_week'` columns consist of null values. with `'custAge'` has approx 24% null values, `'schooling'` has approx 29% null values and lastly `'day_of_week'` has approx 9.5% null values.

### **@ Approach:**

- As the mentioned columns have quite a number of null values hence dropping the rows won't be feasible.
- Instead of dropping the rows we can fill the numeric column that is `'custAge'` with **mean** or **median** and categorical columns `'schooling'` and `'day_of_week'` with **mode**.

In [None]:
# for Customer Age we impute with mean or median as its numerical column
print('******************************* Treating the null values in customer age column ***************************************')
custAge_mean = int(train_df[train_df['custAge'].isna()==False].custAge.mean())
custAge_median = int(train_df[train_df['custAge'].isna()==False].custAge.median())

print('********************************************* Mean Imputation *********************************************************')

mean_impute = train_df.custAge.fillna(custAge_mean)
plt.hist(mean_impute,bins = 20,color = 'blue',edgecolor = 'black')
plt.title('Distribution of Customer Age After Mean Imputation')
plt.xlabel('Customer Age')
plt.ylabel('Frequency')
plt.grid(alpha = 0.75)
plt.show()

print('********************************************* Median Imputation *******************************************************')

median_impute = train_df.custAge.fillna(custAge_median)
plt.hist(median_impute,bins = 20,color = 'blue',edgecolor = 'black')
plt.title('Distribution of Customer Age After Median Imputation')
plt.xlabel('Customer Age')
plt.ylabel('Frequency')
plt.grid(alpha = 0.75)
plt.show()


### **~ Observation:**

- The histogram of `custAge` shows a similar **left-skewed** distribution when missing values are filled with either the **mean** or the **median**.
### **@ Approach:**
- Since both give similar results and **median** is generally more robust to skewed data, we will proceed with **median imputation**.


In [None]:
train_df.custAge = train_df.custAge.fillna(custAge_median) 
test_df.custAge = test_df.custAge.fillna(custAge_median)

In [None]:
# imputaiton of Schooling and day_of_week columns

print('******************************* Treating the null values in schooling and day_of_week column ***************************************')

schooling_mode = train_df[train_df['schooling'].isna()==False]['schooling'].mode()[0]
day_of_week_mode = train_df[train_df['day_of_week'].isna()==False]['day_of_week'].mode()[0]
print(f"Mode of schooling column is : {schooling_mode}")
print(f"Mode of day_of_week column is : {day_of_week_mode}")
train_df.schooling = train_df.schooling.fillna(schooling_mode) # imputation with mode as categorical column
train_df.day_of_week = train_df.day_of_week.fillna(day_of_week_mode) # imputation with mode as categorical column
test_df.schooling = test_df.schooling.fillna(schooling_mode) # imputation with mode as categorical column
test_df.day_of_week = test_df.day_of_week.fillna(day_of_week_mode) # imputation with mode as categorical column

In [None]:
train_df.info()

In [None]:
lst = [train_df,test_df]
for i in lst:
    print(f"************************************************** {i.name} ****************************************************************")
    for j in i.columns:
        print(j)
        print(i[j].unique())
        print(len(i[j]))

<div align="left">

### **~ Observation:**

- After printing the unique values in each column, it was observed that some columns contain `'unknown'` as a value.  
- While this is not technically null, it still represents missing or uninformative data.  
- We can either **impute** these values using appropriate strategies (such as **mode imputation**),  
  or treat `'unknown'` as a **separate category** and **encode** it accordingly.

### **@ Approach:**

- I plan to check the count of `'unknown'` values in each column first.
- If the `'count'` is relatively **small**, I’ll consider imputing them (e.g., with the mode for categorical features).
- But if the count is high, I might treat `'unknown'` as a separate **category** and **encode** it instead.

</div>



In [None]:
for i in train_df.columns:
    if train_df[i].dtype=='object':
        print(f"****************************************************** {i} **********************************************************")
        print(train_df[i].value_counts())

### **~ Observation & Imputation Strategy:**

It can be observed that most of the **`unknown`** entries form a **small percentage** of the total values.  
- For example out of 7414 entries the following columns:  
  - `profession` has 61 unknowns.   
  - `marital` has 8 unknowns.  
  - `schooling` has 231 unknowns.    
  - `housing` & `loan` has 168 unknowns each.
- Except for `default` column with 1432 unknowns out of the 7414 entries which is significantly higher compared to the rest.

### **@ Approach**:
- For columns where **`unknown`** appeared less number of time, those were imputed using statistical measures like **mode**.
- For **`default`**, where the number of `unknown` entries are **high**, this **`unknown's`** are treated as a separate category and are encoded to preserve information.

In [None]:
columns_with_unknowns = ['profession','marital','schooling','housing','loan']
for i in columns_with_unknowns:
    train_df[i] = train_df[i].replace('unknown',train_df[i].mode()[0])
    test_df[i] = test_df[i].replace('unknown',train_df[i].mode()[0])

In [None]:
train_df['responded'] = train_df['responded'].map({'yes':1,'no':0}).astype(int)
train_numeric = []
train_categorical = []
for i in train_df.drop('responded',axis = 1).columns:
    if train_df[i].dtype=='object':
        train_categorical.append(i)
    else:
        train_numeric.append(i)
print(train_categorical)
print(train_numeric)

In [None]:
test_numeric = []
test_categorical = []
for i in test_df.columns:
    if test_df[i].dtype=='object':
        test_categorical.append(i)
    else:
        test_numeric.append(i)
print(test_categorical)
print(test_numeric)

In [None]:
train_encoded_df = pd.get_dummies(train_df[train_categorical], drop_first = True).astype(int).join(train_df[train_numeric])
train_encoded_df['responded'] = train_df['responded']

In [None]:
test_encoded_df = pd.get_dummies(test_df[test_categorical], drop_first=True).astype(int).join(test_df[test_numeric])
test_encoded_df = test_encoded_df.reindex(columns=train_encoded_df.columns.drop('responded'), fill_value=0)

In [None]:
possible_flags = ['schooling_illiterate','default_yes','month_mar','month_sep','poutcome_nonexistent',
                          'poutcome_success','pdays','previous','pmonths','pastEmail']
for i in possible_flags:
    print(train_encoded_df[i].value_counts())

### **~ Observation:**

- There are columns with only 1 or 2 positive samples like **`schooling_illiterate`** and **`default_yes`**.

- The columns **`pdays`** and **`pmonths`** basically tells when was the person contacted. 

- The columns **`months`**, **`poutcome_success`**, **`previous`** and **`pastEmail`** are informative can be kept as it is.

### **@ Approach**:

- Dropping columns **`schooling_illiterate`** and **`default_yes`** as doesn't adds much value to the overall analysis.

- Transforming the columns **`pdays`** and **`pmonths`** into binary

- Also, outlier detection was considered, but due to the imbalanced nature of the dataset and risk of losing minority class information, no aggressive outlier removal or capping was applied. 


In [None]:
train_encoded_df.drop(['schooling_illiterate','default_yes'],axis = 1,inplace = True)
test_encoded_df.drop(['schooling_illiterate','default_yes'],axis = 1,inplace = True)

In [None]:
train_encoded_df['pdays'] = train_encoded_df['pdays'].apply(lambda x: 0 if x==999 else 1).astype(int)
train_encoded_df['pmonths'] = train_encoded_df['pmonths'].apply(lambda x: 0 if x==999 else 1).astype(int)
test_encoded_df['pdays'] = test_encoded_df['pdays'].apply(lambda x: 0 if x==999 else 1).astype(int)
test_encoded_df['pmonths'] = test_encoded_df['pmonths'].apply(lambda x: 0 if x==999 else 1).astype(int)

### **~Observation :**
- The output class is highly imbalance.
### **@Approach :**

- Apply SMOTE only on training data only to avoid leakage.
- Also, it balances the dataset by synthesizing new minority samples.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(train_encoded_df.drop('responded',axis=1),train_encoded_df['responded'],test_size = 0.2,random_state = 42,shuffle = True)
X_resampled, y_resampled = smote.fit_resample(x_train, y_train)

### **AdaBoost Classifier :**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier(n_estimators=50, random_state=42)

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
thresholds = np.arange(0.1, 0.6, 0.05)
f1_scores = {t: [] for t in thresholds}

for fold, (train_idx, val_idx) in enumerate(kf.split(X_resampled, y_resampled), 1):
    X_train, X_val = X_resampled.iloc[train_idx], X_resampled.iloc[val_idx]
    y_train, y_val = y_resampled.iloc[train_idx], y_resampled.iloc[val_idx]
    ab.fit(X_train, y_train)
    y_proba = ab.predict_proba(X_val)[:, 1]
    for thresh in thresholds:
        y_pred = (y_proba >= thresh).astype(int)
        f1_scores[thresh].append(f1_score(y_val, y_pred))
        accuracy = accuracy_score(y_val,y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
avg_f1 = {t: np.mean(scores) for t, scores in f1_scores.items()}
best_thresh = max(avg_f1, key=avg_f1.get)
print(f"Best Threshold is {best_thresh} and Average F1 score is {avg_f1[best_thresh]}")

### **Decision Tree Classifier :**

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
thresholds = np.arange(0.1, 0.6, 0.05)
f1_scores = {t: [] for t in thresholds}

for fold, (train_idx, val_idx) in enumerate(kf.split(X_resampled, y_resampled), 1):
    X_train, X_val = X_resampled.iloc[train_idx], X_resampled.iloc[val_idx]
    y_train, y_val = y_resampled.iloc[train_idx], y_resampled.iloc[val_idx]
    clf.fit(X_train, y_train)
    y_proba = clf.predict_proba(X_val)[:, 1]
    for thresh in thresholds:
        y_pred = (y_proba >= thresh).astype(int)
        f1_scores[thresh].append(f1_score(y_val, y_pred))
        accuracy = accuracy_score(y_val,y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
avg_f1 = {t: np.mean(scores) for t, scores in f1_scores.items()}
best_thresh = max(avg_f1, key=avg_f1.get)
print(f"Best Threshold is {best_thresh} and Average F1 score is {avg_f1[best_thresh]}")

### **Random Forest Classifier :**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100,criterion='gini', max_depth=None)

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
thresholds = np.arange(0.1, 0.6, 0.05)
f1_scores = {t: [] for t in thresholds}

for fold, (train_idx, val_idx) in enumerate(kf.split(X_resampled, y_resampled), 1):
    X_train, X_val = X_resampled.iloc[train_idx], X_resampled.iloc[val_idx]
    y_train, y_val = y_resampled.iloc[train_idx], y_resampled.iloc[val_idx]
    rf.fit(X_train, y_train)
    y_proba = rf.predict_proba(X_val)[:, 1]
    for thresh in thresholds:
        y_pred = (y_proba >= thresh).astype(int)
        f1_scores[thresh].append(f1_score(y_val, y_pred))
        accuracy = accuracy_score(y_val,y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
avg_f1 = {t: np.mean(scores) for t, scores in f1_scores.items()}
best_thresh = max(avg_f1, key=avg_f1.get)
print(f"Best Threshold is {best_thresh} and Average F1 score is {avg_f1[best_thresh]}")

In [None]:
imp_features = pd.Series(rf.feature_importances_, index=X_resampled.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
imp_features.head(15).plot(kind='bar')
plt.title('Top 15 Feature Importances by Random Forest')
plt.ylabel('Importance Score')
plt.xlabel('Features')
plt.tight_layout()
plt.show()

In [None]:
opt_model = RandomForestClassifier(n_estimators=200, max_depth=None,random_state=42)
opt_model.fit(X_resampled, y_resampled)
test_proba = opt_model.predict_proba(test_encoded_df)[:,1]
test_pred = (test_proba >= best_thresh).astype(int)
print(test_pred)

<div align="left">

### **~ Summary:**

- Outlier removal was eliminated to avoid the loss of minority data.
- The unknown and near constant features were omitted or coded accordingly.
- the colums with one class were dropped like schooling_illiterate or default_yes
- Columns like pdays and pmonths were converted to binary for extract the underlying meaning from the data.
- Only the training data was used to apply the SMOTE without compromising data leakage.
- A RF model used with the threshold to take care of the problem of class imbalance.
- Thresholds were chosen with the use of cross-validation to be robust to generalization.
- Lastly, feature importance graph was visualized to make it easier to understand.

### **Questions & Answers**

**1. Describe your model and why did you choose this model over other types of models?**

The final model selected is SMOTE+Random Forest Classifier. As SMOTE balances the classes by adding synthetic minority samples, and Random Forest effectively learns from this more balanced data using its ensemble approach.


**2. Describe any other models you have tried and why do you think this model performs better?**

- **Models tried**:

    - **Logistic Regression**: When combined with SMOTE LR achieved high accuracy, but it failed to address the minority class.
                               The best threshold achieved was 0.40 with average F1_score = 0.83.
    - **AdaBoost**: With SMOTE it Performed slightly better than LR but is sensitive to noise introduced by SMOTE,
                    Best threshold achieved was 0.50 with average F1_score = 0.843
    - **Decision Tree**: With SMOTE Decision Tree performed better than both LR and AdaBoost, Best threshold achieved was 0.50 with average                           F1_score = 0.8915. But it is prone to overfitting 
    - **Random forest**: Combined with SMOTE and threshold tuning it Performed better than all the previous models as it combines many                                decision trees, reduces chance of overfitting and captures the complex patterns in the data that was balanced
                         after SMOTE, Best average F1 score observed is 0.936 at threshold 0.55.

**3. How did you handle missing data?**

- Instead of dropping the rows with null values, filled the numeric column that is **`'custAge'`** with **median** as it's more robust to skewed data and categorical columns **`'schooling'`** and **`'day_of_week'`** with **mode**.
- For columns where **`unknown`** appeared less number of time, those were imputed using statistical measures like **mode**.
- Columns where the number of **`unknown`** entries were **high**, those were treated as a separate category and were encoded to preserve information.
- Dropped columns **`schooling_illiterate`** and **`default_yes`** as doesn't adds much value to the overall analysis.
- Transforming the columns **`pdays`** and **`pmonths`** into binary to preserve underlying information.

**4. How did you handle categorical (string) data?**

- Used **one-hot encoding** (`pd.get_dummies`) on categorical features.

**5. How did you handle unbalanced data?**

- Used **SMOTE** to oversample minority class in training set only to avoid data leakage.

**6. How did you test your model?**

- Split data into **train/test sets**.
- Applied **cross-validation** on training set for robust model selection and threshold tuning.
- Final evaluation performed using metrics **precision, recall, F1_score** (not just accuracy).
