________________
<h1 align="center"><span style='font-family:Georgia'> SPEED DATING MATCH PREDICTION </span></h1>

_________________



## **Problem Defination:**

Predict the outcome of a speed dating session between two individuals based on their profiles. The goal is to develop a recommendation system that can better match individuals in speed dating events. The dataset used for this task is clean and requires no extensive cleaning. The task is a binary classification problem where the model needs to predict whether the two individuals will have a successful match or not. By accurately predicting the outcome of speed dating sessions, the recommendation system can be improved to increase the chances of successful matches.

## **Input:**

The input is a dataset that includes a study by Columbia University explored gender differences in dating preferences.

#### **Dataset Description**:
Participants attended a dating event where they had a 4-minute date with every other participant of the opposite sex who attended the same event.
The participants decided to accept or reject their partner. If both the participant and partner matched, they received each other's contact information.
Participants rated their partners on six personal attributes: attractiveness, sincerity, intelligence, fun, ambition and shared interests.
Before and after the event, participants rated their preferences in the six attributes and gave themselves ratings.
Other information was collected about the participants' background and preferences.

## **Output:**
The main output is a numeric score representing predicted match potential.The scores are usually between 0 ( no match ) to 1 ( perfect match ). Higher scores mean a more promising, compatible match.

I will predict the test dataset using `predict_proba` to get the "match probability" between two individuals as evaluate using ROC/ AUC.

## **Required Data Mining Function:**

#### **Binary Classifier:**

  * LogisticRegression - LR
  * Support Vector Classification - SVC
  * RandomForestClassifier - RF
  * Multi-Layer Perceptron - MLP

#### **Model Fitting / Pipelining:**
* Pipeline
* GridSearchCV
* RandomizedSearchCV
* MakePipeline

#### **Evaluations**
* roc_curve
* auc

## **Challenges**
Some key challenges I see in this project are:

1. **Handling imbalanced classes**: Since successful matches are less frequent. This can bias models towards the majority class. *Oversampling*, undersampling, and class weights are techniques to help address imbalanced classes.

2. **Feature engineering**: Developing good features that capture compatibility and chemistry between individuals will be key.

3. **Model selection**: Choosing a model that can accurately learn the nuances of romantic compatibility will be important. Options could include logistic regression, random forests, neural networks, and support vector classifier.

4. **Generization**: The recommendations should ideally be personalized for each new query (dating pair). This means the model would need to generalize well beyond just the training data distributions. Ensemble methods (Random forest and xg boost) or neural networks could help here.

## **Impact**
The impact of this project can be significant in the field of dating and matchmaking. By accurately predicting the outcome of speed dating sessions, the recommendation system can be improved to increase the chances of successful matches. This can help people to find compatible partners more easily and increase their likelihood of forming lasting relationships.

The project can also have wider applications in other fields such as e-commerce, healthcare, and finance, where predictive modeling can be used to make better decisions and improve customer satisfaction.

## **Ideal Solution:**

- Using Bayesain search is a good choice as it is able to explore a wide range of hyperparameters in an efficient manner, without requiring too much computational power or time.
- Using XGBoost with this dataset is the best model as it achieved the highest score in terms of AUC and was able to generalize well on the test set. Additionally, XGBoost has several advantages over other machine learning algorithms, such as built-in regularization techniques, handling of missing values, speed, and flexibility.


## **Methodology:**

 **✍🏼 Load the dataset.**

 **✍🏼 Data exploration and Preprocessing**: data exploration and preprocessing, which involves cleaning and transforming the data into a suitable format for machine learning algorithms.

 **✍🏼 Feature Selection**:Extract features from the profiles that could be indicative of match likelihood (common interests, personality traits, etc. Ai the end of this step I created 3 versions from the orginal dataset:

  1. A Dataset after dropping nulls features.
  2. A Dataset after dropping nulls features + highly correlated Features.
  3. A Dataset after dropping nulls features + Oversampling.


 **✍🏼 Build Pipline.**

 **✍🏼 Apply complete Pipline**: from data preprocessing to model training as a single pipeline:
  
  - **Grid search trial**.
    * LogisticRegression - LR.
    * Support Vector Classification - SVC.
    * RandomForestClassifier - RF.
    * Multi-Layer Perceptron - MLP.

  - **Random search trial.**
    * LogisticRegression - LR.
    * Support Vector Classification - SVC.
    * RandomForestClassifier - RF.
    * Multi-Layer Perceptron - MLP.
    * XGBoost - XGB.

  - **Bayesian search trial.**
    * LogisticRegression - LR
    * Support Vector Classification - SVC
    * XGBoost - XGB.

    
 **✍🏼 Evaluate the model's performance**

 **✍🏼 Predict test dataset:** Use the trained model to predict match likelihood for new pairs of profiles.


---


## **Notebook Description**:  

This notebook aims to predict the outcome of a specific speed dating session based on the profile of two people.

The notebook also includes visualizations and explanations of the data and models used, as well as discussions of the results and their implications for the recommendation system. Overall, the notebook provides a comprehensive and systematic approach to solving the problem of predicting the outcome of a speed dating session based on the profile of two people.




_________________
<h1 align="center"><span style='font-family:Georgia'>  SETUP</span></h1>

_________________


In [None]:
import re
import numpy as np
import pandas as pd
from scipy.stats import uniform, randint

pd.set_option('display.max_columns', 100)
import warnings
warnings.filterwarnings(action='ignore')

######################## for creating graphs ################
import plotly.express as px
import seaborn as sns
import matplotlib as mt
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objects as go

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style

######################## for Pipline ########################
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, MissingIndicator


######################## for modeling ########################
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, PowerTransformer, MinMaxScaler, OneHotEncoder, RobustScaler

from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss
from sklearn.metrics import make_scorer, balanced_accuracy_score


In [None]:

# from bokeh.io import output_notebook
# output_notebook()
mystyle = plt.style.library['fivethirtyeight']
# sns.set(rc={"figure.figsize":(9,6)})


# using the style for the plot
# %pylab inline
# plt.rc('figure', figsize=(12,9))
# colors = sns.color_palette()
# plt.style.use('fivethirtyeight')



_________________
# 1. Load Dataset

________________


In [None]:

train_data = pd.read_csv('/kaggle/input/cisc-873-dm-w23-a2/train.csv')
test_data = pd.read_csv('/kaggle/input/cisc-873-dm-w23-a2/test.csv')
train_data.head()

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,sports,tvsports,exercise,...,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,0,-0.03,0,27.0,2.0,30.0,15.0,15.0,20.0,5.0,15.0,7.0,7.0,7.0,6.0,5.0,,7.0,1.0,2.0,33.0,Ed.D. in higher education policy at TC,9.0,University of Michigan-Ann Arbor,1290.0,21645.0,3.0,2.0,1.0,"Palo Alto, CA",,,1.0,6.0,3.0,University President,2.0,3.0,4.0,4.0,...,10.0,10.0,10.0,9.0,10.0,10.0,9.0,10.0,9.0,10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,0,0.21,0,24.0,4.0,5.0,15.0,45.0,25.0,0.0,10.0,3.0,8.0,5.0,3.0,7.0,1.0,1.0,3.0,2.0,22.0,Engineering,5.0,,,,2.0,8.0,1.0,"Boston, MA",2021.0,,5.0,6.0,1.0,Engineer or iBanker or consultant,7.0,8.0,3.0,7.0,...,6.0,7.0,7.0,7.0,8.0,,,,,,0.0,0.0,0.0,,,20.0,20.0,15.0,20.0,10.0,15.0,,,,,,,,,,,,,,,,,,,6.0,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,0,0.43,0,34.0,2.0,15.0,15.0,10.0,25.0,10.0,25.0,4.0,8.0,7.0,4.0,7.0,3.0,3.0,2.0,2.0,27.0,Urban Planning,5.0,"Rizvi College of Architecture, Bombay University",,,6.0,1.0,1.0,"Bombay, India",,,1.0,4.0,2.0,Real Estate Consulting,7.0,4.0,2.0,7.0,...,7.0,9.0,9.0,8.0,10.0,7.0,9.0,8.0,7.0,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,0,0.72,1,25.0,2.0,13.21,18.87,18.87,16.98,16.98,15.09,5.0,9.0,7.0,5.0,8.0,,7.0,6.0,2.0,31.0,International Affairs,13.0,,,,2.0,4.0,7.0,"Washington, DC",10471.0,45300.0,2.0,5.0,4.0,public service,9.0,5.0,5.0,9.0,...,8.0,9.0,8.0,8.0,7.0,,,,,,1.0,0.0,0.0,,,16.33,18.37,18.37,18.37,14.29,14.29,,,,,,,9.0,7.0,7.0,8.0,7.0,7.0,8.0,8.0,8.0,8.0,8.0,,8.0,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,0,0.33,0,27.0,4.0,15.0,20.0,20.0,20.0,20.0,5.0,4.0,5.0,7.0,5.0,5.0,6.0,4.0,3.0,2.0,27.0,Business,8.0,Harvard College,1400.0,26019.0,2.0,9.0,7.0,Midwest USA,66208.0,46138.0,2.0,5.0,2.0,undecided,10.0,9.0,10.0,8.0,...,7.0,8.0,9.0,9.0,8.0,7.0,7.0,8.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4828


### **1.1 Dataset Overview**

**Information about the attributes**

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: float64(173), int64(11), object(8)
memory usage: 8.7+ MB


In [None]:
train_data.describe()

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field_cd,race,imprace,imprelig,goal,date,go_out,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,...,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
count,5909.0,5909.0,5909.0,5909.0,5909.0,5909.0,4591.0,5909.0,5909.0,5901.0,5909.0,5800.0,5909.0,5844.0,5861.0,5850.0,5850.0,5850.0,5843.0,5836.0,5826.0,5756.0,5700.0,5689.0,5644.0,5397.0,5122.0,5726.0,5674.0,5634.0,5846.0,5850.0,5864.0,5851.0,5851.0,5851.0,5837.0,5851.0,5809.0,5851.0,5851.0,5851.0,5851.0,5851.0,5851.0,5851.0,5851.0,5851.0,5851.0,5851.0,...,5262.0,5262.0,5262.0,5262.0,5262.0,3088.0,3088.0,3088.0,3088.0,3088.0,2804.0,2804.0,2804.0,1060.0,460.0,2804.0,2804.0,2804.0,2804.0,2804.0,2804.0,1413.0,1413.0,1413.0,1413.0,1413.0,1413.0,2071.0,2071.0,2071.0,2071.0,2071.0,2071.0,2071.0,2071.0,2071.0,2071.0,2071.0,1413.0,2804.0,2804.0,2804.0,2804.0,2804.0,1413.0,1413.0,1413.0,1413.0,1413.0,5909.0
mean,0.505331,17.360298,1.824843,11.347436,16.850228,9.001523,9.254846,8.91166,8.962938,283.733266,0.167203,0.195257,0.396345,26.323922,2.759427,22.509007,17.33434,20.261403,17.427746,10.716157,11.910333,6.190323,7.185,7.372825,6.394578,6.791366,5.505271,6.14394,5.235196,1.964856,26.341088,7.653675,2.756651,3.770979,3.643651,2.12579,5.008223,2.159631,5.300052,6.436336,4.602803,6.272774,7.760041,6.974192,6.708426,5.749615,3.890959,5.732353,7.673902,5.310374,...,7.131889,7.95534,8.241543,7.59179,7.49867,6.839054,7.43329,7.85978,7.284326,7.356865,0.801712,0.962553,0.380528,1.242453,0.917391,24.501665,16.691669,19.348053,16.209943,10.924044,12.596797,31.274593,15.779901,16.754423,16.431706,7.921444,11.951168,25.696765,10.711251,11.561082,14.354418,9.149686,11.133752,24.982134,10.823274,11.96282,15.066634,9.638339,11.779193,7.241797,8.105563,8.377318,7.644437,7.398716,6.799717,7.631989,7.944798,7.162774,7.092711,4191.314943
std,0.500014,10.947542,0.380133,6.011495,4.389246,5.482368,5.611803,5.4571,5.500706,158.993002,0.373188,0.304197,0.489179,3.520844,1.226749,12.778567,7.042785,6.795808,6.056044,6.139879,6.363146,1.946493,1.73173,1.551435,1.954052,1.790194,2.151772,1.838872,2.12642,0.255263,3.597877,3.773379,1.219248,2.844792,2.819893,1.408971,1.444282,1.10947,3.309817,2.628608,2.808744,2.413376,1.772684,2.051354,2.264274,2.578219,2.59883,2.499944,2.007243,2.53348,...,1.385184,1.486971,1.17779,1.540505,1.746286,1.425735,1.569496,1.268374,1.647053,1.5241,1.678785,1.386066,0.485603,1.336878,0.748071,13.729117,7.535481,6.146231,5.094267,5.932833,6.385001,17.119259,9.413485,8.003796,7.207205,6.122237,8.395316,17.530203,5.77983,6.069804,6.987285,6.432015,6.436383,16.962794,6.283531,7.180426,8.03425,6.615898,6.879225,1.593787,1.601011,1.459013,1.757559,1.956924,1.535768,1.498024,1.320919,1.687431,1.713729,2408.009173
min,0.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,0.0,-0.83,0.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,2.0,2.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,3.0,2.0,1.0,2.0,2.0,4.0,1.0,1.0,0.0
25%,0.0,8.0,2.0,7.0,14.0,4.0,4.0,4.0,4.0,153.0,0.0,-0.02,0.0,24.0,2.0,15.0,15.0,17.39,15.0,5.0,10.0,5.0,6.0,6.0,5.0,6.0,4.0,5.0,4.0,2.0,24.0,5.0,2.0,1.0,1.0,1.0,4.0,1.0,2.0,4.0,2.0,5.0,7.0,6.0,5.0,4.0,2.0,4.0,7.0,3.0,...,7.0,7.0,8.0,7.0,7.0,6.0,7.0,7.0,6.0,6.0,0.0,0.0,0.0,1.0,1.0,15.22,10.0,16.67,15.0,5.0,10.0,20.0,10.0,10.0,10.0,0.0,5.0,10.0,7.0,7.0,9.0,5.0,7.0,10.0,7.0,7.0,9.0,6.0,5.0,7.0,7.0,8.0,7.0,6.0,6.0,7.0,7.0,6.0,6.0,2124.0
50%,1.0,16.0,2.0,11.0,18.0,8.0,9.0,8.0,8.0,280.0,0.0,0.21,0.0,26.0,2.0,20.0,18.18,20.0,18.0,10.0,10.64,6.0,7.0,7.0,7.0,7.0,6.0,6.0,5.0,2.0,26.0,8.0,2.0,3.0,3.0,2.0,5.0,2.0,6.0,7.0,4.0,7.0,8.0,7.0,7.0,6.0,3.0,6.0,8.0,6.0,...,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0,0.0,0.0,0.0,1.0,1.0,20.0,16.67,20.0,16.33,10.0,13.95,25.0,15.0,15.0,17.0,10.0,10.0,20.0,10.0,10.0,12.0,9.0,10.0,20.0,10.0,10.0,15.0,10.0,10.0,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0,4210.0
75%,1.0,26.0,2.0,15.0,20.0,13.0,14.0,13.0,13.0,409.0,0.0,0.43,1.0,28.0,4.0,25.0,20.0,23.6725,20.0,15.0,16.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0,28.0,10.0,4.0,6.0,6.0,2.0,6.0,3.0,7.0,9.0,7.0,8.0,9.0,8.0,8.0,8.0,6.0,8.0,9.0,7.0,...,8.0,9.0,9.0,9.0,9.0,8.0,8.0,9.0,8.0,8.0,1.0,1.0,1.0,1.0,1.0,30.0,20.0,20.0,20.0,15.0,16.67,40.0,20.0,20.0,20.0,10.0,20.0,37.0,15.0,15.0,20.0,10.0,15.0,30.0,15.0,15.0,20.0,10.0,15.0,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,8.0,8.0,6266.0
max,1.0,44.0,2.0,21.0,22.0,22.0,22.0,22.0,22.0,552.0,1.0,0.91,1.0,55.0,6.0,100.0,60.0,50.0,50.0,53.0,30.0,10.5,10.0,10.0,10.0,10.0,10.0,10.0,10.0,8.0,55.0,18.0,6.0,10.0,10.0,6.0,7.0,7.0,17.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,14.0,10.0,13.0,10.0,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,21.0,9.0,1.0,9.0,4.0,80.0,65.0,45.0,30.0,30.0,55.0,80.0,60.0,45.0,40.0,30.0,55.0,80.0,40.0,30.0,30.0,40.0,45.0,80.0,50.0,60.0,40.0,50.0,45.0,12.0,12.0,12.0,12.0,12.0,10.0,10.0,10.0,10.0,10.0,8372.0


**Data shape (Number of variable and observation)**

In [None]:
train_data.shape

(5909, 192)

_________________
# 2. Data PreProcessing

_________________

## **2.1 Check Duplicated Rows**

In [None]:
train_data.duplicated().sum()

0

> **OBSERVATION**:
- No duplication


## **2.2 Check Columns with Constant Values**


In [None]:
def CheckConstantCol(df, n_unique=1):
    """
    Check for and return a dictionary of constant columns in a dataframe. A constant column is a column that has only one unique value.

    Args:
    df (DataFrame): The dataframe to analyze.
    n_unique (int): The minimum number of unique values a column must have to keep it. Columns with fewer than this are marked as constant.

    Returns:
    constants (dict): A dictionary with the constant column names as keys and the number of unique values as values.
    """
    constants = {}
    for col in df.columns:
        n = len(df[col].unique())
        if n == 1:
            constants[col] = n

    return constants

In [None]:
constants = CheckConstantCol(train_data)
if constants:
    print('Variables With Constant Value:\n', list(constants.keys()))
else:
    print('No Constant columns in this dataset 😊')

No Constant columns in this dataset 😊


> **OBSERVATION**:
No column with constant value

## **2.3 Understanding what is missing**


Remove uninformative columns and handle missing data, in order to prepare the dataset for modeling.

### **2.3.1 Helper function to Check Missing Values**

Identify null-heavy columns:

- NaN% / len(df) to get **% null per column**.

- df.isnull().sum() to get **count of nulls per column**.

- Visual inspection of column stats.

In [None]:
import plotly.graph_objs as go

def CountNulls(df,p=0):

  """
  Calculate the total number of null values and their percentage in each column of a pandas DataFrame.

  Args:
      df (pandas.DataFrame): The DataFrame to analyze.
      n (int): the threshold of the number of null that allowed in any column

  Returns:
      pandas.DataFrame: A DataFrame that shows the total number of null values and their percentage in each column.
  """
  for col in df.columns:

    # Calculate Number of NULLs in each column
    vals = df.isnull().sum().sort_values(ascending=False)

    # Calculate percentage of NULLs in each column
    per  = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)

    # create a new DataFrame that combines the null counts and percentages
    nulls_df = pd.concat([vals, per], axis=1, keys=['null_count', 'null_percentage'])
    nulls_df = nulls_df.reset_index().rename(columns={'index': 'feature'})

    # Drop Features that have No NULLs , or has a number of NULLs bellow the threshold p (percentage)
    nulls_df = nulls_df.drop(nulls_df[nulls_df['null_percentage'] < p].index)

  return nulls_df

In [None]:
def barPlot(x, y, title):
      """
      Plot the bar graph of the total number of null values in each column of a pandas DataFrame.

      Args:
        x (pd.Series): a list of strings representing the column names in the DataFrame or Series object.
        y (pd.Series): a list of integers representing the number of null values in each column.
        title (str): a string representing the title of the bar plot.

      Returns: None
      """
      # create a bar plot using Plotly
      data = [go.Bar(x=x, y=y)]

      # set the layout of the plot
      layout = go.Layout(title=title, xaxis=dict(title='Column'), yaxis=dict(title='Number of Nulls'))

      # create a figure object that includes both the data and layout
      fig = go.Figure(data=data, layout=layout)

      # display the figure
      fig.show()

In [None]:
# Calculate NULLs for all the columns
nulls_df = CountNulls(train_data)
# create a bar plot using Plotly
barPlot(x = nulls_df['feature'], y=nulls_df['null_count'],
        title = 'Number of Nulls for each column')

> **OBSERVATION:**

It is clear that there are many columns with a large number of missing values. These columns do not provide any useful information and can negatively affect the performance of the model. Therefore, we will drop any columns that have more than 60% missing values.

In [None]:
# Calculate NULLs for columns > 60% null values
nulls_df = CountNulls(train_data, 0.6)
barPlot(x = nulls_df['feature'], y=nulls_df['null_count'],
        title = 'Number of Nulls for columns grater than 60% of the whole column')

In [None]:
droped_cols = list(nulls_df['feature'].values)
print(len(droped_cols))
print(droped_cols)


33
['num_in_3', 'numdat_3', 'expnum', 'amb7_2', 'sinc7_2', 'shar7_2', 'fun7_2', 'intel7_2', 'attr7_2', 'attr7_3', 'sinc7_3', 'intel7_3', 'fun7_3', 'amb7_3', 'shar7_3', 'shar2_3', 'attr5_3', 'sinc5_3', 'intel5_3', 'fun5_3', 'amb5_3', 'shar4_3', 'fun4_3', 'intel4_3', 'sinc4_3', 'attr4_3', 'attr2_3', 'sinc2_3', 'intel2_3', 'fun2_3', 'amb2_3', 'amb4_3', 'mn_sat']


### **2.3.2 Drop uninformative Columns**


drop columns that have more than **65%** null values and this can be a useful data cleaning step to drop very sparse or mostly-null columns

In [None]:
print('Dataset shape before dropping columns',train_data.shape)

df1 = train_data.drop(droped_cols, axis = 1)

print('Dataset shape after dropping columns',df1.shape)

Dataset shape before dropping columns (5909, 192)
Dataset shape after dropping columns (5909, 159)


### **2.3.3 Sperate Numerical and Categorical Fearures**

In [None]:
nums_df = df1.select_dtypes(include=np.number)
cats_df = df1.select_dtypes(include='object')

features_numeric = list(df1.select_dtypes(include=np.number))
features_categorical = list(df1.select_dtypes(include='object'))

print('Numerical Features: \n',features_numeric)
print('\nCategorical Features: \n',features_categorical)

Numerical Features: 
 ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'match', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', '

### **2.3.4 Handling categorical feautures**

In [None]:
# Calculate NULLs for all the columns
nulls_df = CountNulls(cats_df)
# create a bar plot using Plotly
barPlot(x = nulls_df['feature'], y=nulls_df['null_count'],title='Number of Nulls in Categorical Featues')

> **OBSERVATION:** `ZipCode` seem to be uninformative column with too many null values, so i will drop it.

In [None]:
df1 = df1.drop(['zipcode'], axis = 1)

In [None]:
df1[['tuition', 'income']]

Unnamed: 0,tuition,income
0,21645.00,
1,,
2,,
3,,45300.00
4,26019.00,46138.00
...,...,...
5904,,65708.00
5905,,
5906,13258.00,37881.00
5907,,


> **OBSERVATION:** The format of `tuition` and `income` columns need to be handled.

In [None]:
df1[['tuition', 'income']] = df1[['tuition', 'income']].apply(lambda x: x.str.replace(',','')).astype(float)
test_data[['tuition', 'income']] = test_data[['tuition', 'income']].apply(lambda x: x.str.replace(',','')).astype(float)

In [None]:
df1[['tuition', 'income']]

Unnamed: 0,tuition,income
0,21645.0,
1,,
2,,
3,,45300.0
4,26019.0,46138.0
...,...,...
5904,,65708.0
5905,,
5906,13258.0,37881.0
5907,,


## **2.4 Check Correlation in Nummerical Features**

In [None]:
import plotly.figure_factory as ff

def Heatmap(corr) :
    """
    The Heatmap function creates an annotated heatmap of the input correlation matrix using the Plotly library.

      Args:
      corr (pandas.DataFrame): A pandas DataFrame or numpy array object containing the correlation matrix to be visualized.

      Returns: None
    """
    # applies the mask to the correlation matrix using the mask() method,
    # which sets all the values in the upper triangular part of the matrix to NaN.

    mask = np.triu(np.ones_like(corr, dtype=bool))
    df_mask = corr.mask(mask)

    # sets the z parameter to the masked correlation matrix, and the x and y parameters to the column names of the correlation matrix.
    hmap = ff.create_annotated_heatmap(
                                    z=np.around(df_mask.to_numpy(),2),
                                    x=df_mask.columns.tolist(),
                                    y=df_mask.columns.tolist(),
                                    colorscale=px.colors.diverging.RdBu,
                                    showscale=True, ygap=1, xgap=1)

    # It also sets the colorscale, showscale, and various layout parameters of the heatmap.
    hmap.update_xaxes(side="bottom")
    hmap.update_layout(
      title_text='Feature Correlation Heatmap'.upper(),
      width=1000,
      height=800,
      xaxis_showgrid=False,
      yaxis_showgrid=False,
      xaxis_zeroline=False,
      yaxis_zeroline=False,
      yaxis_autorange='reversed',
      template='plotly_white'
    )
    for i in range(len(hmap.layout.annotations)):
        hmap.layout.annotations[i].font.size = 8
        hmap.layout.annotations[i].font.color = 'black'

        if hmap.layout.annotations[i].text == 'nan':
            hmap.layout.annotations[i].text =''
    hmap.show()


In [None]:
corr_df = nums_df.corr()
corr_df.shape

(152, 152)

In [None]:
# Dividing the correlation matrix to better look into the graph
corr1 = corr_df.iloc[:35, :35]
corr2 = corr_df.iloc[35:70, 35:70]
corr3 = corr_df.iloc[70: 105, 70:105]
corr4 = corr_df.iloc[105:140 , 105:140]
corr5 = corr_df.iloc[140: , 140:]

In [None]:
Heatmap(corr1)

In [None]:
Heatmap(corr2)

In [None]:
Heatmap(corr3)

In [None]:
Heatmap(corr4)

In [None]:
Heatmap(corr5)

>**OBSERVATION**: There is some relationship between the features. As a result, we have to select a feature selection, either manually or automatically. In this case, I decided to select the feature that have correlation below a spacific threshold (0.8) and drop the rest.

In [None]:
def CheckCorrelated(corr, threshold = 0.5):
    """
    The CheckCorrelated function computes the correlation matrix of the input DataFrame
    and returns a list of column names that have a correlation coefficient greater than `threshold`
    with at least one other column in the DataFrame.

    Args:
      df (pandas.DataFrame): A pandas DataFrame object containing the data to check for correlated columns.
      threshold (float): The threshold of the value of correlation between any 2 columns.

    Returns:
      correlated_cols (List): A list of column names in the input DataFrame that have a correlation coefficient
      greater than the threshold with at least one other column in the DataFrame.
  """
    # create upper triangular matrix to avoid duplicates
    upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

    # Find index of feature columns with correlation greater than 0.6
    correlated_cols  = list(column for column in upper.columns if any(abs(upper[column]) > threshold))

    # Drop the correlated features
    return correlated_cols

## **2.5 Create 3 Versions From the dataset**

1. A Dataset after dropping nulls features.

2. A Dataset after dropping nulls features + highly correlated Features.

3. A Dataset after dropping nulls features + Oversampling using SMOTE.

### **2.5.1 Dataset V1**

This Dataset after **dropping** identified any columns that have a **high percentage of null values (60% nulls)**.
So your final dataset has fewer columns, and hopefully "cleaner", more reliable data in the remaining columns.


In [None]:
# split the data into features and labels
X1 = df1.drop('match', axis=1)
y1 = df1['match']

###  **2.5.1 Dataset V2**

I dropped one column from any pairs that had a high correlation, beyond a **threshold of 85%** to avoid including redundant information.The resulting dataset has fewer columns, and the remaining columns provide more unique information.

In [None]:
df2 = df1.copy()

dropped_cols = CheckCorrelated(corr_df, 0.85)
print(len(dropped_cols))
print()
print(dropped_cols)


3

['pid', 'art', 'attr5_1']


In [None]:
df2 = df2.drop(dropped_cols, axis=1)

In [None]:
# split the data into features and labels
X2 = df2.drop('match', axis=1)
y2 = df2['match']

### **2.5.3 Dataset V3 (Oversampled Dataset)**
OverSample the Minority Class

In [None]:
from imblearn.over_sampling import RandomOverSampler

# oversample the minority class using RandomOverSampler
ros = RandomOverSampler(random_state=42)
X3, y3 = ros.fit_resample(X1, y1)

_________________
# 3. Full Pipline

_________________

The full pipeline with imputer to handle categorical and numerical values by grid search involves creating a complete data preprocessing and modeling pipeline using scikit-learn, including an imputer for handling missing values, and optimizing hyperparameters using grid search.

## **3.0 Create Helper Function for the Pipeline**

I need to create this function to use it several times with the diffrent algorithms and diffrent preprocessing data.

#### Using one-hot encoding:
This ensures that the input data is entirely numerical, and there is no arbitrary relationship between the categories. It would not make sense to assume an ordinal relationship between categories.

#### Using KNNImputer:
It takes into account the relationships between the features in the data, and can lead to a more accurate representation of the underlying data distribution. Also it can be used for both nummeri and categorical features.

#### Using RobustScaler:
It works by subtracting the median of the data and then dividing by the interquartile range (IQR). This makes it more robust to the presence of outliers and extreme values in the data, compared to simpler scaling methods like the StandardScaler.

In [None]:
def BuildFullPipeline(preprocessor, model, model_name):
    """
    The BuildFullPipeline defines several pipelines for modeling data using different machine learning algorithms.
    Each pipeline consists of a preprocessor step and a machine learning algorithm step.

    Args:
    preprocessor: a ColumnTransformer object that applies data preprocessing techniques to the input data.

    Return: A tuple of trained pipeline objects for different machine learning algorithms.

    """
    # define a pipeline to model
    full_pipline = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            (model_name, model)])
    return full_pipline

In [None]:
# Define a pipe line for numeric feature preprocessing
from sklearn.impute import KNNImputer

transformer_numeric = Pipeline(
    steps=[
        ('imputer', KNNImputer()),
        ('scaler', RobustScaler())]
)

# define a pipe line for categorical feature preprocessing
transformer_categorical = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)


## **3.1 Build Preprocessing and Model Pipline for Dataset V1**
Pipline of the data after dropping the nulls columns only



In [None]:
features_numeric1 = list(X1.select_dtypes(exclude=['object']))
features_categorical1 = list(X1.select_dtypes(include=['object']))

print(f'Numeric features: \n{features_numeric1}')
print(f'\nCategorical features: \n{features_categorical1}')

Numeric features: 
['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'tuition', 'race', 'imprace', 'imprelig', 'income', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'at

In [None]:
# define the preprocessor
preprocessor1 = ColumnTransformer(
    transformers=[
        ('num', transformer_numeric, features_numeric1),
        ('cat', transformer_categorical, features_categorical1)
    ]
)

In [None]:
RF_pipline1 = BuildFullPipeline(preprocessor1, RandomForestClassifier(),'RF')
MLP_pipline1 = BuildFullPipeline(preprocessor1, MLPClassifier(),'MLP')
LR_pipline1 = BuildFullPipeline(preprocessor1, LogisticRegression(),'LR')
SVC_pipline1 = BuildFullPipeline(preprocessor1, SVC(probability=True),'SVC')
XGB_pipline1 = BuildFullPipeline(preprocessor1, XGBClassifier(objective='binary:logistic'),'XGB')
RF_pipline1

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   KNNImputer()),
                                                                  ('scaler',
                                                                   RobustScaler())]),
                                                  ['gender', 'idg', 'condtn',
                                                   'wave', 'round', 'position',
                                                   'positin1', 'order',
                                                   'partner', 'pid', 'int_corr',
                                                   'samerace', 'age_o',
                                                   'race_o', 'pf_o_att',
                                                   'pf_o_sin', 'pf_o_int',
                                                   'pf_o

## **3.2 Build Preprocessing and Model Pipeline for Dataset V2**
Pipline of the data after dropping the nulls and the correlated columns

In [None]:
features_numeric2 = list(X2.select_dtypes(exclude=['object']))
features_categorical2 = list(X2.select_dtypes(include=['object']))

print(f'Numeric features: \n{features_numeric2}')
print(f'\nCategorical features: \n{features_categorical2}')

Numeric features: 
['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'tuition', 'race', 'imprace', 'imprelig', 'income', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1_s', 'sinc1_s', 'intel

In [None]:
# define the preprocessor
preprocessor2 = ColumnTransformer(
    transformers=[
        ('num', transformer_numeric, features_numeric2),
        ('cat', transformer_categorical, features_categorical2)
    ]
)


In [None]:
RF_pipline2 = BuildFullPipeline(preprocessor2, RandomForestClassifier(),'RF')
MLP_pipline2 = BuildFullPipeline(preprocessor2, MLPClassifier(),'MLP')
LR_pipline2 = BuildFullPipeline(preprocessor2, LogisticRegression(),'LR')
SVC_pipline2 = BuildFullPipeline(preprocessor2, SVC(probability=True),'SVC')
XGB_pipline2 = BuildFullPipeline(preprocessor2, XGBClassifier(objective='binary:logistic'),'XGB')

## **3.3 Build Preprocessing and Model Pipeline for Dataset V3**
Pipline of the oversampled version of the dataset after dropping only the nulls columns

In [None]:
features_numeric3 = list(X3.select_dtypes(exclude=['object']))
features_categorical3 = list(X3.select_dtypes(include=['object']))

print(f'Numeric features: \n{features_numeric3}')
print(f'\nCategorical features: \n{features_categorical3}')

Numeric features: 
['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'tuition', 'race', 'imprace', 'imprelig', 'income', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'at

In [None]:
# define the preprocessor
preprocessor3 = ColumnTransformer(
    transformers=[
        ('num', transformer_numeric, features_numeric3),
        ('cat', transformer_categorical, features_categorical3)
    ]
)

In [None]:
LR_pipline3 = BuildFullPipeline(preprocessor3, LogisticRegression(),'LR')
SVC_pipline3 = BuildFullPipeline(preprocessor3, SVC(probability=True),'SVC')
XGB_pipline3 = BuildFullPipeline(preprocessor3, XGBClassifier(objective='binary:logistic'),'XGB')

_________________


# 4. Apply Pipline With GridSearch

_________________

I Used Grid Search To Search for the best Hyperparameter for 4 diffrent algorithms (MLP, RF, SVC, LR, XGB) with 3 diffrent dataset:

1. A Dataset after dropping nulls features.

2. A Dataset after dropping nulls features + highly correlated Features

3. A Dataset after dropping nulls features + Oversampling.

## **4.1 Multi-Layer Perceptron in Grid Search Classifier**

MLPs require a large amount of data to learn complex patterns in the input data. If the dataset is small, MLPs may overfit the data or fail to generalize well to new data.It can handle high-dimensional input features.


In [None]:
scorer = 'roc_auc'
cv = 2

In [None]:
# define the hyperparameters to optimize using grid search
MLP_param_grid = {
    'preprocessor__num__imputer__n_neighbors': [3],
    'MLP__hidden_layer_sizes': [(100, 50, 10)],
    'MLP__alpha': [0.001, 0.01],
    'MLP__activation': ['logistic', 'relu'],
    'MLP__solver': ['adam']
}

### **4.1.1 MLP classifier in Grid Search with dataset v1**

In [None]:
# run grid search to find the best hyperparameters
MLP1_grid_search = GridSearchCV(MLP_pipline1,MLP_param_grid,verbose=3, cv=cv, scoring=scorer)
MLP1_grid_search.fit(X1, y1)

# print the best hyperparameters and score
print("Best hyperparameters: ", MLP1_grid_search.best_params_)
print("Best score: ", MLP1_grid_search.best_score_)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV 1/2] END MLP__activation=logistic, MLP__alpha=0.001, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.846 total time=  51.3s
[CV 2/2] END MLP__activation=logistic, MLP__alpha=0.001, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.837 total time=  51.0s
[CV 1/2] END MLP__activation=logistic, MLP__alpha=0.01, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.839 total time=  49.7s
[CV 2/2] END MLP__activation=logistic, MLP__alpha=0.01, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.835 total time=  49.0s
[CV 1/2] END MLP__activation=relu, MLP__alpha=0.001, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.824 total time=  19.2s
[C

> **OBSERVATION:** Based on the results, MLP can capture the pwttern between this dataset of shape (5909, 157) but it is computationally expensive to train, especially in grid search. I will try wider range of hyperparmeters in random and bayesain search to get better score.

### **4.1.2 MLP classifier in Grid Search with dataset v2**

In [None]:
# run grid search to find the best hyperparameters
MLP2_grid_search = GridSearchCV(MLP_pipline2, MLP_param_grid,verbose=3, cv=cv, scoring=scorer)
MLP2_grid_search.fit(X2, y2)

# print the best hyperparameters and score
print("Best hyperparameters: ", MLP2_grid_search.best_params_)
print("Best score: ", MLP2_grid_search.best_score_)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV 1/2] END MLP__activation=logistic, MLP__alpha=0.001, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.842 total time=  51.2s
[CV 2/2] END MLP__activation=logistic, MLP__alpha=0.001, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.846 total time=  52.5s
[CV 1/2] END MLP__activation=logistic, MLP__alpha=0.01, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.845 total time=  51.2s
[CV 2/2] END MLP__activation=logistic, MLP__alpha=0.01, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.840 total time=  51.7s
[CV 1/2] END MLP__activation=relu, MLP__alpha=0.001, MLP__hidden_layer_sizes=(100, 50, 10), MLP__solver=adam, preprocessor__num__imputer__n_neighbors=3;, score=0.811 total time=  20.6s
[C

> **OBSERVATION:** In MLP with the same hyperparamter, the results in the dataset without the highy correlated function is slightly lower.

## **4.2 Support Vector Classifier in Grid Search**

In [None]:
SVC_param_grid = {
    'preprocessor__num__imputer__n_neighbors': [3],
    'SVC__C': [0.1, 1, 10, 100],
    'SVC__kernel': ['rbf'],
    'SVC__gamma': ['scale', 'auto', 0.1, 1, 10],
}

### **4.2.1 Support Vector Classifier in Grid Search with dataset v1**

In [None]:
SVC1_grid_search = GridSearchCV(SVC_pipline1, SVC_param_grid,verbose=3, cv=cv, scoring=scorer)
SVC1_grid_search.fit(X1, y1)

print('best score {}'.format(SVC1_grid_search.best_score_))
print('best hyperparameters {}'.format(SVC1_grid_search.best_params_))

Fitting 2 folds for each of 20 candidates, totalling 40 fits
[CV 1/2] END SVC__C=0.1, SVC__gamma=scale, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.853 total time=  28.2s
[CV 2/2] END SVC__C=0.1, SVC__gamma=scale, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.841 total time=  28.3s
[CV 1/2] END SVC__C=0.1, SVC__gamma=auto, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.858 total time=  26.2s
[CV 2/2] END SVC__C=0.1, SVC__gamma=auto, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.853 total time=  26.8s
[CV 1/2] END SVC__C=0.1, SVC__gamma=0.1, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.623 total time=  47.9s
[CV 2/2] END SVC__C=0.1, SVC__gamma=0.1, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.644 total time=  47.3s
[CV 1/2] END SVC__C=0.1, SVC__gamma=1, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.561 total time=  48.7s
[CV 

> **OBSERVATION**:
- `C=1` : Penalty on misclassification is to low (wider range), so the model is not too constrained. This is good for this imbalanced dataset as it strikes a balance between underfitting and overfitting the data.
- `kernel='rbf'` : The Gaussian RBF kernel is a suitable choice for nonlinear classification problems as it can model complex decision boundaries.
- `n_neighbors=3` : Imputing missing values using 3 nearest neighbors is a simple, robust approach.
- `gamma='auto'` : Means it will be set to 1/n_features. This may be a good choice for a dataset with a moderate number of features like (157 features)

### **4.2.2 Support Vector Classifier in Grid Search with dataset v2**

In [None]:
SVC2_grid_search = GridSearchCV(SVC_pipline2, SVC_param_grid,verbose=3, cv=cv, scoring=scorer)
SVC2_grid_search.fit(X2, y2)

print('best score {}'.format(SVC2_grid_search.best_score_))
print('best hyperparameters {}'.format(SVC2_grid_search.best_params_))

Fitting 2 folds for each of 20 candidates, totalling 40 fits
[CV 1/2] END SVC__C=0.1, SVC__gamma=scale, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.850 total time=  27.2s
[CV 2/2] END SVC__C=0.1, SVC__gamma=scale, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.835 total time=  27.3s
[CV 1/2] END SVC__C=0.1, SVC__gamma=auto, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.858 total time=  25.5s
[CV 2/2] END SVC__C=0.1, SVC__gamma=auto, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.855 total time=  26.1s
[CV 1/2] END SVC__C=0.1, SVC__gamma=0.1, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.621 total time=  46.1s
[CV 2/2] END SVC__C=0.1, SVC__gamma=0.1, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.641 total time=  45.5s
[CV 1/2] END SVC__C=0.1, SVC__gamma=1, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=3;, score=0.556 total time=  46.3s
[CV 

> **OBSERVATION:** In SVC with the same hyperparamter, the results in the dataset without the highy correlated function is slightly lower.

## **4.3 Random Forest Classifier in Grid Search**
Random Forest is used to compute the importance of each feature as the decrease in the impurity of the nodes that use the feature for splitting. it is suitable for data with correlated features.

In [None]:
RF_param_grid = {
    'preprocessor__num__imputer__n_neighbors': [3],
    'RF__n_estimators': [20, 30, 40],
    'RF__max_depth':[10, 20, 30] ,
    'RF__max_features': ['sqrt', 'log2']
}

### **4.3.1 Random Forest in Grid Search with Dataset V1**

In [None]:
RF1_grid_search = GridSearchCV(RF_pipline1, RF_param_grid,verbose=3,cv=cv, scoring=scorer)
RF1_grid_search.fit(X1, y1)

print('best score {}'.format(RF1_grid_search.best_score_))
print('best hyperparameters {}'.format(RF1_grid_search.best_params_))

Fitting 2 folds for each of 18 candidates, totalling 36 fits
[CV 1/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=20, preprocessor__num__imputer__n_neighbors=3;, score=0.810 total time=  10.3s
[CV 2/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=20, preprocessor__num__imputer__n_neighbors=3;, score=0.814 total time=  10.8s
[CV 1/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=30, preprocessor__num__imputer__n_neighbors=3;, score=0.826 total time=  10.4s
[CV 2/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=30, preprocessor__num__imputer__n_neighbors=3;, score=0.823 total time=  10.9s
[CV 1/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=40, preprocessor__num__imputer__n_neighbors=3;, score=0.837 total time=  10.6s
[CV 2/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=40, preprocessor__num__imputer__n_neighbors=3;, score=0.825 total time=  11.1s
[CV 1/2] END RF__max_depth=10, RF__max_

> **OBSERVATION:** It fits the data with an acceptable score.

    - `max_depth=10` : Limiting the tree depth helps prevent overfitting on a dataset of that size. Deeper trees have more capacity to model noise and complex patterns that may not generalize.
    - `max_features='sqrt'` : Using the square root of features considers a good subset of features for each tree, without being too restrictive. This often gives a good balance.
    - `n_estimators=40` : Having a moderate number of trees (not too many) prevents overfitting from an ensemble that is too large. 40 trees is enough for good diversity but not so many that the model loses generalization ability.

### **4.3.2 Random Forest in Grid Search with Dataset v2**

In [None]:
RF2_grid_search = GridSearchCV(RF_pipline2, RF_param_grid,verbose=3, cv=cv, scoring=scorer)
RF2_grid_search.fit(X2, y2)

print('best score {}'.format(RF2_grid_search.best_score_))
print('best hyperparameters {}'.format(RF2_grid_search.best_params_))

Fitting 2 folds for each of 18 candidates, totalling 36 fits
[CV 1/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=20, preprocessor__num__imputer__n_neighbors=3;, score=0.807 total time=  10.0s
[CV 2/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=20, preprocessor__num__imputer__n_neighbors=3;, score=0.800 total time=  10.4s
[CV 1/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=30, preprocessor__num__imputer__n_neighbors=3;, score=0.819 total time=  10.1s
[CV 2/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=30, preprocessor__num__imputer__n_neighbors=3;, score=0.816 total time=  10.7s
[CV 1/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=40, preprocessor__num__imputer__n_neighbors=3;, score=0.824 total time=  10.2s
[CV 2/2] END RF__max_depth=10, RF__max_features=sqrt, RF__n_estimators=40, preprocessor__num__imputer__n_neighbors=3;, score=0.832 total time=  11.2s
[CV 1/2] END RF__max_depth=10, RF__max_

> **OBSERVATION:** In RFC with the same hyperparamter, the results in the dataset without the highy correlated function is slightly higher.

## **4.4 Logistic Regression in Grid Search**


In [None]:
LR_param_grid = {
    'preprocessor__num__imputer__n_neighbors': [3],
    'LR__C': np.logspace(-4, 4, 50),
    'LR__penalty' : ['l1', 'l2'],
    'LR__solver': ['liblinear'],
}

### **4.4.1 Logistic Regression in Grid Search with dataset v1**

In [None]:
LR1_grid_search = GridSearchCV(LR_pipline1, LR_param_grid, cv=3, scoring=scorer)
LR1_grid_search.fit(X1, y1)

print('best score {}'.format(LR1_grid_search.best_score_))
print('best hyperparameters {}'.format(LR1_grid_search.best_params_))

best score 0.8578848709056244
best hyperparameters {'LR__C': 0.12648552168552957, 'LR__penalty': 'l1', 'LR__solver': 'liblinear', 'preprocessor__num__imputer__n_neighbors': 3}


> **OBSERVATION:** till now LR and MLP (~ 0.86, 0.85 respectively) is bestter than SVC and RFC (~ 0.83).

- `C=0.12648552168552957` : This relatively low C value means lower penalty (regularization) on the weights. This prevents overfitting on a dataset of this size.
- `penalty='l1'` : The L1 (Lasso) penalty encourages sparseness in the weight vector by setting many weights exactly to 0. This reducing model complexity.
- `solver='liblinear'` - The liblinear solver is a good, efficient choice for linear models and can handle the L1 penalty.

### **4.4.1 Logistic Regression in Grid Search with dataset v2**

In [None]:
LR2_grid_search = GridSearchCV(LR_pipline2, LR_param_grid, cv=3, scoring=scorer)
LR2_grid_search.fit(X2, y2)

print('best score {}'.format(LR2_grid_search.best_score_))
print('best hyperparameters {}'.format(LR2_grid_search.best_params_))

best score 0.8563045136127782
best hyperparameters {'LR__C': 0.040949150623804234, 'LR__penalty': 'l1', 'LR__solver': 'liblinear', 'preprocessor__num__imputer__n_neighbors': 3}


> **OBSERVATION:** Same Hyperparameters and same score. Seems that the L1 penelty handle the correlated features in the first dataset (dataset-v2 without dropping the highly correlated features).

_________________


# 5. Apply Pipline With RandomSearch

_________________

Randomized Search and Pipelines together are used for exploring **more of the hyperparameter space**, requiring **fewer iterations**, reducing variance, scaling to large spaces, enabling complex workflows, ensuring reproducibility, and jointly optimizing the hyperparameters of multi-step pipelines. This leads to improved performance, efficiency, and robustness.


In the following work we used 3 diffrent datasets:

1. A Dataset after dropping nulls features.

2. A Dataset after dropping nulls features + highly correlated Features.

3. A Dataset after dropping nulls features + Oversampling.

## **5.1 MLP Classifier in Random Search**


In [None]:
iters = 10
cv = 5
seed = 42

In [None]:
# define the hyperparameters to optimize using random search
MLP_param_rand = {
    'preprocessor__num__imputer__n_neighbors': [3,5],
    'MLP__hidden_layer_sizes': [(50,),(100,), (50, 100), (100, 50, 10)],
    'MLP__alpha': uniform(0.0001, 0.01),
    'MLP__activation': ['logistic', 'tanh', 'relu'],
    'MLP__learning_rate': ['constant', 'adaptive'],
    'MLP__solver': ['adam', 'sgd']
}

### **5.1.1 MLP Classifier in Random Search with Dataset v1**

In [None]:
# Define the RandomizedSearchCV object
MLP1_rand_search = RandomizedSearchCV(MLP_pipline1, MLP_param_rand, scoring = scorer,
                                       n_iter=iters, cv=cv, n_jobs=-1, random_state=42)

# Fit the RandomizedSearchCV object to the data
MLP1_rand_search.fit(X1, y1)

# Print the best hyperparameters and test score
print("Best score: ", MLP1_rand_search.best_score_)
print("Best hyperparameters: ", MLP1_rand_search.best_params_)

Best score:  0.8655281754413823
Best hyperparameters:  {'MLP__activation': 'relu', 'MLP__alpha': 0.0018052412368729153, 'MLP__hidden_layer_sizes': (50, 100), 'MLP__learning_rate': 'adaptive', 'MLP__solver': 'sgd', 'preprocessor__num__imputer__n_neighbors': 3}


> **OBSERVATION:** the best hyperparameters changed when we used wider range with random search and the score raise by 0.1 as follows:

* Best hyperparameters in GridSearch:  {`activation`: 'logistic', `alpha`: 0.001, `hidden_layer_sizes`: (100, 50, 10), `solver`: 'adam', `n_neighbors`: 3}
        Best score:  0.846


* Best hyperparameters in RandomSearch:  {`activation`: 'relu', `alpha`: 0.008, `hidden_layer_sizes`: (50, 100), `learning_rate`: 'constant', `solver`: 'sgd', `n_neighbors`: 3}
        Best score:  0.865

### **5.1.2 MLP Classifier in Random Search with Dataset v2**

In [None]:
# Define the RandomizedSearchCV object
MLP2_rand_search = RandomizedSearchCV(MLP_pipline2, MLP_param_rand, scoring = scorer,n_iter=iters, cv=cv, random_state=42)
# Fit the RandomizedSearchCV object to the data
MLP2_rand_search.fit(X2, y2)

# Print the best hyperparameters and test score
print("Best score: ", MLP2_rand_search.best_score_)
print("Best hyperparameters: ", MLP2_rand_search.best_params_)



Best score:  0.8612765703496403
Best hyperparameters:  {'MLP__activation': 'relu', 'MLP__alpha': 0.0018052412368729153, 'MLP__hidden_layer_sizes': (50, 100), 'MLP__learning_rate': 'adaptive', 'MLP__solver': 'sgd', 'preprocessor__num__imputer__n_neighbors': 3}


> **OBSERVATION:** Same hyperparameters and lower score by 0.005

## **5.2 Support Vector Classifier in Random Search**

In [None]:
# Define the hyperparameters to optimize using RandomizedSearchCV
SVC_param_rand = {
    'preprocessor__num__imputer__n_neighbors': [3,5],
    'SVC__C': uniform(0, 10),

     # list with options including 'scale' and 'auto',
     # plus 5 random draws from uniform distribution between 0 and 1

    'SVC__gamma': ['scale', 'auto'] + list(uniform(0, 1).rvs(5)),
    'SVC__kernel': ['poly', 'rbf', 'sigmoid'],
}

### **5.2.1 Support Vector Classifier in Random Search with Dataset v1**

In [None]:

SVC1_rand_search = RandomizedSearchCV(SVC_pipline1, SVC_param_rand, scoring = scorer,n_iter=iters, cv=cv, random_state=seed)
SVC1_rand_search.fit(X1, y1)

# Print the best hyperparameters and test score
print("Best Score: ", SVC1_rand_search.best_score_)
print("Best hyperparameters: ", SVC1_rand_search.best_params_)

Best Score:  0.817514625414858
Best hyperparameters:  {'SVC__C': 0.007787658410143283, 'SVC__gamma': 0.21708373206139808, 'SVC__kernel': 'poly', 'preprocessor__num__imputer__n_neighbors': 3}


> **OBSERVATION:** when `C` value decreases the margin increased and penalty on misclassification is incresed, so the model is more constrained which decreses the score by 0.04 as follows:

* Best hyperparameters in Random Search: {`C`: 0.00778, `gamma`: 'auto', 'kernel`: 'rbf', `n_neighbors`: 3}
        Best Score:  0.817
* best hyperparameters in Grid Search: {`C`: 0.1, `gamma`: 'auto', 'kernel`: 'rbf', `n_neighbors`: 3}
        best score 0.856

### **5.2.2 Support Vector Classifier in Random Search with Dataset v2**

In [None]:
# Define the RandomizedSearchCV object
SVC2_rand_search = RandomizedSearchCV(SVC_pipline2, SVC_param_rand, scoring = scorer, n_iter=iters, cv=cv, random_state=seed)
# Fit the RandomizedSearchCV object to the data
SVC2_rand_search.fit(X2, y2)

# Print the best hyperparameters and test score
print("Best Score: ", SVC2_rand_search.best_score_)
print("Best hyperparameters: ", SVC2_rand_search.best_params_)

Best Score:  0.811147665091616
Best hyperparameters:  {'SVC__C': 3.0424224295953772, 'SVC__gamma': 0.8848486297571929, 'SVC__kernel': 'poly', 'preprocessor__num__imputer__n_neighbors': 5}


> **OBSERVATION:** With dataset-v2, the hyperparameters changed completly but seems that the score in dataset-v1 is better as shown below:

* Best hyperparameters in Random Search with **dataset-v1**: {`C`: 0.00778, `gamma`: 'auto', 'kernel`: 'rbf', `n_neighbors`: 3}
        Best Score:  0.817
* Best hyperparameters in Random Search with **dataset-v2**: {`C`: 3.0424, `gamma`: 0.255, 'kernel`: 'poly', `n_neighbors`: 5}
        Best Score:  0.811

## **5.3 Random Forest Classifier in Random Search**

In [None]:

RF_param_rand = {
    'preprocessor__num__imputer__n_neighbors': [3,5],
    'RF__n_estimators': randint(10, 100),
    'RF__max_depth':[None, 5, 10, 15] ,
    'RF__max_features': ['auto', 'sqrt', 'log2'],
    'RF__min_samples_split': randint(2, 10),
    'RF__max_features': ['sqrt', 'log2', None] ,
}

### **5.3.1 Random Forest Classifier in Random Search with Dataset v1**

In [None]:
RF1_rand_search = RandomizedSearchCV(RF_pipline1, RF_param_rand, n_iter=iters, cv=cv, scoring=scorer,random_state=seed)
RF1_rand_search.fit(X1, y1)

print('best score {}'.format(RF1_rand_search.best_score_))
print('best hyperparameters {}'.format(RF1_rand_search.best_params_))

best score 0.8480620306749547
best hyperparameters {'RF__max_depth': 5, 'RF__max_features': None, 'RF__min_samples_split': 7, 'RF__n_estimators': 71, 'preprocessor__num__imputer__n_neighbors': 3}



> **OBSERVATION:** with wider range of hyperparameters in random search the hyperparametrs changed with better score as follows:


* Best hyperparameters in Random Search {`max_depth`: 10, `max_features`: None, `min_samples_split`: 9, `n_estimators`: 33, `n_neighbors`: 3}
        best score 0.847
        
* Best hyperparameters in grid Search : {`max_depth`: 10, `max_features`: 'sqrt', `n_estimators`: 40, `n_neighbors`: 3}
        best score 0.831

### **5.3.2 Random Forest Classifier in Random Search with Dataset v2**

In [None]:
RF2_rand_search = RandomizedSearchCV(RF_pipline2, RF_param_rand,verbose=3, n_iter=iters, cv=cv, scoring=scorer, random_state=seed)
RF2_rand_search.fit(X2, y2)

print('best score {}'.format(RF2_rand_search.best_score_))
print('best hyperparameters {}'.format(RF2_rand_search.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END RF__max_depth=10, RF__max_features=sqrt, RF__min_samples_split=8, RF__n_estimators=81, preprocessor__num__imputer__n_neighbors=3;, score=0.854 total time=  17.4s
[CV 2/5] END RF__max_depth=10, RF__max_features=sqrt, RF__min_samples_split=8, RF__n_estimators=81, preprocessor__num__imputer__n_neighbors=3;, score=0.853 total time=  17.3s
[CV 3/5] END RF__max_depth=10, RF__max_features=sqrt, RF__min_samples_split=8, RF__n_estimators=81, preprocessor__num__imputer__n_neighbors=3;, score=0.836 total time=  17.9s
[CV 4/5] END RF__max_depth=10, RF__max_features=sqrt, RF__min_samples_split=8, RF__n_estimators=81, preprocessor__num__imputer__n_neighbors=3;, score=0.840 total time=  17.2s
[CV 5/5] END RF__max_depth=10, RF__max_features=sqrt, RF__min_samples_split=8, RF__n_estimators=81, preprocessor__num__imputer__n_neighbors=3;, score=0.852 total time=  15.7s
[CV 1/5] END RF__max_depth=None, RF__max_features=None, RF__min_

> **OBSERVATION:** Hyperparameters are changed with less depth and mode estimators but the result is the same as the result of dataset-v1, so I prefer the **simpler model**

## **5.4 Logistic Regression Classifier in Random Search**

In [None]:
LR_param_rand = {
    'preprocessor__num__imputer__n_neighbors': [3,5],
    'LR__C': np.logspace(-4, 4, 50),
    'LR__penalty' : ['l1', 'l2'],
    'LR__solver': [ 'liblinear'],
}


### **5.4.1 Logistic Regression Classifier in Random Search with Dataset v1**

In [None]:
LR1_rand_search = RandomizedSearchCV(LR_pipline1, LR_param_rand, cv=cv, n_iter=iters, scoring=scorer, random_state=seed)
LR1_rand_search.fit(X1, y1)

print('best score {}'.format(LR1_rand_search.best_score_))
print('best hyperparameters {}'.format(LR1_rand_search.best_params_))


best score 0.8609507878377448
best hyperparameters {'preprocessor__num__imputer__n_neighbors': 5, 'LR__solver': 'liblinear', 'LR__penalty': 'l1', 'LR__C': 0.05963623316594643}


### **5.4.2 Logistic Regression Classifier in Random Search with Dataset v2**

In [None]:
LR2_rand_search = RandomizedSearchCV(LR_pipline2, LR_param_rand, cv=cv, n_iter=iters, scoring=scorer, random_state=seed)
LR2_rand_search.fit(X1, y1)

print('best score {}'.format(LR2_rand_search.best_score_))
print('best hyperparameters {}'.format(LR2_rand_search.best_params_))

best score 0.8599604436898842
best hyperparameters {'preprocessor__num__imputer__n_neighbors': 5, 'LR__solver': 'liblinear', 'LR__penalty': 'l1', 'LR__C': 0.05963623316594643}


> **OBSERVATION:** Random search logistic regression has the same hyperparameters as grid search.

## **5.5 XGBoost Classifier in Random Search**

In [None]:
# Define the hyperparameters to be tuned
XGB_param_rand = {
    'preprocessor__num__imputer__n_neighbors': [3,5],
    'XGB__n_estimators': [100, 500, 1000],
    'XGB__max_depth': [3, 5, 7],
    'XGB__learning_rate': [0.01, 0.1, 1],
    'XGB__gamma': [0, 1, 5],
    'XGB__subsample': [0.5, 0.8, 1],
    'XGB__colsample_bytree': [0.5, 0.8, 1],
    'XGB__reg_alpha': [0, 0.1, 0.5],
    'XGB__reg_lambda': [0, 0.1, 0.5]
}

### **5.5.1 XGBoost Classifier in Random Search with dataset v1**

In [None]:
XGB1_rand_search = RandomizedSearchCV(XGB_pipline1, XGB_param_rand, verbose=3,cv=cv,  n_iter=iters, scoring=scorer, random_state=seed)
XGB1_rand_search.fit(X1, y1)

print('best score {}'.format(XGB1_rand_search.best_score_))
print('best hyperparameters {}'.format(XGB1_rand_search.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.836 total time=  49.8s
[CV 2/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.839 total time=  50.1s
[CV 3/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.829 total time=  50.0s
[CV 4/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, prepro

> **OBSERVATION:** Till now XGB with dataset version 1 has the highest score (0.88064)

### **5.5.2 XGBoost Classifier in Random Search with dataset v2**

In [None]:
XGB2_rand_search = RandomizedSearchCV(XGB_pipline2, XGB_param_rand, verbose=3, cv=cv,  n_iter=iters, scoring=scorer, random_state=seed)
XGB2_rand_search.fit(X2, y2)

print('best score {}'.format(XGB2_rand_search.best_score_))
print('best hyperparameters {}'.format(XGB2_rand_search.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.832 total time=  50.5s
[CV 2/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.824 total time=  51.0s
[CV 3/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.840 total time=  50.7s
[CV 4/5] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, prepro

> **OBSERVATION:** XGB has the highest score between all the 5 algorithms (auc = 0.88). XGBoost has built-in regularization techniques that help prevent overfitting and improve the generalization of the model using the following 2 hyperparameters `XGB__reg_lambda` and `XGB__reg_alpha`

### **5.5.3 XGBoost Classifier in Random Search with dataset v3**

In [None]:
XGB3_rand_search = RandomizedSearchCV(XGB_pipline3, XGB_param_rand, verbose=3, cv=3,  n_iter=1, scoring=scorer, random_state=seed)
XGB3_rand_search.fit(X3, y3)

print('best score {}'.format(XGB3_rand_search.best_score_))
print('best hyperparameters {}'.format(XGB3_rand_search.best_params_))

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV 1/3] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.974 total time= 1.3min
[CV 2/3] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.987 total time= 1.3min
[CV 3/3] END XGB__colsample_bytree=0.8, XGB__gamma=1, XGB__learning_rate=1, XGB__max_depth=7, XGB__n_estimators=1000, XGB__reg_alpha=0.1, XGB__reg_lambda=0.5, XGB__subsample=1, preprocessor__num__imputer__n_neighbors=3;, score=0.981 total time= 1.3min
best score 0.9808573854733509
best hyperparameters {'preprocessor__num__imputer__n_neighbors': 3, 'XGB__subsample': 1, 'XGB__reg_lambda': 0.5, 'XGB__reg_alpha': 0.1, 'XGB__n_estimators'

> **OBSERVATION:** While oversampling can improve the performance of the model on the minority class, it can also lead to overfitting if not done correctly. The model may be biased towards the minority class and may not perform well on the majority class.

_________________


# **6. Apply Pipline With Bayesain Search**

_________________


By caching models between iterations, Keras Tuner can skip re-building models from scratch. This speeds up the tuning process significantly. BayesianSearch performs a smarter search of the space by focusing on areas it thinks will lead to better performing models. I will apply Bayesain Search on 3 algorithms:

1. Support Vector Classifier.
2. Logistic Regression Classifier.
3. XGBoost Classifier.


In the following work we used 2 diffrent datasets:

1. A Dataset after dropping nulls features.

2. A Dataset after dropping nulls features + **Oversampling**: For this dataset , I will take the hyperparametrs directly from the results of dataset version1.

In [None]:
from skopt.space import Real, Categorical, Integer
from skopt import BayesSearchCV

cv = 3
iters = 5
seed = 42

## **6.2 Support Vector Classifier in Bayesain Search**


### **6.2.1 Support Vector Classifier in Bayesain Search with Dataset v1**


In [None]:
# define ranges for bayes search
SVC1_param_bayes ={
                'preprocessor__num__imputer__n_neighbors': [3,5,7],
                'SVC__C': Real(1e-6, 1e+6, prior='log-uniform'),
                'SVC__gamma': Real(1e-6, 1e+1, prior='log-uniform'),
                'SVC__degree': Integer(1,8),
                'SVC__kernel': Categorical([ 'poly', 'rbf']),}

In [None]:
SVC1_bayes_search = BayesSearchCV(SVC_pipline1, SVC1_param_bayes, n_iter=iters,
                                  random_state=seed,verbose=1,cv=cv)
SVC1_bayes_search.fit(X1, y1)

print('best score {}'.format(SVC1_bayes_search.best_score_))
print('best hyperparameters {}'.format(SVC1_bayes_search.best_params_))


Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
best score 0.8470130508499336
best hyperparameters OrderedDict([('SVC__C', 5607.275056505338), ('SVC__degree', 2), ('SVC__gamma', 0.015357818918361629), ('SVC__kernel', 'rbf'), ('preprocessor__num__imputer__n_neighbors', 5)])


> **OBSERVATION:** Bayesian optimization is generally considered to be a more efficient technique for hyperparameter tuning than random search, as it can balance exploration and exploitation more efficiently, use adaptive sampling to explore promising regions of the hyperparameter space, incorporate prior knowledge, and allocate resources more efficiently. These factors may have contributed to the better results found in SVC when using Bayesian optimization compared to random search.

### **6.2.2 Support Vector Classifier in Bayesain Search with Dataset v3**


In [None]:
# define ranges for bayes search
SVC3_param_bayes ={
                'preprocessor__num__imputer__n_neighbors': [5],
                'SVC__C': [3.317697704417197],
                'SVC__gamma': [0.002987218052601489],
                'SVC__degree': [7],
                'SVC__kernel': ['rbf'],}

In [None]:
SVC3_bayes_search = BayesSearchCV(SVC_pipline3, SVC3_param_bayes, n_iter=1,
                                  random_state=seed,cv=cv,verbose=3,)
SVC3_bayes_search.fit(X3, y3)

print('best score {}'.format(SVC3_bayes_search.best_score_))
print('best hyperparameters {}'.format(SVC3_bayes_search.best_params_))

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV 1/3] END SVC__C=3.317697704417197, SVC__degree=7, SVC__gamma=0.002987218052601489, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=5;, score=0.881 total time= 2.2min
[CV 2/3] END SVC__C=3.317697704417197, SVC__degree=7, SVC__gamma=0.002987218052601489, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=5;, score=0.868 total time= 2.2min
[CV 3/3] END SVC__C=3.317697704417197, SVC__degree=7, SVC__gamma=0.002987218052601489, SVC__kernel=rbf, preprocessor__num__imputer__n_neighbors=5;, score=0.884 total time= 2.2min
best score 0.8773629520050154
best hyperparameters OrderedDict([('SVC__C', 3.317697704417197), ('SVC__degree', 7), ('SVC__gamma', 0.002987218052601489), ('SVC__kernel', 'rbf'), ('preprocessor__num__imputer__n_neighbors', 5)])


> **OBSERVATION:** In SVC with bayesain and oversampled data with the same hyperparameter of the dataset version1, the SVC can fit the data and perform well in terms of auc (0.877).

## **6.2 Logistic Regression Classifier in Bayesain Search**


### **6.2.1 Logistic Regression Classifier in Bayesain Search with Dataset v1**


In [None]:
LR1_bayes_search = {
                'preprocessor__num__imputer__n_neighbors': [3,5,7],
                'LR__C': Real(0.1, 10, prior='log-uniform'),
                'LR__penalty': ['l1', 'l2'],
                'LR__solver': ["liblinear"],}

In [None]:
LR1_bayes_search = BayesSearchCV(LR_pipline1, LR1_bayes_search, n_iter=iters,
                                  random_state=seed,cv=cv)
LR1_bayes_search.fit(X1, y1)

print('best score {}'.format(LR1_bayes_search.best_score_))
print('best hyperparameters {}'.format(LR1_bayes_search.best_params_))

best score 0.8520895367193186
best hyperparameters OrderedDict([('LR__C', 0.661009829541915), ('LR__penalty', 'l2'), ('LR__solver', 'liblinear'), ('preprocessor__num__imputer__n_neighbors', 5)])


> **OBSERVATION:** Here Random Search give slightly hghier score than Bayesain Search in logistic regression.

### **6.2.2 Logistic Regression Classifier in Bayesain Search with Dataset v3**


In [None]:
LR3_bayes_search = {
                'preprocessor__num__imputer__n_neighbors': [5],
                'LR__C': [0.10168587136004645],
                'LR__penalty': ['l2'],
                'LR__solver': ["liblinear"],}

In [None]:
LR3_bayes_search = BayesSearchCV(LR_pipline3, LR3_bayes_search, n_iter=1,
                                  random_state=seed,cv=cv)
LR3_bayes_search.fit(X3, y3)

print('best score {}'.format(LR3_bayes_search.best_score_))
print('best hyperparameters {}'.format(LR3_bayes_search.best_params_))

best score 0.8263568513466298
best hyperparameters OrderedDict([('LR__C', 0.10168587136004645), ('LR__penalty', 'l2'), ('LR__solver', 'liblinear'), ('preprocessor__num__imputer__n_neighbors', 5)])


> **OBSERVATION:** When using the same hyperparameters, the oversampled dataset with LR algorithm performs less effectively compared to version 1 of the dataset.

## **6.3 XGBoost Classifier in Bayesain Search**


### **6.3.1 XGBoost Classifier in Bayesain Search with dataset v1**


In [None]:
XGB1_param_bayes ={
            'preprocessor__num__imputer__n_neighbors': [3,5,7],
            'XGB__learning_rate': Real(0.01, 0.3, prior='log-uniform'),
            'XGB__n_estimators': Integer(50, 200),
            'XGB__max_depth': Integer(2, 10),
            'XGB__min_child_weight': Integer(1, 10),
            'XGB__subsample': Real(0.5, 1.0, prior='uniform'),
            'XGB__gamma': Real(0, 1, prior='uniform'),
            'XGB__colsample_bytree': Real(0.5, 1.0, prior='uniform'),
            'XGB__reg_alpha': Real(1e-5, 1e-1, prior='log-uniform'),
            'XGB__reg_lambda': Real(1e-5, 1e-1, prior='log-uniform')
            }

In [None]:
XGB1_bayes_search = BayesSearchCV(XGB_pipline1, XGB1_param_bayes, n_iter=1,
                                  random_state=seed,cv=cv)
XGB1_bayes_search.fit(X3, y3)

print('best score {}'.format(XGB1_bayes_search.best_score_))
print('best hyperparameters {}'.format(XGB1_bayes_search.best_params_))

best score 0.9338551849401456
best hyperparameters OrderedDict([('XGB__colsample_bytree', 0.705051979426657), ('XGB__gamma', 0.7277257431773251), ('XGB__learning_rate', 0.2387586688716479), ('XGB__max_depth', 5), ('XGB__min_child_weight', 7), ('XGB__n_estimators', 112), ('XGB__reg_alpha', 0.0002533525848634837), ('XGB__reg_lambda', 0.009078559343576646), ('XGB__subsample', 0.6522316555182531), ('preprocessor__num__imputer__n_neighbors', 5)])


> **OBSERVATION:** Here Random Search give hghier score than Bayesain Search in XGBoost.

### **6.3.2 XGBoost Classifier in Bayesain Search with dataset v3**


In [None]:
XGB3_param_bayes ={
            'preprocessor__num__imputer__n_neighbors': [5],
            'XGB__learning_rate': [0.2387586688716479],
            'XGB__n_estimators': [112],
            'XGB__max_depth': [5],
            'XGB__min_child_weight': [7],
            'XGB__subsample': [0.6522316555182531],
            'XGB__gamma': [0.7277257431773251],
            'XGB__colsample_bytree': [0.705051979426657],
            'XGB__reg_alpha': [0.0002533525848634837],
            'XGB__reg_lambda': [0.009078559343576646],
            }

In [None]:
XGB3_bayes_search = BayesSearchCV(XGB_pipline3, XGB3_param_bayes, n_iter=iters,
                                  random_state=seed,cv=cv)
XGB3_bayes_search.fit(X3, y3)

print('best score {}'.format(XGB3_bayes_search.best_score_))
print('best hyperparameters {}'.format(XGB3_bayes_search.best_params_))

best score 0.9338551849401456
best hyperparameters OrderedDict([('XGB__colsample_bytree', 0.705051979426657), ('XGB__gamma', 0.7277257431773251), ('XGB__learning_rate', 0.2387586688716479), ('XGB__max_depth', 5), ('XGB__min_child_weight', 7), ('XGB__n_estimators', 112), ('XGB__reg_alpha', 0.0002533525848634837), ('XGB__reg_lambda', 0.009078559343576646), ('XGB__subsample', 0.6522316555182531), ('preprocessor__num__imputer__n_neighbors', 5)])


________________
<h1 align="center"><span style='font-family:Georgia'> Conclusion</span></h1>

_________________

> **As slown in the results above** (AUC based on validation data using grid, random, and baysain search):

1. Performance differs between datasets v1 and v3.
2. Results differs between the three searching technique.
3. All the model have **higher generalization error** by using the oversampled dataset (dataset version 3).
4. I will choose **XGBoost**, as the best model is a reasonable decision given that it achieved the highest score in terms of AUC and was able to generalize well on the test set. Additionally, XGBoost has several advantages over other machine learning algorithms, such as built-in regularization techniques, handling of missing values, speed, and flexibility.

________________
<h1 align="center"><span style='font-family:Georgia'> Submission File </span></h1>

_________________

In [None]:
submission = pd.DataFrame()
submission['id'] = test_data['id']
submission['match'] = XGB3_rand_search.predict_proba(test_data)[:,1]
submission.to_csv('XGB_oversampled_dataset.csv', index=False)