# Final Project Phase 3 - Home Credit Default Risk

Spring 2024

**Team Members:**
- Glen Colletti
- Alex Bordanca
- Paul Miller






Recency
Freqency
Monetary

Recency: How recently a customer has made  a purchase. Time elapsed since customer's last purchase.

Frequnecy: How often a customer purchases. Number of transactions a customer has made.

Monetary: Represents how much money customer has spent on purchases. Sum of all transactions.

Typically each measure is scaled 1 to 5 with 5 being the best customer (recent purchase, frequent purchases, high spending)

## How to translate to loans?

Monetary seems to translate well to loans. An applicant who borrows large amounts of money, all else being equal, would be a good customer. Then again if an appicant just borrowed a lot of money, it might not be logical to loan them even more money. Maybe the other features will account for how well the applicant is managing the previous high dollar loan.  

Frequnecy might do ok with loans. An applicant who borrows frequnently, all else being equal, would be a good customer.

Recency might not do as well with loans. Is it good that a recent customer would be applying for another loan? This might depend quite a bit on what the loans are for. A business might take out loans frequently if it is part of their business model to loan money to start projects customers will pay for (home construction comes to mind).  

## Dataset issue

Data is organized by loan ID not customer ID

# Load Dependencies and data

In [1]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"alexbordanca","key":"d9c74782ba569bbacddf222b676a9d32"}'}

In [2]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list

ref                                                           title                                            size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------  ----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
sudarshan24byte/online-food-dataset                           Online Food Dataset                               3KB  2024-03-02 18:50:30          26549        522  0.9411765        
nbroad/gemma-rewrite-nbroad                                   gemma-rewrite-nbroad                              8MB  2024-03-03 04:52:39           1661        104  1.0              
lovishbansal123/adult-census-income                           Adult Census Income                             450KB  2024-04-12 08:18:30            476         23  1.0              
sukhmandeepsinghbrar/most-subscribed-youtube-channel          Most Subscribed YouTube Chan

In [3]:
DATA_DIR = "/HCDR/DATA_DIR"   #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir DATA_DIR

! kaggle competitions download home-credit-default-risk -p $DATA_DIR
!ls -l $DATA_DIR

Downloading home-credit-default-risk.zip to /HCDR/DATA_DIR
100% 685M/688M [00:23<00:00, 28.3MB/s]
100% 688M/688M [00:24<00:00, 30.0MB/s]
total 704708
-rw-r--r-- 1 root root 721616255 Dec 11  2019 home-credit-default-risk.zip


In [4]:
import zipfile
unzippingReq = True #True
if unzippingReq: #please modify this code
    zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
    # extractall():  Extract all members from the archive to the current working directory. path specifies a different directory to extract to
    zip_ref.extractall(f'{DATA_DIR}')
    zip_ref.close()

In [5]:
!ls -l $DATA_DIR

total 3326092
-rw-r--r-- 1 root root  26567651 Apr 16 00:46 application_test.csv
-rw-r--r-- 1 root root 166133370 Apr 16 00:46 application_train.csv
-rw-r--r-- 1 root root 375592889 Apr 16 00:46 bureau_balance.csv
-rw-r--r-- 1 root root 170016717 Apr 16 00:46 bureau.csv
-rw-r--r-- 1 root root 424582605 Apr 16 00:46 credit_card_balance.csv
-rw-r--r-- 1 root root     37383 Apr 16 00:46 HomeCredit_columns_description.csv
-rw-r--r-- 1 root root 721616255 Dec 11  2019 home-credit-default-risk.zip
-rw-r--r-- 1 root root 723118349 Apr 16 00:46 installments_payments.csv
-rw-r--r-- 1 root root 392703158 Apr 16 00:46 POS_CASH_balance.csv
-rw-r--r-- 1 root root 404973293 Apr 16 00:46 previous_application.csv
-rw-r--r-- 1 root root    536202 Apr 16 00:46 sample_submission.csv


In [6]:
from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
np.random.seed(0)


train_data = pd.read_csv('/HCDR/DATA_DIR/application_train.csv') #data we have the target class for
test_data = pd.read_csv('/HCDR/DATA_DIR/application_test.csv') #data we need to predict target class for, for competition

col_names = train_data.columns.values.tolist()
col_names.sort()
print(col_names)

['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_INCOME_TOTAL', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'APARTMENTS_AVG', 'APARTMENTS_MEDI', 'APARTMENTS_MODE', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MEDI', 'BASEMENTAREA_MODE', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'CODE_GENDER', 'COMMONAREA_AVG', 'COMMONAREA_MEDI', 'COMMONAREA_MODE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_REGISTRATION', 'DEF_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'ELEVATORS_AVG', 'ELEVATORS_MEDI', 'ELEVATORS_MODE', 'EMERGENCYSTATE_MODE', 'ENTRANCES_AVG', 'ENTRANCES_MEDI', 'ENTRANCES_MODE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'FLAG_CONT_MOBILE', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 

In [7]:
PrevApp_data = pd.read_csv('/HCDR/DATA_DIR/previous_application.csv') #data from previous applications to Home Credit
print(np.shape(PrevApp_data))
col_names = PrevApp_data.columns.values.tolist()
col_names.sort()
print(col_names)

(1670214, 37)
['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE', 'CHANNEL_TYPE', 'CNT_PAYMENT', 'CODE_REJECT_REASON', 'DAYS_DECISION', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_TERMINATION', 'FLAG_LAST_APPL_PER_CONTRACT', 'HOUR_APPR_PROCESS_START', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CLIENT_TYPE', 'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PAYMENT_TYPE', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'NAME_SELLER_INDUSTRY', 'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'NFLAG_INSURED_ON_APPROVAL', 'NFLAG_LAST_APPL_IN_DAY', 'PRODUCT_COMBINATION', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'SELLERPLACE_AREA', 'SK_ID_CURR', 'SK_ID_PREV', 'WEEKDAY_APPR_PROCESS_START']


In [8]:
PrevApp_data.head(5)

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


In [9]:
! pip install pandasql

Collecting pandasql
  Downloading pandasql-0.7.3.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pandasql
  Building wheel for pandasql (setup.py) ... [?25l[?25hdone
  Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26771 sha256=c4fb4da385fdc98cd69cd102d47de9dfbc07445f0792c16e0f4ecca8febb7a8f
  Stored in directory: /root/.cache/pip/wheels/e9/bc/3a/8434bdcccf5779e72894a9b24fecbdcaf97940607eaf4bcdf9
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3


In [10]:
from pandasql import sqldf

augmented_train_data = sqldf('''
with rfm as (select
  SK_ID_CURR, sum(AMT_CREDIT) as MONETARY_VALUE,
  max(DAYS_DECISION) as RECENCY_FEATURE,
  (max(DAYS_DECISION) - min(DAYS_DECISION))/COUNT(DISTINCT SK_ID_PREV) as FREQUENCY_FEATURE
from PrevApp_data
where AMT_CREDIT <> 0
group by 1
)
select train.*, rfm.RECENCY_FEATURE, rfm.FREQUENCY_FEATURE, rfm.MONETARY_VALUE
from train_data train
left join rfm
on train.SK_ID_CURR = rfm.SK_ID_CURR
''')
augmented_train_data

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,RECENCY_FEATURE,FREQUENCY_FEATURE,MONETARY_VALUE
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0.0,0.0,0.0,0.0,0.0,1.0,-606.0,0.0,179055.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0.0,0.0,0.0,0.0,0.0,0.0,-746.0,531.0,1452573.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,-815.0,0.0,20106.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,,,,,,,-181.0,72.0,2625259.5
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0.0,0.0,0.0,0.0,0.0,0.0,-374.0,330.0,999832.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,,,,,,,-273.0,0.0,40455.0
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,,,,,,,-2497.0,0.0,56821.5
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,1.0,0.0,0.0,1.0,0.0,1.0,-1909.0,471.0,41251.5
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,-277.0,22.0,268879.5


In [11]:
filtered_train_data = sqldf('''
SELECT
  TARGET,
  FLOORSMAX_MEDI,
  ELEVATORS_MEDI,
  FLOORSMIN_MEDI,
  AMT_CREDIT,
  TOTALAREA_MODE,
  DAYS_EMPLOYED,
  OBS_30_CNT_SOCIAL_CIRCLE,
  CNT_FAM_MEMBERs,
  CNT_CHILDREN,
  OWN_CAR_AGE,
  DAYS_ID_PUBLISH,
  DAYS_LAST_PHONE_CHANGE,
  CODE_GENDER,
  OCCUPATION_TYPE,
  AMT_INCOME_TOTAL,
  RECENCY_FEATURE,
  FREQUENCY_FEATURE,
  MONETARY_VALUE
FROM
  augmented_train_data

''')



In [12]:
filtered_train_data

Unnamed: 0,TARGET,FLOORSMAX_MEDI,ELEVATORS_MEDI,FLOORSMIN_MEDI,AMT_CREDIT,TOTALAREA_MODE,DAYS_EMPLOYED,OBS_30_CNT_SOCIAL_CIRCLE,CNT_FAM_MEMBERS,CNT_CHILDREN,OWN_CAR_AGE,DAYS_ID_PUBLISH,DAYS_LAST_PHONE_CHANGE,CODE_GENDER,OCCUPATION_TYPE,AMT_INCOME_TOTAL,RECENCY_FEATURE,FREQUENCY_FEATURE,MONETARY_VALUE
0,1,0.0833,0.00,0.1250,406597.5,0.0149,-637,2.0,1.0,0,,-2120,-1134.0,M,Laborers,202500.0,-606.0,0.0,179055.0
1,0,0.2917,0.08,0.3333,1293502.5,0.0714,-1188,1.0,2.0,0,,-291,-828.0,F,Core staff,270000.0,-746.0,531.0,1452573.0
2,0,,,,135000.0,,-225,0.0,1.0,0,26.0,-2531,-815.0,M,Laborers,67500.0,-815.0,0.0,20106.0
3,0,,,,312682.5,,-3039,2.0,2.0,0,,-2437,-617.0,F,Laborers,135000.0,-181.0,72.0,2625259.5
4,0,,,,513000.0,,-3038,0.0,1.0,0,,-3458,-1106.0,M,Core staff,121500.0,-374.0,330.0,999832.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,0,0.6042,0.22,0.2708,254700.0,0.2898,-236,0.0,1.0,0,,-1982,-273.0,M,Sales staff,157500.0,-273.0,0.0,40455.0
307507,0,0.0833,0.00,0.1250,269550.0,0.0214,365243,0.0,1.0,0,,-4090,0.0,F,,72000.0,-2497.0,0.0,56821.5
307508,0,0.1667,0.00,0.2083,677664.0,0.7970,-7921,6.0,1.0,0,,-5150,-1909.0,F,Managers,153000.0,-1909.0,471.0,41251.5
307509,1,0.0417,,,370107.0,0.0086,-4786,0.0,2.0,0,,-931,-322.0,F,Laborers,171000.0,-277.0,22.0,268879.5


In [13]:
from pandasql import sqldf

augmented_test_data = sqldf('''
with rfm as (select
  SK_ID_CURR, sum(AMT_CREDIT) as MONETARY_VALUE,
  max(DAYS_DECISION) as RECENCY_FEATURE,
  (max(DAYS_DECISION) - min(DAYS_DECISION))/COUNT(DISTINCT SK_ID_PREV) as FREQUENCY_FEATURE
from PrevApp_data
where AMT_CREDIT <> 0
group by 1
)
select train.*, rfm.RECENCY_FEATURE, rfm.FREQUENCY_FEATURE, rfm.MONETARY_VALUE
from test_data train
left join rfm
on train.SK_ID_CURR = rfm.SK_ID_CURR
''')
filtered_test_data = sqldf('''
SELECT
  FLOORSMAX_MEDI,
  ELEVATORS_MEDI,
  FLOORSMIN_MEDI,
  AMT_CREDIT,
  TOTALAREA_MODE,
  DAYS_EMPLOYED,
  OBS_30_CNT_SOCIAL_CIRCLE,
  CNT_FAM_MEMBERs,
  CNT_CHILDREN,
  OWN_CAR_AGE,
  DAYS_ID_PUBLISH,
  DAYS_LAST_PHONE_CHANGE,
  CODE_GENDER,
  OCCUPATION_TYPE,
  AMT_INCOME_TOTAL,
  RECENCY_FEATURE,
  FREQUENCY_FEATURE,
  MONETARY_VALUE
FROM
  augmented_test_data

''')
filtered_test_data


Unnamed: 0,FLOORSMAX_MEDI,ELEVATORS_MEDI,FLOORSMIN_MEDI,AMT_CREDIT,TOTALAREA_MODE,DAYS_EMPLOYED,OBS_30_CNT_SOCIAL_CIRCLE,CNT_FAM_MEMBERS,CNT_CHILDREN,OWN_CAR_AGE,DAYS_ID_PUBLISH,DAYS_LAST_PHONE_CHANGE,CODE_GENDER,OCCUPATION_TYPE,AMT_INCOME_TOTAL,RECENCY_FEATURE,FREQUENCY_FEATURE,MONETARY_VALUE
0,0.1250,,,568800.0,0.0392,-2329,0.0,2.0,0,,-812,-1740.0,F,,135000.0,-1740.0,0.0,23787.000
1,,,,222768.0,,-4469,0.0,2.0,0,,-1623,0.0,M,Low-skill Laborers,99000.0,-757.0,0.0,40153.500
2,,,,663264.0,,-4458,0.0,2.0,0,5.0,-3503,-856.0,M,Drivers,202500.0,-273.0,575.0,584536.500
3,0.3750,0.32,0.0417,1575000.0,0.3700,-1866,0.0,4.0,2,,-4208,-1805.0,F,Sales staff,315000.0,-797.0,252.0,464602.500
4,,,,625500.0,,-2191,0.0,3.0,1,16.0,-4262,-821.0,M,,180000.0,-111.0,355.0,601101.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,,,,412560.0,,-5169,1.0,1.0,0,,-3399,-684.0,F,,121500.0,-683.0,0.0,254700.000
48740,,,,622413.0,,-1149,2.0,4.0,2,,-3003,0.0,F,Sales staff,157500.0,-770.0,420.0,394816.500
48741,0.3333,0.16,,315000.0,0.1663,-3037,0.0,3.0,1,4.0,-1504,-838.0,F,,202500.0,-84.0,377.0,265033.665
48742,0.6250,0.16,,450000.0,0.1974,-2731,0.0,2.0,0,,-1364,-2308.0,M,Managers,225000.0,-577.0,432.0,637893.000


In [14]:
filtered_train_data.columns

Index(['TARGET', 'FLOORSMAX_MEDI', 'ELEVATORS_MEDI', 'FLOORSMIN_MEDI',
       'AMT_CREDIT', 'TOTALAREA_MODE', 'DAYS_EMPLOYED',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'CNT_FAM_MEMBERS', 'CNT_CHILDREN',
       'OWN_CAR_AGE', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE',
       'CODE_GENDER', 'OCCUPATION_TYPE', 'AMT_INCOME_TOTAL', 'RECENCY_FEATURE',
       'FREQUENCY_FEATURE', 'MONETARY_VALUE'],
      dtype='object')

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin

class ReplaceValuesTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_copy = X.copy()
        X_copy[self.column] = X_copy[self.column].apply(lambda x: 0 if x > 0 else x)
        return X_copy

In [16]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier

# Sample data
X = filtered_train_data.drop(columns=['TARGET'])
y = filtered_train_data['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define column transformer for numerical and categorical features
numeric_features = ['FLOORSMAX_MEDI', 'ELEVATORS_MEDI', 'FLOORSMIN_MEDI',
       'AMT_CREDIT', 'TOTALAREA_MODE', 'DAYS_EMPLOYED',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'CNT_FAM_MEMBERS', 'CNT_CHILDREN',
       'OWN_CAR_AGE', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE']  # List of numerical feature column indices
categorical_features = ['CODE_GENDER','OCCUPATION_TYPE']  # List of categorical feature column indices

numeric_transformer = Pipeline(steps=[
    ('replace_values', ReplaceValuesTransformer(column='DAYS_EMPLOYED')),
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())

])

categorical_transformer = Pipeline(steps=[
     ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create the pipeline with the preprocessor and XGBoost classifier
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier())
])

# Fit the pipeline on the training data
xgb_pipeline.fit(X_train, y_train)

# Predict using the pipeline on the test data
y_pred = xgb_pipeline.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


Accuracy: 0.9194673430564363


In [None]:
from sklearn.metrics import roc_auc_score

# Define parameter grid for XGBoost classifier
xgb_param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__learning_rate': [0.05, 0.1, 0.2],
    'classifier__max_depth': [3, 5, 7],
    'classifier__gamma': [0, 0.1, 0.2, 0.5]
}

# Create GridSearchCV with n_jobs=-1 and scoring='roc_auc'
xgb_grid_search = GridSearchCV(xgb_pipeline, xgb_param_grid, cv=3, scoring='roc_auc', verbose=3, n_jobs=-1)

# Fit the grid search on the training data
xgb_grid_search.fit(X_train, y_train)

# Get the best estimator from the grid search
best_xgb_pipeline = xgb_grid_search.best_estimator_

# Predict using the best estimator
y_pred = best_xgb_pipeline.predict(X_test)

# Calculate the AUC-ROC of the best estimator
auc_roc = roc_auc_score(y_test, y_pred)
print(f"AUC-ROC: {auc_roc}")


Fitting 3 folds for each of 108 candidates, totalling 324 fits


In [None]:
from sklearn.linear_model import LogisticRegression

# Create the pipeline with the preprocessor and logistic regression classifier
logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=10000))
])

logreg_param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear', 'saga']
}

# Create GridSearchCV with n_jobs=-1 and scoring='roc_auc'
logreg_grid_search = GridSearchCV(logreg_pipeline, logreg_param_grid, cv=3, scoring='roc_auc', verbose=3, n_jobs=-1)

# Fit the grid search on the training data
logreg_grid_search.fit(X_train, y_train)

# Get the best estimator from the grid search
best_logreg_pipeline = logreg_grid_search.best_estimator_

# Predict using the best estimator
y_pred = best_logreg_pipeline.predict_proba(X_test)[:, 1]

# Calculate the AUC-ROC of the best estimator
auc_roc = roc_auc_score(y_test, y_pred)
print(f"AUC-ROC: {auc_roc}")



In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create the pipeline with the preprocessor and KNN classifier
knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNNClassifier())
])

knn_param_grid = {
    'classifier__n_neighbors': [3, 5, 7, 9, 11],  # Number of neighbors to use
    'classifier__weights': ['uniform', 'distance'],  # Weighting method for predictions
    'classifier__metric': ['euclidean', 'manhattan'],  # Distance metric
    'classifier__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']  # Algorithm used to compute nearest neighbors
}

# Create GridSearchCV with n_jobs=-1 and scoring='roc_auc'
knn_grid_search = GridSearchCV(knn_pipeline, knn_param_grid, cv=3, scoring='roc_auc', verbose=3, n_jobs=-1)

# Fit the grid search on the training data
knn_grid_search.fit(X_train, y_train)

# Get the best estimator from the grid search
best_knn_pipeline = knn_grid_search.best_estimator_

# Predict using the best estimator
y_pred = best_knn_pipeline.predict(X_test)

# Calculate the AUC-ROC of the best estimator
auc_roc = roc_auc_score(y_test, y_pred)
print(f"AUC-ROC: {auc_roc}")

