# IMPORTANT 

## Install latest version of packages to be used in the code

The latest version of libraries need to be installed as per competition rules and kindly adhere to that and install the updated version of libraries in the code. 

## Please set random seed so that reproducible answers are attained

Wherever randomness is expected, do select the random seed so that the results are reproducible. Reproducibility of results is a **very important** component of model development without which reliable models are not attained. 

In [1]:
!pip install --upgrade scikit-learn numpy pandas catboost 



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from termcolor import colored
import plotly.express as px
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score,accuracy_score,classification_report
import warnings
warnings.filterwarnings('always') 

## Loading test and train datasets 

We will load the train and test datasets and do some basic level of EDA to understand the pattern of features in the data 

* <b> Train data: </b> This is the data which we will be using to train the model. Since we are solving a classification problem, we will have a column in train dataset corresponding to the target labels. 
* <b> Test data: </b> This is the data on which the predictions will be made based on the model trained on train dataset. 



In [3]:
################# Reading train and test datasets
#train_data         = pd.read_csv('Train.csv')
#test_data          = pd.read_csv('Test.csv')

train_data         = pd.read_csv("C:/Users/KWABENABOATENG/Desktop/AZUBI AFRICA/AZUBI CAPSTONE/AZUBI-CAPSTONE-PROJECT/SampleSubmission.csv")
test_data          = pd.read_csv("C:/Users/KWABENABOATENG/Desktop/AZUBI AFRICA/AZUBI CAPSTONE/AZUBI-CAPSTONE-PROJECT/Test.csv")



target_column_name = ['income_above_limit']

########## The target column to be used for training 
target_column      = train_data[target_column_name]

########## Since ID is a unique identifier, it must be dropped 
Cols2drop          = ['ID']


######### Feature set corresponding to train and test data
train_df           = train_data.drop(Cols2drop+target_column_name,axis=1)
test_df            = test_data.drop(Cols2drop,axis=1)

print(colored(f'The shape of train data is    {train_df.shape}     ','green',attrs=['bold']))

print(colored(f'The shape of target column is {target_column.shape}','green',attrs=['bold']))

print(colored(f'The shape of test data is     {test_df.shape}      ','blue',attrs=['bold']))

print('------------------------------------------------------------------------------')
print(colored('The train data looks like below :- \n','green'))
display(train_df.head(5))

print('------------------------------------------------------------------------------')
print(colored('The test data looks like below :- \n','blue'))
display(test_df.head(5))


[1m[32mThe shape of train data is    (89786, 0)     [0m
[1m[32mThe shape of target column is (89786, 1)[0m
[1m[34mThe shape of test data is     (89786, 41)      [0m
------------------------------------------------------------------------------
[32mThe train data looks like below :- 
[0m


0
1
2
3
4


------------------------------------------------------------------------------
[34mThe test data looks like below :- 
[0m


Unnamed: 0,age,gender,education,class,education_institute,marital_status,race,is_hispanic,employment_commitment,unemployment_reason,...,country_of_birth_father,country_of_birth_mother,migration_code_change_in_msa,migration_prev_sunbelt,migration_code_move_within_reg,migration_code_change_in_reg,residence_1_year_ago,old_residence_reg,old_residence_state,importance_of_record
0,54,Male,High school graduate,Private,,Married-civilian spouse present,White,All other,Children or Armed Forces,,...,US,US,unchanged,,unchanged,unchanged,Same,,,3388.96
1,53,Male,5th or 6th grade,Private,,Married-civilian spouse present,White,Central or South American,Full-time schedules,,...,El-Salvador,El-Salvador,?,?,?,?,,,,1177.55
2,42,Male,Bachelors degree(BA AB BS),Private,,Married-civilian spouse present,White,All other,Full-time schedules,,...,US,US,?,?,?,?,,,,4898.55
3,16,Female,9th grade,,High school,Never married,White,All other,Children or Armed Forces,,...,US,US,unchanged,,unchanged,unchanged,Same,,,1391.44
4,16,Male,9th grade,,High school,Never married,White,All other,Not in labor force,,...,US,US,?,?,?,?,,,,1933.18


In [4]:
########### Encoding the target column 

target_column['income_above_limit'] = target_column['income_above_limit'].map({'Above limit':1,'Below limit':0})
target_column['income_above_limit'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  target_column['income_above_limit'] = target_column['income_above_limit'].map({'Above limit':1,'Below limit':0})


Series([], Name: count, dtype: int64)

<b>Class imbalance </b> <br>


We will be seeing the class imbalance using value_counts() method of pandas dataframe and use histogram to plot the imbalances
<hr>

In [5]:
print('The class Imbalance in the data is given below')
display(train_data['income_above_limit'].value_counts())
print('---------------------------------------------------------------\n')
print('The class imbalance in terms of percentage is given below ')
display(train_data['income_above_limit'].value_counts(normalize=True))
print('----------------------------------------------------------------\n')
pct_df = pd.DataFrame(train_data['income_above_limit'].value_counts(normalize=True)).reset_index().rename({'index':'Target_values','income_above_limit':'Percentage'},axis=1)
fig = px.bar(pct_df,x='Target_values',y='Percentage', height=400,width = 400,title='class imbalance')
fig.show()

The class Imbalance in the data is given below


income_above_limit
1    89786
Name: count, dtype: int64

---------------------------------------------------------------

The class imbalance in terms of percentage is given below 


income_above_limit
1    1.0
Name: proportion, dtype: float64

----------------------------------------------------------------



ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Percentage', 'proportion'] but received: Target_values

Clearly we have a highly imbalanced dataset available with us and hence we need to perform steps to mitigate the imbalance accordingly. The following methods could be used:- 
1. Downsample the majority class (Here majority class is 'Below limit') 
2. Upsample the minority class (Here, minority class is 'Above limit') 
3. Use class weights while performing model development <br>
Reference : https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html



<b> NaN value analysis </b> 


In [None]:
def nan_value_plot(df):
    nan_dict  = {}
    for cols in df.columns:
        nan_dict[cols] = df[cols].isna().sum()/df.shape[0]
    nan_pct_df = pd.DataFrame.from_dict(nan_dict,orient='index').reset_index().rename({'index':'Columns',0:'NaN_pct'},axis=1)
    fig = px.bar(nan_pct_df,x='Columns',y='NaN_pct', height=400,width = 400,title='NaN value percentage in each column')
    fig.update_layout(
                        xaxis = dict(
                        tickfont = dict(size=5)))
    fig.show()    

In [None]:
print(colored('We see the distribution of NaN values in train data as below','green',attrs=['bold']))
nan_value_plot(train_df)

print('-------------------------------------------------------------------------------------------------')
print('\n')
print(colored('We see the distribution of NaN values in test data as below','blue',attrs=['bold']))
nan_value_plot(test_df)

<b> Comments:- </b>
* There are columns with extremely high proportion of NaN values, we may drop them. 
* There are columns with NaN values that can be handled easily using imputations with mean, median (in case of numerical) or mode(in case of categorical) 
* Use Models like LightGBM, CatBoost or XGBoost that handles the NaN values implicitly while model training. 
* Observe that the proportion of NaN value distribution is same in train and test and select NaN value handling techniques accordingly. 
* Be creative 🧠 (but also be logical 😉) !!



I will personally drop all the columns where the proportion of NaN values is above 80% and proceed with columns/features that are left. 

In [6]:
nan_cols_drop  = []
for cols in test_df.columns:
    if test_df[cols].isna().sum()/test_df.shape[0] >0.8:
        nan_cols_drop.append(cols)

In [7]:
print(colored(f'We will drop the following columns from both train and test data: ','yellow',attrs=['bold']))
print(nan_cols_drop)

[1m[33mWe will drop the following columns from both train and test data: [0m
['education_institute', 'unemployment_reason', 'is_labor_union', 'veterans_admin_questionnaire', 'old_residence_reg', 'old_residence_state']


In [8]:
print('The shape of train and test data before dropping columns with high proportion of NaN values is - ')
print(colored(f'The shape of train data is    {train_df.shape}     ','green',attrs=['bold']))

print(colored(f'The shape of target column is {target_column.shape}','green',attrs=['bold']))

print(colored(f'The shape of test data is     {test_df.shape}      ','blue',attrs=['bold']))

train_df = train_df.drop(nan_cols_drop,axis=1)
test_df  = test_df.drop(nan_cols_drop,axis=1)

print('---------------------------------------------------------------------------------------------------')
print('The shape of train and test data after dropping columns with high proportion of NaN values is - ')
print(colored(f'The shape of train data is    {train_df.shape}     ','green',attrs=['bold']))

print(colored(f'The shape of target column is {target_column.shape}','green',attrs=['bold']))

print(colored(f'The shape of test data is     {test_df.shape}      ','blue',attrs=['bold']))

The shape of train and test data before dropping columns with high proportion of NaN values is - 
[1m[32mThe shape of train data is    (89786, 0)     [0m
[1m[32mThe shape of target column is (89786, 1)[0m
[1m[34mThe shape of test data is     (89786, 41)      [0m


KeyError: "['education_institute', 'unemployment_reason', 'is_labor_union', 'veterans_admin_questionnaire', 'old_residence_reg', 'old_residence_state'] not found in axis"

### Simple Baseline Validation strategy 

We will now do an 80-20 split of train data provided. As discussed previously, the participants are free to use the validation strategy of their own choice. 

Points to consider while selecting a validation strategy:
* Make sure the model is not overfitting on train data. 
* CV score and leaderboard scores are in sync. 
* Stable validation strategy when using K Folds etc. 

In [None]:
train, X_test, train_y, y_test = train_test_split(train_df, target_column, test_size=0.2, random_state=42,stratify=target_column)

### Model development 🤖 💻 🤖

We will be straight away using a CatBoost model for training because it handles categorical features well, can implicitly handle NaN values, and can give a quick baseline (with minimal preprocessing) which can be used as a benchmark to be improved upon. 

<br>

In the below steps, we will convert all the categorical columns to string datatype and capture the indices where string datatype is present which will then be used as an input for the CatBoost Classification model. 

In [None]:
cat_cols_index = np.where(train_df.dtypes=='object')[0]
for i in range(len(train_df.columns)):
    if i in cat_cols_index:
        train[train_df.columns[i]]   = train[train_df.columns[i]].astype(str)
        X_test[train_df.columns[i]]  = X_test[train_df.columns[i]].astype(str)
        test_df[train_df.columns[i]] = test_df[train_df.columns[i]].astype(str)

In [None]:
model           = CatBoostClassifier(random_state=42,n_estimators =50 )
_               = model.fit(train,train_y,cat_features= cat_cols_index)


Parameter tuning tips for CatBoost:

👓 Do focus on parameters like n_estimators, max_depth, reg_lambda, reg_alpha, scale_pos_weight, learning_rate and explore other parameters from the link : https://catboost.ai/en/docs/references/training-parameters/


In [None]:
acc_valid = accuracy_score(model.predict(X_test),y_test)

print(colored(f'The accuracy attained on the validation set is {acc_valid}','green',attrs=['bold']))



We got a good enough accuracy but is our model really performing that good ?? 🤔

👓 Consider the class imbalance of the data given with respect to the metric assigned. We can get 94% accuracy just by classifying everything as 'Below limit' but that will mean that we must get an accuracy above 94% to ensure the models are learning properly. 👓 

🔭 Let's investigate the classification report for both train and validation data and see how good the baseline is. 

In [None]:
print('\n')
print('The classification report only on the validation data is below-')
print(colored(classification_report(y_test, model.predict(X_test)),'blue',attrs=['bold']))

print('The classification report only on the train data is below-')
print(colored(classification_report(train_y, model.predict(train)),'green',attrs=['bold']))

The performance of our minority class in terms of precision and recall is too low. Hence our F1 score is also very low. 



### A little hack 

Let's do a small hack though 🤓 🤓 🤓

We can use probability based thresholds and see how performance improves. We will select a lower threshold for class label 1.
The default threshold is 0.5 which means that if the probability of 1 is above 0.5, then the predicted class is 1 else it is 0.

<br>

We will lower the threshold to 0.4 and say that if the probability of class being 1 is above 0.4, then we can classify it as 1 and if it is less than 0.4, then it will be 0. 

In [None]:
thresh     = 0.4
train_pred = np.where(model.predict_proba(train)[:,1]>thresh,1,0)
test_pred  = np.where(model.predict_proba(X_test)[:,1]>thresh,1,0)

print('\n')
print('The classification report only on the validation data is below-')
print(colored(classification_report(y_test,test_pred),'blue',attrs=['bold']))

print('The classification report only on the train data is below-')
print(colored(classification_report(train_y, train_pred),'green',attrs=['bold']))

We do see some improvement in the performance because the f1 score on our validation data moved from 0.58 to 0.61. 
For more information about how the threshold is selected, please follow [ROC Curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) of sklearn and in general how ROC curve works 📚 📚

### Submission Time 🎉

We will now predict on the test data given and see what score we get on leaderboard. 

We will now download the file "Sample_submission_1.csv" and submit it. 

In [None]:
subdf                       = pd.read_csv('/content/SampleSubmission.csv')
subdf['income_above_limit'] = model.predict(test_df)
subdf.to_csv('Sample_submission_1.csv',index=False)
subdf['income_above_limit'].value_counts(normalize=True)

How to get better scores:
1. Feature engineering is the key. Refer to the variable dictionary and create meaningful features which can boost the score
2. Try out different models and categorical data preprocessing (read about categorical encoding) because a lot of features are categorical. 
3. Feature selection with feature importance 
4. Keep a check on classification report to observe overfitting and underfitting and select appropriate hyper-parameters to tune.
5. Suitable probability threshold selection as shown above. 
6. Be creative while selecting validation split 
For example:- Use Stratified K folds, grouped K folds, repeated stratified k folds, train test split with stratification etc 
7. Ensemble multiple models to get a stable prediction. 
8. Be creative and may the best model win 🏆 🏆 🏆