## Binary Classification with a Software Defects Dataset

It's my kaggle notebook for [this](https://www.kaggle.com/competitions/playground-series-s3e23/overview) competition, which is about Predicting defects in C programs given various various attributes about the code.

## Importing libraries and loading dataset

In [None]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [None]:
!pip install xgboost==2.0.0



In [None]:
import opendatasets as od
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings

pd.set_option('display.max_columns', None)

In [None]:
warnings.filterwarnings("ignore")

In [None]:
od.download('https://www.kaggle.com/competitions/playground-series-s3e23/data')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: parhamrou
Your Kaggle Key: ··········
Downloading playground-series-s3e23.zip to ./playground-series-s3e23


100%|██████████| 5.71M/5.71M [00:00<00:00, 41.9MB/s]


Extracting archive ./playground-series-s3e23/playground-series-s3e23.zip to ./playground-series-s3e23





In [None]:
DIR = './playground-series-s3e23'
dirs = os.listdir(DIR)
dirs

['sample_submission.csv', 'test.csv', 'train.csv']

## EDA
Looking how our dataset looks like, attribute data types, etc.

You can find information about the meaning of each attribute in [this](https://www.kaggle.com/competitions/playground-series-s3e23/discussion/445196) notebook!

In [None]:
labeled_dir, test_dir = os.path.join(DIR, 'train.csv'), os.path.join(DIR, 'test.csv')
data = pd.read_csv(labeled_dir)
test_data = pd.read_csv(test_dir)

In [None]:
data.head(10)

Unnamed: 0,id,loc,v(g),ev(g),iv(g),n,v,l,d,i,e,b,t,lOCode,lOComment,lOBlank,locCodeAndComment,uniq_Op,uniq_Opnd,total_Op,total_Opnd,branchCount,defects
0,0,22.0,3.0,1.0,2.0,60.0,278.63,0.06,19.56,14.25,5448.79,0.09,302.71,17,1,1,0,16.0,9.0,38.0,22.0,5.0,False
1,1,14.0,2.0,1.0,2.0,32.0,151.27,0.14,7.0,21.11,936.71,0.05,52.04,11,0,1,0,11.0,11.0,18.0,14.0,3.0,False
2,2,11.0,2.0,1.0,2.0,45.0,197.65,0.11,8.05,22.76,1754.01,0.07,97.45,8,0,1,0,12.0,11.0,28.0,17.0,3.0,False
3,3,8.0,1.0,1.0,1.0,23.0,94.01,0.19,5.25,17.86,473.66,0.03,26.31,4,0,2,0,8.0,6.0,16.0,7.0,1.0,True
4,4,11.0,2.0,1.0,2.0,17.0,60.94,0.18,5.63,12.44,365.67,0.02,20.31,7,0,2,0,7.0,6.0,10.0,10.0,3.0,False
5,5,23.0,4.0,4.0,3.0,69.0,338.21,0.07,14.15,22.81,3772.51,0.11,209.42,17,1,2,0,16.0,10.0,40.0,19.0,7.0,False
6,6,24.0,4.0,1.0,4.0,60.0,294.41,0.08,12.46,24.62,3295.25,0.1,183.07,19,0,3,0,14.0,13.0,40.0,23.0,7.0,False
7,7,14.0,1.0,1.0,1.0,49.0,221.65,0.18,5.47,46.06,1183.48,0.07,65.75,11,0,2,0,7.0,18.0,26.0,23.0,1.0,False
8,8,34.0,10.0,1.0,4.0,122.0,684.98,0.07,14.33,43.43,9941.84,0.23,552.32,29,1,3,0,16.0,29.0,75.0,47.0,19.0,False
9,9,9.0,2.0,1.0,2.0,16.0,55.35,0.11,9.0,6.15,498.16,0.02,27.68,4,0,2,0,9.0,2.0,12.0,4.0,3.0,False


In [None]:
print('Training data shape:', data.shape)
print('Testing data shape:', test_data.shape)

Training data shape: (101763, 23)
Testing data shape: (67842, 22)


In [None]:
data.dtypes

id                     int64
loc                  float64
v(g)                 float64
ev(g)                float64
iv(g)                float64
n                    float64
v                    float64
l                    float64
d                    float64
i                    float64
e                    float64
b                    float64
t                    float64
lOCode                 int64
lOComment              int64
lOBlank                int64
locCodeAndComment      int64
uniq_Op              float64
uniq_Opnd            float64
total_Op             float64
total_Opnd           float64
branchCount          float64
defects                 bool
dtype: object

In [None]:
data.isna().sum()

id                   0
loc                  0
v(g)                 0
ev(g)                0
iv(g)                0
n                    0
v                    0
l                    0
d                    0
i                    0
e                    0
b                    0
t                    0
lOCode               0
lOComment            0
lOBlank              0
locCodeAndComment    0
uniq_Op              0
uniq_Opnd            0
total_Op             0
total_Opnd           0
branchCount          0
defects              0
dtype: int64

In [None]:
data['defects'].replace({True: 1, False: 0}, inplace=True)

In [None]:
data.describe()

Unnamed: 0,id,loc,v(g),ev(g),iv(g),n,v,l,d,i,e,b,t,lOCode,lOComment,lOBlank,locCodeAndComment,uniq_Op,uniq_Opnd,total_Op,total_Opnd,branchCount,defects
count,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0,101763.0
mean,50881.0,37.34716,5.492684,2.845022,3.498826,96.655995,538.280956,0.111634,13.681881,27.573007,20853.59,0.179164,1141.357982,22.802453,1.773945,3.979865,0.196604,11.896131,15.596671,57.628116,39.249698,9.839549,0.226644
std,29376.592059,54.600401,7.900855,4.631262,5.534541,171.147191,1270.791601,0.100096,14.121306,22.856742,190571.4,0.421844,9862.795472,38.54101,5.902412,6.382358,0.998906,6.749549,18.064261,104.53766,71.692309,14.412769,0.418663
min,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,25440.5,13.0,2.0,1.0,1.0,25.0,97.67,0.05,5.6,15.56,564.73,0.03,31.38,7.0,0.0,1.0,0.0,8.0,7.0,15.0,10.0,3.0,0.0
50%,50881.0,22.0,3.0,1.0,2.0,51.0,232.79,0.09,9.82,23.36,2256.23,0.08,125.4,14.0,0.0,2.0,0.0,11.0,12.0,30.0,20.0,5.0,0.0
75%,76321.5,42.0,6.0,3.0,4.0,111.0,560.25,0.15,18.0,34.34,10193.24,0.19,565.92,26.0,1.0,5.0,0.0,16.0,20.0,66.0,45.0,11.0,0.0
max,101762.0,3442.0,404.0,165.0,402.0,8441.0,80843.08,1.0,418.2,569.78,16846620.0,26.95,935923.39,2824.0,344.0,219.0,43.0,410.0,1026.0,5420.0,3021.0,503.0,1.0


In [None]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, chi2
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
from sklearn.feature_selection import mutual_info_regression
from sklearn.decomposition import PCA

In [None]:
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

In [None]:
X_train, y_train = data.drop(['defects', 'id'], axis=1), data['defects']
X_test = test_data.drop('id', axis=1)

As our dataset is imbalanced, we use SMOTE to make the number of each class instances equal.

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_oversampled, y_train_oversampled = smote.fit_resample(scaled_X_train, y_train)

## Model selection
To select the best model among the five models mentioned (**Random Forest Classifier**, **Multinomial Naive Bayes**, **KNN**, **Logistic Regression**, and **XGB Classifier**), we use the following techniques:

1. **KFold Cross Validation**: This technique helps us evaluate the performance of each model by splitting the data into K folds and training the model K times, each time using a different fold as the validation set. This allows us to get a more reliable estimate of the model's performance.

2. **Randomized Search CV**: This technique helps us find the best hyperparameters for each model. It performs a randomized search over a predefined hyperparameter space and evaluates the model's performance using cross validation. By trying different combinations of hyperparameters, we can find the set of parameters that gives the best performance for each model.

By combining KFold cross validation and Randomized Search CV, we can effectively compare the performance of the five models and select the one with the best parameters.

In [None]:
models_dict = {
    'RandomForestClassifier': RandomForestClassifier(),
    'MultinomialNB': MultinomialNB(),
    'KNeighborsClassifier': KNeighborsClassifier(),
    'LogisticRegression': LogisticRegression(),
    'XGBClassifier': XGBClassifier()
}

models_parameters = {
    'RandomForestClassifier': {'n_estimators': np.arange(100, 1000, 100),
                               'criterion': ['gini', 'entropy'],
                               'max_depth': np.arange(3, 10)},


    'MultinomialNB': {'alpha': np.logspace(-3, 0, num=4)},

    'KNeighborsClassifier': {'n_neighbors': np.arange(3, 10)},

    'LogisticRegression': {'C': np.linspace(0.1, 1, num=10)},

    'XGBClassifier': {'n_estimators': np.arange(100, 1000, 100),
                      'max_depth': np.arange(3, 10),
                      'learning_rate': np.logspace(-3, 0, num=5),
                      'subsample': [0.6, 0.7, 0.8, 1],
                      'lambda': np.logspace(-5, 2, num=8),
                      'alpha': np.logspace(-5, 2, num=8),
                      'min_child_weight': np.arange(0, 5, 3)}
}

In [None]:
kfold = KFold(n_splits=4, random_state=20, shuffle=True)

In [None]:
def select_model(X, y):
    models = {}
    for model_name, model in models_dict.items():
        random_search = RandomizedSearchCV(model, param_distributions=models_parameters[model_name], n_iter=20,
                        verbose=1, cv=kfold, scoring='roc_auc', random_state=42, n_jobs=-1)
        random_search.fit(X, y)
        print('Result for', model_name)
        print('Best params:', random_search.best_params_)
        print('Best score:', random_search.best_score_)
        models[model_name] = random_search.best_estimator_
        print('*' * 10)

    return models

In [None]:
models = select_model(X_train_oversampled, y_train_oversampled)

Fitting 4 folds for each of 20 candidates, totalling 80 fits
Result for RandomForestClassifier
Best params: {'n_estimators': 100, 'max_depth': 9, 'criterion': 'entropy'}
Best score: 0.7900860403133818
**********
Fitting 4 folds for each of 4 candidates, totalling 16 fits
Result for MultinomialNB
Best params: {'alpha': 0.001}
Best score: 0.7633900650615708
**********
Fitting 4 folds for each of 7 candidates, totalling 28 fits
Result for KNeighborsClassifier
Best params: {'n_neighbors': 9}
Best score: 0.7452673327520609
**********
Fitting 4 folds for each of 10 candidates, totalling 40 fits
Result for LogisticRegression
Best params: {'C': 1.0}
Best score: 0.7741653014073198
**********
Fitting 4 folds for each of 20 candidates, totalling 80 fits
Result for XGBClassifier
Best params: {'subsample': 0.6, 'n_estimators': 800, 'min_child_weight': 0, 'max_depth': 8, 'learning_rate': 0.005623413251903491, 'lambda': 0.1, 'alpha': 10.0}
Best score: 0.7922677016688203
**********


Based on the evaluation results, it has been determined that the **XGB Classifier** is the best model among the five models mentioned. Therefore, we will use the XGB Classifier to apply it to the test data and predict probabilities for each instance.

In [None]:
model = models['XGBClassifier']

In [None]:
predicts = model.predict(scaled_X_test)
predicts[:5]

array([0, 0, 1, 0, 0])

In [None]:
probs = model.predict_proba(scaled_X_test)
probs[:5]

array([[0.7712573 , 0.2287427 ],
       [0.8081882 , 0.19181181],
       [0.34197295, 0.65802705],
       [0.5350929 , 0.4649071 ],
       [0.86241776, 0.13758226]], dtype=float32)

In [None]:
probs_df = pd.DataFrame({'id': test_data['id'], 'defects': probs[:,1]})
probs_df.head()

Unnamed: 0,id,defects
0,101763,0.228743
1,101764,0.191812
2,101765,0.658027
3,101766,0.464907
4,101767,0.137582


In [None]:
probs_df.shape

(67842, 2)

In [None]:
probs_df.to_csv('submission.csv', index=False)

## Conclusion

This was my first attempt at participating in an ongoing Kaggle competition. At the end of the competition, I achieved a ROC_AUC score of **0.79344**. Out of 1704 participants, I secured a position of 370, with a score that was only 0.00085 less than the winning score.