# **CSC 418 Group D Data Science Project**

Category : **Major**
Group Members
 
1. P15/5620/2019 : Njagi Baraka Fadhili
2. P15/1636/2019 : Kabiru Sharleen Njeri
3. P15/1635/2019 : Obora Melanie Fayne
4. P15/137631/2019 : Ali Amina Abdi
5. P15/130607/2018 : Munyao Mary June

## Importing libraries
_____

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Loading Model libraries
import sys
!{sys.executable} -m pip install xgboost

from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from lightgbm import LGBMClassifier
from scipy.special import erfc
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score,classification_report,confusion_matrix , recall_score, precision_score
from sklearn.metrics import roc_curve, roc_auc_score, log_loss

from keras.layers import Input, Dense
from keras.models import Model

np.random.seed(2017)

## Reading Files
_____

In [None]:
train_data = pd.read_csv("../data/train.csv")
# Preview the first five rows of the train dataset
print(f'The shape of the dataset is: {train_data.shape}')
train_data.head()

In [None]:
test_data = pd.read_csv("../data/test.csv")
# Preview the first five rows of the test dataset
print(f'The shape of the dataset is: {test_data.shape}')
test_data.head()

 *   We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column

* The train data and test data  has 4992 unique Columns 
* the train data has 4459 rows 
* the test data has 49342 rows 
* In the Train data , the Number of columns is more than the number of train rows.
* Test data is almost 10 times as that of train set.


## Data Understanding and Preparation
_____

**Things to check for**:

1. Check for the unique data and unqiue column name in the train and test data 
2. Check for null and duplicate values
3. Check for Outliers
4. Check for Feature Distribution
5. Check for Feature Importance: Through corellation and collinearity

### Checking for Data Uniqueness

#### In the Train Dataset

In [None]:
# Check unique data in the train dataset columns 
unique_df = train_data.nunique().reset_index()
unique_df.columns = ["col_name", "unique_count"]
unique_df = unique_df.sort_values("unique_count")
unique_df

In [None]:
# As we can see there column with any one unique value present
# Lets print the no of columns with 1 unique values 
constant_col = unique_df[unique_df["unique_count"]==1]
constant_col.shape

In [None]:
#remove this constant col 
print('Original Shape of Train Dataset {}'.format(train_data.shape))
train_data.drop(constant_col.col_name.tolist(), axis = 1, inplace = True)
print('Shape after dropping Constant Columns from Train Dataset {}'.format(train_data.shape))

#### In the Test Dataset

In [None]:
#check unique data in the column 
unique_df = test_data.nunique().reset_index()
unique_df.columns = ["col_name", "unique_count"]
unique_df = unique_df.sort_values("unique_count")
unique_df

In [None]:
constant_col = unique_df[unique_df["unique_count"]==1]
constant_col.shape

**Observation**: 
1. The train data has 256 constant columns 
2. The test data has 0 constant columns 

### Checking for Duplicate Features 

In [None]:
# get the boolean array of duplicate column names in the train dataset
duplicate_col = train_data.columns.duplicated()

# check if there are any duplicate column names
if any(duplicate_col):
    print("There are duplicate column names")
else:
    print("All column names are unique")

In [None]:
# get the boolean array of duplicate column names in the test dataset
duplicate_col = test_data.columns.duplicated()

# check if there are any duplicate column names
if any(duplicate_col):
    print("There are duplicate column names")
else:
    print("All column names are unique")

**Observation**: The train and test data have unique columns names 

### Reducing dimensionality using

Dimensionality reduction is the process of reducing the number of features or variables in a dataset while preserving as much of the important information as possible. This can be useful for visualizing high-dimensional data, reducing the computational cost of modeling, and avoiding overfitting


Autoencoder is an unsupervised neural network that learns to reconstruct the input data by compressing it into a lower-dimensional representation (encoding) and then decompressing it back to its original form (decoding). It can be used for dimensionality reduction by using the encoded representation as a new feature space.

**Steps involved**: 
* Prepare Data
* Design Auto Encoder
* Train Auto Encoder
* Use Encoder level from Auto Encoder
* Use Encoder to obtain reduced dimensionality data for train and test sets


In [None]:
# lets first create a copy of the train and test data 
train_df = train_data.copy()
test_df = test_data.copy()

In [None]:
# drop the target and id column from the train data and test data to protect them from encoding
train_df.drop(train_df[['ID', 'target']], axis=1, inplace=True)
test_df.drop(test_df[['ID']], axis=1, inplace= True)
print(train_df.shape)
print(test_df.shape)

In [None]:
# scale the train and test data for neural network  
# Create the scaler object
scaler = StandardScaler()
# Scale the train data data
train_scaled = scaler.fit_transform(train_df )
test_scaled = scaler.fit_transform(test_df )

In [None]:
#design the autoencoder 
#split the train data in train and test 
import numpy as np
np.random.seed(2017)
X_train, X_test = train_test_split(train_scaled, train_size = 0.9, random_state = np.random.seed(2017))


In [None]:
# Defining the input layer 
col_no = train_scaled.shape[1]
input_dim = Input(shape = (col_no, ))

# Defining the encoder dimension
encoding_dim = 200

# Creating  Encoder Layers
encoded1 = Dense(3000, activation = 'relu')(input_dim) #apply to the previous layer 
encoded2 = Dense(2750, activation = 'relu')(encoded1)
encoded3 = Dense(2500, activation = 'relu')(encoded2)
encoded4 = Dense(2250, activation = 'relu')(encoded3)
encoded5 = Dense(2000, activation = 'relu')(encoded4)
encoded6 = Dense(1750, activation = 'relu')(encoded5)
encoded7 = Dense(1500, activation = 'relu')(encoded6)
encoded8 = Dense(1250, activation = 'relu')(encoded7)
encoded9 = Dense(1000, activation = 'relu')(encoded8)
encoded10 = Dense(750, activation = 'relu')(encoded9)
encoded11 = Dense(500, activation = 'relu')(encoded10)
encoded12 = Dense(250, activation = 'relu')(encoded11)
encoded13 = Dense(encoding_dim, activation = 'relu')(encoded12)

# Creating the Decoder Layers
decoded1 = Dense(250, activation = 'relu')(encoded13)
decoded2 = Dense(500, activation = 'relu')(decoded1)
decoded3 = Dense(750, activation = 'relu')(decoded2)
decoded4 = Dense(1000, activation = 'relu')(decoded3)
decoded5 = Dense(1250, activation = 'relu')(decoded4)
decoded6 = Dense(1500, activation = 'relu')(decoded5)
decoded7 = Dense(1750, activation = 'relu')(decoded6)
decoded8 = Dense(2000, activation = 'relu')(decoded7)
decoded9 = Dense(2250, activation = 'relu')(decoded8)
decoded10 = Dense(2500, activation = 'relu')(decoded9)
decoded11 = Dense(2750, activation = 'relu')(decoded10)
decoded12 = Dense(3000, activation = 'relu')(decoded11)
decoded13 = Dense(col_no, activation = 'sigmoid')(decoded12)

# Creating the autoenconder
# The combined Encoder and Deocder layers input will be the input dim layer and output is the decode layer 
autoencoder = Model(inputs = input_dim, outputs = decoded13)

# Compiling the Model
autoencoder.compile(optimizer = 'adadelta', loss = 'binary_crossentropy')

In [None]:
autoencoder.summary()

In [None]:
# Once the autoencoder is compiled, we train it using the training dataset.
autoencoder.fit(X_train, X_train, epochs = 10, batch_size = 32, shuffle = False, validation_data = (X_test, X_test))

Using the encoder to reduce dimensionality:
* Once the autoencoder is trained, you can use the encoder part of the autoencoder to reduce the dimensionality of the dataset. By calling the predict() function on the encoder, you can transform the input data to a lower-dimensional representation.

In [None]:
# We use the autoencoder to reduce the dimension of the train and test data 
encoder = Model(inputs = input_dim, outputs = encoded13)
encoded_input = Input(shape = (encoding_dim, ))

In [None]:
# Predict the new train and test using the autoencoder 
new_train = pd.DataFrame(encoder.predict(train_scaled))
new_train = new_train.add_prefix('feature_')


In [None]:
new_test = pd.DataFrame(encoder.predict(test_scaled))
new_test = new_test.add_prefix('feature_')

In [None]:
# We then add back the target and the Id code we droped earlier 
train_df1 = pd.concat([train_data[['ID', 'target']], new_train], axis=1)
print(train_df1.shape)
train_df1.head()

In [None]:
# Viewing the shape of the new test data 
test_df1 = pd.concat([test_data[['ID']], new_train], axis=1)
print(test_df1.shape)
test_df1.head()


### Checking for nulls and duplicates

In [None]:
train_df1.info()

In [None]:
test_df1.info()

In [None]:
# describing numerical values
train_df1.describe().T

In [None]:
# Categorical Values/Object Values
train_df1.describe(include="O").T

In [None]:
test_df1.describe().T

In [None]:
#Categorical Values/Object Values
test_df1.describe(include="O").T

In [None]:
#missing value 
train_df1.isnull().sum()

In [None]:
#missing value 
test_df1.isnull().sum()

In [None]:
#duplicate train data rows  
train_df1.duplicated().sum()

In [None]:
#duplicate test data rows 
test_df1.duplicated().sum()


**Observations**: 
1. The dataset is full of zeros 
2. We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column
3. The dataset has 202 unique Columns and 400,000 rows
4. The dataset has 0 missing value 
5. The dataset has 0 duplicate rows

### Checking for Data Uniqueness ... again 

In [None]:
#checking columns with one unqiue value in train and test data 
#check unique data in the column 
unique_df = train_df1.nunique().reset_index()
unique_df.columns = ["col_name", "unique_count"]
unique_df = unique_df.sort_values("unique_count")
unique_df


In [None]:
#as we can see there column with any one unique value present
#lets print the no of columns with 1 unique values 
constant_col = unique_df[unique_df["unique_count"]==1]
constant_col.shape

In [None]:
#remove this constant col 
print('Original Shape of Train Dataset {}'.format(train_df1.shape))
train_df1.drop(constant_col.col_name.tolist(), axis = 1, inplace = True)
print('Shape after dropping Constant Columns from Train Dataset {}'.format(train_df1.shape))

In [None]:
#checking columns with one unqiue value in train and test data 
#check unique data in the column 
unique_df = test_df1.nunique().reset_index()
unique_df.columns = ["col_name", "unique_count"]
unique_df = unique_df.sort_values("unique_count")
unique_df


In [None]:
#as we can see there column with any one unique value present
#lets print the no of columns with 1 unique values 
constant_col = unique_df[unique_df["unique_count"]==1]
constant_col.shape

In [None]:
#remove this constant col 
print('Original Shape of Train Dataset {}'.format(test_df1.shape))
test_df1.drop(constant_col.col_name.tolist(), axis = 1, inplace = True)
print('Shape after dropping Constant Columns from Train Dataset {}'.format(test_df1.shape))

**Observations**: 
1. The dataset is f...

### Dealing with sparse data 

Sparse data means that there are many gaps present in the data being recorded. 
As we saw in our train and test data most of the dataset values are zeros 

In [None]:
# lets drop the sparse data 
def drop_sparse_from_train_test(train_df1, test_df1):
    column_to_drop_data_from = [x for x in train_df1.columns if not x in ['ID','target']]
    for f in column_to_drop_data_from:
        if len(np.unique(train_df1[f]))<2:
            train_df1.drop(f, axis=1, inplace=True)
            test_df1.drop(f, axis=1, inplace=True)
    return train_df1, test_df1

train_df1, test_df1 = drop_sparse_from_train_test(train_df1, test_df1)

### Checking for Feature Distribution 

In [None]:
print('Distributions of the  columns in  the dataset')

plt.figure(figsize=(40, 200))
for i, col in enumerate(list(train_df1.columns)[2:]):
    plt.subplot(50,4,i+1 ,);
    plt.hist(train_df1[col])
    plt.title(col)

In [None]:
# Distribution of columns per target class
print("Distribution of columns per target class")
sns.set_style('darkgrid')
plt.figure(figsize=(40,200));
for i,col in enumerate(list(train_df1.columns)[2:]):
    plt.subplot(50,4,i+1 ,);
    sns.distplot(train_df1[train_df1['target']==0][col],hist=False,label='0',color='green');
    sns.distplot(train_df1[train_df1['target']==1][col],hist=False,label='1',color='red');


In [None]:
# Distribution of the feature aganist the target variables 
# Scatter Plots, Distribution Curves
my_colors = ['blue', 'red']
sns.pairplot(train_df1,hue="target", palette=my_colors, corner=True)

### Discretization of the Continous Target Variable 

* Discretization is a technique where the continuous variable is divided into a set of discrete intervals, called bins. Each bin represents a range of values, and the data points are assigned to the bin that corresponds to their value 
* Entropy-based discretization: This method uses the concept of entropy to determine the optimal number of bins and the boundaries of each bin. This method tries to find the binning that maximizes the information gain of the target variable.
* We chose this method of discretization because other methods like Equal width discretization: , Equal frequency discretization are affected by outliers.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

# create the discretizer
est = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='quantile')

# fit the discretizer to the target column data
est.fit(train_df1['target'].values.reshape(-1, 1))

# transform the target column data
train_df1['target'] = est.transform(train_df1['target'].values.reshape(-1, 1))

### Target Variable distribution 

In [None]:
#target value count 
train_df1["target"].value_counts() 

In [None]:
# Target Variable Analysis
sns.countplot(train_df1["target"], palette='Set2')

**Observation**: The Target Column is balanced, i.e there is little to no bias present in the target feature

### Checking for Feature Importance

#### Using Correlation

In [None]:
train_df1.corr(method='pearson').style.background_gradient(cmap='rocket_r')

**General Interprataion of Pearson Co-relation**
1. **Perfect**: Near ± 1.
2. **High**: ± 0.50 to ± 1
3. **Moderate**: ± 0.30 to ± 0.49
4. **Low** degree: Below + 0.2
5. **None** : 0

**Observation**: 


In [None]:
# #checking for relevant feature 
# cor =dataset.corr()
# #Correlation with output variable
# cor_target = abs(cor["target"])
# #Selecting highly correlated features
# relevant_features = cor_target[cor_target>0.05]
# relevant_features

#### Using Collinearity

In [None]:
# from statsmodels.stats.outliers_influence import variance_inflation_factor
# def calc_vif(X):
#     # Calculating VIF
#     vif = pd.DataFrame()
#     vif["variables"] = X.columns
#     vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

#     return(vif)
# result = calc_vif(dataset[dataset.columns.difference(['target', 'ID_code'], sort=False)])
# result

## Exporting Our New Shiny Clean Datasets

In [None]:
# We then extract the shiny clean processed datasets for modelling
train_df1.to_csv(r'cleaned_data\clean_train.csv', index=False)
test_df1.to_csv(r'cleaned_data\clean_test.csv', index=False)

In [None]:
# Reading the clean train dataset
clean_train = pd.read_csv("./cleaned_data/clean_train.csv")
print(f'The shape of the dataset is: {clean_train.shape}')
clean_train.head()

In [None]:
# Reading the clean test dataset
clean_test = pd.read_csv("./cleaned_data/clean_test.csv")
print(f'The shape of the dataset is: {clean_test.shape}')
clean_test.head()

## Splitting the Train Dataset into Train and Test for the Model

In [None]:
# Select main columns to be used in training
main_cols = clean_train.columns.difference(['ID', 'target'])
X = clean_train[main_cols]
y = clean_train.target

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42)

print(X_train.shape)
print(X_test.shape)

## Testing Different Classifier Algorithms

---

### 1. Logistic regression

In [None]:
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

# Prediction
y_pred_lr = model_lr.predict(X_test)

In [None]:
print('Accuracy is : ', accuracy_score(y_test, y_pred_lr))
print(f'F1 score on the X_test is: {f1_score(y_test, y_pred_lr)}')
print(' recall:', recall_score(y_test, y_pred_lr))
print(' precision:',precision_score(y_test, y_pred_lr))
print('Area under the ROC curve:' , roc_auc_score(y_test, y_pred_lr))
confusion = confusion_matrix(y_test, y_pred_lr)
print(f'Confusion Matrix on the X_test is:\n {confusion}')

**F1 Scrore Board**

1.   Logistic Regression :

### 2. LGBM CLassifier 

In [None]:
model_lgbm = LGBMClassifier()
model_lgbm.fit(X_train, y_train)

# Make predictions
y_pred_lgbm = model_lgbm.predict(X_test)

In [None]:
print('Accuracy is : ', accuracy_score(y_test, y_pred_lgbm))
print(f'F1 score on the X_test is: {f1_score(y_test, y_pred_lgbm)}')
print(' recall:', recall_score(y_test, y_pred_lgbm))
print(' precision:',precision_score(y_test, y_pred_lgbm))
print('Area under the ROC curve:' , roc_auc_score(y_test, y_pred_lgbm))
confusion = confusion_matrix(y_test, y_pred_lgbm)
print(f'Confusion Matrix on the X_test is:\n {confusion}')

**F1 Scrore Board**

1.   LGBMClassifier : 
2.   Logistic Regression : 

Improved by ****

### 3. Random Forest Classifier

In [None]:
main_cols = clean_train.columns.difference(['ID_code','target'])
X = clean_train[main_cols]
y = clean_train.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
model_rf = RandomForestClassifier(criterion='entropy')   
model_rf.fit(X_train,y_train)

# Prediction
y_pred_rf = model_rf.predict(X_test)

In [None]:
# Check accuracy, F1 score and Confusion Matrix

print('Accuracy is : ', accuracy_score(y_test, y_pred_rf))
print(f'F1 score on the X_test is: {f1_score(y_test, y_pred_rf)}')
print(' recall:', recall_score(y_test, y_pred_rf))
print(' precision:',precision_score(y_test, y_pred_rf))
print('Area under the ROC curve:' , roc_auc_score(y_test, y_pred_rf))
confusion = confusion_matrix(y_test, y_pred_rf)
print(f'Confusion Matrix on the X_test is:\n {confusion}')

**F1 Scrore Board**

1.   RandomForestClassifier :
2.   LGBMClassifier :
3.   Logistic Regression :

Improved by ****


### 4. Gradient Classifier

In [None]:
model_gb = GradientBoostingClassifier(
    n_estimators = 400,
    learning_rate = 1.0,
    min_samples_leaf = 10,
    subsample = 1.0,
)
model_gb.fit(X_train, y_train)

# Prediction
y_pred_gb = model_gb.predict(X_test)

In [None]:
# Check accuracy, F1 score and Confusion Matrix

print('Accuracy is : ', accuracy_score(y_test, y_pred_gb))
print(f'F1 score on the X_test is: {f1_score(y_test, y_pred_gb)}')
print(' recall:', recall_score(y_test, y_pred_gb))
print(' precision:',precision_score(y_test, y_pred_gb))
print('Area under the ROC curve:' , roc_auc_score(y_test, y_pred_gb))
confusion = confusion_matrix(y_test, y_pred_gb)
print(f'Confusion Matrix on the X_test is:\n {confusion}')

**F1 Scrore Board**

1.   Gradient Classifier :
2.   RandomForestClassifier :
3.   LGBMClassifier :
4.   Logistic Regression :

Improved by ****

### XGBoost Classifier

In [None]:
model_xg = XGBClassifier()
model_xg.fit(X_train, y_train)

#  Prediction
y_pred_xg = model_xg.predict(X_test)

In [None]:
print('Accuracy is : ', accuracy_score(y_test, y_pred_xg))
print(f'F1 score on the X_test is: {f1_score(y_test, y_pred_xg)}')
print(' recall:', recall_score(y_test, y_pred_xg))
print(' precision:',precision_score(y_test, y_pred_xg))
print('Area under the ROC curve:' , roc_auc_score(y_test, y_pred_xg))
confusion = confusion_matrix(y_test, y_pred_xg)
print(f'Confusion Matrix on the X_test is:\n {confusion}')

**F1 Scrore Board**

1.   Gradient Classifier :
2.   RandomForestClassifier :
3.   LGBMClassifier :
4.   Logistic Regression :
5.   XGBoost Classifier :

Improved by ****

### Ensemble Classifier

In [None]:
# Making the final model using voting classifier
final_model = VotingClassifier(
	estimators=[('lr', model_lr), ('lgbm', model_lgbm)], voting='soft')

# training all the model on the train dataset
final_model.fit(X_train, y_train)

# predicting the output on the test dataset
pred_final = final_model.predict(X_test)

In [None]:
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))
print('Accuracy is : ', accuracy_score(y_test, pred_final))
print(f'F1 score on the X_test is: {f1_score(y_test, pred_final)}')
print(' recall:', recall_score(y_test, pred_final))
print(' precision:',precision_score(y_test, pred_final))
print('Area under the ROC curve:' , roc_auc_score(y_test, pred_final))
confusion = confusion_matrix(y_test, pred_final)
print(f'Confusion Matrix on the X_test is:\n {confusion}')

**F1 Scrore Board**

1.   Gradient Classifier :
2.   RandomForestClassifier :
3.   LGBMClassifier :
4.   Logistic Regression :
5.   XGBoost Classifier :
6.   Ensemble :

Improved by ****

## Predicting The Test Dataset with our Star Model

In [None]:
# Make prediction on the test set
test_df = test_df[main_cols]
predictions = model_lgbm.predict(test_df)

In [None]:

sample_sub = pd.read_csv('../data/sample_submission.csv')
# Create a submission file
sub_file = sample_sub.copy()
sub_file.predictions = predictions

In [None]:
# Check the distribution of our predictions
sns.countplot(sub_file.predictions)

In [None]:
sub_file.to_csv('Unprocessed_Lr_Submission.csv', index = False)