<a href="https://www.kaggle.com/code/mayurdevarajpatil/airline-passenger-satisfaction-eda-model-building?scriptVersionId=144136670" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Loading Packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### DATA
#### We have two CSV files here:

- #### train.csv : 
  This dataset will be used for training the model, i.e. our model will learn from this file. It contains all the feature variables and the target variable.

- #### test.csv : 
  This dataset contains all the feature variables, but not the target variable. We will apply the model to predict the target variable for the test data.

In [None]:
train=pd.read_csv('/kaggle/input/airline-passenger-satisfaction/train.csv')
test=pd.read_csv('/kaggle/input/airline-passenger-satisfaction/test.csv')

Let’s make a copy of train and test data so that even if we have to make any changes in these datasets we would not lose the original datasets.

In [None]:
org_train=train.copy()
org_test=test.copy()

In this section, we will look at the structure of the train and test datasets. Firstly, we will check the features present in our data and then we will look at their data types.

In [None]:
train.columns

We have 24 independent variables and 1 target variable, i.e. satisfaction in the train dataset. Let’s also have a look at the columns of test dataset.

In [None]:
test.columns

We have similar features in the test dataset as the train dataset.But, we dont need the target variable in test dataset as We will predict the satisfaction using the model built using the train data.

#### Dropping Irrelevent variables for both datasets

In [None]:
train=train.drop(columns=['Unnamed: 0','id'])
train.columns

In [None]:
test=test.drop(columns=['Unnamed: 0','satisfaction','id'])
test.columns

###### Given below is the description for each variable.

- Gender: Gender of the passengers (Female, Male)

- Customer Type: The customer type (Loyal customer, disloyal customer)

- Age: The actual age of the passengers

- Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)

- Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)

- Flight distance: The flight distance of this journey

- Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)

- Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient

- Ease of Online booking: Satisfaction level of online booking

- Gate location: Satisfaction level of Gate location

- Food and drink: Satisfaction level of Food and drink

- Online boarding: Satisfaction level of online boarding

- Seat comfort: Satisfaction level of Seat comfort

- Inflight entertainment: Satisfaction level of inflight entertainment

- On-board service: Satisfaction level of On-board service

- Leg room service: Satisfaction level of Leg room service

- Baggage handling: Satisfaction level of baggage handling

- Check-in service: Satisfaction level of Check-in service

- Inflight service: Satisfaction level of inflight service

- Cleanliness: Satisfaction level of Cleanliness

- Departure Delay in Minutes: Minutes delayed when departure

- Arrival Delay in Minutes: Minutes delayed when Arrival

- Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)
     

In [None]:
# Print data types for each variable 
train.dtypes

Let’s look at the shape of the dataset.

In [None]:
train.shape,test.shape

We have 103904 rows and 24 columns in the train dataset and 25976 rows and 23 columns in test dataset.

In this section, we will do univariate analysis. It is the simplest form of analyzing data where we examine each variable individually. For categorical features we can use frequency table or bar plots which will calculate the number of each category in a particular variable. For numerical features, probability density plots can be used to look at the distribution of the variable.



### Target Variable
We will first look at the target variable, i.e., Satisfaction. As it is a categorical variable, let us look at its frequency table, percentage distribution and bar plot. Frequency table of a variable will give us the count of each category in that variable.

In [None]:
train.satisfaction.value_counts()

In [None]:
train.satisfaction.value_counts(normalize=True)

In [None]:
train.satisfaction.value_counts().plot.bar(sharey=True)


Out of 103904 passangers 43% are satisfied.

Now lets visualize each variable separately.

### Categorical Variables:

In [None]:
Categorical = ["Gender","Customer Type","Type of Travel","Class"]
fig,ax=plt.subplots(2,2,figsize=(12,12))
for i,j in enumerate(Categorical):  
    train[j].value_counts().plot.pie(autopct='%0.1f%%',title=j,ax=ax[i//2,i%2])
# plt.subplots_adjust(wspace=.6,hspace=0.5)
plt.show()

In [None]:
plt.figure(figsize=(10,18))
# plt.style.use('ggplot')
plt.rcParams['figure.facecolor'] = '#FFF9ED'


plt.subplot(421)
train.Gender.value_counts().plot.pie(autopct='%0.1f%%',title='Gender')

plt.subplot(422) 
train['Customer Type'].value_counts().plot.pie(autopct='%0.1f%%',title='Customer Type')

plt.subplot(423) 
train.Class.value_counts().plot.pie(autopct='%0.1f%%',title='Class')

plt.subplot(424) 
train['Type of Travel'].value_counts().plot.pie(autopct='%0.1f%%',title='Type of Travel')


plt.show()

#### It can be inferred from the above bar plots that:
- There are more female passangers than male passangers.
- From above graph we can see that number of Loyal Customers are much higher than disloyel customers
- Most number of passengers prefers Businees class followed by Economy class and very less no. of passangers uses Economy Plus class
- 69% of the passangers are travelling for business purpose and 31% are travelling for Personal Travel 
- From above bar plot of Passanger satisfaction we get to know that most of the passangers are satisfied. 56.7% passangers are satisfied about the airline.  
- from the kde plot of Flight Distance we get to know that 80% of the passangers travel for less than 2000 kms.

### Ordinal Variables:

In [None]:
Ordinal=['Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness']

In [None]:
for x in Ordinal:
    fig,ax=plt.subplots(1,2,figsize=(16,5))
    ax=ax.flatten()

    ax[0].pie(train[x].value_counts(),autopct='%1.f%%',
              shadow=True, explode=[.1 for _ in range(train[x].value_counts().nunique())],startangle=0,labels=train[x].value_counts().index
             )
#     ax[0].legend(train[x].value_counts().index)
#     ax[0].legend(loc='right')

    sns.countplot(data=train, y=x, ax=ax[1], order=train[x].value_counts().index[::-1])
    ax[1].set_xlabel('Count',fontsize=20)
    ax[1].set_ylabel(None)

    fig.suptitle(x,fontsize=20,fontweight='bold')
    plt.tight_layout()
    plt.show()

### It can be inferred from the above bar plots that:
- #### inflight wifi services:
 Most of the passangers rated 3 & 2 for inflight wifi services also there are some passangers who rated 0 ,it seems they are not satisfied by the inflight wifi service 
 
- #### Departure/Arrival time convenient:
Most of the passangers rated 4 & 5 for Departure/Arrival time convenient. Seems that most  number of passangers are satisfied by their serivice

- #### Ease of Online booking: 
Passangers in large scale rated 3 for Ease of online bookking that means their might be need to improve in booking services
- #### Gate location:
Most of the customers are not happy with their Gate Location (They might have walked longer than they expected).
Passangers rated 3 & 4 for large number so it seems they are somewhat satisfied.


- #### Food and drink: 
It seems most of the passangers are satisfied by the food and drinks, most of the Passangers Rated 4 & 5. 1/3 people are not satisfied with the food and drink service

- #### Online boarding:
Passangers rated 4 & 3 for large number so it seems they are satisfied.


- #### Seat comfort:
Passangers rated 4 & 5 for large number so it seems they are satisfied. 26% of the people are not satisfied with the seats of the plane

- #### Inflight entertainment:
Most of the people are satisfied/neutral about General Inflight Entertainment. Passangers rated 4 & 5 for large number so it seems they are satisfied.


- #### On-board service:
1/4 of the people are not satisfied with the on-board service Passangers rated 4 & 5 for large number so it seems they are satisfied.
- #### Leg room service :
Very less number of passangers rated 0 & 1 so ,it seems lot of the passangers are satisfied also the number of passangers who rated 4 & 5 are more.
- #### Baggage handling :
Passangers rated 4 & 5 for large number so it seems they are satisfied. Only 18% of the customers are not happy with the Baggage Handling Service

- #### Inflight service :
63% of our customers are happy with the Inflight Services. Passangers rated 4 & 5 for large number so it seems they are satisfied.
- #### Checkin service :
Passangers are mostly satisfied with this service , Most of them rated 4 & 3. Nearly 1/4 of the customers are not satisfied with the Check-in Service
- #### Cleanliness : 
Very few of the passangers are dissatisfied with this service , Very less number of passangers rated 0 & 1.  
Nearly 30% of the people are not satisfied Leg Service and Cleanliness of the plane


### Numerical Variables:
Till now we have seen the categorical and ordinal variables and now lets visualize the numerical variables.

Lets look at the distribution of Applicant income first.

Age,Flight Distance,Departure Delay in Minutes,Arrival Delay in Minutes

#### Age

In [None]:
sns.histplot(train['Age'])

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(train['Age']); 
plt.subplot(122) 
train['Age'].plot.box(figsize=(16,5)) 
plt.show()

It can be inferred that most of the data in the distribution of applicant income is towards left which means it is not normally distributed. We will try to make it normal in later sections as algorithms works better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

#### Flight Distance

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(train['Flight Distance']); 
plt.subplot(122) 
train['Flight Distance'].plot.box(figsize=(16,5)) 
plt.show()

#### Departure Delay in Minutes

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(train['Departure Delay in Minutes']); 
plt.subplot(122) 
train['Departure Delay in Minutes'].plot.box(figsize=(16,5)) 
plt.show()

#### Arrival Delay in Minutes

In [None]:
plt.figure(1) 
plt.subplot(121) 
sns.distplot(train['Arrival Delay in Minutes']); 
plt.subplot(122) 
train['Arrival Delay in Minutes'].plot.box(figsize=(16,5)) 
plt.show()

### Categorical  Variable v/s Target Variable
First of all we will find the relation between target variable and categorical variables. Let us look at the Pie plot and bar plot which will give us the overall proportion of satisfied and unsatisfied passangers with varius categorial variables. 

In [None]:
for j in Categorical:
    fig,axes=plt.subplots(1,2,figsize=(16,8))
    ax=axes.flatten()
    
    ax[0].pie(train.groupby('satisfaction')[j].value_counts(),autopct='%1.f%%',
              shadow=True, explode=[.1 for _ in range(train.groupby('satisfaction')[j].value_counts().nunique())],
              startangle=0,
              )
    ax[0].legend(train.groupby('satisfaction')[j].value_counts().index)
    
    train.groupby('satisfaction')[j].value_counts().unstack().plot.bar(title=j,ax=ax[1],grid=True)
#     ax[1].xticks(rotation=90)
    
    fig.suptitle(j,fontsize=20,fontweight='bold')
    plt.tight_layout()
    plt.show()

#### It can be inferred from above plots:
- The proportion of satisfied Male and Female are more or less same for both satisfied and neutral dissatisfied passengers. 
- The proportion of  loyal customer is very high to  disloyal customer for both satisfied and neutral dissatisfied passengers. 
- The proportion of Business Travel is very high to Personal travel for satisfied , but the proportion for both business travel and personal travel is more or less same.
- The proportion of Business class is very high to Economy and economy plus class for satisfied and the the proportion of economy class is higher to business and economy plus class in neutral and dissatisfied.

### MIssing Value Imputation

Let’s list out feature-wise count of missing values.

In [None]:
train.isnull().sum()

There are missing values in __'Arrival Delay in Minutes'__ feature.

We will treat the missing values in the feature.

We have these methods to fill the missing values:

For numerical variables : imputation using mean or median.

For categorical variables : imputation using mode.

There are very less missing values __"Arrival Delay in Minutes"__ feature so we can fill them using the mean of the features.

In [None]:
# train.loc[train['Arrival Delay in Minutes'].isnull(),['Arrival Delay in Minutes']]=train['Arrival Delay in Minutes'].median()

In [None]:
train['Arrival Delay in Minutes'].fillna(train['Arrival Delay in Minutes'].median(),inplace=True)

Now lets check whether all the missing values are filled in the dataset.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
test['Arrival Delay in Minutes'].fillna(test['Arrival Delay in Minutes'].median(),inplace=True)

In [None]:
test.isnull().sum()

As we can see that all the missing values have been filled in the train dataset.

### Outlier Treatment
As we saw earlier in univariate analysis, "Flight Distance" , 'Departure Delay in Minutes', 'Arrival Delay in Minutes' contains outliers so we have to treat them as the presence of outliers affects the distribution of the data. Let’s examine what can happen to a data set with outliers. For the sample data set:

[1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4]

We find the following: mean, median, mode, and standard deviation

Mean               :   2.58

Median             :   2.5

Mode               :   2

Standard Deviation :   1.08


If we add an outlier to the data set:

[1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400]

The new values of our statistics are:

Mean               :   35.38

Median             :   2.5

Mode               :   2

Standard Deviation :   114.74

It can be seen that having outliers often has a significant effect on the mean and standard deviation and hence affecting the distribution. We must take steps to remove outliers from our data sets.

we have priviously there are no outliers in Age feature so there is no need

Let's see the Outliers present in the feature 'Flight Distance':


Due to these outliers bulk of the data in the 'Flight Distance' is at the left and the right tail is longer. This is called right skewness. One way to remove the skewness is by doing the log transformation. As we take the log transformation, it does not affect the smaller values much, but reduces the larger values. So, we get a distribution similar to normal distribution.

Let’s visualize the effect of log transformation. We will do the similar changes to the test file simultaneously.

In [None]:
train['Flight Distance'].hist(bins=20)
plt.suptitle('Plot before outlier removing');

In [None]:
train['Flight Distance']=np.log(train['Flight Distance'])
test['Flight Distance']=np.log(train['Flight Distance'])

In [None]:
train['Flight Distance'].hist(bins=20)
plt.suptitle('Plot after outlier removing');

In [None]:
test['Flight Distance'].hist(bins=20)
plt.suptitle('Plot after outlier removing for test');

Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.

In [None]:
train['Flight Distance'].plot.box() 

in above plot we can see that the outliers are significantly subsided.

Now let's look at the remaining features:

we will use mean imputation technic to treat outliers.

##### Mean Imputation For 'Departure Delay in Minutes'


In [None]:
sns.boxplot(train['Departure Delay in Minutes'])
plt.title("Box Plot before mean imputation")
plt.show()
q1 = train['Departure Delay in Minutes'].quantile(0.25)
q3 = train['Departure Delay in Minutes'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(train['Departure Delay in Minutes'])
for i in train['Departure Delay in Minutes']:
    if i > Upper_tail or i < Lower_tail:
            train['Departure Delay in Minutes'] = train['Departure Delay in Minutes'].replace(i, m)
sns.boxplot(train['Departure Delay in Minutes'])
plt.title("Box Plot after mean imputation")
plt.show()  

##### Mean Imputation For 'Departure Delay in Minutes' for test

In [None]:
sns.boxplot(test['Departure Delay in Minutes'])
plt.title("Box Plot before mean imputation")
plt.show()
q1 = test['Departure Delay in Minutes'].quantile(0.25)
q3 = test['Departure Delay in Minutes'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(test['Departure Delay in Minutes'])
for i in test['Departure Delay in Minutes']:
    if i > Upper_tail or i < Lower_tail:
            test['Departure Delay in Minutes'] = test['Departure Delay in Minutes'].replace(i, m)
sns.boxplot(test['Departure Delay in Minutes'])
plt.title("Box Plot after mean imputation")
plt.show()  

##### Mean Imputation for 'Arrival Delay in Minutes'

In [None]:
sns.boxplot(train['Arrival Delay in Minutes'])
plt.title("Box Plot before mean imputation")
plt.show()
q1 = train['Arrival Delay in Minutes'].quantile(0.25)
q3 = train['Arrival Delay in Minutes'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(train['Arrival Delay in Minutes'])
for i in train['Arrival Delay in Minutes']:
    if i > Upper_tail or i < Lower_tail:
            train['Arrival Delay in Minutes'] = train['Arrival Delay in Minutes'].replace(i, m)
sns.boxplot(train['Arrival Delay in Minutes'])
plt.title("Box Plot after mean imputation")
plt.show()  

##### Mean Imputation for 'Arrival Delay in Minutes' for test

In [None]:
sns.boxplot(test['Arrival Delay in Minutes'])
plt.title("Box Plot before mean imputation")
plt.show()
q1 = test['Arrival Delay in Minutes'].quantile(0.25)
q3 = test['Arrival Delay in Minutes'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(test['Arrival Delay in Minutes'])
for i in test['Arrival Delay in Minutes']:
    if i > Upper_tail or i < Lower_tail:
            test['Arrival Delay in Minutes'] = test['Arrival Delay in Minutes'].replace(i, m)
sns.boxplot(test['Arrival Delay in Minutes'])
plt.title("Box Plot after mean imputation")
plt.show() 

### LabelEncoding for train dataset

In [None]:
train.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

lb=LabelEncoder()
train['Gender']=lb.fit_transform(train['Gender'])
train['Customer Type']=lb.fit_transform(train['Customer Type'])
train['Type of Travel']=lb.fit_transform(train['Type of Travel'])
train['Class']=lb.fit_transform(train['Class'])
train['satisfaction']=lb.fit_transform(train['satisfaction'])

In [None]:
train.head()

### LabelEncoding for test dataset

In [None]:
test.head()

In [None]:
test['Gender']=lb.fit_transform(test['Gender'])
test['Customer Type']=lb.fit_transform(test['Customer Type'])
test['Type of Travel']=lb.fit_transform(test['Type of Travel'])
test['Class']=lb.fit_transform(test['Class'])


In [None]:
test.head()

X=train.drop(columns=['satisfaction'])
y=train['satisfaction']Now lets look at the correlation between all the numerical variables. We will use the heat map to visualize the correlation. Heatmaps visualize data through variations in coloring. The variables with darker color means their correlation is more.

In [None]:
train.drop(columns=['satisfaction']).corr() 

In [None]:
matrix = train.drop(columns=['satisfaction']).corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(20, 14))  # Adjust the figure size as needed
sns.set(font_scale=1.2)  # Adjust the font size as needed

# Create the heatmap with annotated values in each square
sns.heatmap(matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)

# Display the plot
plt.title("Correlation Matrix")
plt.show()

### observations
- Departure delay in minutes is highly correlated with Arrival Delay in minutes
- Inflight entertainment,food drink , seat comfort and cleanliness are correlated .
- Ease of online booking is correlated with wifi service
- baggage handling is correlated with inflight handling
- the dataframe is free of duplicate rows.


### Splitting Data

In [None]:
X=train.drop(columns=['satisfaction'])
y=train['satisfaction']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size =0.3)

### Feature importance
identification of important features using feature importance

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()

In [None]:
clf.fit(x_train,y_train)

In [None]:
clf.score(x_test,y_test)

In [None]:
clf.feature_importances_

In [None]:
feature_importance = pd.DataFrame({'importance': clf.feature_importances_}, index= x_train.columns).sort_values('importance')
feature_importance

In [None]:
feature_importance.plot.barh()

In [None]:
feature_importance[feature_importance.importance > 0.03]#.index

In [None]:
to_keep = feature_importance[feature_importance.importance > 0.03].index

In [None]:
to_keep

So above features are most important among all the features thats why we will keep only these features in our train and train dataset for model buildingm

In [None]:
x_train=x_train[to_keep]
x_test=x_test[to_keep]

x_train.head()

In [None]:
x_test.head()

### Model Building and Evaluation

### logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score

lr = LogisticRegression() 
lr.fit(x_train, y_train)

In [None]:
pred_test=lr.predict(x_test)
accuracy_score(y_test,pred_test)

In [None]:
pred_test=lr.predict(test[to_keep])

In [None]:
pred_test

### Decision Tree

In [None]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion = 'entropy')
model = dt.fit(x_train[:100],y_train[:100])
model

In [None]:
dt.score(x_test,y_test)

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize=(35,30))
tree.plot_tree(model,
              filled=True,rounded=True)
plt.show()

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10,random_state=1, max_depth=10)
model = rfc.fit(x_train,y_train)

In [None]:
model.score(x_train,y_train),model.score(x_test,y_test)

##### Random Search CV

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # for hyperparameter tuning

In [None]:
# Provide range for max_depth from 1 to 20 with an interval of 2 and from 1 to 200 with an interval of 20 for n_estimators 
paramgrid = {'max_depth': list(range(1, 20, 2)), 'n_estimators': list(range(1, 100, 10))}

In [None]:
rfc = RandomForestClassifier()


In [None]:
%%time
random_search = RandomizedSearchCV(rfc, param_distributions=paramgrid, n_iter=12, cv=5)
random_search.fit(x_train,y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_score_


In [None]:
rfc=RandomForestClassifier(max_depth=11, n_estimators=91, random_state=1)
model=rfc.fit(x_train,y_train)

In [None]:
%%time
rfc.score(x_train,y_train),rfc.score(x_test,y_test)

In [None]:
df=pd.DataFrame({'score(x_train,y_train)':rfc.score(x_train,y_train),'score(x_test,y_test)':rfc.score(x_test,y_test)},index=['Random Forest'])
df

### XGBOOST

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb=XGBClassifier(n_estimators=50, max_depth=4)
model=xgb.fit(x_train,y_train)
model

In [None]:
xgb.score(x_train,y_train),xgb.score(x_test,y_test)

In [None]:
df['score(x_train,y_train)',]=xgb.score(x_train,y_train)
df['score(x_test,y_test)']=xgb.score(x_test,y_test)
df

### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn=KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train,y_train)

In [None]:
knn.score(x_train,y_train),knn.score(x_test,y_test)

### SVM

In [None]:
from sklearn.svm import SVC

In [None]:
%%time 
svm=SVC(kernel='poly',C=10)
svm.fit(x_train,y_train)

In [None]:
%%time 
svm.score(x_train,y_train),svm.score(x_test,y_test)

In [None]:
%%time 
svm=SVC(kernel='rbf',C=10)
svm.fit(x_train,y_train)

In [None]:
%%time 
svm.score(x_train,y_train),svm.score(x_test,y_test)

In [None]:
%%time 
svm=SVC(kernel='sigmoid',C=10)
svm.fit(x_train,y_train)

In [None]:
%%time 
svm.score(x_train,y_train),svm.score(x_test,y_test)

### NB

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

In [None]:
%%time 
nb.fit(x_train,y_train)

In [None]:
%%time 
nb.score(x_train, y_train),nb.score(x_test, y_test)

### GB

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                 max_depth=1, random_state=0)

In [None]:
%%time
gb.fit(x_train, y_train)

In [None]:
%%time
gb.score(x_train, y_train),gb.score(x_test, y_test)

### AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abc=AdaBoostClassifier(n_estimators=100,random_state=1)

In [None]:
%%time
abc.fit(x_train,y_train)

In [None]:
%%time
abc.score(x_train,y_train),abc.score(x_test,y_test)

### LGBMClassifier

In [None]:
# pip install lightgbm
from lightgbm import LGBMClassifier
lgbm=LGBMClassifier()

In [None]:
%%time 
lgbm.fit(x_train,y_train)

In [None]:
%%time
lgbm.score(x_train,y_train),lgbm.score(x_test,y_test)

### CatBoostClassifier

In [None]:
# pip install catboost
from catboost import CatBoostClassifier
cb=CatBoostClassifier()

In [None]:
%%time 
cb.fit(x_train,y_train);

In [None]:
%%time
cb.score(x_train,y_train),cb.score(x_test,y_test)

### Voting

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
vc=VotingClassifier(
    estimators=[('lr', lr), ('rfc', rfc), ('gnb', nb)],
    voting='soft')

In [None]:
%%time
vc.fit(x_train,y_train)

In [None]:
%%time
vc.score(x_train,y_train),vc.score(x_test,y_test)

In [None]:
vc=VotingClassifier(
    estimators=[('lr', lr), ('rfc', rfc), ('gnb', nb)],
    voting='hard')

In [None]:
%%time
vc.fit(x_train,y_train)

In [None]:
%%time
vc.score(x_train,y_train),vc.score(x_test,y_test)

### Multi-layer Perceptron

In [None]:
from sklearn.neural_network import MLPClassifier
mlc=MLPClassifier()

In [None]:
%%time
mlc.fit(x_train,y_train)

In [None]:
%%time
mlc.score(x_train,y_train),mlc.score(x_test,y_test)

### Running 10 Classifiers 
we will create a dataframe containg  "Balanced Accuracy", "Training Accuracy", "Testing Accuracy", "F1 Score", "Precision" and "Recall" for all the classifiers.

In [None]:
import pandas as pd
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score, f1_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from tabulate import tabulate

# Create a list of classifiers
classifiers = [
    ("Logistic Regression", LogisticRegression()),
    ("Decision Tree", DecisionTreeClassifier()),
    ("Random Forest", RandomForestClassifier()),
    ("Gradient Boosting", GradientBoostingClassifier()),
    ("K-Nearest Neighbors", KNeighborsClassifier()),
    ("Gaussian Naive Bayes", GaussianNB()),
    ("Multi-layer Perceptron", MLPClassifier()),
    ("XGBoost", XGBClassifier()),
    ("CatBoost", CatBoostClassifier()), 
    ("AdaBoost", AdaBoostClassifier())
]

# Initialize an empty DataFrame to store the results
results_df = pd.DataFrame(columns=["Classifier", "Balanced Accuracy", "Training Accuracy", "Testing Accuracy", "F1 Score", "Precision", "Recall"])
Classifier=[]
Balanced_Accuracy=[]
Training_Accuracy=[]
Testing_Accuracy=[]
F1_Score=[]
Precision=[]
Recall=[]

# Train and evaluate each classifier
for name, clf in classifiers:
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    y_prob = clf.predict_proba(x_test)[:, 1]
    auc_score = roc_auc_score(y_test, y_prob)
    accuracy_train = accuracy_score(y_train, clf.predict(x_train))
    accuracy_test = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    
    Classifier.append(name)
    Balanced_Accuracy.append(auc_score)
    Training_Accuracy.append(accuracy_train)
    Testing_Accuracy.append(accuracy_test)
    F1_Score.append(f1)
    Precision.append(precision)
    Recall.append(recall)
    
results_df = pd.DataFrame({
    "Classifier": Classifier,
    "Balanced Accuracy": Balanced_Accuracy,
    "Training Accuracy": Training_Accuracy,
    "Testing Accuracy": Testing_Accuracy,
    "F1 Score": F1_Score,
    "Precision": Precision,
    "Recall": Recall
})

In [None]:
# Sort the DataFrame by Balanced Accuracy in descending order
results_df = results_df.sort_values(by="Balanced Accuracy", ascending=False)

# Print the results table with styling
styled_results = results_df.style.background_gradient(cmap='Blues', subset=["Balanced Accuracy", "Training Accuracy", "Testing Accuracy", "F1 Score", "Precision", "Recall"])
# styled_results = styled_results.hide_index()

styled_results

In [None]:
results_df[['Balanced Accuracy','Training Accuracy','Testing Accuracy']].plot.bar()

#### it's obvious that "The CatBoost and XGBoost models stand out as the best performers, showcasing remarkable accuracy and precision."

In [None]:
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming you have your training and testing data as x_train, y_train, x_test, and y_test

# Create a CatBoost classifier
catboost_model = CatBoostClassifier(iterations=100, depth=6, learning_rate=0.1, loss_function='Logloss', verbose=0)

# Train the model
catboost_model.fit(x_train, y_train)

# Make predictions
y_pred = catboost_model.predict(x_test)
y_prob = catboost_model.predict_proba(x_test)[:, 1]

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"AUC Score: {auc_score:.4f}")

# Generate and display the confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

### Conclusion

    From the outset, our goal was to address a pressing problem through a systematic approach. We began by defining the problem and identifying its key aspects. Subsequently, we conducted an exploratory data analysis (EDA) to gain insights into our dataset, revealing valuable patterns and trends.

    Next, we ventured into the modeling phase, where we experimented with various machine learning classifiers. After rigorous testing and evaluation, two standout models emerged: CatBoost and XGBoost. These models consistently outperformed the rest, exhibiting exceptional balanced accuracy, testing accuracy, F1 score, precision, and recall.
    
    As a result, we confidently recommend employing CatBoost or XGBoost as the primary models for solving our problem. Their robust performance is a testament to their ability to provide accurate predictions and drive meaningful outcomes. By leveraging these models, we can make informed decisions and address our problem with confidence. This journey, from problem definition to model selection, has been both illuminating and productive. It underscores the power of data-driven approaches in solving complex challenges and reinforces the significance of selecting the right tools and methodologies for the task at hand.