## LOGISTIC REGRESSION:-
Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes, unlike linear regression that predicts a continuous outcome.

In the simplest case there are two outcomes, which is called binomial

In [179]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [180]:
# So first we have to load dataset 
#  And for this section, we used in built dataset of seaborn
df = sns.load_dataset('titanic')

In [181]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [182]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [183]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [184]:
# So first we drop some columns

df.drop(['who', 'adult_male','class', 'deck', 'embark_town','alive'], axis=1, inplace=True)

In [185]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   pclass    891 non-null    int64  
 2   sex       891 non-null    object 
 3   age       714 non-null    float64
 4   sibsp     891 non-null    int64  
 5   parch     891 non-null    int64  
 6   fare      891 non-null    float64
 7   embarked  889 non-null    object 
 8   alone     891 non-null    bool   
dtypes: bool(1), float64(2), int64(4), object(2)
memory usage: 56.7+ KB


In [186]:
# So now only two columns have null values
#  age and embarked
# so as we know we will fill that null values by mean
df['age'].fillna(df['age'].mean(), inplace = True)

In [187]:
# df['embarked'].fillna(df['embarked'].mode(), inplace= True)
# We will fill that two null value by mode
# But we drop that two rows in the dataset
# And after doing that , whole dataset have 889 rows instead of 891
#  why we drop that two rows:- for understanding how we can do this type of ops
df.dropna(subset=['embarked'], inplace= True)


In [188]:
# Now let's see our dataset:-
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  889 non-null    int64  
 1   pclass    889 non-null    int64  
 2   sex       889 non-null    object 
 3   age       889 non-null    float64
 4   sibsp     889 non-null    int64  
 5   parch     889 non-null    int64  
 6   fare      889 non-null    float64
 7   embarked  889 non-null    object 
 8   alone     889 non-null    bool   
dtypes: bool(1), float64(2), int64(4), object(2)
memory usage: 63.4+ KB


In [189]:
# now let move to the encoding part:
# first we need to import library
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()


In [190]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,alone
0,0,3,male,22.0,1,0,7.25,S,False
1,1,1,female,38.0,1,0,71.2833,C,False
2,1,3,female,26.0,0,0,7.925,S,True
3,1,1,female,35.0,1,0,53.1,S,False
4,0,3,male,35.0,0,0,8.05,S,True


In [191]:
# So here we have only two column with object data type:-
df['sex'] = label.fit_transform(df['sex'])
df['embarked'] = label.fit_transform(df['embarked'])

#  after that , i also change the type 
df = df.astype(int)

In [192]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,alone
0,0,3,1,22,1,0,7,2,0
1,1,1,0,38,1,0,71,0,0
2,1,3,0,26,0,0,7,2,1
3,1,1,0,35,1,0,53,2,0
4,0,3,1,35,0,0,8,2,1


In [193]:
#  we also do scaling on that dataset
# / But for logistic reg. it's not needed
#  so I directly move to the logistic reg. part

In [194]:
# So here I divide my dataset in X and y
# I drop survived column and store other columns in X
# and put SURVIVED column in y:-

X = df.drop('survived',axis=1)
y = df['survived']

In [195]:
# So here i import library again for train and test split
from sklearn.model_selection import train_test_split


In [196]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
     test_size=0.2, random_state=42)

In [197]:
# So now for logistic reg, I import library for that:-

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [198]:
# Now let's go mdel training 
model.fit(X_train, y_train)

In [199]:
# So our model is ready, let's go prediction:-
# here i pass the test data into it 
y_pred = model.predict(X_test)

In [200]:
y_pred

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 1])

In [201]:
y_test

281    0
435    1
39     1
418    0
585    1
      ..
433    0
807    0
25     1
85     1
10     1
Name: survived, Length: 178, dtype: int32

In [202]:
#  So as you see , our model prediction is on right way:-
# So in linear reg. we compare both data and find r2 for testing our model
# but in logistic reg, we have to create confusion metrics for evaluation
#  Let's see how we do that

In [203]:
# CONFUSION MATRIX:-
# So as we know logistic regression is a classification algorithm part
# And in that its only understand binary number 0 or 1
# So there are four outcomes have come 
#                      PREDICTED DATA:
#                   1                0
#   SURVIVED :-
#  ACTUAL DATA: 1   TP              FN
#               0   FP              TN

# TRUE POSITIVE/NEGATIVE    :- Right Prediction
# FALSE POSITIVE/NEGATIVE   :- type1error and type2error, if any outcomes under this


In [204]:
# We can calculate multiple thing by confusion matrix:-
# 1. Accuracy
# 2. Precision
# 3. Recall
# 4. F1 score

# So lets go one by one , what is this?

In [205]:
# Accuracy : shows how many predictions the model got right out of all the predictions
#   TP+TN / TP+TN+FP+FN = Accuracy

# Precision : how many of the "positive" predictions were actually correct
#   TP / TP+FP = Precision

# Recall : (SENSITIVITY) It shows the proportion of true positives detected out of all the actual positive instances
#   TP / TP+FN = Recall

# F1-Score : combines precision and recall into a single metric to balance their trade-off
#   2*Precision*Recall / Precision+Recall

In [206]:
#  so lets perform the above operation , check the prediction of our model
# first I have import library for matrix
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [207]:
# So here we check all of them
#  I simply pass my prediction and my test data
accuracy_score(y_test , y_pred)

0.8033707865168539

In [208]:
confusion_matrix(y_test , y_pred)

array([[90, 19],
       [16, 53]], dtype=int64)

In [209]:
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84       109
           1       0.74      0.77      0.75        69

    accuracy                           0.80       178
   macro avg       0.79      0.80      0.79       178
weighted avg       0.81      0.80      0.80       178



In [None]:
# So here we completed with the logistic regression model
# and we can say that model perfoming well, 
# because we get 84% f1 score , which is pretty good


In [211]:
# After that, let's move to the another type of model building
# and we use the same dataset , after splitting
# we try the different model building on the same dataset, and see which model is best for us.


## KNN :- K Nearest Neighbour

KNN is a simple, supervised machine learning (ML) algorithm that can be used for classification or regression tasks - and is also frequently used in missing value imputation. It is based on the idea that the observations closest to a given data point are the most "similar" observations in a data set, and we can therefore classify unforeseen points based on the values of the closest existing points. By choosing K, the user can select the number of nearby observations to use in the algorithm.

Here, we will show you how to implement the KNN algorithm for classification, and show how different values of K affect the results.

In [212]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,alone
0,0,3,1,22,1,0,7,2,0
1,1,1,0,38,1,0,71,0,0
2,1,3,0,26,0,0,7,2,1
3,1,1,0,35,1,0,53,2,0
4,0,3,1,35,0,0,8,2,1
...,...,...,...,...,...,...,...,...,...
886,0,2,1,27,0,0,13,2,1
887,1,1,0,19,0,0,30,2,1
888,0,3,0,29,1,2,23,2,0
889,1,1,1,26,0,0,30,0,1


In [213]:
#  so knn model, first we need to do scaling , because its take the neighbour value
# if any of the value is high, may be they impact on our model
# so for scaling , first we import library
from sklearn.preprocessing import StandardScaler

In [214]:
scaler = StandardScaler()

In [215]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [216]:
# So now after doing that , we can import knn library
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()

In [217]:
knn_model.fit(X_train_scaled, y_train)


In [218]:
y_pred_knn = knn_model.predict(X_test_scaled)

In [222]:
accuracy_score(y_test , y_pred_knn)

0.7808988764044944

In [220]:
confusion_matrix(y_test, y_pred_knn)

array([[90, 19],
       [20, 49]], dtype=int64)

In [221]:
print(classification_report(y_test , y_pred_knn))

              precision    recall  f1-score   support

           0       0.82      0.83      0.82       109
           1       0.72      0.71      0.72        69

    accuracy                           0.78       178
   macro avg       0.77      0.77      0.77       178
weighted avg       0.78      0.78      0.78       178

