import numpy as np 
import pandas as pd

## What is Logistic Regression?


‘Logistic Regression is used to predict categorical variables with the help of dependent variables. Consider there are two classes and a new data point is to be checked which class it would belong to. Then algorithms compute probability values that range from 0 and 1. 


This is a full list of the variables:

Variable Name	Variable Description

PassengerID	Passenger identification number

Survived	Whether a passenger survived or not: 0 = No, 1 = Yes

Pclass	Tick class: 1 = 1st, 2 = 2nd, 3 = 3rd

sex	Sex

Age	Age in years

sibsp	# of siblings / spouses aboard the Titanic

parch	# of parents / children aboard the Titanic

Pclass_Male	Interaction of Pclass and Male (Pclass_Male=Pclass×Male) 

Pclass_Age	Interaction of Pclass and Age (Pclass_Age=Pclass×Age) 

Male_Age	Interaction of Male and Age (Male_Age=Male×Age) 

Age_sibsp	Interaction of Age and sibsp (Age_sibsp = Age×sibsp) 

### Categorical outcome Variable: Survived (1 = did survive, 0 = did not survive).


In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("Titanic.csv")

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Male,Age,SibSp,Parch,Fare,Pclass_Male,Pclass_Age,Male_Age,Age_SibSp
0,804,1,3,"Thomas, Master. Assad Alexander",1,0.42,0,1,8.5167,3,1.26,0.42,0.0
1,756,1,2,"Hamalainen, Master. Viljo",1,0.67,1,1,14.5,2,1.34,0.67,0.67
2,470,1,3,"Baclini, Miss. Helene Barbara",0,0.75,2,1,19.2583,0,2.25,0.0,1.5
3,645,1,3,"Baclini, Miss. Eugenie",0,0.75,2,1,19.2583,0,2.25,0.0,1.5
4,79,1,2,"Caldwell, Master. Alden Gates",1,0.83,0,2,29.0,2,1.66,0.83,0.0


In [6]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Male,Age,SibSp,Parch,Fare,Pclass_Male,Pclass_Age,Male_Age,Age_SibSp
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,448.582633,0.406162,2.236695,0.634454,29.699118,0.512605,0.431373,34.694514,1.481793,61.938151,19.494636,11.066415
std,259.119524,0.49146,0.83825,0.481921,14.526497,0.929783,0.853289,52.91893,1.300697,34.379609,18.864076,19.093099
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.92,0.0,0.0
25%,222.25,0.0,1.0,0.0,20.125,0.0,0.0,8.05,0.0,38.0,0.0,0.0
50%,445.0,0.0,2.0,1.0,28.0,0.0,0.0,15.7417,1.0,58.0,20.0,0.0
75%,677.75,1.0,3.0,1.0,38.0,1.0,1.0,33.375,3.0,81.0,32.0,20.0
max,891.0,1.0,3.0,1.0,80.0,5.0,6.0,512.3292,3.0,222.0,80.0,106.0


In [8]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Male           0
Age            0
SibSp          0
Parch          0
Fare           0
Pclass_Male    0
Pclass_Age     0
Male_Age       0
Age_SibSp      0
dtype: int64

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  714 non-null    int64  
 1   Survived     714 non-null    int64  
 2   Pclass       714 non-null    int64  
 3   Name         714 non-null    object 
 4   Male         714 non-null    int64  
 5   Age          714 non-null    float64
 6   SibSp        714 non-null    int64  
 7   Parch        714 non-null    int64  
 8   Fare         714 non-null    float64
 9   Pclass_Male  714 non-null    int64  
 10  Pclass_Age   714 non-null    float64
 11  Male_Age     714 non-null    float64
 12  Age_SibSp    714 non-null    float64
dtypes: float64(5), int64(7), object(1)
memory usage: 72.6+ KB


## Splitting the dataset

import itertools # construct specialized tools
import matplotlib.pyplot as plt # visualizations
from matplotlib import rcParams # plot size customization
from termcolor import colored as cl # text customization

X_var = np.asarray(df[['Pclass', 'Male', 'Age', 'SibSp', 'Parch', 'Fare']])
y_var = np.asarray(df['Survived'])

print(cl('X_var samples : ', attrs = ['bold']), X_var[:5])
print(cl('y_var samples : ', attrs = ['bold']), y_var[:5])

Using the ‘StandardScaler’ function in scikit-learn, we are going to normalize the independent variable or the ‘X’ variable. Follow the code to normalize the X variable in python.

In [21]:
from sklearn.model_selection import train_test_split # splitting the data
from sklearn.linear_model import LogisticRegression # model algorithm
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.metrics import f1_score, log_loss # evaluation metric
from sklearn.metrics import precision_score # evaluation metric
from sklearn.metrics import classification_report # evaluation metric
from sklearn.metrics import confusion_matrix # evaluation metric
from sklearn.metrics import log_loss # evaluation metric

X_var = StandardScaler().fit(X_var).transform(X_var)

print(cl(X_var[:5], attrs = ['bold']))

[1m[[ 0.91123237  0.75905134 -2.01697919 -0.55170307  0.66686178 -0.49502447]
 [-0.28256564  0.75905134 -1.99975719  0.52457013  0.66686178 -0.38187981]
 [ 0.91123237 -1.31743394 -1.99424616  1.60084334  0.66686178 -0.29189999]
 [ 0.91123237 -1.31743394 -1.99424616  1.60084334  0.66686178 -0.29189999]
 [-0.28256564  0.75905134 -1.98873512 -0.55170307  1.83961871 -0.1076837 ]][0m
