# Logistic Regression

Logistic or Logit Regression is a statistictical model that uses logistic function to model a binary dependent variable to estimate the parameters of a logistic model.

When using for spam detection in emails, it could predict if a customer will default a loan 

In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

## Dataset

In [7]:
df = pd.read_csv("../xdata/titanic.csv")
df.head(5)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Data exploration

In [9]:
data = df[["pclass","sex","age","survived"]]
data.shape

(1310, 4)

In [10]:
data.describe()

Unnamed: 0,pclass,age,survived
count,1309.0,1046.0,1309.0
mean,2.294882,29.881135,0.381971
std,0.837836,14.4135,0.486055
min,1.0,0.1667,0.0
25%,2.0,21.0,0.0
50%,3.0,28.0,0.0
75%,3.0,39.0,1.0
max,3.0,80.0,1.0


In [11]:
data.head()

Unnamed: 0,pclass,sex,age,survived
0,1.0,female,29.0,1.0
1,1.0,male,0.9167,1.0
2,1.0,female,2.0,0.0
3,1.0,male,30.0,0.0
4,1.0,female,25.0,0.0


In [12]:
data.tail()

Unnamed: 0,pclass,sex,age,survived
1305,3.0,female,,0.0
1306,3.0,male,26.5,0.0
1307,3.0,male,27.0,0.0
1308,3.0,male,29.0,0.0
1309,,,,


## Preprocessing

Finding missing values

In [15]:
print(data.isna().sum(axis=0))

pclass        1
sex           1
age         264
survived      1
dtype: int64


Handling missing values

In [17]:
data = data.dropna(subset=["sex","pclass","survived"])
print(data.isna().sum(axis=0))

pclass        0
sex           0
age         263
survived      0
dtype: int64


In [18]:
data.tail()

Unnamed: 0,pclass,sex,age,survived
1304,3.0,female,14.5,0.0
1305,3.0,female,,0.0
1306,3.0,male,26.5,0.0
1307,3.0,male,27.0,0.0
1308,3.0,male,29.0,0.0


Label Encoding

In [19]:
data["sex"] = data["sex"].map({"male":1,"female":0})
data["sex"]

0       0
1       1
2       0
3       1
4       0
       ..
1304    0
1305    0
1306    1
1307    1
1308    1
Name: sex, Length: 1309, dtype: int64

In [20]:
data.head()

Unnamed: 0,pclass,sex,age,survived
0,1.0,0,29.0,1.0
1,1.0,1,0.9167,1.0
2,1.0,0,2.0,0.0
3,1.0,1,30.0,0.0
4,1.0,0,25.0,0.0


## Feature Extraction

In [21]:
features = data[["sex","age","pclass"]]
target = data[["survived"]]
features.head()

Unnamed: 0,sex,age,pclass
0,0,29.0,1.0
1,1,0.9167,1.0
2,0,2.0,1.0
3,1,30.0,1.0
4,0,25.0,1.0


In [22]:
target.head()

Unnamed: 0,survived
0,1.0
1,1.0
2,0.0
3,0.0
4,0.0


Imputing missing values

In [23]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
features = imputer.fit_transform(features)

In [24]:
features

array([[ 0.    , 29.    ,  1.    ],
       [ 1.    ,  0.9167,  1.    ],
       [ 0.    ,  2.    ,  1.    ],
       ...,
       [ 1.    , 26.5   ,  3.    ],
       [ 1.    , 27.    ,  3.    ],
       [ 1.    , 29.    ,  3.    ]])

Split dataset

In [25]:
feature_train, feature_test, target_train, target_test = train_test_split(features,target)

## Model Training

In [26]:
model = LogisticRegression()
model.fit(feature_train, target_train)
predictions = model.predict(feature_test)

  y = column_or_1d(y, warn=True)


Performance

In [27]:
print(confusion_matrix(target_test,predictions))
print(accuracy_score(target_test,predictions))

[[172  30]
 [ 40  86]]
0.7865853658536586
