# Logistic Regression with Python

For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous data set and very often is a student's first step in machine learning!

We'll be trying to predict a classification- survival or deceased.
Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

## Import Libraries
Let's import some libraries to get started!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Load Data

Let's start by reading in the titanic_train.csv file into a pandas dataframe.

In [None]:
df = pd.read_csv(r'https://github.com/kaopanboonyuen/Python-Data-Science/raw/master/Dataset/adult_data/adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [None]:
df.columns


Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [None]:
df = df[['age', 'workclass', 'education', 'educational-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'hours-per-week', 'native-country','income']]


In [None]:
# Convert income to binary values: 1 for '>50K' and 0 for '<=50K'
print(df['income'].dtype)
print(df['income'].unique())
df['income'] = df['income'].apply(lambda x: 1 if x == '>50K' else 0)
df.head(100)

object
['<=50K' '>50K']


Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,hours-per-week,native-country,income
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,40,United-States,0
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,50,United-States,0
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,40,United-States,1
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,40,United-States,1
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,30,United-States,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,20,Private,HS-grad,9,Never-married,Handlers-cleaners,Own-child,White,Male,40,United-States,0
96,25,Private,Bachelors,13,Never-married,Exec-managerial,Own-child,White,Female,40,United-States,0
97,49,Private,10th,6,Married-civ-spouse,Farming-fishing,Husband,White,Male,40,United-States,0
98,59,Private,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,40,United-States,1


## Check Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   education        48842 non-null  object
 3   educational-num  48842 non-null  int64 
 4   marital-status   48842 non-null  object
 5   occupation       48842 non-null  object
 6   relationship     48842 non-null  object
 7   race             48842 non-null  object
 8   gender           48842 non-null  object
 9   hours-per-week   48842 non-null  int64 
 10  native-country   48842 non-null  object
 11  income           48842 non-null  int64 
dtypes: int64(4), object(8)
memory usage: 4.5+ MB


## Converting Categorical Features

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
workclass = pd.get_dummies(df['workclass'],drop_first=True)
education = pd.get_dummies(df['education'],drop_first=True)
marital_status = pd.get_dummies(df['marital-status'],drop_first=True)
occupation = pd.get_dummies(df['occupation'],drop_first=True)
relationship= pd.get_dummies(df['relationship'],drop_first=True)
race  = pd.get_dummies(df['race'],drop_first=True)
gender = pd.get_dummies(df['gender'],drop_first=True)
native_country= pd.get_dummies(df['native-country'],drop_first=True)


In [None]:
df.info()
income.head(3)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   education        48842 non-null  object
 3   educational-num  48842 non-null  int64 
 4   marital-status   48842 non-null  object
 5   occupation       48842 non-null  object
 6   relationship     48842 non-null  object
 7   race             48842 non-null  object
 8   gender           48842 non-null  object
 9   hours-per-week   48842 non-null  int64 
 10  native-country   48842 non-null  object
 11  income           48842 non-null  int64 
dtypes: int64(4), object(8)
memory usage: 4.5+ MB


Unnamed: 0,>50K
0,False
1,False
2,True


In [None]:
df.drop(['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country'], axis=1, inplace=True)


In [None]:
df = pd.concat([df, workclass], axis=1)
df = pd.concat([df, education], axis=1)
df = pd.concat([df, marital_status], axis=1)
df = pd.concat([df, occupation], axis=1)
df = pd.concat([df, relationship], axis=1)
df = pd.concat([df, race], axis=1)
df = pd.concat([df, gender], axis=1)
df = pd.concat([df, native_country], axis=1)


In [None]:
df.head()

Unnamed: 0,age,educational-num,hours-per-week,income,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,25,7,40,0,False,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
1,38,9,50,0,False,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
2,28,12,40,1,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,44,10,40,1,False,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
4,18,10,30,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


Great! Our data is ready for our model!

# Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).



## Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('income',axis=1)
y = df['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.30, random_state=101)

## Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
print(X_train.dtypes)


age                int64
educational-num    int64
hours-per-week     int64
Federal-gov         bool
Local-gov           bool
                   ...  
Thailand            bool
Trinadad&Tobago     bool
United-States       bool
Vietnam             bool
Yugoslavia          bool
Length: 97, dtype: object


In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
y_pred = model.predict(X_test)

In [None]:
print(list(y_test[:5]))
print(y_pred[:5])

[1, 1, 1, 0, 0]
[0 1 0 0 1]


Let's move on to evaluate our model!

## Evaluation

We can check precision,recall,f1-score using classification report!

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
confusion_matrix(y_test, y_pred, labels=[0,1])

array([[10340,   807],
       [ 1636,  1870]])

In [None]:
print(classification_report(y_test,y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8634    0.9276    0.8943     11147
           1     0.6985    0.5334    0.6049      3506

    accuracy                         0.8333     14653
   macro avg     0.7810    0.7305    0.7496     14653
weighted avg     0.8239    0.8333    0.8251     14653



## Check model parameters

In [None]:
print(X.columns)
print(model.intercept_)
print(model.coef_)

Index(['age', 'educational-num', 'hours-per-week', 'Federal-gov', 'Local-gov',
       'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc',
       'State-gov', 'Without-pay', '11th', '12th', '1st-4th', '5th-6th',
       '7th-8th', '9th', 'Assoc-acdm', 'Assoc-voc', 'Bachelors', 'Doctorate',
       'HS-grad', 'Masters', 'Preschool', 'Prof-school', 'Some-college',
       'Married-AF-spouse', 'Married-civ-spouse', 'Married-spouse-absent',
       'Never-married', 'Separated', 'Widowed', 'Adm-clerical', 'Armed-Forces',
       'Craft-repair', 'Exec-managerial', 'Farming-fishing',
       'Handlers-cleaners', 'Machine-op-inspct', 'Other-service',
       'Priv-house-serv', 'Prof-specialty', 'Protective-serv', 'Sales',
       'Tech-support', 'Transport-moving', 'Not-in-family', 'Other-relative',
       'Own-child', 'Unmarried', 'Wife', 'Asian-Pac-Islander', 'Black',
       'Other', 'White', 'Male', 'Cambodia', 'Canada', 'China', 'Columbia',
       'Cuba', 'Dominican-Republic', 'Ecuador',

#REPORT

In [None]:
#This is a report in this project
print("Model Report")
print("-" * 20)
print("Classification Report:")
print(classification_report(y_test, y_pred, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred, labels=[0, 1]))
print("-" * 20)
print("Model Parameters:")
print("Intercept:", model.intercept_)
print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
  print(f"{feature}: {coef}")

# Macro F1 = 0.7496 !!!


Model Report
--------------------
Classification Report:
              precision    recall  f1-score   support

           0     0.8634    0.9276    0.8943     11147
           1     0.6985    0.5334    0.6049      3506

    accuracy                         0.8333     14653
   macro avg     0.7810    0.7305    0.7496     14653
weighted avg     0.8239    0.8333    0.8251     14653

Confusion Matrix:
[[10340   807]
 [ 1636  1870]]
--------------------
Model Parameters:
Intercept: [-4.27707402]
Coefficients:
age: 0.02402730480243188
educational-num: 0.13634048791675404
hours-per-week: 0.029002833995311377
Federal-gov: 0.3600768449425697
Local-gov: -0.17979718885161933
Never-worked: -0.012873043331788146
Private: -0.009339927735521348
Self-emp-inc: 0.08035554041652666
Self-emp-not-inc: -0.5275913539104696
State-gov: -0.5191279915502693
Without-pay: -0.0594866007980557
11th: -0.9859995647918476
12th: -0.5005044358941123
1st-4th: -0.3836647230371493
5th-6th: -0.5814745797502041
7th-8th: -0.9