# Logistic Regression in python

For this project we will be working with the Titanic Data Set from Kaggle. This is a very famous data set and very often is a student's first step in machine learning!

We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data aquisition and clean-up

In [None]:
train = pd.read_csv("titanic_train.csv")
train.head()

### Exploratory Data Analysis

##### Missing Data

In [None]:
train.isnull().sum()

In [None]:
sns.heatmap(train.isnull(), cmap='viridis')

In [None]:
sns.countplot(x='Survived', data=train)

In [None]:
sns.countplot(x='Survived', hue='Sex', data=train)

In [None]:
sns.countplot(x='Survived', hue='Pclass', data=train)

In [None]:
sns.displot(train['Age'].dropna())

In [None]:
sns.countplot(x='SibSp', data=train)

In [None]:
train['Fare'].hist(bins=40)

##### Data Cleaning: Age Column

In [None]:
sns.boxplot(x='Pclass', y='Age', data=train)

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

In [None]:
train['Age'] = train[['Age', 'Pclass']].apply(impute_age, axis=1)

In [None]:
sns.heatmap(train.isnull(), cmap='viridis')

##### Cleaning Data: Cabin

In [None]:
train.drop('Cabin', axis=1, inplace=True)

In [None]:
sns.heatmap(train.isnull(), cmap='viridis')

In [None]:
train.isnull().sum()

In [None]:
train.dropna(inplace=True)

In [None]:
train.isnull().sum()

In [None]:
train.head()

In [None]:
train.drop(['Name', 'Ticket'], inplace=True, axis=1)

In [None]:
train.head()

##### Data Wrangling: convert 'Sex' and 'Embarked' to numericals

In [None]:
sex = pd.get_dummies(train['Sex'], drop_first=True)
sex

In [None]:
embark = pd.get_dummies(train['Embarked'], drop_first=True)
embark

In [None]:
train.drop(['Sex', 'Embarked'], inplace=True, axis=1)

In [None]:
train = pd.concat([train, sex, embark], axis=1)

In [None]:
train.head(10)

### Building the Model

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived', axis=1), 
                                                    train['Survived'], test_size=0.20, 
                                                    random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
logisticmodel = LogisticRegression(max_iter=1000)
logisticmodel.fit(X_train, y_train)

In [None]:
predictions = logisticmodel.predict(X_test)

### Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, predictions))

In [None]:
print(confusion_matrix(y_test, predictions))

In [None]:
test = pd.read_csv("titanic_test.csv")

In [None]:
test.head()

In [None]:
X_test, y_test