# <font color=blue>Assignments for "Logistic Regression"</font>

- In this assignment, you are going to use your model to predict survival of the Titanic disaster. For this assignment, send a link to a Jupyter notebook containing solutions to the following tasks.
    - Download [Titanic](https://www.kaggle.com/c/titanic/data) data from Kaggle. The data in the train.csv file meets your need.
    - Split your data into training and test sets.
    - Predict the survival based on the test data you split by creating your model.
    - Is your model's performance satisfactory? Explain.
    - Try to improve your model's performance by adding or subtracting some variables. <br>

- Explore the advantages and disadvantages of Logistic Regression and discuss with your mentor.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
titanic = pd.read_csv("../../data/logistic regression/train.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
titanic.isnull().sum()*100/titanic.shape[0]

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

In [5]:
titanic.drop(columns=['Name','Cabin'],inplace=True)

In [6]:
var_null=titanic.isnull().sum()
var_missing=var_null[var_null!=0].index
var_missing

Index(['Age', 'Embarked'], dtype='object')

In [7]:
titanic['Age'].fillna(titanic['Age'].median(),inplace=True)
titanic['Embarked']=titanic['Embarked'].fillna(titanic['Embarked'].value_counts().index[0])

In [8]:
titanic.isnull().sum()*100/titanic.shape[0]

PassengerId    0.0
Survived       0.0
Pclass         0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Embarked       0.0
dtype: float64

In [9]:
titanic['Pclass']=titanic['Pclass'].astype("object")

var_numeric=titanic.select_dtypes(include=['float64','int64'])
var_cat=titanic.select_dtypes(include=['object'])
var_dummies=pd.get_dummies(var_cat,drop_first=True)

var_regress=pd.concat([var_numeric,var_dummies],axis=1)
var_regress.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Ticket_110413,...,Ticket_W./C. 14258,Ticket_W./C. 14263,Ticket_W./C. 6607,Ticket_W./C. 6608,Ticket_W./C. 6609,Ticket_W.E.P. 5734,Ticket_W/C 14208,Ticket_WE/P 5735,Embarked_Q,Embarked_S
0,1,0,22.0,1,0,7.25,0,1,1,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,1,26.0,0,0,7.925,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,0,35.0,0,0,8.05,0,1,1,0,...,0,0,0,0,0,0,0,0,0,1


In [10]:
feature_cols = ['Age','SibSp','Fare','Parch','Pclass_2','Pclass_3','Sex_male','Embarked_Q','Embarked_S']
X = var_regress[feature_cols] # Features
Y=var_regress['Survived']


In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 6)

In [12]:
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, Y_train)

train_accuracy = logreg.score(X_train, Y_train)
test_accuracy = logreg.score(X_test, Y_test)

print('Accuracy on Train Data : {:.2f}'.format(train_accuracy), 
      'Accuracy on Test Data  : {:.2f}'.format(test_accuracy), sep='\n')

Accuracy on Train Data : 0.79
Accuracy on Test Data  : 0.85
