# Kaggle Titanic
## Logistic Regression with Python


For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous dataset.



# Step - 0

## Import Libraries

In [1]:
!pip install missingno

Collecting missingno
  Downloading missingno-0.4.0.tar.gz
Building wheels for collected packages: missingno
  Running setup.py bdist_wheel for missingno ... [?25ldone
[?25h  Stored in directory: /Users/venkat/Library/Caches/pip/wheels/92/46/9a/a8f3e9ad98ee4a53242e5ec371309dd71bd1177eb95c72788f
Successfully built missingno
Installing collected packages: missingno
Successfully installed missingno-0.4.0


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
%matplotlib inline

# Step -1

Load the dataset.

Pandas provides two important data types with in built functions to be able to provide extensive capability to handle the data.The datatypes include Series and DataFrames.

# Series

A Series is very similar to a one dimensional array.The difference between an array and a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location and it is not necessary that the index should be integer value,it can hold any arbitrary python object.

In [3]:
#this creates a series with indexing based on the position of element in list
my_list = [10,20,30]
series_1=pd.Series(my_list)
series_1

0    10
1    20
2    30
dtype: int64

In [4]:
#this example illustrates how to add index to the rows
labels=['a','b','c']
series_2=pd.Series(my_list,labels)
series_2

a    10
b    20
c    30
dtype: int64

In [5]:
#it is not necessary to specify the data and labels in the same order.the order can be changed but
#care should be taken that attributes are specified while referring to their value.
series_3=pd.Series(index=labels,data=my_list)
series_3

a    10
b    20
c    30
dtype: int64

In [6]:
#the series can be created from dictionary as
d={'a':10,'b':20,'c':30}
series_4=pd.Series(d)
series_4

a    10
b    20
c    30
dtype: int64

# DataFrames

DataFrame is another important data type of pandas.It can be thought of as a collection of series.

In [7]:
from numpy.random import randn
np.random.seed(101)

In [8]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Pandas provides ways to read or get the data from various sources like read_csv,read_excel,read_html etc.The data is read and stored in the form of DataFrames.

In [9]:
data = pd.read_csv('titanic.csv')
data.head() #this reads the first five entries in the data read from the source

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
#to get the last 5 entries of the data
data.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [11]:
data.shape

(891, 12)

In [12]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


### Accessing individual data in the data frame

#### Working with Columns

since each dataframe is a collection of series if we access a single column we get a series object

In [14]:
s=data['Cabin'].head()

In [15]:
type(s)

pandas.core.series.Series

In [16]:
data[['Cabin','Parch']].head()

Unnamed: 0,Cabin,Parch
0,,0
1,C85,0
2,,0
3,C123,0
4,,0


In [17]:
data.Cabin.head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

In [18]:
#to add a new column which is similar to parch(for illustration)
data['New_parch']=data['Parch']
data['New_parch'].head()

0    0
1    0
2    0
3    0
4    0
Name: New_parch, dtype: int64

In [19]:
data.info() #New_parch is added to the set of columns in the data frames

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
New_parch      891 non-null int64
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB


### Working with rows

There are two ways to access the rows of a data frame.It can be done through indexing or by using the name of the index 

In [20]:
data.iloc[2] #this is to access second row of the data frame using index

PassengerId                         3
Survived                            1
Pclass                              3
Name           Heikkinen, Miss. Laina
Sex                            female
Age                                26
SibSp                               0
Parch                               0
Ticket               STON/O2. 3101282
Fare                            7.925
Cabin                             NaN
Embarked                            S
New_parch                           0
Name: 2, dtype: object

In [21]:
data.loc[67] #since here indexing is based on numbers

PassengerId                          68
Survived                              0
Pclass                                3
Name           Crease, Mr. Ernest James
Sex                                male
Age                                  19
SibSp                                 0
Parch                                 0
Ticket                        S.P. 3464
Fare                             8.1583
Cabin                               NaN
Embarked                              S
New_parch                             0
Name: 67, dtype: object

In [22]:
#or taking the example data frame created during illustration i.e
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [23]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

To drop columns and rows we use drop method

In [24]:
data.drop('New_parch',axis=1).head() #axis=1 specifies we are dealing with columns

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
#the above command does not modify the original dataframe.To modify the original dataframe we use
data.drop('New_parch',axis=1,inplace=True)

In [None]:
#we perform conditional based accessing on the dataframe as
data[data['Pclass']>2].head()#return rows which satisfy the condition

### Merging

In [None]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})  

In [None]:
pd.merge(left,right,how='inner',on='key')

# Step - 2

Exploratory Data - Analysis


Visualizing missing values in the dataset.

In [None]:
ms.matrix(data)

In [None]:
data['Cabin'].value_counts()

We can observe that there are missing values in 'Age', 'Cabin' and 'Embarked'. Let's continue.

In [None]:
data.info()

Visualization of data with Seaborn

In [None]:
sns.jointplot(x='Fare',y='Age',data=data)

In [None]:
sns.distplot(data['Fare'])

In [None]:
sns.heatmap(data.corr(),cmap='coolwarm')
plt.title('data.corr()')

In [None]:
sns.swarmplot(x='Pclass',y='Age',data=data,palette='Set2')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=data,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=data,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data = data,palette='rainbow')

In [None]:
data['Age'].hist(bins = 30, color = 'darkred', alpha = 0.8)

In [None]:
sns.countplot(x = 'SibSp', data = data)

In [None]:
data['Fare'].hist(color = 'green', bins = 40, figsize = (8,3))

#### What do you observe from the above charts?

# Step - 3


## Data Cleaning

We want to fill the missing values of the age in the dataset with the average age value for each of the classes. This is called data imputation.

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=data,palette='winter')

The average age for each of the classes are estimated to be as follows:
  
  * For **Class 1** - The average age is 37
  * For **Class 2** - The average age is 29
  * For **Class 3** - The average age is 24
  
Let's impute these values into the age column.

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        # Class-1
        if Pclass == 1:
            return 37
        # Class-2 
        elif Pclass == 2:
            return 29
        # Class-3
        else:
            return 24

    else:
        return Age

Applying the function.

In [None]:
data['Age'] = data[['Age','Pclass']].apply(impute_age,axis=1)

Now let's visualize the missing values.

In [None]:
ms.matrix(data)

The Age column is imputed sucessfully.

Let's drop the Cabin column and the row in the Embarked that is NaN.

In [None]:
data.drop('Cabin', axis = 1,inplace=True)

In [None]:
data.head()

In [None]:
data.dropna(inplace = True)

In [None]:
ms.matrix(data)

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
data.info()

In [None]:
sex = pd.get_dummies(data['Sex'],drop_first=True)
embark = pd.get_dummies(data['Embarked'],drop_first=True)
sex


In [None]:
embark

In [None]:
data.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
data.head()


In [None]:
data = pd.concat([data,sex,embark],axis=1)

In [None]:
data.head()

# Step - 4

## Building a Logistic Regression model


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived',axis=1), 
                                                    data['Survived'], test_size=0.30, 
                                                    random_state=101)

In [None]:
from sklearn.linear_model import LogisticRegression

# Build the Model.
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predict =  logmodel.predict(X_test)
predict

Let's move on to evaluate our model.

## Evaluation

We can check precision, recall, f1 - score using classification report!

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
print(confusion_matrix(y_test, predict))

---
                                                     THE END