## Classification Algorithm
In this section we are going to learn about classification algorithm.

Problem Statement: The problem statement is to predict the list of survivors in the disaster situation that happened more than 100 years ago when Titanic sank to the bottom of the ocean.



In [1]:
#Libraries
import numpy as np
import pandas as pd

In [2]:
df_training = pd.read_csv('../input/train.csv')

df_training.shape

(891, 12)

So data set has total 891 rows. that is it has information about 891 people wether they survived or not. It has 12 rows, lets see what they are and what is there data type.

In [3]:
df_training.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

## Feature Description
We have following details about the passengers of Titanic. Lets apply basic analysis on them. we will do an EDA before accepting or rejecting a column as predictor for survival value.
- Passenger Id - uniquely identifying each passenger. This cannot help us in identifying the survival of a passenger.
- Survived - Thie is the value that tells wether a passenger has survived or not with values 0 and 1. this is a categorical value and data type needs to be changed in the dataframe.
- Pclass - This tells the class of passenger. When the ship was sinking, most of the survivors were chosen from high class. hence the Pclass will help us identify the survivors.
- Name - Just like passenger Id, Name will be different for each row and so not useful in predicting the survival of passengers.
- Sex - Female passengers were give preference over male passengers to go in life boats. and so Sex is a good predictor for the survival.
- Age - Childrens were preferred during rescue operation on Titanic. Hence we will have to create category of age for deciding the survivors.
- SibSp - It gives total number of siblings and spouse for that particular passenger.
- Parch - Number of parents or children aboard the ship.
- Ticket - Ticket number.
- Fare - Passenger Fare.
- Cabin - Cabin Number. A passenger can have a cabin or may not have the cabin. we can create a categorical variable which stores if a passenger has cabin or not.
- Embarked - Port at which passenger embarked their journey.

In [4]:
# Lets remove passenger id out of the training set and store it in another variable
training_passengerId = df_training.PassengerId

df_training.drop(columns=['PassengerId'],inplace=True)

#dropping Name and Ticket and fare as well out of the data
df_training.drop(columns=['Name','Ticket','Fare'],inplace=True)
df_training.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Cabin,Embarked
0,0,3,male,22.0,1,0,,S
1,1,1,female,38.0,1,0,C85,C
2,1,3,female,26.0,0,0,,S
3,1,1,female,35.0,1,0,C123,S
4,0,3,male,35.0,0,0,,S


In [5]:
#Lets annalyze the values of remaining data

print('Survived value counts: ')
print(df_training.Survived.value_counts())

print('Count by class: ')
print(df_training.Pclass.value_counts())

print('count by sex: ')
print(df_training.Sex.value_counts())

print('Cabin or without cabin count')
print('Without cabin', df_training.Cabin.isnull().sum())
print('With cabin', df_training.shape[0] - df_training.Cabin.isnull().sum())

print('Count by Journey Embarking point:')
print(df_training.Embarked.value_counts())

Survived value counts: 
0    549
1    342
Name: Survived, dtype: int64
Count by class: 
3    491
1    216
2    184
Name: Pclass, dtype: int64
count by sex: 
male      577
female    314
Name: Sex, dtype: int64
Cabin or without cabin count
Without cabin 687
With cabin 204
Count by Journey Embarking point:
S    644
C    168
Q     77
Name: Embarked, dtype: int64


Lets change these values to category type


In [6]:
#creating category types
df_training.Survived=df_training.Survived.astype('category')
df_training.Pclass=df_training.Pclass.astype('category')
df_training.Sex=df_training.Sex.astype('category')
df_training.Embarked = df_training.Embarked.astype('category')

# lets do feature engineering using cabin. if a passenger has cabin and if a passenger doesnot have a cabin.
df_training['cabinAllocated'] = df_training.Cabin.apply(lambda x: 0 if type(x)==float else 1)
df_training['cabinAllocated'] = df_training['cabinAllocated'].astype('category')

In [7]:
df_training.dtypes

Survived          category
Pclass            category
Sex               category
Age                float64
SibSp                int64
Parch                int64
Cabin               object
Embarked          category
cabinAllocated    category
dtype: object

In [8]:
# Lets drop Cabin first
df_training.drop(columns=['Cabin'],inplace=True)

Now lets draw some garphs to understand age column's behaviour againsth the count.

In [9]:
print("Min Age : {}, Max age : {}".format(df_training.Age.min(),df_training.Age.max()))

Min Age : 0.42, Max age : 80.0


In [10]:
df_training.Age.isnull().sum()

177

As there are 177 records without age, we can either ignore them or randomly put some values. Age played an important role in deciding the survivals. Lets put some random numbers in place of null values.

In [11]:
random_list = np.random.randint(df_training.Age.mean() - df_training.Age.std(), 
                                         df_training.Age.mean() + df_training.Age.std(), 
                                         size=df_training.Age.isnull().sum())
df_training['Age'][np.isnan(df_training['Age'])] = random_list
df_training['Age'] = df_training['Age'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [12]:
# Lets divide age in 5 bins

df_training['AgeGroup'] = pd.cut(df_training.Age,5,labels=[1,2,3,4,5])


In [13]:
#As we have categorized age into AgeGroup, lets remove Age
df_training.drop(columns=['Age'],inplace=True)

Lets get complete family size from Parch and SibSp columns by adding them.

In [14]:
#Adding 1 to indicate the person in that row
df_training['family'] = df_training.Parch+df_training.SibSp+1

In [15]:
df_training.drop(columns=['SibSp','Parch'],inplace=True)
df_training.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked,cabinAllocated,AgeGroup,family
0,0,3,male,S,0,2,2
1,1,1,female,C,1,3,2
2,1,3,female,S,0,2,1
3,1,1,female,S,1,3,2
4,0,3,male,S,0,3,1


In [16]:
df_training['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [17]:
df_training['category_sex'] = df_training['Sex'].apply(lambda x: 1 if x=='male'  else 0)

In [18]:
df_training.drop(columns=['Sex'],inplace=True)

In [19]:
df_training.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [20]:
df_training.Embarked = df_training.Embarked.fillna('S')
df_training.Embarked = df_training.Embarked.map({'S':1,'C':2,'Q':3}).astype('int')

In [21]:
df_training.Embarked.value_counts()

1    646
2    168
3     77
Name: Embarked, dtype: int64

In [22]:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(df_training.iloc[:,1:],df_training.iloc[:,0],test_size=0.2,random_state=0)

In [23]:
from sklearn.ensemble import RandomForestClassifier

randomForest = RandomForestClassifier(n_estimators=100)

randomForest.fit(train_x,train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [24]:
y_hat = randomForest.predict(test_x)

In [25]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y,y_hat)

0.7821229050279329

Lets use complete set to create the model

In [26]:
randomForest.fit(df_training.iloc[:,1:],df_training.iloc[:,0])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Now lets import the test file and create the output based on the test file. Before that, we will have to make all the manipulations on test file that we did on training file.

In [27]:
df_testing = pd.read_csv('../input/test.csv')

# Lets remove passenger id out of the training set and store it in another variable
testing_passengerId = df_testing.PassengerId

df_testing.drop(columns=['PassengerId'],inplace=True)

#dropping Name and Ticket and fare as well out of the data
df_testing.drop(columns=['Name','Ticket','Fare'],inplace=True)
df_testing.head()

#creating category types
df_testing.Pclass=df_testing.Pclass.astype('category')
df_testing.Sex=df_testing.Sex.astype('category')
df_testing.Embarked = df_testing.Embarked.astype('category')

# lets do feature engineering using cabin. if a passenger has cabin and if a passenger doesnot have a cabin.
df_testing['cabinAllocated'] = df_testing.Cabin.apply(lambda x: 0 if type(x)==float else 1)
df_testing['cabinAllocated'] = df_testing['cabinAllocated'].astype('category')

# Lets drop Cabin first
df_testing.drop(columns=['Cabin'],inplace=True)

random_list_test = np.random.randint(df_testing.Age.mean() - df_testing.Age.std(), 
                                         df_testing.Age.mean() + df_testing.Age.std(), 
                                         size=df_testing.Age.isnull().sum())
df_testing['Age'][np.isnan(df_testing['Age'])] = random_list_test
df_testing['Age'] = df_testing['Age'].astype(int)

# Lets divide age in 5 bins

df_testing['AgeGroup'] = pd.cut(df_testing.Age,5,labels=[1,2,3,4,5])


#As we have categorized age into AgeGroup, lets remove Age
df_testing.drop(columns=['Age'],inplace=True)

#Adding 1 to indicate the person in that row
df_testing['family'] = df_testing.Parch+df_testing.SibSp+1

df_testing.drop(columns=['SibSp','Parch'],inplace=True)

df_testing['category_sex'] = df_testing['Sex'].apply(lambda x: 1 if x=='male'  else 0)
df_testing.drop(columns=['Sex'],inplace=True)

df_testing.Embarked = df_testing.Embarked.fillna('S')
df_testing.Embarked = df_testing.Embarked.map({'S':1,'C':2,'Q':3}).astype('int')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [28]:
submission_data = pd.DataFrame({'PassengerId':testing_passengerId, 'Survived':randomForest.predict(df_testing)})

submission_data.to_csv("Submission_Data.csv",index=False)