Importing the required packages
1. pandas library required for data analysis, in this case to read the dataset file
2. linear_model from scikit-learn, for the purpose of using the Logistic Regression algorithm
:

In [25]:
import pandas as ps
from sklearn import linear_model

Begin with reading the training dataset file which contains the input features(Age,SibSp,Parch,Sex,Pclass,Fare,Embarked columns) as well as the output feature(Survived column) required for training the algorithm:

In [26]:
trainSet=ps.read_csv("./data/train.csv")

Before beginning with data analysis, it is important to do a data cleanup : to deal with empty values and normalize the required data from text to number format so that it is easy for the algorithm to understand the values and optimize them accordingly.

1. In the below data_cleanup method, the "Age" and "Fare" column values which are empty are filled up with the mean of the age column and mode of the fare column respectively(I have used mean and mode just to try different functions, "median" can also be used).
2. The empty values in the "Embarked" column are replaced by default as "S".
   To make the data algorithm-friendly, I have replaced the values "S","C","Q" with 1,2 & 3 respectively.
3. The values for the "Sex" column are replaced with 1 & 2 for "female" & "male" respectively.

In [27]:
def data_cleanup(dataset):
    dataset["Age"] = dataset["Age"].fillna(dataset["Age"].dropna().mean())
    dataset["Fare"]=dataset["Fare"].fillna(dataset["Fare"].dropna().mean())
    dataset["Embarked"]=dataset["Embarked"].fillna("S")
    dataset.loc[dataset["Embarked"]=="S","Embarked"]=1
    dataset.loc[dataset["Embarked"]=="C","Embarked"]=2
    dataset.loc[dataset["Embarked"]=="Q","Embarked"]=3
    dataset.loc[dataset["Sex"]=="female","Sex"]=1
    dataset.loc[dataset["Sex"]=="male","Sex"]=0

Using the data_cleanup method, clean up of the training dataset obtained from the file is performed.

Now, the columns required to feed the algorithm are selected from the training dataset and assigned into a variable called featureSet. This will be the input for the algorithm.

The output column "Survived" is selected and assigned into targetValue variable. This will be referred as the output by the algorithm.

:

In [28]:
data_cleanup(trainSet)

featureSet=trainSet[["Age","SibSp","Parch","Sex","Pclass","Fare","Embarked"]].values
targetValue=trainSet["Survived"].values

In the below code, an instance of the LogisticRegression method is created and using the fit method on the instance "classify", the input and output features are fed to the algorithm:

In [29]:
classify=linear_model.LogisticRegression()
classify_=classify.fit(featureSet,targetValue)

The accuracy of the classifier can be computed using the score method on the classify_ variable, which in this case is 80%:

In [30]:
print(classify_.score(featureSet,targetValue))

0.8002244668911336


We can also see the variation in the accuracy by altering the number of input features as shown below:

In [31]:
featureSet=trainSet[["Age","SibSp","Parch","Sex","Pclass"]].values
targetValue=trainSet["Survived"].values

classify=linear_model.LogisticRegression()
classify_=classify.fit(featureSet,targetValue)

print(classify_.score(featureSet,targetValue))

0.8035914702581369


Now, reading the testing dataset file which contains the input features(Age,SibSp,Parch,Sex,Pclass,Fare,Embarked columns) for testing the algorithm.

The data cleanup is done for the test dataset as well using the method defined above.

The set of input features is fed into the variable testfeatureSet.

In [33]:
testSet=ps.read_csv("./data/test_input.csv")

data_cleanup(testSet)

testfeatureSet=testSet[["Age","SibSp","Parch","Sex","Pclass"]].values

The testfeatureSet is passed into the predict method of the classifier instance and the result is recorded in a variable result.
The first 20 output results can be seen below:

In [34]:
result=classify_.predict(testfeatureSet)

print("Printing first 20 results from the prediction: 0 as deceased and 1 as survived")

result[0:20]

Printing first 20 results from the prediction: 0 as deceased and 1 as survived


array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0],
      dtype=int64)

The expected output is read from the file test_expected_output which contains the right output to be displayed for the test dataset.

The expected output is compared with the classifier to compute the accuracy of the prediction as shown below:

In [35]:
output = ps.read_csv('./data/test_expected_output.csv')
targetOutput=output["Survived"].values

print(classify_.score(testfeatureSet,targetOutput))

0.8947368421052632
