In [1]:
# In this notebook, I will show how to tackle Kaggle's entry level challenge called Titanic. In this challenge, you are given training and test dataset. Your goal is to use the training dataset to build and train a model and then use it to predict whether a passenger will surivive or not listed in test dataset. Once you have your predictions, you need to submit the results to Kaggle which will evaluate your model's performance.
# As part of this challenge, we will:
# 1. Load data
# 2. Explore and clean data
# 3. Train our model
# 4. Predict values using our model

# I will be using Databricks Community Edition to run Apache Spark and use it's machine learning library, MLlib, to build a logistic regression model.

In [2]:
# Download train and test datasets from my github repo: https://github.com/himoacs/kaggle/tree/master/titanic
# Upload them into databricks so you can easily load them into a DataFrame
train_df = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/train.csv')
train_df.show()

# Here is the description of each column provided by Kaggle
# PassengerId - Unique id for the passenger
# Survived - Whether the passenger survived or not
# Pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
# Name - Passenger's name
# Sex - Passenger's sex
# Age - Passenger's age
# SibSp - Number of siblings / spouses aboard the Titanic
# Parch - Number of parents / children aboard the Titanic
# Ticket - Ticket number
# Fare - Ticket fare
# Cabin - Cabin number
# Embarked - Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [3]:
# We can also take a look at the schema
train_df.printSchema()

In [4]:
# We now need to decide which of the features provided to us can actually be used to predict whether a passenger survives or drowns.
# I am going to exclude PassengerId, Name, Parch, Ticket and Cabin because these I don't believe these features influence the outcome. For example, whether a person survives or not doesn't decide on his or her name. We can sometimes get additionall information from these features such as extract the title and see if a person is a doctor or not and use that as a feature. This technique is called Feature Engineering and we will not be focusing on that in thist post.
# Which features you select is also a personal decision. You may think that number of children a passenger has on-board might matter whereas I might think that it does not. While there are some very obvious features that should be included, others can be tough to select.

train_df = train_df.select(['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked'])

In [5]:
# Let's explore the dataset
train_df.describe().show()

# As we can see below, we have 891 rows for 4 columns but two columns (Age and Embarked) have 714 and 889 values. This is because they have missing/NA values. 

In [6]:
# Let's clean this data by dropping any rows with null values
train_df_clean = train_df.na.drop()
train_df_clean.describe().show()

# As you can see now, after we drop the rows with null values, we have a total of 712 rows and each column has the same number of values.

In [7]:
# Now that we have a feel of what the data looks like and have cleaned it up, we can start analyzing the data.

from pyspark.ml.feature import VectorAssembler, VectorIndexer, StringIndexer, OneHotEncoder

In [8]:
# Since we have some categorical features (Sex and Embarked) in our dataset, we need to 'OneHotEncoder' them so that our machine learning model can understand them. I have a separate post that explains OneHotEncoding: http://www.enlistq.com/feature-encoding-python-using-scikit-learn/

sex_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
sex_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

# We will need to do the same for our other categorical feature - Embarked.

embarked_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkedIndex')
embarked_encoder = OneHotEncoder(inputCol='EmbarkedIndex', outputCol='EmbarkedVec')

In [9]:
# Now that we have our categorical features encoded, we are ready to convert our training dataset into the vector form that Spark's MLlib expects it to be into. Remember to not include the 'Survived' column as an input because that's what we are trying to predict.

assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'Age', 'Fare', 'EmbarkedVec'], outputCol='features')

In [10]:
# We are now ready to use Logistic Regression model
# I have covered Logistic Regression in my earlier post: http://www.enlistq.com/implementing-a-binomial-logistic-regression-model-in-python/
from pyspark.ml.classification import LogisticRegression
logistic_reg_model = LogisticRegression(featuresCol='features', labelCol='Survived')

In [11]:
# We will now create a pipeline to bring everything together.

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[sex_indexer, embarked_indexer, sex_encoder, embarked_encoder, assembler, logistic_reg_model])

In [12]:
# We will now fit our logistic regression model on our training dataset
model_fit = pipeline.fit(train_df_clean)

In [13]:
# According to the Kaggle submission rules, we need to predict values for the passengers listed in 'test.csv' and then submit those results to Kaggle.
# The final results should include two columns: PassengerId and Survived.
test_df = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/test.csv')
test_df = test_df.select(['PassengerId', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked'])
test_df.describe().show()

In [14]:
# As we can see, some of the rows don't have Age and/or Fare. We need to fill these with some sensible values. One popular way to fill missing values is to use the mean.
age_mean = test_df.agg({'Age': 'mean'}).first()[0]
fare_mean = test_df.agg({'Fare': 'mean'}).first()[0]
test_df = test_df.fillna(age_mean, subset=['Age'])
test_df = test_df.fillna(fare_mean, subset=['Fare'])

# As we can see now, all columns have the same number of values (418)
test_df.describe().show()

In [15]:
# We need to feed the test data to our fitted model and predict.
results = model_fit.transform(test_df)

In [16]:
# Here is what our predictions look like:
kaggle_results = results.select('PassengerId', 'prediction')
kaggle_results.show()

In [17]:
# Write to CSV so we can submit to Kaggle
kaggle_results.write.csv(r'/FileStore/tables/titanic_kaggle_results.csv')