### Titanic Dataset

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container{width:100% !impotant;}<style>"))

### Start Spark Sesion

In [2]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName('LogRed').getOrCreate()

### Read CSV Data

In [8]:
df = spark.read.csv('titanic.csv', inferSchema=True, header=True)

In [9]:
df.show()

+--------+------+--------------------+------+----+-----------------------+-----------------------+-------+
|Survived|Pclass|                Name|   Sex| Age|Siblings/Spouses Aboard|Parents/Children Aboard|   Fare|
+--------+------+--------------------+------+----+-----------------------+-----------------------+-------+
|       0|     3|Mr. Owen Harris B...|  male|22.0|                      1|                      0|   7.25|
|       1|     1|Mrs. John Bradley...|female|38.0|                      1|                      0|71.2833|
|       1|     3|Miss. Laina Heikk...|female|26.0|                      0|                      0|  7.925|
|       1|     1|Mrs. Jacques Heat...|female|35.0|                      1|                      0|   53.1|
|       0|     3|Mr. William Henry...|  male|35.0|                      0|                      0|   8.05|
|       0|     3|     Mr. James Moran|  male|27.0|                      0|                      0| 8.4583|
|       0|     1|Mr. Timothy J McC...

In [10]:
df.printSchema()

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Siblings/Spouses Aboard: integer (nullable = true)
 |-- Parents/Children Aboard: integer (nullable = true)
 |-- Fare: double (nullable = true)



### Select Columns

In [12]:
my_cols = df.select(['Survived', 'Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'])

In [13]:
my_cols.describe().show()

+-------+-------------------+------------------+------+------------------+-----------------------+-----------------------+-----------------+
|summary|           Survived|            Pclass|   Sex|               Age|Siblings/Spouses Aboard|Parents/Children Aboard|             Fare|
+-------+-------------------+------------------+------+------------------+-----------------------+-----------------------+-----------------+
|  count|                887|               887|   887|               887|                    887|                    887|              887|
|   mean| 0.3855693348365276| 2.305524239007892|  null|29.471443066516347|     0.5253664036076663|     0.3833145434047351|32.30542018038328|
| stddev|0.48700411775101266|0.8366620036697728|  null|14.121908405462552|      1.104668553867569|     0.8074659070316833|49.78204040017391|
|    min|                  0|                 1|female|              0.42|                      0|                      0|              0.0|
|    max|    

### Drop Rows with NA

In [33]:
final_data = my_cols.na.drop('all')

### Feature Engineering

In [34]:
from pyspark.ml.feature import VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer

In [44]:
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

In [45]:
Pclass_encoder = OneHotEncoder(inputCol='Pclass', outputCol='PclassVec')

In [46]:
assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 
                                       'Fare'],outputCol='features')

### Logistic Regression

In [47]:
from pyspark.ml.classification import LogisticRegression

In [48]:
log_reg_titanic = LogisticRegression(featuresCol='features', labelCol='Survived')

### Spark Pipeline

In [49]:
from pyspark.ml import Pipeline

In [50]:
pipeline = Pipeline(stages=[gender_indexer, gender_encoder, Pclass_encoder, assembler, log_reg_titanic])

### Split Data into Train and Test

In [51]:
train_data, test_data = final_data.randomSplit([.8, .2])

In [52]:
fit_model = pipeline.fit(train_data)

In [54]:
results = fit_model.transform(test_data)

### Evaluate Model

In [56]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [63]:
bi_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')
multi_eval_f1 = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='Survived', metricName='f1')
multi_eval_acc = MulticlassClassificationEvaluator(predictionCol = 'prediction', labelCol='Survived', metricName='accuracy')

In [64]:
results = results.select('Survived', 'prediction')

### AUC

In [65]:
bi_eval.evaluate(results)

0.754509697545097

### F1

In [66]:
multi_eval_f1.evaluate(results)

0.767085103397247

### Accuracy

In [67]:
multi_eval_acc.evaluate(results)

0.7701149425287356

### End Spark Session

In [68]:
spark.stop()