# Titanic_Logistic_Regression

This sample works with the famous "titanic dataset", which is given in titanic.csv file.
The dataset records various attributes of passengers on the Titanic, including who survived and who didn't.

Using the "survived" as lable (supvised information) and others as fearures, we can predict one (or instance), who survived or not by using Spark LogisticRegression model, as a binary classifier.

The prediction's accuracy is also evaluated, finally.



In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('myproj').getOrCreate()

In [None]:
data = spark.read.csv('titanic.csv',inferSchema=True,header=True)

In [None]:
data.printSchema()

In [None]:
data.columns

In [None]:
# ignore PassengerId, Name, Ticket, Cabin, as they are not useful for analysis

my_cols = data.select(['Survived',  
                       
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked'])


In [None]:
my_final_data = my_cols.na.drop()

In [None]:
my_final_data.show()

### Working with Categorical Columns

Let's break this down into multiple steps to make it all clear.

In [None]:
# as sex is represented as string, we need to convert it to index (number)

from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

In [None]:
gender_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [None]:
embark_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')

In [None]:
assembler = VectorAssembler(inputCols=['Pclass', #define features
 'SexVec',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'EmbarkVec'],outputCol='features')

# "survived" will be "lable", so it was taken off

In [None]:


from pyspark.ml.classification import LogisticRegression

# understand the class "LogisticRegression", using help command

help(LogisticRegression)

## Pipelines 

Using Spark (or Scikit-Learn) "Pipline API" make it easier to combine multiple algorithms into a single pipeline, or workflow. 

More deatails in https://spark.apache.org/docs/2.2.0/ml-pipeline.html

Here, it's simple case.

Let's see an example of how to use pipelines (we'll get a lot more practice with these later!)

In [None]:
from pyspark.ml import Pipeline

In [None]:
log_reg_titanic = LogisticRegression(featuresCol='features',labelCol='Survived')

In [None]:
# A Pipeline chains (or connects) multiple Transformers and Estimators together to specify an ML workflow.

help(Pipeline)

pipeline = Pipeline(stages=[gender_indexer,embark_indexer,
                           gender_encoder,embark_encoder,
                           assembler,log_reg_titanic])      

#puth them as a pipeline

In [None]:
train_titanic_data, test_titanic_data = my_final_data.randomSplit([0.7,.3])  
#divide data into train_data and test_data

In [None]:
fit_model = pipeline.fit(train_titanic_data) # training "train_data"

In [None]:
results = fit_model.transform(test_titanic_data) 

#A Transformer is an algorithm which can transform one DataFrame into another DataFrame. 
# E.g., an ML model is a Transformer which transforms a DataFrame 
# with features into a DataFrame with predictions.

In [None]:
results

In [None]:
results.show()

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluator for binary classification, which expects two input columns: rawPrediction and label. 
# The rawPrediction column can be of type double (binary 0/1 prediction, or probability of 
# label 1) or of type vector 
#(length-2 vector of raw predictions, scores, or label probabilities).
# Look at more details

help(BinaryClassificationEvaluator)

In [None]:
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Survived')

# using this evaluator to compute the accuracy of predictions for "survived"

In [None]:
results.select('Survived','prediction').show()

In [None]:
AUC = my_eval.evaluate(results)

In [None]:
# the accuracy of this evaluation
AUC

## Great Job!