# Tutorial of PySpark on Titanic Dataset

Apache Spark is an open-source distributed framework that is used for Big Data analysis. 
Pyspark is the Python API for Apache Spark, equipped with libraries such as PySparkSQL and MLlib.

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression

spark = SparkSession.builder.appName('titanic_logreg').getOrCreate()
df = spark.read.csv('titanic.csv', inferSchema = True, header = True)
df.show(5)

ModuleNotFoundError: No module named 'pyspark'

## Explore Data

In [2]:
df.printSchema()

NameError: name 'df' is not defined

In [3]:
df.columns

NameError: name 'df' is not defined

In [4]:
titanic_df = df.toPandas()
titanic_df.sample(5)

NameError: name 'df' is not defined

In [5]:
titanic_df.info()

NameError: name 'titanic_df' is not defined

In [5]:
titanic_df.describe()

NameError: name 'titanic_df' is not defined

## Pre-processing

In [2]:
# Drop PassengerId, Cabin, Ticket
my_col = df.select(['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'])

NameError: name 'df' is not defined

In [3]:
final_data = my_col.na.drop()
final_data.toPandas().info()

NameError: name 'my_col' is not defined

## Describe VectorAssembler, StringIndexer, VectorIndexer, OneHotEncoder

A <b>transformer</b> converts one data frame to another, oftentimes appending or combining columns.

* <b>VectorAssembler</b>: combines a given list of columns into one single vector. Often used to combine raw features and features generated by other feature transformers into one feature vector. 
* <b>StringIndexer:</b> encodes a string column of labels to a column of label indices
* <b>VectorIndexer:</b> takes an input column in vector form, decides which values are categorical and changes those values to indices. This is often used for Decision Trees and Tree Ensembles.
* <b>OneHotEncoder:</b> maps a categorical feature, represented as a label index, to a binary vector. Each binary entry in that vector indicates the presence of a certain feature value out of the all the categorical features.

In [4]:
from pyspark.ml.feature import (VectorAssembler, StringIndexer, VectorIndexer, OneHotEncoder)

gender_indexer = StringIndexer(inputCol = 'Sex', outputCol = 'SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol = 'SexVec')

ModuleNotFoundError: No module named 'pyspark'

In [5]:
embark_indexer = StringIndexer(inputCol = 'Embarked', outputCol = 'EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol = 'EmbarkIndex', outputCol = 'EmbarkVec')

NameError: name 'StringIndexer' is not defined

In [6]:
assembler = VectorAssembler(inputCols = ['Pclass', 'SexVec', 'Age', 'SibSp', 'Parch', 'Fare', 'EmbarkVec'], outputCol = 'features')

NameError: name 'VectorAssembler' is not defined

## What is a Pipeline?

Spark represents the machine learning workflow using functions called transformers and estimators, located within a pipeline that links these functions together. 

As mentioned earlier, a <b>transformer</b> converts one data frame to another, oftentimes appending or combining columns. An <b>estimator</b> is an algorithm that fits or trains  data in some way, producing a model for that data.

A <b>pipeline</b> is a series of pipeline stages linked together. Pipeline stages are either transformers and estimators. Our pipeline below links the gender_indexer, embark_indexer, gender_encoder, and assembler transformers with the log_reg estimator.

In [7]:
from pyspark.ml import Pipeline

log_reg = LogisticRegression(featuresCol = 'features', labelCol = 'Survived')

ModuleNotFoundError: No module named 'pyspark'

In [8]:
pipeline = Pipeline(stages = [gender_indexer, embark_indexer, 
                             gender_encoder, embark_encoder,
                             assembler, log_reg])

NameError: name 'Pipeline' is not defined

In [9]:
train, test = final_data.randomSplit([0.7, 0.3])

NameError: name 'final_data' is not defined

In [10]:
fit_model = pipeline.fit(train)

NameError: name 'pipeline' is not defined

In [11]:
results = fit_model.transform(test)

NameError: name 'fit_model' is not defined

In [12]:
results.select('prediction', 'Survived').show(3)

NameError: name 'results' is not defined

## Add metrics, graphs and performance analysis here

In [13]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

eval = BinaryClassificationEvaluator(rawPredictionCol = 'rawPrediction', labelCol = 'Survived')
AUC = eval.evaluate(results)
AUC

ModuleNotFoundError: No module named 'pyspark'