<a href="https://colab.research.google.com/github/muhammetsnts/SPARK/blob/main/projects/3.Titanic_Survive_with_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Info

We will try to predict the passengers will survive or not by using `titanic.csv` dataset.

# Setup Environment

In [None]:
# install Java8
!apt-get -q install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.1.1
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz

# unzip it
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

# install findspark 
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
#spark = SparkSession.builder.appName('lr').getOrCreate()

# Download and Read Dataset
We will use the titanic dataset for classification example.

In [None]:
!wget -q https://raw.githubusercontent.com/muhammetsnts/SPARK/main/data/titanic.csv

In [None]:
data = spark.read.csv("titanic.csv", inferSchema=True, header=True)

In [None]:
data.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [None]:
data.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



# Data Cleaning

In [None]:
data.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

We will select the columns that will use only.


In [None]:
my_cols = data.select(['Survived',
                       'Pclass', 
                       'Sex',
                       'Age',
                       'SibSp',
                       'Parch',
                       'Fare',
                       'Embarked'])

## Dealing with Missing Data

We will just drop all missing data.

In [None]:
final_data = my_cols.na.drop()

## Dealing with Categorical Data

In [None]:
from pyspark.ml.feature import VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer

### STRING INDEXER

Allows us to convert every string into number. Example:

|A| B| C|
|-|-|-|
|0|1|2|



### ONEHOT ENCODER
Transforms the indexed numbers into vector format. Example:

KEY: A B C

For A:
[1, 0, 0]

In [None]:
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex', outputCol='EmbarkVec')

In [None]:
assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'EmbarkVec', 'Age', 'SibSp', 'Parch', 'Fare'], 
                            outputCol='features')

# Logistic Regression Model with Pipeline

We have created indexers but we need to call them. So we will use `Pipeline` approach before creating the classification model.

In [None]:
from pyspark.ml.classification import LogisticRegression

In [None]:
from pyspark.ml import Pipeline

In [None]:
log_reg_titanic = LogisticRegression(featuresCol='features', labelCol='Survived')

In [None]:
pipeline = Pipeline(stages=[
                            gender_indexer,
                            gender_encoder,
                            embark_indexer,
                            embark_encoder,
                            assembler,
                            log_reg_titanic
])

## Train-Test Split

In [None]:
train_data, test_data = final_data.randomSplit([0.7,0.3])

## Fitting Pipeline

In [None]:
fit_model = pipeline.fit(train_data)

## Transform Pipeline

In [None]:
results = fit_model.transform(test_data)

# Evaluation

In [None]:
results.select('Survived','prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
+--------+----------+
only showing top 20 rows



In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
# BinaryClassificationEvaluator will return the are under the ROC

my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')

In [None]:
AUC = my_eval.evaluate(results) # Area Under the Curve

In [None]:
AUC

0.7859375