---
<h1 style="text-align: center;">
CSCI 4521: Applied Machine Learning (Fall 2024)
</h1>

<h1 style="text-align: center;">
Homework 3
</h1>

<h3 style="text-align: center;">
(Due Tue, Oct. 29, 11:59 PM CT)
</h3>

---

### The RMS Titanic was a British ocean liner considered by many as "unsinkable." Unfortunately, the Titanic hit an iceberge and sank on April 15, 1912 on her trip from Southampton, England to New York City, USA. There were not enough lifeboards onboard for everyone and, as a result, an estimated 1500 people died out of the 2224 passengers and crew onboard. The Titanic disaster was one of the deadliest ship sinkings. There was a large element of luck involved in surviving the shipwreck but some people were more likely to survive than others.

![rms-titanic-14047.png](attachment:d3b8257e-f179-4545-8953-e44343a5f64d.png)


### In this homework, your task is to predict whether a passenger will survive the shipwreck or not. You need to use machine learning and develop classification models to accomplish this task. The only data you have available is passenger data in the dataset `titanic_dataset_csci4521.csv` which consists of the following features:
- ### Passenger ID,
- ### Ticket class (1 = first class, 2 = second class, 3 = third class),
- ### Passenger name,
- ### Sex,
- ### Age,
- ### Number of siblings or spouses aboard,
- ### Number of parents or children aboard,
- ### Ticket number,
- ### Fare,
- ### Cabin number, and
- ### Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
### and label:
- ### Survived ($y_i=1$) or
- ### Not survived ($y_i = 0$).
### You must decide if and how to clean and preprocess the data, which classification algorithms to use, which and how to tune any hyperparameters, how to measure performance, which models to select, and which final model to use.

### You can use any of the coding packages we've used in class (numpy, pandas, pyspark, scikit-learn, etc.) and you must write and submit working code. Reminder, you cannot use ChatGPT or similar technologies. Please see the syllabus for more details.

### You also need to submit a short report of your work describing all steps you took, explanations of why you took those steps, results, what you learned, how you might use what you learned in the future, and your conclusions. We expect the report to be well-written and clearly describe everything you've done and why.

---

### Write your code here

## Step 1 - The Basics: Setting up imports and starting a pyspark session.

In [28]:
# Install PySpark on Colab
# !pip install pyspark

In [29]:
# PySpark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count, median, expr, round
from pyspark.sql import functions as F

from pyspark.ml.feature import Imputer, StandardScaler, VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.functions import vector_to_array
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Colab imports
from google.colab import drive
drive.mount('/content/drive')

# sklearn imports
from sklearn.model_selection import train_test_split, LeaveOneOut

# Misc Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
# Build PySpark session
spark = SparkSession.builder.appName("HW3").getOrCreate()

## Step 2 - Loading the Dataset

In [31]:
# Read CSV Data
df = spark.read.csv("/content/drive/MyDrive/colab_data/titanic_dataset_csci4521.csv", header=True, inferSchema=True)

# No. of Samples
print("Number of samples = ", df.count())
# No. of Features available
print("Number of features = ", len(df.columns))

# Print head upto 10
display(df.show(10))

Number of samples =  894
Number of features =  12
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          6|       0|     3| 

None

##Step 3 - Examining Nulls

In [32]:
# Examining Nulls
null_counts = df.select([
    count(when(col(column).isNull(), column)).alias(column)
    for column in df.columns
])

# Display the count of null values for each column
null_counts.show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|    11|   4|  5|178|    1|    0|     0|   0|  690|      17|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



##Step 4 - Cleaning and Splitting Data

In [33]:
# Drop cabin - the data here is mostly null
df = df.drop("Cabin")

# Impute Null's in age using random sampling within Pclass
ages_per_class = df.groupBy("Pclass").agg(F.collect_list("Age").alias("class_age"))

df = df.join(ages_per_class, on="Pclass", how="left")

random_sample_code_string = "class_age[floor(rand()*size(class_age))]"

df = df.withColumn(
    "Age",
    F.when(
        df["Age"].isNull(), F.expr(random_sample_code_string)
    ).otherwise(df["Age"])
)

# Drop the added helper col
df = df.drop("class_age")

# Sanity Check
print("Samples Before: ", df.count())
# Drop all samples with any columns as null
df = df.dropna(how="any")
print("Samples After: ", df.count())

Samples Before:  894
Samples After:  859


In [34]:
# Separate Labels
label = df.select("Survived")

# Drop irrelevant/uneccessary features (reasonings in report)
features_to_remove = ["PassengerId", "Name", "Ticket"]
df = df.drop(*features_to_remove)

# Show new df
df.show(20)

+------+--------+------+----+-----+-----+-------+--------+
|Pclass|Survived|   Sex| Age|SibSp|Parch|   Fare|Embarked|
+------+--------+------+----+-----+-----+-------+--------+
|     3|       0|  male|22.0|    1|    0|   7.25|       S|
|     1|       1|female|38.0|    1|    0|71.2833|       C|
|     3|       1|female|26.0|    0|    0|  7.925|       S|
|     1|       1|female|35.0|    1|    0|   53.1|       S|
|     3|       0|  male|35.0|    0|    0|   8.05|       S|
|     3|       0|  male|11.0|    0|    0| 8.4583|       Q|
|     1|       0|  male|54.0|    0|    0|51.8625|       S|
|     3|       0|  male| 2.0|    3|    1| 21.075|       S|
|     3|       1|female|27.0|    0|    2|11.1333|       S|
|     2|       1|female|14.0|    1|    0|30.0708|       C|
|     3|       1|female| 4.0|    1|    1|   16.7|       S|
|     1|       1|female|58.0|    0|    0|  26.55|       S|
|     3|       0|  male|39.0|    1|    5| 31.275|       S|
|     3|       0|female|14.0|    0|    0| 7.8542|       

## Step 5 - Using different classification methods

In [35]:
# Setting up future operations by encoding, scaling and vectorizing
indexer_sex = StringIndexer(inputCol="Sex", outputCol="SexIndex")
indexer_embarked = StringIndexer(inputCol="Embarked", outputCol="EmbarkedIndex")
encoder = OneHotEncoder(inputCols=["SexIndex", "EmbarkedIndex"],
                        outputCols=["SexVec", "EmbarkedVec"])

# Assemble into vector columns
assembler = VectorAssembler(
    inputCols=["Pclass", "Age", "SibSp", "Parch", "Fare", "SexVec", "EmbarkedVec"],
    outputCol="vector_features"
)

# Scaling
scaler = StandardScaler(inputCol="vector_features", outputCol="features")

# Transform df
df = indexer_sex.fit(df).transform(df)
df = indexer_embarked.fit(df).transform(df)
df = encoder.fit(df).transform(df)
df = assembler.transform(df)
df = scaler.fit(df).transform(df)

# Final DataFrame with features and label columns only
final_df = df.select("features", "Survived")
final_df = final_df.withColumnRenamed("Survived", "label")

final_df.show(20)


+--------------------+-----+
|            features|label|
+--------------------+-----+
|[3.58375316685557...|    0|
|[1.19458438895185...|    1|
|(8,[0,1,4,6],[3.5...|    1|
|[1.19458438895185...|    1|
|[3.58375316685557...|    0|
|(8,[0,1,4,5],[3.5...|    0|
|[1.19458438895185...|    0|
|[3.58375316685557...|    0|
|[3.58375316685557...|    1|
|[2.38916877790371...|    1|
|[3.58375316685557...|    1|
|(8,[0,1,4,6],[1.1...|    1|
|[3.58375316685557...|    0|
|(8,[0,1,4,6],[3.5...|    0|
|(8,[0,1,4,6],[2.3...|    1|
|[3.58375316685557...|    0|
|[2.38916877790371...|    1|
|[3.58375316685557...|    0|
|(8,[0,1,4,7],[3.5...|    1|
|[2.38916877790371...|    0|
+--------------------+-----+
only showing top 20 rows



#### Split data into train test and validation

In [36]:
train, valid, test = final_df.randomSplit([0.8, 0.1, 0.1], seed = 5782267)

### 5.1 Logistic Regression

In [37]:
# Build the Logistic regression model with hyperparameters
lr_model = LogisticRegression()
param_grid = ParamGridBuilder().addGrid(lr_model.regParam, [0.01, 0.1, 1.0]).build()

# Test
cv = CrossValidator(estimator=lr_model, estimatorParamMaps=param_grid,
                    evaluator=BinaryClassificationEvaluator(labelCol = "label"), numFolds=5)
cv_model = cv.fit(train)

best_lr_model = cv_model.bestModel

print("Best Logistic Regression Model Params: ", best_lr_model.extractParamMap())

Best Logistic Regression Model Params:  {Param(parent='LogisticRegression_2bd4be8be406', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_2bd4be8be406', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_2bd4be8be406', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_2bd4be8be406', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_2bd4be8be406', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_2bd4be8be406', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_2bd4be8be406', name='maxBlockSizeInMB

### 5.2 Decision Trees

In [38]:
# Build DT model with hyperparams
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")
param_grid_dt = ParamGridBuilder().addGrid(dt.maxDepth, [5, 10, 15]).addGrid(dt.maxBins, [20, 40, 60]).build()

# Test
cv_dt = CrossValidator(estimator=dt, estimatorParamMaps=param_grid_dt,
                       evaluator=BinaryClassificationEvaluator(labelCol="label"), numFolds=5)
cvModel_dt = cv_dt.fit(train)

best_dt_model = cvModel_dt.bestModel

print("Best Decision Tree Model Params: ",  best_dt_model.extractParamMap())

Best Decision Tree Model Params:  {Param(parent='DecisionTreeClassifier_6edf7939ee9c', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False, Param(parent='DecisionTreeClassifier_6edf7939ee9c', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10, Param(parent='DecisionTreeClassifier_6edf7939ee9c', name='featuresCol', doc='features column name.'): 'features', Param(parent='DecisionTreeClassifier_6edf7939ee9c', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported

### 5.3 Random Forest

In [39]:
# Build Random Forest Model with Hyperparams
rf = RandomForestClassifier(featuresCol="features", labelCol="label")
paramGrid_rf = ParamGridBuilder().addGrid(rf.numTrees, [40, 100, 160]).addGrid(rf.maxDepth, [4, 10, 16]).build()

# Test
cv_rf = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid_rf,
                       evaluator=BinaryClassificationEvaluator(labelCol="label"), numFolds=5)

cvModel_rf = cv_rf.fit(train)

best_rf_model = cvModel_rf.bestModel

print("Best Random Forest Model Params: ", best_rf_model.extractParamMap())

Best Random Forest Model Params:  {Param(parent='RandomForestClassifier_c40f4f99965f', name='bootstrap', doc='Whether bootstrap samples are used when building trees.'): True, Param(parent='RandomForestClassifier_c40f4f99965f', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False, Param(parent='RandomForestClassifier_c40f4f99965f', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10, Param(parent='RandomForestClassifier_c40f4f99965f', name='featureSubsetStrategy', doc="The number of features to consider for

### Step 6: Comparing Models

In [40]:
# Helper Function
def model_performance(model, data, labelCol="label"):
    # Metrics were chosen based on this article : https://medium.com/analytics-vidhya/evaluation-metrics-for-classification-models-e2f0d8009d69
    predictions = model.transform(valid)

    evaluate_acc = MulticlassClassificationEvaluator(labelCol=labelCol, metricName="accuracy")
    evaluate_f1 = MulticlassClassificationEvaluator(labelCol=labelCol, metricName="f1")
    evaluate_precision = MulticlassClassificationEvaluator(labelCol=labelCol, metricName="weightedPrecision")
    evaluate_recall = MulticlassClassificationEvaluator(labelCol=labelCol, metricName="weightedRecall")

    accuracy = evaluate_acc.evaluate(predictions)
    f1 = evaluate_f1.evaluate(predictions)
    precision = evaluate_precision.evaluate(predictions)
    recall = evaluate_recall.evaluate(predictions)

    results = {
        "Accuracy": accuracy,
        "F1": f1,
        "Precision": precision,
        "Recall": recall
    }

    return results

# Evaluate each model
results_lr = model_performance(best_lr_model, final_df)
results_dt = model_performance(best_dt_model, final_df)
results_rf = model_performance(best_rf_model, final_df)

### Step 7: Summarize and Conclude

In [41]:
# create a table to better see results
results_df = pd.DataFrame([results_lr, results_dt, results_rf])
results_df.index = ["Logistic Regression", "Decision Tree", "Random Forest"]
results_df

Unnamed: 0,Accuracy,F1,Precision,Recall
Logistic Regression,0.764706,0.756624,0.771992,0.764706
Decision Tree,0.741176,0.734767,0.742415,0.741176
Random Forest,0.788235,0.775913,0.815092,0.788235


### Step 8: Apply best performing model to test data

In [46]:
combined_data = train.union(valid)
best_model = dt.fit(combined_data)

# Prediction on test data
test_predictions = best_model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")

# Calculate evaluation metrics on the test set
accuracy = evaluator.evaluate(test_predictions, {evaluator.metricName: "accuracy"})
f1_score = evaluator.evaluate(test_predictions, {evaluator.metricName: "f1"})
weighted_precision = evaluator.evaluate(test_predictions, {evaluator.metricName: "weightedPrecision"})
weighted_recall = evaluator.evaluate(test_predictions, {evaluator.metricName: "weightedRecall"})

print(f"Test Accuracy: {accuracy}")
print(f"Test F1 Score: {f1_score}")
print(f"Test Weighted Precision: {weighted_precision}")
print(f"Test Weighted Recall: {weighted_recall}")

Test Accuracy: 0.8222222222222222
Test F1 Score: 0.8199539252170831
Test Weighted Precision: 0.8211530283700867
Test Weighted Recall: 0.8222222222222222


---

### Write your report here

The Report

1.   Step 1 - The Basics

*   Reasons:
*   I chose to use PySpark because it provides mutliple useful packages and functions for various steps across the task.
*   Robust and good for the moderate sized dataset that we have
*   Allows good interoperability with other packages if needed later.

2. Step 2 - Examining data

*  Reasons:
*  Loaded the data just to see what the features types were, what data was largely missing etc.

3. Step 3 - Examining Nulls

*  I print the empty values in the dataset and observe that age has multiple missing values.
*  Here is my strategy to deal with missing values. Since age is a numerical value I will replace it with the median of the dataset. For the remaining features, since they are categorical and low in numbers removing them shouldn't affect the learning too much so I will drop those rows

4. Step 4 - Cleaning the data



*   Noticed that the cabin column had too many nulls. Decided to drop the feature as filling its values from such a small subset of data would likely yield baised results, and the feature data is not likely ot have much impact on our overall predicitions because we can ascertain similar information using Pclass
*   Decided to impute the values for age using random sampling within a PClass, to provide some semblance of realism ot the data, without introducing too much bias.
*   Removed the feature PassengerID because it is linearly increasing and not an intrinsic property of the passeneger so it could produce negative effects on the predicitions
*   Removed Name, Ticket for similar reasons to PassenegerID
*   Removed the label from the df and stored it seperately, we'll use it later



5. Step 5: Using the data
*   First I transformed the data into a useful format. This meant scaling features to avoid variance and encoding categorical features like sex, embarked etc
* Then I vectorised the data to be able to provide it into the PySpark functions since they require vectors as input.
* Finally I split the data into a train test and validation split. This allows us to train and tests the data but also use some of it to tune models and parameters.

* I decided to use models for logistic regression, decision trees and random forests. Since the task was classification these seemed like the most useful models. Additionally that task is only binary classiication.
* Logistic regression might have been a good choice because the features were scaled and linearly independant for a lower dimension size and the task at hand is binary classification.
* Decision Tree and Random Forest are also reaosnable models as they can find deeper connections among features over longer computations.

* HyperParams: I chose values for hyperparams mostly at random and by looking at some commonly used values as well as the slides.

6. Step 6: Comparing the data

* I wrote a simple helper function to get characteristics of the models. The features chosen were from the articlee linked in the code block.

7. Step 7 and 8: Comparing the data

* Finally I compared the models performance on the test data and noticed that Decision Trees had the best performance on the test data.

What I Learned

* Data cleaning and imputing to produce viable data
* Using pyspark for classification tasks
* How to evaluate and tune models.

Future Use

* Classification is a very useful technique.
* I could extend this given a data set with more features to be more accurate.
* Use these skills for open source projects on Kaggle
* Tackle small classification problems on open data sets in medical and financial settings(own interest)

Results

* I used python code to clean data, create useful froms of it, and then use a machine learning model to fit the data, thereby finding that the Decision Tree mdoel fits the data best, and can predict survival of a passenger with roughly 82% accuracy

---