<h1 style="text-align:center"> Drexel University </h1>
<h2 style = "text-align:center"> College of Computing and Informatics</h2>
<h2 style = "text-align:center">INFO 323: Cloud Computing and Big Data</h2>
<h3 style = "text-align:center">Assignment 4: Spark ML</h3>
<div style="text-align:center; border-style:solid; padding: 10px">
<div style="font-weight:bold">Due Date: Sunday, June 11, 2023</div>
This assignment counts for 10% of the final grade
</div>

### A. Assignment Overview
This assignment provides the opportunity for you to practice with Spark data analytics. 

### B. What to Hand In
	
Sumbit a completed this Jupyter notebook. 

### C. How to Hand In

Submit your Jupyter notebook file through the course website in the Blackboard Learn system.

### D. When to Hand In

1. Submit your assignment no later than 11:59pm in the due date.
2. There will be a 10% (absolute value) deduction for each day of lateness, to a maximum of 3 days; assignments will not be accepted beyond that point. Missing work will earn a zero grade.

### Note: All programming must be done in Spark platform

## Import libraries

In [None]:
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import Binarizer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType

## Data Ingest:
### Go to the Storage section of the GCP web console and create a new bucket
### Open CloudShell and git clone this repo: `git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp`
### Then, run:
- `cd data-science-on-gcp/02_ingest`
- `./ingest_from_crsbucket bucketname`
- `./bqload.sh (csv-bucket-name) YEAR`
- `cd ../03_sqlstudio`
- `./create_views.sh`
- `cd ../04_streaming`
- `./ingest_from_crsbucket.sh`

After the above steps, 26 JSON files should appear in the folder "flights/tzcorr/' in the bucket.

# Problem Definition:
In this assignment, you are asked to build, tune, and evaluate a RandomForest classifier for predicting arrival delay of flights. You will use the tzcorr data sets.

There are 26-30 data files. When you build the model, start with one data set. When your code works on the single data set, then apply the model to more data sets.

You will build the model by experimenting with different sets of features and tuning the hyperparameters of the RandomForest classifier.

## Path to dataset:
1. If use Databricks, the data files are located in a AWS S3 storage bucket. They can be accessed by the paths after listing the content as:
```
dbutils.fs.ls("s3://info323-ya45-spring2023/tzcorr/")
```
2. If use GCP, the data files can be accessed from your GS bucket, as:
```
BUCEKT = 'your bucket name'
inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET) # a file
#inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET)  # all files
```

## Question 1: Read dataset: Choose a data file and read the content as a Spark DataFrame named as 'flights'; Create a relational view for Spark SQL; and print out the schema of the data. How many records in the data set?

In [None]:
dbutils.fs.ls("s3://info323-ya45-spring2023/tzcorr/")

In [None]:
json_file_path = "s3://info323-ya45-spring2023/tzcorr/flights_tzcorr_all_flights-00015-of-00026.json"
flights = spark.read.json(json_file_path)
flights.createOrReplaceTempView("flights_view")
flights.printSchema()

In [None]:
num_records = flights.count()
num_records

## Question 2: Choose at least 4 columns as features including `DEP_DELAY`, `TAXI_OUT`, and `DISTANCE`; define the feature columns as `featureColumns`.

In [None]:
featureColumns = ['DEP_DELAY', 'TAXI_IN', 'TAXI_OUT', 'DISTANCE']

## Question 3: Check missing values. Run `flights.describe().toPandas()`. Remove the rows with missing values in the feature columns and 'ARR_DELAY'.

In [None]:
# Check missing values
missing_values = flights.describe().toPandas()

# Remove rows with missing values in the feature columns and 'ARR_DELAY'
flights_cleaned = flights.na.drop(subset=featureColumns + ['ARR_DELAY'])

flights_cleaned.show()

## Question 4: If any one of the selected feature columns is not in numeric data type, convert it to appropriate numeric data type.

In [None]:
# Check the data types of the selected feature columns
feature_data_types = flights_cleaned.select(featureColumns).dtypes

# Convert non-numeric columns to appropriate numeric data type
for column, data_type in feature_data_types:
    if data_type != 'double':
        flights_cleaned = flights_cleaned.withColumn(column, flights_cleaned[column].cast('double'))

## Question 5: Create the label column. Use "Binarizer" function to create a target categorical variable from the "ARR_DELAY" column. Name this target variable as "label". Specifically, if arrival delay is less than 15 minutes, then we want the categorical value to be 0, otherwise the categorical value should be 1. Use "14.99999" for the threshold argument. Use "ARR_DELAY" for the inputCol argument. Use "label" for the outputCol argument.

In [None]:
# Define the threshold value
threshold = 14.99999

# Create the Binarizer transformer
binarizer = Binarizer(threshold=threshold, inputCol="ARR_DELAY", outputCol="label")

## Question 6: Apply the above "binarizer" tranformation to the input DataFrame to create a new DataFrame named "binarizedDF".

In [None]:
# Apply the binarizer transformation to create a new DataFrame
binarizedDF = binarizer.transform(flights_cleaned)

## Question 7: Look at the values of the first four rows in the new DataFrame `binarizedDF`.

In [None]:
binarizedDF.show(4)

## Question 8: Use "VectorAssembler" to aggregate the features which will be used to make predictions into a single column. Use "featureColumns" defined earlier for the inputCols argument. Use "features" for the outputCol argument.

In [None]:
assembler = VectorAssembler(inputCols=featureColumns, outputCol='features')

## Question 9: Apply the above assembler to the DataFrame "binarizedDF" to create a new DataFrame called "assembled" with the aggregated features in a column.

In [None]:
assembled = assembler.transform(binarizedDF)

## Question 10: Split the DataFrame "assembled" into "train" and "test" by calling randomSplit(). Specify the ratios of the two sets as 80% and 20%. Set the seed number to be 12345.

In [None]:
trainRatio = 0.8
testRatio = 0.2
seed = 12345

train, test = assembled.randomSplit([trainRatio, testRatio], seed)

## Question 11: Print the number of rows in the train and test DataFrames to check the sizes.

In [None]:
train_count = train.count()
test_count = test.count()
print("Train set count:", train_count)
print("Test set count:", test_count)

## Question 12: Create and train a Random Forest classifier use default parameter values.

In [None]:
# Create a Random Forest classifier
rf = RandomForestClassifier()

# Train the Random Forest classifier on the training data
rfmodel = rf.fit(train)

## Question 13: Save the model for late use.

In [None]:
path = 'dbfs:/FileStore/tables/ly364/models/'
model_path = path + "rfmodel"
rfmodel.write().overwrite().save(model_path)

## Question 14: Load the saved model.

In [None]:
rfmodel = RandomForestClassificationModel.load(model_path)


In [None]:
rfmodel = RandomForestClassificationModel.read().load(model_path)

## Question 15: Make prediction using the test data set. Store the results in "predictions".

In [None]:
predictions = rfmodel.transform(test)

## Question 16: Look at the values of "prediction" and "label" of the first ten rows in the predictions to see whether the prediction matches the input.

In [None]:
predictions.select("prediction", "label").head(10)

## Question 17: Evaluate the model on the test data. Save the result as `evalSummary`.

In [None]:
# Evaluate the model on the test data
evalSummary = rfmodel.evaluate(test)

## Question 18: Print out the accuracy.

In [None]:
evalSummary.accuracy

## Question 19: Print out precisions by labels.

In [None]:
evalSummary.precisionByLabel

## Question 20: Print out recalls by labels.

In [None]:
evalSummary.recallByLabel

## Question 21: Feature selection and hyperparameter tuning. In the following, you will select the best combination of features and tune the hyperparameter of the ranfom forest classifier through grid search. First, Split the DataFrame "binarizedDF" into "train" and "test" by calling randomSplit(). Specify the ratios of the two sets as 80% and 20%. Set the seed number to be 12345.

In [None]:
train, test = binarizedDF.randomSplit([0.8, 0.2], seed=12345)

## Question 22: Define a list of possible feature combinations. Modify the following list "feature_sets" by adding the additional feature you selected. Use all possible 2-, 3-, and 4-tuple combinations.

In [None]:
features_sets = [['DEP_DELAY', 'DISTANCE'], ['DEP_DELAY', 'TAXI_OUT'], ['TAXI_OUT', 'DISTANCE'], ['DEP_DELAY', 'TAXI_OUT', 'DISTANCE']]

## Question 23: Create a VectorAssembler with outputCol="features" and store the result to "assembler".

In [None]:
assembler = VectorAssembler(inputCols=featureColumns, outputCol="features")

## Question 24: Create a RandomForestClassifier with all default values and store the result to "rf".

In [None]:
rf = RandomForestClassifier()

## Question 25: Create a Pipleline with stages=[assembler, rf] and store the result to "pipepline".

In [None]:
pipeline = Pipeline(stages=[assembler, rf])

## Question 26: Create a ParamGridBuilder and store the result to "grid". Add the following paramters and their ranges to "grid":
1. assembler.inputCols, features_sets
2. rf.numTrees, [20, 50, 100]
3. rf.maxDepth, [2, 3, 5]
4. rf.minInstancesPerNode, [1, 2, 3]

In [None]:
grid = ParamGridBuilder() \
    .addGrid(assembler.inputCols, features_sets) \
    .addGrid(rf.numTrees, [20, 50, 100]) \
    .addGrid(rf.maxDepth, [2, 3, 5]) \
    .addGrid(rf.minInstancesPerNode, [1, 2, 3]) \
    .build()

## Question 27: Create a BinaryClassificationEvaluator and store the result to "evaluator".

In [None]:
evaluator = BinaryClassificationEvaluator()

## Question 28: Create CrossValidator with the following options and store the result to "cv".
1. estimator=pipeline
2. estimatorParamMaps=grid
3. evaluator=evaluator
4. parallelism=2

In [None]:
cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=grid,
                    evaluator=evaluator,
                    parallelism=2)

## Question 29: Fit the CrossValidator on the train set and store the result to "cvModel".

In [None]:
cvModel = cv.fit(train)

## Question 30: Print out the avgMetrics of the "cvModel".

In [None]:
print(cvModel.avgMetrics)

## Question 31: Save and load the cvModel.

In [None]:
path = "dbfs:/FileStore/tables/ly364/models/" 
model_path = path + "rf_cvModel" 
cvModel.write().overwrite().save(model_path) 
cvModel = CrossValidatorModel.read().load(model_path) 

## Question 32: Make predictions on the test data. Store the result to "predictions".

In [None]:
# Make predictions on the test data
predictions = cvModel.transform(test)

## Question 33: Create a MulticlassClassificationEvaluator and store the result to cvEvaluator.

In [None]:
# Create a MulticlassClassificationEvaluator
cvEvaluator = MulticlassClassificationEvaluator()

## Question 34: Print out the precision for label=1, i.e., ARR_DELAY >=15 minutes.

In [None]:
# Set the label column to "label"
cvEvaluator.setLabelCol("label")

# Set the metric name to "precision" and set the target class to 1
cvEvaluator.setMetricName("precisionByLabel")
cvEvaluator.setMetricLabel(1)

# Calculate the precision for label=1
precision_label_1 = cvEvaluator.evaluate(predictions)

# Print the precision for label=1
print("Precision for label=1 (ARR_DELAY >=15 minutes):", precision_label_1)


## Question 35: Print out the recall for label=1, i.e., ARR_DELAY >=15 minutes.

In [None]:
# Set the label column to "label"
cvEvaluator.setLabelCol("label")

# Set the metric name to "precision" and set the target class to 1
cvEvaluator.setMetricName("recallByLabel")
cvEvaluator.setMetricLabel(1)

# Calculate the precision for label=1
precision_label_1 = cvEvaluator.evaluate(predictions)

# Print the precision for label=1
print("Precision for label=1 (ARR_DELAY >=15 minutes):", precision_label_1)


## Question 36: Print out the precision for label=0, i.e., ARR_DELAY < 15 minutes.

In [None]:
# Set the label column to "label"
cvEvaluator.setLabelCol("label")

# Set the metric name to "precision" and set the target class to 0
cvEvaluator.setMetricName("precisionByLabel")
cvEvaluator.setMetricLabel(0)

# Calculate the precision for label=0
precision_label_0 = cvEvaluator.evaluate(predictions)

# Print the precision for label=0
print("Precision for label=0 (ARR_DELAY < 15 minutes):", precision_label_0)

## Question 37: Print out the recall for label=0, i.e., ARR_DELAY  < 15 minutes.

In [None]:
# Set the label column to "label"
cvEvaluator.setLabelCol("label")

# Set the metric name to "recall" and set the target class to 0
cvEvaluator.setMetricName("recallByLabel")
cvEvaluator.setMetricLabel(0)

# Calculate the recall for label=0
recall_label_0 = cvEvaluator.evaluate(predictions)

# Print the recall for label=0
print("Recall for label=0 (ARR_DELAY < 15 minutes):", recall_label_0)

## Question 38: What are the feature combination and hyperparameter values for the best model? Did the best model improve the label=1 precision and recall compared to the model before cross validation tuning? Could you continue to improve the model?

In [None]:
# Get the best model from the CrossValidator
bestModel = cvModel.bestModel

# Retrieve the stages of the best model
stages = bestModel.stages

# Extract the VectorAssembler stage
assembler = stages[0]

# Get the feature combination from the VectorAssembler
featureCombination = assembler.getInputCols()

# Extract the RandomForestClassifier stage
rf = stages[1]

# Get the hyperparameter values from the RandomForestClassifier
numTrees = rf.getNumTrees
maxDepth = rf.getMaxDepth
minInstancesPerNode = rf.getMinInstancesPerNode

# Print the feature combination and hyperparameter values
print("Feature Combination:", featureCombination)
print("Num Trees:", numTrees)
print("Max Depth:", maxDepth)
print("Min Instances Per Node:", minInstancesPerNode)

In the code above, we first retrieve the best model from the CrossValidator object cvModel and store it in the variable bestModel. We then access the stages of the best model using the stages attribute. The first stage is the VectorAssembler, so we extract it from the stages list and store it in the variable assembler. Next, we retrieve the feature combination used by the VectorAssembler using getInputCols().

Similarly, we extract the RandomForestClassifier stage from the stages list and store it in the variable rf. Finally, we access the hyperparameter values from the RandomForestClassifier using the appropriate methods (getNumTrees, getMaxDepth, getMinInstancesPerNode).

The code snippet prints the feature combination and hyperparameter values using print().

To assess if further improvement is possible, wewe can analyze the performance of the best model and consider additional approaches, such as:

1. Feature Engineering: Explore different feature combinations, derive new features, or apply feature transformations to enhance the model's predictive power. Experiment with domain-specific knowledge or feature selection techniques to select the most informative features.

2. Hyperparameter Tuning: Although the best model from the CrossValidator performed well, there might be further room for improvement by fine-tuning the hyperparameters of the RandomForestClassifier. You can perform a more extensive search using a wider range of parameter values or try different optimization algorithms like Bayesian Optimization or GridSearchCV.

3. Model Selection: Apart from the RandomForestClassifier, you can try other machine learning algorithms that are suitable for classification tasks, such as Gradient Boosting, Support Vector Machines, or Neural Networks. Compare their performance with the RandomForestClassifier to determine if a different algorithm yields better results.

4. Ensemble Methods: Consider utilizing ensemble methods like Bagging or Stacking, where multiple models are combined to make predictions. By combining different models with diverse characteristics, you may improve the overall performance and robustness of the predictive model.

5. Data Augmentation: If the available dataset is limited, you can explore techniques like data augmentation to generate additional training samples. This can involve techniques such as oversampling the minority class (label=1) or applying synthetic data generation methods.