<h1 style="text-align:center"> Drexel University </h1>
<h2 style = "text-align:center"> College of Computing and Informatics</h2>
<h2 style = "text-align:center">INFO 323: Cloud Computing and Big Data</h2>
<h3 style = "text-align:center">Assignment 4: Spark ML</h3>
<div style="text-align:center; border-style:solid; padding: 10px">
<div style="font-weight:bold">Due Date: Sunday, June 11, 2023</div>
This assignment counts for 10% of the final grade
</div>

### A. Assignment Overview
This assignment provides the opportunity for you to practice with Spark data analytics. 

### B. What to Hand In
	
Sumbit a completed this Jupyter notebook. 

### C. How to Hand In

Submit your Jupyter notebook file through the course website in the Blackboard Learn system.

### D. When to Hand In

1. Submit your assignment no later than 11:59pm in the due date.
2. There will be a 10% (absolute value) deduction for each day of lateness, to a maximum of 3 days; assignments will not be accepted beyond that point. Missing work will earn a zero grade.

### Note: All programming must be done in Spark platform

## Import libraries

In [0]:
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import Binarizer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel

## Data Ingest:
### Go to the Storage section of the GCP web console and create a new bucket
### Open CloudShell and git clone this repo: `git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp`
### Then, run:
- `cd data-science-on-gcp/02_ingest`
- `./ingest_from_crsbucket bucketname`
- `./bqload.sh (csv-bucket-name) YEAR`
- `cd ../03_sqlstudio`
- `./create_views.sh`
- `cd ../04_streaming`
- `./ingest_from_crsbucket.sh`

After the above steps, 26 JSON files should appear in the folder "flights/tzcorr/' in the bucket.

# Problem Definition:
In this assignment, you are asked to build, tune, and evaluate a RandomForest classifier for predicting arrival delay of flights. You will use the tzcorr data sets.

There are 26-30 data files. When you build the model, start with one data set. When your code works on the single data set, then apply the model to more data sets.

You will build the model by experimenting with different sets of features and tuning the hyperparameters of the RandomForest classifier.

## Path to dataset:
1. If use Databricks, the data files are located in a AWS S3 storage bucket. They can be accessed by the paths after listing the content as:
```
dbutils.fs.ls("s3://info323-ya45-spring2023/tzcorr/")
```
2. If use GCP, the data files can be accessed from your GS bucket, as:
```
BUCEKT = 'your bucket name'
inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET) # a file
#inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET)  # all files
```

## Question 1: Read dataset: Choose a data file and read the content as a Spark DataFrame named as 'flights'; Create a relational view for Spark SQL; and print out the schema of the data. How many records in the data set?

## Question 2: Choose at least 4 columns as features including `DEP_DELAY`, `TAXI_OUT`, and `DISTANCE`; define the feature columns as `featureColumns`.

## Question 3: Check missing values. Run `flights.describe().toPandas()`. Remove the rows with missing values in the feature columns and 'ARR_DELAY'.

## Question 4: If any one of the selected feature columns is not in numeric data type, convert it to appropriate numeric data type.

## Question 5: Create the label column. Use "Binarizer" function to create a target categorical variable from the "ARR_DELAY" column. Name this target variable as "label". Specifically, if arrival delay is less than 15 minutes, then we want the categorical value to be 0, otherwise the categorical value should be 1. Use "14.99999" for the threshold argument. Use "ARR_DELAY" for the inputCol argument. Use "label" for the outputCol argument.

## Question 6: Apply the above "binarizer" tranformation to the input DataFrame to create a new DataFrame named "binarizedDF".

## Question 7: Look at the values of the first four rows in the new DataFrame `binarizedDF`.

## Question 8: Use "VectorAssembler" to aggregate the features which will be used to make predictions into a single column. Use "featureColumns" defined earlier for the inputCols argument. Use "features" for the outputCol argument.

## Question 9: Apply the above assembler to the DataFrame "binarizedDF" to create a new DataFrame called "assembled" with the aggregated features in a column.

## Question 10: Split the DataFrame "assembled" into "train" and "test" by calling randomSplit(). Specify the ratios of the two sets as 80% and 20%. Set the seed number to be 12345.

## Question 11: Print the number of rows in the train and test DataFrames to check the sizes.

## Question 12: Create and train a Random Forest classifier use default parameter values.

## Question 13: Save the model for late use.

## Question 14: Load the saved model.

## Question 15: Make prediction using the test data set. Store the results in "predictions".

## Question 16: Look at the values of "prediction" and "label" of the first ten rows in the predictions to see whether the prediction matches the input.

## Question 17: Evaluate the model on the test data. Save the result as `evalSummary`.

## Question 18: Print out the accuracy.

## Question 19: Print out precisions by labels.

## Question 20: Print out recalls by labels.

## Question 21: Feature selection and hyperparameter tuning. In the following, you will select the best combination of features and tune the hyperparameter of the ranfom forest classifier through grid search. First, Split the DataFrame "binarizedDF" into "train" and "test" by calling randomSplit(). Specify the ratios of the two sets as 80% and 20%. Set the seed number to be 12345.

## Question 22: Define a list of possible feature combinations. Modify the following list "feature_sets" by adding the additional feature you selected. Use all possible 2-, 3-, and 4-tuple combinations.

In [0]:
features_sets = [['DEP_DELAY', 'DISTANCE'], ['DEP_DELAY', 'TAXI_OUT'], ['TAXI_OUT', 'DISTANCE'], ['DEP_DELAY', 'TAXI_OUT', 'DISTANCE']]

## Question 23: Create a VectorAssembler with outputCol="features" and store the result to "assembler".

## Question 24: Create a RandomForestClassifier with all default values and store the result to "rf".

## Question 25: Create a Pipleline with stages=[assembler, rf] and store the result to "pipepline".

## Question 26: Create a ParamGridBuilder and store the result to "grid". Add the following paramters and their ranges to "grid":
1. assembler.inputCols, features_sets
2. rf.numTrees, [20, 50, 100]
3. rf.maxDepth, [2, 3, 5]
4. rf.minInstancesPerNode, [1, 2, 3]

## Question 27: Create a BinaryClassificationEvaluator and store the result to "evaluator".

## Question 28: Create CrossValidator with the following options and store the result to "cv".
1. estimator=pipeline
2. estimatorParamMaps=grid
3. evaluator=evaluator
4. parallelism=2

## Question 29: Fit the CrossValidator on the train set and store the result to "cvModel".

## Question 30: Print out the avgMetrics of the "cvModel".

## Question 31: Save and load the cvModel.

## Question 32: Make predictions on the test data. Store the result to "predictions".

## Question 33: Create a MulticlassClassificationEvaluator and store the result to cvEvaluator.

## Question 34: Print out the precision for label=1, i.e., ARR_DELAY >=15 minutes.

## Question 35: Print out the recall for label=1, i.e., ARR_DELAY >=15 minutes.

## Question 36: Print out the precision for label=0, i.e., ARR_DELAY < 15 minutes.

## Question 37: Print out the recall for label=1, i.e., ARR_DELAY  < 15 minutes.

## Question 38: What are the feature combination and hyperparameter values for the best model? Did the best model improve the label=1 precision and recall compared to the model before cross validation tuning? Could you continue to improve the model?