---
<h1 style="text-align: center;">
CSCI 4521: Applied Machine Learning (Fall 2024)
</h1>

<h1 style="text-align: center;">
Homework 4
</h1>

<h3 style="text-align: center;">
(Due Tue, Nov. 12, 11:59 PM CT)
</h3>

---

### Weather impacts life each and every day in both relatively minor and significant ways. Many people check the weather daily to know what the predicted temperature and precipitation will be so they can plan how to dress, what activities to do, and how early to leave on their daily commute.

![weather.jpg](attachment:8a91f958-b8a7-413a-9be5-491b8c881176.jpg)

Image from https://www.un.org/en/un-chronicle/future-weather-climate-and-water-across-generations


### In this homework, your task is to predict the temperature (in degrees Celcius) at different points in time. You need to use machine learning and develop regression models to accomplish this task. The only data you have available is the weather data in the dataset `weather_csci4521_hw4.csv`. Each row in the dataset is a different point in time and the columns are the features consisting of Date and Daily Summary, and many features computed from Visibility, Wind Speed and Bearing, Humidity, Pressure, and Loud Cover. The target variable is in column "Temperature (C)".

### You must clean and preprocess the data then decide which regression algorithms to use, which and how to tune any hyperparameters, how to measure performance, which models to select, and which final model to use.

### You must use **PySpark** to clean and preprocess the data. If you use anything other than PySpark, you will receive no credit for this homework. The one exception is for feature selection. You are allowed to use Pandas to decide how many features to keep but you must use PySpark to select those features. After cleaning and preprocessing, you can use any of the coding packages we've used in class (Numpy, Pandas, PySpark, Scikit-learn, etc.). Make sure to write and submit clean, working code. Reminder, you cannot use ChatGPT or similar technologies. Please see the syllabus for more details.

### You also need to submit a short report of your work describing all steps you took, explanations of why you took those steps, results, what you learned, how you might use what you learned in the future, and your conclusions. We expect the report to be well-written and clearly describe everything you've done and why.

---

### Write your code here

In [23]:
# Setting up imports

# Pyspark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, dayofmonth, hour, weekofyear, dayofweek
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler, Imputer
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression, RandomForestRegressor, GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator

# Colab imports
from google.colab import drive
drive.mount('/content/drive')

# Misc imports
import numpy as np
import pandas as pd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [24]:
# Create Spark Session
spark = SparkSession.builder.appName("HW4").getOrCreate()

In [25]:
# Step 0: Load Data

# Read CSV
df = spark.read.csv("/content/drive/MyDrive/colab_data/weather_csci4521_hw4.csv", header=True, inferSchema=True)

# No. of Samples
print("Number of samples = ", df.count())
# No. of Features available
print("Number of features = ", len(df.columns))

# Print head upto 30
display(df.show(30))

Number of samples =  96453
Number of features =  109
+-------------------+--------------------+-------------------+-------------------+--------------------+-------------------+--------------------+-------------------+---------+---------+-------------------+----------+-------------------+--------------------+--------------------+-------------------+-------------------+-------------------+----------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-

None

In [26]:
# Step 1: Cleaning and Preprocessing

# feature_0 through feature_105 are numerical doubles
numerical_features_columns = [col for col in df.columns if col.startswith("feature_")]

# imputing these with the median (reasoning in report)
imputer = Imputer(strategy="median", inputCols=numerical_features_columns, outputCols=numerical_features_columns)
df = imputer.fit(df).transform(df)

# Transforming Formatted date into date (reasoning in report)
df = df.withColumn("Year", year("Formatted Date")) \
           .withColumn("Month", month("Formatted Date")) \
           .withColumn("Day", dayofmonth("Formatted Date")) \
           .withColumn("Hour", hour("Formatted Date")) \
           .withColumn("WeekOfYear", weekofyear("Formatted Date")) \
           .withColumn("DayOfWeek", dayofweek("Formatted Date"))

# Drop the original "Formatted Date" column
# Data already accounted for, don't want to over represent
df = df.drop("Formatted Date")

# Index daily summary
daily_summary_indexer = StringIndexer(inputCol="Daily Summary", outputCol="DailySummaryIndex")
df = daily_summary_indexer.fit(df).transform(df)

# Encode daily summary index using one hot encoder, then transform
daily_summary_encoder = OneHotEncoder(inputCol="DailySummaryIndex", outputCol="DailySummaryVec")
df = daily_summary_encoder.fit(df).transform(df)

# Drop old columns
df = df.drop("Daily Summary", "DailySummaryIndex")

In [27]:
# Step 2: Feature Selection

# Vectorizing the features for future steps
initial_features = [col for col in df.columns if col not in ["Temperature (C)"]]
assembler = VectorAssembler(inputCols=initial_features, outputCol="features_assembled")
df = assembler.transform(df)

# Use Random Forest Regressor to find feature importance (reasoning in report)
rf = RandomForestRegressor(featuresCol="features_assembled", labelCol="Temperature (C)", numTrees=50)
rf_model = rf.fit(df)

# Maintainance Step:
# function creates a dense maps to float value of x
# just makes future steps easier
feature_importances_dense = [float(x) for x in rf_model.featureImportances.toArray()]  # Dense array of all importances

# dictionary - each feature to its importance
feature_importance_dict = dict(zip(initial_features, feature_importances_dense))

# Sort features by importance
sorted_features = sorted(feature_importance_dict, key=feature_importance_dict.get, reverse=True)

# Choose top 25 features (reasoning in report)
selected_features = sorted_features[:25]

print("Selected features based on importance:", selected_features)

# Setup df to use selected features and the label
df = df.select(selected_features + ["Temperature (C)"])
assembler = VectorAssembler(inputCols=selected_features, outputCol="features_assembled")
df = assembler.transform(df)

# Scaling
scaler = StandardScaler(inputCol="features_assembled", outputCol="features")
df = scaler.fit(df).transform(df).select("features", col("Temperature (C)").alias("label"))

# Check df after preprocessing
df.show(10)

Selected features based on importance: ['WeekOfYear', 'feature_8', 'Month', 'feature_19', 'feature_7', 'Hour', 'feature_87', 'Year', 'Day', 'DailySummaryVec', 'feature_10', 'DayOfWeek', 'feature_92', 'feature_52', 'feature_66', 'feature_100', 'feature_99', 'feature_76', 'feature_77', 'feature_58', 'feature_82', 'feature_44', 'feature_21', 'feature_15', 'feature_93']
+--------------------+-----------------+
|            features|            label|
+--------------------+-----------------+
|(237,[0,1,2,3,4,5...| 9.47222222222222|
|(237,[0,1,2,3,4,5...|9.355555555555558|
|(237,[0,1,2,3,4,6...|9.377777777777778|
|(237,[0,1,2,3,4,5...| 8.28888888888889|
|(237,[0,1,2,3,4,5...|8.755555555555553|
|(237,[0,1,2,3,4,5...| 9.22222222222222|
|(237,[0,1,2,3,4,5...|7.733333333333334|
|(237,[0,1,2,3,4,5...| 8.77222222222222|
|(237,[0,1,2,3,4,5...|10.82222222222222|
|(237,[0,1,2,3,4,5...|13.77222222222222|
+--------------------+-----------------+
only showing top 10 rows



In [28]:
# Train-Test-Validation split (reasoning in report)
train_data, temp_data = df.randomSplit([0.7, 0.3], seed=5782267)
test_data, val_data = temp_data.randomSplit([0.5, 0.5], seed=5782267)

In [29]:
# Step 3: Training and validation

# Choose some regression models (reasoning in report)
models = [
    (LinearRegression(featuresCol="features", labelCol="label"),
     ParamGridBuilder()
     .addGrid(LinearRegression.regParam, [0.01, 0.1, 1.0])
     .addGrid(LinearRegression.elasticNetParam, [0.0, 0.5, 1.0])
     .build()),
    (RandomForestRegressor(featuresCol="features", labelCol="label"),
     ParamGridBuilder()
     .addGrid(RandomForestRegressor.numTrees, [50, 100])
     .addGrid(RandomForestRegressor.maxDepth, [5, 10])
     .build()),
    (GBTRegressor(featuresCol="features", labelCol="label"),
     ParamGridBuilder()
     .addGrid(GBTRegressor.maxDepth, [5, 10])
     .addGrid(GBTRegressor.maxIter, [20, 50])
     .build())
]

# Evaluate the different models using rmse (reason in report)
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="label", metricName="rmse")

# Loop through each model and each param
best_model = None
best_rmse = float("inf")

for model, param_grid in models:
    crossval = CrossValidator(estimator=model,
                              estimatorParamMaps=param_grid,
                              evaluator=evaluator,
                              numFolds=5)
    cv_model = crossval.fit(train_data)

    # Get best model based on rmse (reasoning in report)
    validation_rmse = evaluator.evaluate(cv_model.transform(val_data))

    # got model.__clas__.__name__ from the official documentation and a
    # stack overflow page that I can't find, not attempting to plagiarize
    print(f"Validation RMSE for {model.__class__.__name__}: {validation_rmse}")

    if validation_rmse < best_rmse:
        best_rmse = validation_rmse
        best_model = cv_model.bestModel

print("Best model:", best_model)

Validation RMSE for LinearRegression: 6.623138036673498
Validation RMSE for RandomForestRegressor: 4.322085529432972
Validation RMSE for GBTRegressor: 3.6355988189233646
Best model: GBTRegressionModel: uid=GBTRegressor_04a6ce427ddd, numTrees=20, numFeatures=237


In [30]:
# Step 4 : Results

# Different results of chosen model on test data
predictions = best_model.transform(test_data)
metrics = ["rmse", "mae", "r2"]

# Results
for metric in metrics:
    eval_metric = RegressionEvaluator(predictionCol="prediction", labelCol="label", metricName=metric)
    print(f"{metric.upper()}: {eval_metric.evaluate(predictions)}")

RMSE: 3.5837771799337808
MAE: 2.823774326893743
R2: 0.860111843134522


---

### Write your report here

Step 0 : Loading the Data

*   Used PySpark as instructed.
*   Relatively straight forward - load the data and see the structure.
*   First thing I noticed is a large number of numerical data features. The second is two useful non numerical features, date and summary.
*   Examined the data and especially the nulls to see what data was missing, as well as the schema of the data itself.


Step 1: Cleaning


*   Missing values - Imputed the null values with the median strategy for numerical features. I chose the median because it's less sensitive to outliers.
*   Formatting some features - I modifed the formatted date column to become different features in the dat frame. This makes logical sense since feature data might actually be more indicative atomically. For example, certain months are more susceptive to rainfall, and so this split captures seasonal changes.
*   I also used One Hot Encoding to capture the daily summary, since some description of numerical data might encompass data from features we might no otherwise consider.



Step 2 : Feature Selection


*   Because of the high dimensionality but need for specificity I used RandomForestRegressor to reduce the number of features I was considering. I ranked features by importance and picked the first 25. This is because after 15, the importances tapered off to become negligible.
*   I also scaled the data, mainly for future steps. Linear regressor is known to perform better on scaled data, and it doesn't affect the others negatively to scale, hence the decision was made.



Step 3 : Training and Validation


*   I used a standard 0.7, 0.15, 0.15 split for splitting the available data into train test and validation. I will use the validation set to tune the hyperparamters subsequently.
*   Explaining some choices : I chose to test between LinearRegressor, RandomForestRegressor and GBTRegressor for the model training.
*   Linear regressor acts as a baseline model to see if the data works for simple models.
*   Then I used RandomForest as an ensemble method, and finally GBT as the most computationally intensive but theoretically best performer on this kind of data.
*  Additionally I added some hyperparameters to test across for each model. The values were chosen mainly at random, but the main idea is to move across orders of magnitude, and pick the best performing ones. I used 5 fold cross val to avoid overfitting but the number itself was mainly chosen as a standard.
*  Finally, I used the Root Mean Squared Error to rank the inital performance of the model. I choose to define the "Best" model for this data, using this metric because it is harsh on large errors which helps deal with outliers and averages the predictions providing decent fits.




Step 4 : Results

*   Finally, I test the performance of the best model (which is the GBT model as expected) and these are the obtianed results -
*   RMSE: 3.5837771799337808
*   MAE: 2.823774326893743
*   R2: 0.860111843134522



Lessons


*   Tuning hyperparamters is essential to finding good performance metrics for many different types of models.
*   Sparse vectors and how to deal with them as outputs of builtin functions.
*   The different error metrics and how to interpret them.



Future Applications


*   Regression tasks can be used in just about every aspect of life. For me particularly, I am going to use the learnigs from this assignment to build models for stock price prediction and exploring feature creation in complex models.
*   Other potential applications include many continuous classification tasks.



Conclusions


*   Ultimately, I used different regression models to predict temperature based on multiple features.
*   We obtained a R2 of 0.86 meaning there is a 86% correlation between the data and the prediction which a reasonable result.



---