<center><img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300"></center><br/>

# Assignment: Notebook for Peer Assignment

Estimated time needed: 60 minutes


# Assignment Scenario

Congratulations! You have just been hired by a US Weather forecast firm as a data scientist.

The company is considering the weather condition to help predict the possibility of precipitations, which involves using various local climatological variables, including temperature, wind speed, humidity, dew point, and pressure. The data you will be handling was collected by a NOAA weather station located at the John F. Kennedy International Airport in Queens, New York.

Your task is to provide a high level analysis of weather data in JFK Airport. Your stakeholders want to understand the current and historical record of precipitations based on different variables. For now they are mainly interested in a macro-view of JFK Airport Weather, and how it relates to the possibility to rain because it will affect flight delays and etc.


# Introduction

This project relates to the NOAA Weather Dataset - JFK Airport (New York). The original dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/). 

For this project, you will be using a subset dataset, which contains 5727 rows (about 5% or original rows) and 9 columns. The end goal will be to predict the precipitation using some of the available features. In this project, you will practice reading data files, preprocessing data, creating models, improving models and evaluating them to ultimately choose the best model.




## Table of Contents:

Using this R notebook you will complete **10 tasks**:
* [0. Import Modules](#cell0)
* [1. Download and Unzip NOAA Weather Dataset](#cell1)
* [2. Read Dataset into Project](#cell2)
* [3. Select Subset of Columns](#cell3)
* [4. Clean Up Columns](#cell4)
* [5. Convert Columns to Numerical Types](#cell5)
* [6. Rename Columns](#cell6)
* [7. Exploratory Data Analysis](#cell7)
* [8. Linear Regression](#cell8)
* [9. Improve the Model](#cell9)
* [10. Find Best Model](#cell10)


<a id="cell0"></a>
## 0. Import required modules

Tidymodels is a collection of packages that use tidyverse principles to easily do the entire modeling process from preprocessing initial data, to creating a model, to tunning hyperparameters. The tidymodels packages can be used to produce high quality statistical and machine learning models. Our Jupyter notebook platforms have a built-in Tidyverse, Tidymodels and rlang packages so we do not need to install these packages prior to loading library. However, if you decide to run this lab on your RStudio Desktop locally on your machine, you can remove the commented lines of code to install these packages before loading.
 


In [1]:
# Install tidymodels if you haven't done so
install.packages("rlang")
install.packages("tidymodels")

SyntaxError: invalid syntax (1797538935.py, line 2)

**Note: After installing the packages, restart the kernel. Without installing the packages again, load them. Tidyverse and Tidymodels will be the two main packages you will use.**


In [None]:
# Library for modeling
library(tidymodels)

# Load tidyverse
library(tidyverse)


### Understand the Dataset

The original NOAA JFK dataset contains 114,546 hourly observations of various local climatological variables (including temperature, wind speed, humidity, dew point, and pressure). 

In this project you will use a sample dataset, which is around 293 KB. [Link to the sample dataset](https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz).

The sample contains 5727 rows (about 5% or original rows) and 9 columns, which are:
- DATE
- HOURLYDewPointTempF
- HOURLYRelativeHumidity
- HOURLYDRYBULBTEMPF
- HOURLYWETBULBTEMPF
- HOURLYPrecip
- HOURLYWindSpeed
- HOURLYSeaLevelPressure
- HOURLYStationPressure

The original dataset is much bigger. Feel free to explore the original dataset. [Link to the original dataset.](https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa_weather.html) 

For more information about the dataset, checkout the [preview](https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/data-preview/index.html?_ga=2.176781478.281508226.1616293518-1509963377.1616117067&cm_mc_uid=90945889198916153255549&cm_mc_sid_50200000=64650651616293516933) of NOAA Weather - JFK Airport.


<a id="cell1"></a>

## 1. Download NOAA Weather Dataset

Use the `download.file()` function to download the sample dataset from the URL below.

URL = 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz'


In [None]:
# URL of the dataset
URL <- 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz'

# Download the dataset
download.file(URL, destfile = 'noaa-weather-sample-data.tar.gz', mode = 'wb')


Untar the zipped file.


In [None]:
# Specify the file path
file_path <- 'noaa-weather-sample-data.tar.gz'

# Untar the file
untar(file_path)


<a id="cell2"></a>
## 2. Extract and Read into Project
We start by reading in the raw dataset. You should specify the file name as "noaa-weather-sample-data/jfk_weather_sample.csv".


In [None]:
library(readr)

# Specify the file path
file_path <- 'noaa-weather-sample-data/jfk_weather_sample.csv'

# Read the CSV file into a dataframe
weather_data <- read_csv(file_path)

Next, display the first few rows of the dataframe.


In [None]:
head(weather_data, 10)

Also, take a `glimpse` of the dataset to see the different column data types and make sure it is the correct subset dataset with about 5700 rows and 9 columns.


In [None]:
str(weather_data)

# Summary of the dataframe (for numeric columns)
summary(weather_data)

# Check the dimensions of the dataframe (number of rows and columns)
dim(weather_data)

<a id="cell3"></a>
## 3. Select Subset of Columns

The end goal of this project will be to predict `HOURLYprecip` (precipitation) using a few other variables. Before you can do this, you first need to preprocess the dataset. Section 3 to section 6 focuses on preprocessing.

The first step in preprocessing is to select a subset of data columns and inspect the column types.

The key columns that we will explore in this project are:
- HOURLYRelativeHumidity
- HOURLYDRYBULBTEMPF
- HOURLYPrecip
- HOURLYWindSpeed
- HOURLYStationPressure

Data Glossary:
- 'HOURLYRelativeHumidity' is the relative humidity given to the nearest whole percentage.
- 'HOURLYDRYBULBTEMPF' is the dry-bulb temperature and is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
- 'HOURLYPrecip' is the amount of precipitation in inches to hundredths over the past hour. For certain automated stations, precipitation will be reported at sub-hourly intervals (e.g. every 15 or 20 minutes) as an accumulated amount of all precipitation within the preceding hour. A “T” indicates a trace amount of precipitation.
- 'HOURLYWindSpeed' is the speed of the wind at the time of observation given in miles per hour (mph).
- 'HOURLYStationPressure' is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).

`Select` those five columns and store the modified dataframe as a new variable.


In [None]:
selected_columns <- c("HOURLYRelativeHumidity", "HOURLYDRYBULBTEMPF", "HOURLYPrecip", "HOURLYWindSpeed", "HOURLYStationPressure")
weather_subset <- weather_data[selected_columns]

Show the first 10 rows of this new dataframe.


In [None]:
head(weather_subset, 10)

<a id="cell4"></a>
## 4. Clean Up Columns

From the dataframe preview above, we can see that the column `HOURLYPrecip` - which is the hourly measure of precipitation levels - contains both `NA` and `T` values. `T` specifies *trace amounts of precipitation* (meaning essentially no precipitation), while `NA` means *not available*, and is used to denote missing values. Additionally, some values also have "s" at the end of them, indicating that the precipitation was snow. 

Inspect the unique values present in the column `HOURLYPrecip` (with `unique(dataframe$column)`) to see these values.


In [None]:
unique_values <- unique(weather_subset$HOURLYPrecip)
unique_values

Having characters in values (like the "T" and "s" that you see in the unique values) will cause problems when you create a model because values for precipitation should be numerical. So you need to fix these values that have characters. 

Now, for the column `HOURLYPrecip`:
1. Replace all the `T` values with "0.0" and 
2. Remove "s" from values like "0.02s". In R, you can use the method `str_remove(column, pattern = "s$")` to remove the character "s" from the end of values. The "$" tells R to match to the end of values. The `pattern` is a regex pattern. Look at [here](https://www.rdocumentation.org/packages/stringi/versions/1.5.3/topics/about_search_regex) for more information about regex and matching to strings in R.

Remember that you can use `tidyverse`'s  `mutate()` to update columns.

You can check your work by checking if unique values of `HOURLYPrecip` still contain any `T` or `s`. Store the modified dataframe as a new variable.


In [None]:
# Load required libraries if not already loaded
library(dplyr)
library(stringr)

# Make a copy of weather_subset to work with
weather_cleaned <- weather_subset

# Replace "T" values with "0.0" in HOURLYPrecip
weather_cleaned <- weather_cleaned %>%
  mutate(HOURLYPrecip = ifelse(HOURLYPrecip == "T", "0.0", HOURLYPrecip))

# Remove "s" from values like "0.02s"
weather_cleaned <- weather_cleaned %>%
  mutate(HOURLYPrecip = str_remove(HOURLYPrecip, pattern = "s$"))

# Check unique values in HOURLYPrecip column to verify
unique_values_cleaned <- unique(weather_cleaned$HOURLYPrecip)
unique_values_cleaned

<a id="cell5"></a>
## 5. Convert Columns to Numerical Types
Now that you have removed the characters in the `HOURLYPrecip` column, you can safely covert the column to a numeric type.

First, check the types of the columns. You will notice that all are `dbl` (double or numeric) except for `HOURLYPrecip`, which is `chr` (character or string). Use the `glimpse` function from Tidyverse.


In [2]:
# Load required libraries if not already loaded
library(dplyr)

# Display the current structure of the dataframe
glimpse(weather_cleaned)

# Convert HOURLYPrecip to numeric type
weather_cleaned <- weather_cleaned %>%
  mutate(HOURLYPrecip = as.numeric(HOURLYPrecip))

# Check the structure again to confirm the conversion
glimpse(weather_cleaned)


SyntaxError: invalid syntax (1228444521.py, line 8)

Convert `HOURLYPrecip` to the `numeric` type and store the cleaned dataframe as a new variable.


In [None]:
# Load required libraries if not already loaded
library(dplyr)

# Make a copy of weather_cleaned to work with
weather_final <- weather_cleaned

# Convert HOURLYPrecip to numeric type
weather_final <- weather_final %>%
  mutate(HOURLYPrecip = as.numeric(HOURLYPrecip))

# Check the structure of weather_final to confirm the conversion
glimpse(weather_final)


We can now see that all fields have numerical data type.


<a id="cell6"></a>
## 6. Rename Columns
Let's rename the following columns as:
- 'HOURLYRelativeHumidity' to 'relative_humidity'
- 'HOURLYDRYBULBTEMPF' to 'dry_bulb_temp_f'
- 'HOURLYPrecip' to 'precip'
- 'HOURLYWindSpeed' to 'wind_speed'
- 'HOURLYStationPressure' to 'station_pressure'

You can use `dplyr::rename()`. Then, store the final dataframe as a new variable.


In [None]:
# Load required libraries if not already loaded
library(dplyr)

# Make a copy of weather_final to work with
weather_final_renamed <- weather_final

# Rename columns
weather_final_renamed <- weather_final_renamed %>%
  rename(
    relative_humidity = HOURLYRelativeHumidity,
    dry_bulb_temp_f = HOURLYDRYBULBTEMPF,
    precip = HOURLYPrecip,
    wind_speed = HOURLYWindSpeed,
    station_pressure = HOURLYStationPressure
  )

# Display the structure of weather_final_renamed to verify column names
glimpse(weather_final_renamed)


<a id="cell7"></a>
## 7. Exploratory Data Analysis
Now that you have finished preprocessing the dataset, you can can start exploring the columns more.

First, split the data into a training and testing set. Splitting a dataset is done randomly, so to have reproducible results set the seed = 1234. Also, use 80% of the data for training.


In [None]:
# Load required libraries
library(rsample)

# Set seed for reproducibility
set.seed(1234)

# Split the data into training (80%) and testing (20%) sets
data_split <- initial_split(weather_final_renamed, prop = 0.8)

# Extract the training and testing datasets
train_data <- training(data_split)
test_data <- testing(data_split)

# Check the dimensions of the datasets
dim(train_data)
dim(test_data)


Next, looking at just the **training set**, plot histograms or box plots of the variables (`relative_humidity`, `dry_bulb_temp_f`, `precip`, `wind_speed`,  `station_pressure`) for an intial look of their distributions using `tidyverse`'s `ggplot`. Leave the testing set as is because it is good practice to not see the testing set until evaluating the final model.


In [None]:
# Load required libraries if not already loaded
library(ggplot2)
library(dplyr)

# Plot histograms for each variable
train_data %>%
  ggplot(aes(x = relative_humidity)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Relative Humidity")

train_data %>%
  ggplot(aes(x = dry_bulb_temp_f)) +
  geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Dry Bulb Temperature (F)")

train_data %>%
  ggplot(aes(x = precip)) +
  geom_histogram(binwidth = 0.1, fill = "lightcoral", color = "black") +
  labs(title = "Distribution of Precipitation")

train_data %>%
  ggplot(aes(x = wind_speed)) +
  geom_histogram(binwidth = 1, fill = "lightyellow", color = "black") +
  labs(title = "Distribution of Wind Speed")

train_data %>%
  ggplot(aes(x = station_pressure)) +
  geom_histogram(binwidth = 0.01, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Station Pressure")

# Plot boxplots for each variable
train_data %>%
  ggplot(aes(y = relative_humidity)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Relative Humidity")

train_data %>%
  ggplot(aes(y = dry_bulb_temp_f)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Dry Bulb Temperature (F)")

train_data %>%
  ggplot(aes(y = precip)) +
  geom_boxplot(fill = "lightcoral", color = "black") +
  labs(title = "Distribution of Precipitation")

train_data %>%
  ggplot(aes(y = wind_speed)) +
  geom_boxplot(fill =


<a id="cell8"></a>
## 8. Linear Regression 
After exploring the dataset more, you are now ready to start creating models to predict the precipitation (`precip`).

Create simple linear regression models where `precip` is the response variable and each of `relative_humidity`, `dry_bulb_temp_f`,`wind_speed` or `station_pressure` will be a predictor variable, e.g. `precip ~ relative_humidity`, `precip ~ dry_bulb_temp_f`, etc. for a total of four simple models. 
Additionally, visualize each simple model with a scatter plot.


In [None]:
# Load required libraries if not already loaded
library(ggplot2)
library(dplyr)

# Fit linear regression model
model1 <- lm(precip ~ relative_humidity, data = train_data)

# Scatter plot for model 1
ggplot(train_data, aes(x = relative_humidity, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Linear Regression: precip ~ relative_humidity",
       x = "Relative Humidity (%)", y = "Precipitation")


In [None]:
# Fit linear regression model
model2 <- lm(precip ~ dry_bulb_temp_f, data = train_data)

# Scatter plot for model 2
ggplot(train_data, aes(x = dry_bulb_temp_f, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "Linear Regression: precip ~ dry_bulb_temp_f",
       x = "Dry Bulb Temperature (F)", y = "Precipitation")


In [None]:
# Fit linear regression model
model3 <- lm(precip ~ wind_speed, data = train_data)

# Scatter plot for model 3
ggplot(train_data, aes(x = wind_speed, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Linear Regression: precip ~ wind_speed",
       x = "Wind Speed (mph)", y = "Precipitation")


In [None]:
# Fit linear regression model
model4 <- lm(precip ~ station_pressure, data = train_data)

# Scatter plot for model 4
ggplot(train_data, aes(x = station_pressure, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "Linear Regression: precip ~ station_pressure",
       x = "Station Pressure (in Hg)", y = "Precipitation")


<a id="cell9"></a>
## 9. Improve the Model
Now, try improving the simple models you created in the previous section. 

Create at least two more models, each model should use at least one of the different techniques:
1. Add more features/predictors
2. Add regularization (L1, L2 or a mix)
3. Add a polynomial component

Also, for each of the models you create, check the model performance using the **training set** and a metric like MSE, RMSE, or R-squared.

Consider using `tidymodels` if you choose to add regularization and tune lambda.


In [None]:
# Load required libraries if not already loaded
library(tidymodels)
library(glmnet)
library(dplyr)

# Model 1: Multiple Predictors
formula1 <- as.formula("precip ~ relative_humidity + dry_bulb_temp_f + wind_speed + station_pressure")
model1_multiple <- lm(formula1, data = train_data)

# Check model performance on training set
train_data <- train_data %>%
  mutate(pred_model1 = predict(model1_multiple))
train_rmse <- sqrt(mean((train_data$precip - train_data$pred_model1)^2))
train_rmse

# Model 2: Lasso Regularization
x <- model.matrix(precip ~ ., data = train_data)[,-1]
y <- train_data$precip
model2_lasso <- glmnet(x, y, alpha = 1)
cv_model2 <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model2$lambda.min
model2_lasso_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)
train_pred_lasso <- predict(model2_lasso_best, s = best_lambda, newx = x)
train_rmse_lasso <- sqrt(mean((y - train_pred_lasso)^2))
train_rmse_lasso

# Model 3: Polynomial Component
model3_poly <- lm(precip ~ poly(dry_bulb_temp_f, degree = 2), data = train_data)
train_data <- train_data %>%
  mutate(pred_poly = predict(model3_poly))
train_rmse_poly <- sqrt(mean((train_data$precip - train_data$pred_poly)^2))
train_rmse_poly


<a id="cell10"></a>
## 10. Find Best Model
Compare the regression metrics of each model from section 9 to find the best model overall. To do this, 

1. Evaluate the models on the **testing set** using at least one metric (like MSE, RMSE or R-squared).
2. After calculating the metrics on the testing set for each model, print them out in as a table to easily compare. You can use something like:
```
model_names <- c("model_1", "model_2", "model_3")
train_error <- c("model_1_value", "model_2_value", "model_3_value")
test_error <- c("model_1_value", "model_2_value", "model_3_value")
comparison_df <- data.frame(model_names, train_error, test_error)
```
3. Finally, from the comparison table you create, conclude which model performed the best.


In [None]:
# Load required libraries if not already loaded
library(tidymodels)
library(glmnet)
library(dplyr)

# Fit Model 1: Multiple Predictors
formula1 <- as.formula("precip ~ relative_humidity + dry_bulb_temp_f + wind_speed + station_pressure")
model1_multiple <- lm(formula1, data = train_data)

# Fit Model 2: Lasso Regularization
x <- model.matrix(precip ~ ., data = train_data)[,-1]
y <- train_data$precip
model2_lasso <- glmnet(x, y, alpha = 1)
cv_model2 <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model2$lambda.min
model2_lasso_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)

# Fit Model 3: Polynomial Component
model3_poly <- lm(precip ~ poly(dry_bulb_temp_f, degree = 2), data = train_data)

# Evaluate models on the testing set
# Model 1: Multiple Predictors
test_data <- test_data %>%
  mutate(pred_model1 = predict(model1_multiple, newdata = test_data))
test_rmse_model1 <- sqrt(mean((test_data$precip - test_data$pred_model1)^2))

# Model 2: Lasso Regularization
x_test <- model.matrix(precip ~ ., data = test_data)[,-1]
test_pred_lasso <- predict(model2_lasso_best, s = best_lambda, newx = x_test)
test_rmse_model2 <- sqrt(mean((test_data$precip - test_pred_lasso)^2))

# Model 3: Polynomial Component
test_data <- test_data %>%
  mutate(pred_poly = predict(model3_poly, newdata = test_data))
test_rmse_model3 <- sqrt(mean((test_data$precip - test_data$pred_poly)^2))

# Create comparison table
model_names <- c("Model 1 (Multiple Predictors)", "Model 2 (Lasso Regularization)", "Model 3 (Polynomial Component)")
train_error <- c(train_rmse, train_rmse_lasso, train_rmse_poly)
test_error <- c(test_rmse_model1, test_rmse_model2, test_rmse_model3)

comparison_df <- data.frame(model_names, train_error, test_error)

# Print comparison table
print(comparison_df)

# Conclude which model performed the best
best_model <- comparison_df$model_names[which.min(comparison_df$test_error)]
cat("\nThe best performing model on the testing set is:", best_model, "\n")

## Author(s)

<h4> Yiwen Li </h4>

## Contributions

<h4> Tiffany Zhu </h4>

## <h3 align="center"> © IBM Corporation 2021. All rights reserved. <h3/>
