# Predicting the maximum temperature for tomorrow

This project explores the climate and temperature data to build a model that can predict the maximum temp next day in Davis, CA. The data was collected from an online platform: https://www.visualcrossing.com 

I particuarly selected Davis as the city to make this model on. One can select any City or region to get the data and make a model. 

### State the Question and Determine required data

The question that we are going to explore is: What will be the tomorrow's maxmimum temperature? We will build the model on the previous approximately one year of temp data of the city.

## Roadmap 

Before jumping into the project, we need to make a roadmap of what we are expecting from this project. Here is a brief guide to keep us on track. The following steps form the basis for the machine learning workflow once we have a problem and model in mind:
1. State the question and determine required data
2. Acquire the data in an accessible format
3. Identify and correct missing data points/anomalies as required
4. Prepare the data for the machine learning model
5. Establish a baseline model that you aim to exceed
6. Train the model on the training data
7. Make predictions on the test data
8. Compare predictions to the known test set targets and calculate performance metrics
9. If performance is not satisfactory, adjust the model, acquire more data, or try a different modeling technique
10. Interpret model and report results visually and numerically


### Importing the necesssary libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

After saving the csv file of temperature data on my local device, we will read it via python and use it as our dataset for this project.

In [8]:
dataset = pd.read_csv('Davis weather report.csv')
df = pd.DataFrame(dataset)
df.describe()

Unnamed: 0,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,precipprob,...,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,moonphase
count,397.0,397.0,397.0,397.0,397.0,397.0,397.0,397.0,397.0,306.0,...,397.0,397.0,397.0,397.0,397.0,397.0,397.0,397.0,113.0,397.0
mean,24.240302,9.566751,16.379345,23.580856,9.1267,16.043073,6.781864,60.370227,0.729673,15.03268,...,19.639295,189.09471,1015.797985,20.057683,14.660202,219.96801,16.409068,6.783375,10.176991,0.507834
std,8.344292,4.982814,6.16985,7.695052,5.3666,6.170964,4.59535,19.223851,6.229447,35.797686,...,8.383926,54.029475,5.560739,26.346453,2.919583,137.798003,9.104151,2.458427,1.881442,0.310301
min,5.9,-0.8,3.4,5.6,-4.1,2.8,-7.8,16.6,0.0,0.0,...,2.3,50.6,1003.5,0.0,1.5,6.2,0.5,0.0,10.0,0.0
25%,17.8,5.7,11.6,17.8,4.8,11.4,4.2,45.73,0.0,0.0,...,13.4,161.9,1011.7,1.2,15.1,95.4,7.8,5.0,10.0,0.25
50%,23.9,9.1,16.0,23.9,9.0,16.0,7.1,57.4,0.0,0.0,...,18.6,182.8,1014.9,7.1,15.9,230.6,18.4,7.0,10.0,0.5
75%,31.1,13.8,21.7,29.3,13.8,21.4,10.6,72.9,0.0,0.0,...,24.1,210.9,1020.3,30.3,16.0,308.3,24.6,9.0,10.0,0.76
max,42.5,20.6,30.7,40.9,20.6,29.9,15.0,99.5,109.77,100.0,...,48.1,343.0,1030.4,100.0,25.9,545.0,31.7,10.0,30.0,1.0


#### Step 2 Acquire the data in an accessible format

We have prior knowledge of where Davis, CA is and we know that Davis is a city, where it never snows, so we can leave the data columns which has snow data. Similiarly there are some other columns which we do not need, so we can exclude them, or equivalently, we will only include the data which is relivant to us.

In [9]:
df = df[['datetime','tempmax', 'tempmin', 'temp', 'feelslike', 'dew', 'humidity', 'precip']]
df.describe()

Unnamed: 0,tempmax,tempmin,temp,feelslike,dew,humidity,precip
count,397.0,397.0,397.0,397.0,397.0,397.0,397.0
mean,24.240302,9.566751,16.379345,16.043073,6.781864,60.370227,0.729673
std,8.344292,4.982814,6.16985,6.170964,4.59535,19.223851,6.229447
min,5.9,-0.8,3.4,2.8,-7.8,16.6,0.0
25%,17.8,5.7,11.6,11.4,4.2,45.73,0.0
50%,23.9,9.1,16.0,16.0,7.1,57.4,0.0
75%,31.1,13.8,21.7,21.4,10.6,72.9,0.0
max,42.5,20.6,30.7,29.9,15.0,99.5,109.77


Notice, that the date is in a single format of "YYYY-MM-DD", so we would like to convert it into a format where year, month and date are in different columns. Also, there is no missing data points/anomalies or incorrect information so we do not need to correct anything other than date format.

In [10]:
import datetime

df['year'] = pd.DatetimeIndex(df['datetime']).year
df['month'] = pd.DatetimeIndex(df['datetime']).month
df['day'] = pd.DatetimeIndex(df['datetime']).day

df = df.drop('datetime', axis = 1)

In [11]:
df

Unnamed: 0,tempmax,tempmin,temp,feelslike,dew,humidity,precip,year,month,day
0,28.1,7.1,18.1,17.8,2.7,42.50,0.0,2021,4,1
1,27.9,6.6,17.2,17.0,0.8,40.81,0.0,2021,4,2
2,21.9,6.3,13.3,12.7,4.1,56.63,0.0,2021,4,3
3,19.9,5.3,12.4,11.7,5.5,65.50,0.0,2021,4,4
4,20.3,8.1,13.2,12.3,6.8,67.30,0.0,2021,4,5
...,...,...,...,...,...,...,...,...,...,...
392,21.4,7.1,13.9,13.6,3.8,55.00,0.0,2022,4,28
393,25.5,8.6,16.5,16.2,1.1,42.60,0.0,2022,4,29
394,26.1,8.5,17.2,17.2,7.6,57.20,0.0,2022,4,30
395,27.8,8.1,17.6,17.2,2.8,47.10,0.0,2022,5,1


Now, we have date as day, month, and year in different columns as we wanted. We are ready to proceed to next steps.

### Prepare the data for the machine learning model

We are ready to prepare the data for our machine learning model. We need to divide estimators and predictors from the dataset.

In [13]:
# y are the values we want to predict
y = np.array(df['tempmax'])

# Remove the y column from the dataframe
df = df.drop('tempmax', axis = 1)

# Saving dataframe name for later use
df_list = list(df.columns)

# Convert to numpy array
X = np.array(df)

We need to the dataset in form of numpy array to implement Random Forest Algorithm, that is why we converted the dataframe into a numpy array for X and y.

### Splitting the dataset into Training set and Test set

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Lets look at the shape of all the data to make sure we did everything correctly.

In [22]:
print('Training X Shape:', X_train.shape)
print('Training y Shape:', y_train.shape)
print('Test X Shape:', X_test.shape)
print('Test y Shape:', y_test.shape)

Training X Shape: (297, 9)
Training y Shape: (297,)
Test X Shape: (100, 9)
Test y Shape: (100,)


We see that the shape of X_train matches with y_train and the other two arrays matches, so we can confirm that we did everything correctly.

### Establish Baseline


Before we start making predictions, we need to set a baseline for our model. This baseline would serve as a measure to which we want to beat our model with. If our model cannot improve upon the baseline, then it will be a failure and we should try a different model or admit that machine learning is not right for our problem. 

So, let the baseline for our case can be the  max temperature averages. In other words, our baseline is the error we would get if we simply predicted the average max temperature for all days.

In [23]:
# The baseline predictions are the average temp over the past year
baseline_preds = X_test[:, df_list.index('temp')]

# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - y_test)

print('Average Baseline Error:', round(np.mean(baseline_errors),2))

Average Baseline Error: 8.09


### Training the model 

In [24]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000,
                                 random_state = 42) # n_estimators is the number of decision trees
regressor.fit(X, y)

RandomForestRegressor(n_estimators=1000, random_state=42)

### Predicting the Test set Results

Now we have build the model, and we are ready to predict and evaluate the model.

In [32]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)).head(5)

[[36.69 36.7 ]
 [14.37 14.  ]
 [18.21 18.  ]
 [30.43 30.1 ]
 [32.38 32.  ]
 [26.83 27.  ]
 [32.5  32.  ]
 [36.69 36.9 ]
 [26.07 25.5 ]
 [23.65 24.1 ]
 [24.93 25.1 ]
 [26.59 26.4 ]
 [38.07 38.  ]
 [24.04 24.1 ]
 [23.08 23.  ]
 [29.95 30.3 ]
 [29.72 30.  ]
 [ 9.95 11.  ]
 [14.35 14.  ]
 [23.68 24.1 ]
 [26.71 28.  ]
 [12.32 12.  ]
 [12.17 12.1 ]
 [28.08 28.1 ]
 [31.29 31.2 ]
 [19.18 19.  ]
 [41.16 41.1 ]
 [32.54 31.2 ]
 [18.92 19.  ]
 [17.26 17.8 ]
 [30.39 30.6 ]
 [19.72 20.  ]
 [17.21 17.  ]
 [30.04 30.  ]
 [40.86 41.2 ]
 [20.36 21.  ]
 [ 7.09  5.9 ]
 [34.41 34.7 ]
 [22.94 23.6 ]
 [37.22 37.3 ]
 [18.77 18.1 ]
 [19.47 20.  ]
 [30.35 30.1 ]
 [18.8  18.6 ]
 [36.28 35.9 ]
 [37.   37.  ]
 [33.24 33.3 ]
 [23.75 24.  ]
 [22.37 23.  ]
 [29.61 29.  ]
 [10.65 10.6 ]
 [31.86 31.1 ]
 [ 9.72  9.  ]
 [31.56 31.4 ]
 [27.67 28.  ]
 [15.05 15.  ]
 [27.04 26.4 ]
 [13.83 13.1 ]
 [30.74 31.1 ]
 [37.94 38.6 ]
 [22.7  23.  ]
 [29.11 28.1 ]
 [19.39 19.4 ]
 [38.54 38.9 ]
 [ 8.2   7.1 ]
 [30.13 30.3 ]
 [18.58 18

AttributeError: 'NoneType' object has no attribute 'head'

Above, the left side is the  predictions and right data points are the test points (actual max temp for a day), we can see that the predicted points are really close to actual max temperatures. This gives us a measure that oue model is a good one.

In [34]:
# Calculate the absolute Errors
errors = abs(y_pred - y_test)

# Print out the mean absolute error (MAE)
print('Mean Absolute Error:', round(np.mean(errors),2), 'degrees.')

Mean Absolute Error: 0.39 degrees.


Our average is off by 0.39 degrees. That is more than 8 degree average improvement over the baseline. This might seem a lot of improvement for a model this like, and it is true because we used max temp for the error estimate, which in reality is a not a very good estimator. If we would have something like average historical max temp, then it would have made more sense to use that instead. Anyhow, we continue building our model and see the next steps

In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%')


In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

Though, we are getting a really high accuracy, this is not a good model. That is because the estimator that we are using is not a good estimator. Max temp from previous days is not a good estimator for this model, it is baised to produce a really high accuracy. Nonetheless, I was able to implement Random Forest algorithm to train a dataset, and build a model, and make a prediction on the data. 