# Predicting Max Temperature using Machine Learning
*Junsoo Derek Shin*
<br>
*May 2018*
***
### Purpose
The purpose of the project is to predict max temperatures of the days in April 2018 using machine learning techniques.

The general guide on the machine learning concepts and the inspiration to apply them came from came from these two sources:
1. https://www.kaggle.com/learn/machine-learning
2. http://stackabuse.com/using-machine-learning-to-predict-the-weather-part-2


### 1. Get and Load Weather Underground Data

**`extract_weather_data()`** function asks the Weather Underground API for the historic data and writes the returned JSON data into a text file. The delay is put in, so that the number of requests doesn't exceed the 10-requests-per-minute limit. There is also daily limit of 500 requests, so `days` argument should not be greater than 500. I have already run this function and gathered data from January 1, 2014 to April 30, 2018, and the data is available in the `weatherdata.txt` file.

In [1]:
import requests
from datetime import datetime, timedelta
import time
import json

from collections import namedtuple
import pandas as pd

In [2]:
# get history JSON data from the Weather Underground API
# def extract_weather_data(api_key, base_url, target_date, days):
#     with open('weatherdata.txt', 'a') as outfile:
#         for _ in range(days):
#             request = base_url.format(api_key, target_date.strftime('%Y%m%d'))
#             response = requests.get(request)
#             if response.status_code == 200:
#                 data = response.json()["history"]["dailysummary"][0]
#                 json.dump(data, outfile)
#                 outfile.write('\n')
#             time.sleep(6)
#             target_date += timedelta(days=1)

**`fill_dateframe()`** function reads the text file filled with JSON objects and creates a Pandas DataFrame from it. The `target_date` argument is the starting date of this weather text file and should be a `datetime` object like `datetime(2014, 1, 1)`. This date increments as we create the DataFrame and works as the index. The `namedtuple` is similar to a struct or class and lets us use attributes, so that the code is more readable.

In [3]:
# from the text file of JSON objects, create a list of namedtuples, and use it
# to create a DataFrame
def fill_dataframe(target_date):
    features = ["date", "meantempm", "meandewptm", "meanpressurem", 
                "maxhumidity", "minhumidity", "maxtempm", "mintempm", 
                "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", 
                "precipm"]
    DailySummary = namedtuple("DailySummary", features)
    records = []
    with open('weatherdata.txt', 'r') as f:
        for line in f:
            data = json.loads(line)
            records.append(DailySummary(
                date = target_date,
                meantempm = data["meantempm"],
                meandewptm = data["meandewptm"],
                meanpressurem = data["meanpressurem"],
                maxhumidity = data["maxhumidity"],
                minhumidity = data["minhumidity"],
                maxtempm = data["maxtempm"],
                mintempm = data["mintempm"],
                maxdewptm = data["maxdewptm"],
                mindewptm = data["mindewptm"],
                maxpressurem = data["maxpressurem"],
                minpressurem = data["minpressurem"],
                precipm = data["precipm"],
            ))
            target_date += timedelta(days=1)
    df = pd.DataFrame(records, columns=features).set_index('date')
    return df

In [4]:
target_date = datetime(2014, 1, 1)
days = 365
# extract_weather_data(API_KEY, BASE_URL, target_date, days)

In [5]:
data = fill_dataframe(target_date)

FileNotFoundError: [Errno 2] No such file or directory: 'weatherdata2.txt'

All of the columns are `object` (or `string`). Let's convert them into numeric data, so that it's easier to work with.

In [None]:
data.dtypes

In [None]:
data = data.apply(pd.to_numeric, errors='ignore')

In [None]:
data.dtypes

In [None]:
data['precipm'].unique()

`precipm` column was the only column that couldn't be converted into numbers, and the reason was that it had values such as `'T'`, which stands for "Trace" or a very litte amount of precipitation. Since "Trace" should be different from zero precipitation, I will assign an arbitrary value of 0.01 for now.

In [None]:
data = data.apply(pd.to_numeric, errors='coerce')
trace_rows = data['precipm'].isnull()
data.loc[trace_rows, 'precipm'] = 0.01

In [None]:
data.dtypes

In [None]:
data['precipm'].unique()

### 2. Add Features/Columns
What we want to predict is clear: it's `maxtempm`. However, we cannot use the data from the same date to train our model because we won't have those data from that day. So, one way to go about this is adding columns of measurements from the previous days. For example, for `meanpressurem`, we would have columns, such as `meanpressurem_1`, `meanpressurem_2`, `meanpressurem_3`, which are measurements from 1, 2 and 3 days prior.

So let's add the columns for 1, 2 and 3-days-prior measurements for every column except for the temperature columns. For the first few rows at the top, we need a buffer because we don't have the prior-days data for them.

In [None]:
# given a feature and the number of prior days, add column(s) to the DataFrame
def derive_nth_day_feature(df, feature, N):
    num_rows = df.shape[0]
    nth_prior_measurements = [None]*N + [df[feature][i] for i in range(0, num_rows-N)]
    col_name = "{}_{}".format(feature, N)
    df[col_name] = nth_prior_measurements

In [None]:
for feature in data.columns:
    for N in range(1, 4):
        derive_nth_day_feature(data, feature, N)

In [None]:
data.columns

We don't need the measurements on the same day as the temperatures because we won't have them for the days we are trying to predict. If we can have those measurements, we can probably have temperatures for those days as well. So let's drop those columns.

We also see that the columns are missing certain amount of values because of the missing n_day prior values. Other than those rows, everything seems to be filled in. Although every row is valuable, I don't want to impute values that could be way off, let's drop those 1-to-3 rows and obtain non-null data all around.

In [None]:
features_to_drop = ['meandewptm', 'maxdewptm', 'mindewptm',
                    'meanpressurem', 'maxpressurem', 'minpressurem',
                    'maxhumidity', 'minhumidity',
                    'precipm']
data.drop(features_to_drop, axis=1, inplace=True)

In [None]:
data.info()

In [None]:
data.dropna(axis=0, inplace=True)

In [None]:
data.info()

Let's split the data into **`train_data`** and **`april_data`**, so that we can train our models with **`train_data`** and eventually test them with **`april_data`**.

In [None]:
april_data = data[data.index >= datetime(2018, 4, 1)]
train_data = data.drop(april_data.index, axis=0)
april_data.info()
train_data.info()

### 3. Linear Regression
Now we can build our first model using linear regression! Linear regression requires that the features we are using and the target variable we are trying to predict have linear relationships. One way to assess that is by calculating Pearson correlation coefficients. These values range from -1 to 1, and the values close to -1 and 1 mean that the features have strong linear relationships with the target variable, and values close to 0 mean that they have weak linear relationships with the target variable.

In [None]:
train_data.corr()[['maxtempm']].sort_values('maxtempm')

It looks like the features with temperature and dew point have linear relationships with max temperature.

In [None]:
linear_features = [feature for feature in train_data.columns
                                       if 'temp' in feature or
                                          'dewpt' in feature]

The first model will use linear regression. It will train on 80% of the data and tested on the 20% of the data. Using cross validation, we will cut the data in five, rotate the training-testing-data pairs, and then calculate the mean of those five scores, which will be the final score for this model.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

X = train_data[linear_features].drop(['meantempm', 'maxtempm', 'mintempm'], axis=1)
y = train_data['maxtempm']

scores = -1 * cross_val_score(LinearRegression(),
                              X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
scores.mean()

In [None]:
from sklearn.metrics import mean_absolute_error

prev_maxtemp = X['maxtempm_1']
mean_absolute_error(y, prev_maxtemp)

Although the mean error from the linear regression model is slightly better than just using the max temperatures from a day before, the error of 3.42 degrees Celsius (6.16 degrees Fahrenheit) is pretty high. Let's see if we can improve our predictions with another technique.

### 4. Decision Tree, Random Forest and XGBoost

In [None]:
X = train_data.drop(['meantempm', 'maxtempm', 'mintempm'], axis=1)
y = train_data['maxtempm']

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

for max_leaf_nodes in [5, 50, 500, 5000]:
    scores = -1 * cross_val_score(DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes),
                            X, y,
                            cv=5,
                            scoring='neg_mean_absolute_error')
    print(str(max_leaf_nodes), str(scores.mean()))

In [None]:
from sklearn.ensemble import RandomForestRegressor

scores = -1 * cross_val_score(RandomForestRegressor(50),
                              X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
scores.mean()

In [None]:
from xgboost import XGBRegressor

scores = -1 * cross_val_score(XGBRegressor(n_estimators=1000, learning_rate=0.05),
                              X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
scores.mean()

### 5. Predict April 2018

From the models trained and tested on the 3 years of data, the linear regression model seems to be the most accurate model, so we will use the linear regression model to predict the daily maximum temperatures of April 2018. First, let's train the model using the entire 3 years of data (not just 80% of them) since the April data will be our test data anyway. Then using the April data, we will make predictions.

In [None]:
train_y = train_data['maxtempm']
train_X = train_data[linear_features].drop(['meantempm', 'maxtempm', 'mintempm'], axis=1)
test_y = april_data['maxtempm']
test_X = april_data[linear_features].drop(['meantempm', 'maxtempm', 'mintempm'], axis=1)

linear_model = LinearRegression()
linear_model.fit(train_X, train_y)

predictions = linear_model.predict(test_X)
print("Linear Regression model's mean absolute error: "+ str(mean_absolute_error(test_y, predictions)))

Let's create two naive benchmarks. The first benchmark will use previous-day measurements as its predictions. The second benchmark will use previous-year measurements as its predictions.

In [None]:
prev_day_benchmark = april_data['maxtempm_1']
prev_year_benchmark = train_data.loc[(train_data.index >= datetime(2017, 4, 1)) & 
                                     (train_data.index <= datetime(2017, 4, 30))]['maxtempm']
print("prev_day MAE: " + str(mean_absolute_error(test_y, prev_day_benchmark)))
print("prev_year MAE: " + str(mean_absolute_error(test_y, prev_year_benchmark)))

April 2018 was a tough month to predict maximum temperatures. Maximum temperatures seemed to have changed day-to-day, and they were even more different from those of April 2017. Both benchmarks and our linear regression model performed with, I would say, large errors. However, the linear regression model does do better than the naive benchmarks, so this is a decent start. After studying more about the machine learning algorithms, I want to figure out what made the algorithms I used here struggled.

### 6. Time to See What Happened

Let's look at how my predictions and the actual maximum temperatures look on a scatterplot. If the predicitons were good, the plot would show a straight, diagonal line.

In [None]:
from matplotlib import pyplot as plt

plt.scatter(predictions, test_y)
plt.show()

The plot looks more like two diagonal lines, and really not a straight line that I wanted.

Let's check if there were any outliers in my training data that I missed.

In [None]:
features_used = [feature for feature in linear_features 
                                     if feature not in ['meantempm', 'maxtempm', 'mintempm']]
for feature in features_used:
    train_data[feature].hist()
    plt.xlabel(feature)
    plt.show()

There doesn't seem to be many outliers in our training data.

It is true that using the temperature and dew point measurements in the 3 previous days to predict the maximum temperature for the next day seemed difficult during the training of the model as well. Perhaps the model should have been more strongly fit (maybe it's underfit at the moment), or perhaps the model should have been trained on the April data only.

In [None]:
my_preds = linear_model.predict(train_X)
print(str(mean_absolute_error(train_y, my_preds)))

In [None]:
plt.scatter(my_preds, train_y)
plt.show()