<h1>Part I: Prepare our weather training data for the models</h1>
<ul>
    <li>Need to strip the data for each city in each data set and combine them all into onedataset</li>
    <li>Need to also get the data for each city's latitude and longitude and include that in the dataset for predicting the forecasts</li>
    <li>Need to prepare labels so that they are offest one hour</li>
</ul>

<h3>Step 1: Load data from csv files<h3>

In [1]:
import tensorflow as tf


tf.test.is_built_with_cuda()
tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [2]:
import pandas as pd

def load_weather_data(csv_file):
    city_attrib_csv_path = "datasets/historical-hourly-weather-data/" + csv_file
    return pd.read_csv(city_attrib_csv_path)

In [3]:
city_attributes_csv = load_weather_data("city_attributes.csv")
humidity_csv = load_weather_data("humidity.csv")
pressure_csv = load_weather_data("pressure.csv")
temperature_csv = load_weather_data("temperature.csv")
weather_description_csv = load_weather_data("weather_description.csv")
wind_direction_csv = load_weather_data("wind_direction.csv")
wind_speed_csv = load_weather_data("wind_speed.csv")

<h3>Step 2: Load the datasets for each city</h3>

In [4]:
import numpy as np

def load_city_dataset(city_name, model_labels_name):
    city_dataset = pd.DataFrame(data=humidity_csv["datetime"])
    
    # create new columns for month, day, hour_of_day
    # this will be used to split up data in "datatime" column
    city_dataset["month"] = ""
    city_dataset["day"] = ""
    city_dataset["hour_of_day"] = ""
    
    # seperate the values for datetime into month, day, hour_of_day int columns
    for i, date in enumerate(city_dataset["datetime"]):
        date, time = date.split(" ", 1)
        year, month, day = date.split("-", 2)
        hours, minute_seconds = time.split(":", 1)
        
        city_dataset["month"][i] = int(month)
        city_dataset["day"][i] = int(day)
        city_dataset["hour_of_day"][i] = int(hours)

    # drop datetime column
    city_dataset = city_dataset.drop("datetime", axis=1)
        
    # create new columns for latitude and longitude
    # add the values to each row in the table
    city_dataset["latitude"] = ""
    city_dataset["longitude"] = ""
    city_index = city_attributes_csv.index[city_attributes_csv["City"] == city_name]
    latitude_val = city_attributes_csv.get_value(city_index[0], "Latitude")
    longitude_val = city_attributes_csv.get_value(city_index[0], "Longitude")
    
    for i, row in enumerate(city_dataset["day"]):
        city_dataset["latitude"][i] = latitude_val
        city_dataset["longitude"][i] = longitude_val
    
    # create new columns
    # assign weather data for the city to columns
    city_dataset["humidity"] = humidity_csv[city_name]
    city_dataset["pressure"] = pressure_csv[city_name]
    city_dataset["temperature"] = temperature_csv[city_name]
    city_dataset["weather_description"] = weather_description_csv[city_name]
    city_dataset["wind_direction"] = wind_direction_csv[city_name]
    city_dataset["wind_speed"] = wind_speed_csv[city_name]
    
    # create new column for labels
    # each label represents the value of an attribute one hour later
    city_dataset[model_labels_name + "_labels"] = ""
    for i, row in enumerate(city_dataset[model_labels_name]):
        if(i < (len(city_dataset.index) - 1)):
            city_dataset[model_labels_name + "_labels"][i] = city_dataset[model_labels_name][i + 1]
        else:
            city_dataset[model_labels_name + "_labels"][i] = np.NaN
            
    return city_dataset

<h3>Step 3: Combine all of the cities into one dataset</h3>

In [5]:
def combine_all_cities(city_att_file, model_labels_name):
    # create new dataset to hold all of the weather data
    full_dataset = pd.DataFrame()
    
    # load data for each city and append to the full dataframe set
    for city in city_att_file["City"]:
        full_dataset = full_dataset.append(load_city_dataset(city, model_labels_name))
    
    # re-index the dataframe
    full_dataset = full_dataset.reset_index()
    full_dataset = full_dataset.drop(columns=['index'])
    
    return(full_dataset)
        

In [46]:
full_humidity_dataset = combine_all_cities(city_attributes_csv, model_labels_name="humidity")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [47]:
full_humidity_dataset

Unnamed: 0,month,day,hour_of_day,latitude,longitude,humidity,pressure,temperature,weather_description,wind_direction,wind_speed,humidity_labels
0,10,1,12,49.2497,-123.119,,,,,,,76
1,10,1,13,49.2497,-123.119,76.0,,284.630000,mist,0.0,0.0,76
2,10,1,14,49.2497,-123.119,76.0,,284.629041,broken clouds,6.0,0.0,76
3,10,1,15,49.2497,-123.119,76.0,,284.626998,broken clouds,20.0,0.0,77
4,10,1,16,49.2497,-123.119,77.0,,284.624955,broken clouds,34.0,0.0,78
...,...,...,...,...,...,...,...,...,...,...,...,...
1629103,11,29,20,31.769,35.2163,,,,,,,
1629104,11,29,21,31.769,35.2163,,,,,,,
1629105,11,29,22,31.769,35.2163,,,,,,,
1629106,11,29,23,31.769,35.2163,,,,,,,


In [48]:
# export dataset as csv for easy import on other projects
full_humidity_dataset.to_csv('weather_data_with_humidity_labels.csv', index=False)

In [6]:
full_pressure_dataset = combine_all_cities(city_attributes_csv, model_labels_name="pressure")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [42]:
full_pressure_dataset.to_csv('weather_data_with_pressure_labels.csv', index=False)

In [52]:
full_temperature_dataset = combine_all_cities(city_attributes_csv, model_labels_name="temperature")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [53]:
full_temperature_dataset.to_csv('weather_data_with_temperature_labels.csv', index=False)

In [13]:
full_description_dataset = combine_all_cities(city_attributes_csv, 
                                              model_labels_name="weather_description")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [43]:
full_description_dataset.to_csv('weather_data_with_description_labels.csv', index=False)

In [19]:
full_wdirection_dataset = combine_all_cities(city_attributes_csv,
                                            model_labels_name="wind_direction")
full_wdirection_dataset.to_csv('weather_data_with_wdirection_labels.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [44]:
full_wdirection_dataset.to_csv('weather_data_with_wdirection_labels.csv', index=False)

In [20]:
full_wspeed_dataset = combine_all_cities(city_attributes_csv,
                                        model_labels_name="wind_speed")
full_wspeed_dataset.to_csv('weather_data_with_wspeed_labels.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [45]:
full_wspeed_dataset.to_csv('weather_data_with_wspeed_labels.csv', index=False)

<h3>Step 4: Clean the full dataset</h3>

In [6]:
full_temperature_dataset = load_weather_data('weather_data_with_temperature_labels.csv')

In [7]:
full_temperature_dataset = full_temperature_dataset.dropna()

In [8]:
full_temperature_dataset = full_temperature_dataset.reset_index(drop=True)

In [25]:
full_pressure_dataset = load_weather_data('weather_data_with_pressure_labels.csv')
full_pressure_dataset = full_pressure_dataset.dropna()
full_pressure_dataset = full_pressure_dataset.reset_index(drop=True)

<h3>Step 5: Transformer pipeline </h3>

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

numbers_pipe = Pipeline([
     ('std_scaler', StandardScaler()),
])

In [None]:
train_num = full_temperature_dataset.drop("weather_description", axis=1)
train_num = train_num.drop("temperature_labels", axis=1)

In [27]:
train_num = full_pressure_dataset.drop("weather_description", axis=1)
train_num = train_num.drop("pressure_labels", axis=1)

In [28]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = list(train_num)
cat_attribs = ["weather_description"]

full_pipeline = ColumnTransformer([
    ("num", numbers_pipe, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

<h3>Step 6: Split the data into a training and testing set</h3>

In [11]:
temperature_labels = full_temperature_dataset[['temperature_labels']]
temperature_data = full_temperature_dataset.drop(columns=['temperature_labels'])

In [32]:
labels = full_pressure_dataset[['pressure_labels']]
data = full_pressure_dataset.drop(columns=['pressure_labels'])

In [33]:
full_pipeline.fit(data)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('num',
                                 Pipeline(memory=None,
                                          steps=[('std_scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),
                                 ['month', 'day', 'hour_of_day', 'latitude',
                                  'longitude', 'humidity', 'pressure',
                                  'temperature', 'wind_direction',
                                  'wind_speed']),
                                ('cat',
                                 OneHotEncoder(categories='auto', drop=None,
                                              

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

full_training_data, test_data, full_training_labels, test_labels = train_test_split(
    data, labels)
training_data, valid_data, training_labels, valid_labels = train_test_split(
    full_training_data, full_training_labels)

In [35]:
training_data = full_pipeline.transform(training_data).toarray()
training_labels = training_labels.to_numpy()

valid_data = full_pipeline.transform(valid_data).toarray()
valid_labels = valid_labels.to_numpy()

test_data = full_pipeline.transform(test_data).toarray()
test_labels = test_labels.to_numpy()

In [36]:
valid_data.shape

(298805, 64)

<h3>Step 7: Start training models</h3>

In [37]:
import tensorflow as tf
from tensorflow import keras

def build_weather_model(n_hidden=1, n_neurons=30, input_shape=[64]):
    model = keras.models.Sequential()
    model.add(keras.layers.InputLayer(input_shape=input_shape))
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu"))
    model.add(keras.layers.Dense(1))
    model.compile(loss="mse", optimizer="adam")
    return model


In [295]:
model = keras.wrappers.scikit_learn.KerasClassifier(build_fn=build_weather_model, epochs=100, batch_size=128)

In [44]:
import tensorflow as tf
from tensorflow import keras

model = build_weather_model(n_hidden=10, n_neurons=100, input_shape=training_data.shape[1:])
weather_model = model.fit(training_data, training_labels, epochs=200,
                   batch_size=128, validation_data=(valid_data, valid_labels),
                   callbacks=[keras.callbacks.EarlyStopping(patience=20)])

Train on 896413 samples, validate on 298805 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200


In [45]:
#n = 300, hidden = 5, loss = 1.3554
#n = 500, hidden = 4, loss = 1.4676

# pressure
#n = 500, hidden = 5, loss = 33.8083
# n = 250 performed slightly better prob around 30.0

#n = 250, hidden = 10, loss = 33.8063
#n = 100, hidden = 10, loss = 36.1238
mse_test = model.evaluate(test_data, test_labels)



In [19]:
model.save("temp_model.1.3554.h5")

In [196]:
X_new = temp_test_data[:3]
y_pred = model.predict(X_new)

In [197]:
y_pred

array([[278.3897 ],
       [287.9649 ],
       [295.60535]], dtype=float32)

In [198]:
temp_test_labels[:3]

array([[279.17 ],
       [286.218],
       [296.13 ]])