# Data explore Notebook

In this notebook, I am going to explore the dataset to get to know it better and to extract the information I need to build a weather predicting ML Model

But to get started analyzing the dataset, let's import the needed libraries. You can install them using the ```REQUIREMENTS.txt``` file in the GitHub Repository this notebook is located in.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

## Introduction

The dataset we are going to use for the project is a weather dataset found on kaggle.com on the 13th February 2021. You can find the dataset following [this link](https://www.kaggle.com/selfishgene/historical-hourly-weather-data). The data is licensed under the [ODbL](https://opendatacommons.org/licenses/odbl/).

The data the dataset contains was collected between 2012 and 2017 and includes hourly weather data from 30 cities in the United States and Canada. Also the dataset includes the data from 6 cities in Israel.

The data the dataset includes are listed down here now:
* humidity in %
* pressure in hPa
* temperature in Kelvin (K)
* wind_direction in meteorological degress (°)
* wind_speed in m/s
* weather_describtion in text
* city_attributes in text

All the different attributes are stored in multiple csv files, which are located inside the folder ./weather_data

In the model, we are first focusing on predicting the temperature, by letting the model learn, which temperature it should expect for a given date, meaning that we will feed in the date as our feature and hopefully get a temperature value (label) in return. Later on, I might try to expand this project to incorparate more components that are necessary to predict the weather, but that will be seen by time (I will update this page when the rest of the project is changing).


## Importing the files

So now it is time for us to open up the csv dataset using pandas DataFrames and display them in a table like form. Displaying it in this notebook in a table like form and plotting the data makes it easier to read than the comma seperated csv text files with thousands and thousands of lines of data. Also we won't display the complete dataset in a table, because the files have so many entries.


In [3]:
path = "./weather_data_raw/"

humidity_raw = pd.read_csv(path + 'humidity.csv', index_col = 0)
pressure_raw = pd.read_csv(path + 'pressure.csv', index_col = 0)
temperature_raw = pd.read_csv(path + 'temperature.csv', index_col = 0)
wind_direction_raw = pd.read_csv(path + 'wind_direction.csv', index_col = 0)
wind_speed_raw = pd.read_csv(path + 'wind_speed.csv', index_col = 0)
weather_description_raw = pd.read_csv(path + 'weather_description.csv', index_col = 0)
city_attributes_raw = pd.read_csv(path + 'city_attributes.csv', index_col = 0)

In [4]:
cities = city_attributes_raw.index.values

cities[0]

'Vancouver'

## Splitting the data

Each file we imported now, includes all the data for all the cities. But it would make our live easier if we split them up to a dictionary which includes one pandas DataFrame for each city so we can call them using the ```cityname``` key of our dictionary. Additionally we will rename the series titles from the city name to their actual name (temperature etc).

In [25]:
humidity = {}
pressure = {}
temperature = {}
wind_direction = {}
wind_speed = {}
weather_description = {}
city_attributes = {}

for city in cities:
    humidity[city] = pd.DataFrame(humidity_raw[city]).rename(columns={city: 'humidity'})
    pressure[city] = pd.DataFrame(pressure_raw[city]).rename(columns={city: 'pressure'})
    temperature[city] = pd.DataFrame(temperature_raw[city]).rename(columns={city: 'temperature'})
    wind_direction[city] = pd.DataFrame(wind_direction_raw[city]).rename(columns={city: 'wind_direction'})
    wind_speed[city] = pd.DataFrame(wind_speed_raw[city]).rename(columns={city: 'wind_speed'})
    weather_description[city] = weather_description_raw[city]

## Recombining the dataset

To make it easier for us to train the model, we create new .csv files and panda DataFrames, where each file includes all the meteorological data for one specific city like Vancouver. This makes it easy to train the model only on one specific city first, which helps to reduce calculation times.

The new csv file will have the following layout:

Date | temperature | humidity | pressure | wind_speed | wind_direction
------ | ------ | ------ | ------ | ------ | ------
Unix timestamp | Temperature in Celcius | Humidity in % | Pressure in hPa | Wind speed in m/s | Wind direction in met degrees 

Because our dataset includes the temperature data in Kelvin and we want Celcius for our calculation, we will convert it using the following formula: ```Celcius [°C] = Kelvin [K] - 273.15```

In addition to that, we will store all the DataFrames in a Python Dictionary, so we can access each city by it's name.

To the disk, we will store the new csv files using the following naming system: *\[city_name\]_weather_dataset.csv*

To uniform this, we also replace the spaces in our city name with -.

Additionally, a json file including all the cities will be stored to the disk and called *cities.json*

Knowing this, we can later access the data by importing the cities list and loading all the cities included in the list by opening the file with the following scheme.

In [26]:
# convert data to celcius [°C]
for city in cities:
    for index, val in enumerate(temperature[city]['temperature']):
        temperature[city]['temperature'][index] = val - 273.15

In [35]:
dataset = {}
for city in cities:
    dataset[city] = pd.concat([temperature[city], humidity[city], pressure[city], wind_speed[city], wind_direction[city]], axis=1)

## Writing to .csv file

After we resorted the data into a more usable format, let's write the dataset to our disk. To do so, we run the following code below, which will store the data in this project directory in the weather_data folder

In [36]:
for city in dataset:
    filename = "weather_data/" + city.lower().replace(" ", "-") + "_weather_data_dataset.csv"
    dataset[city].to_csv(filename, encoding='utf-8')