# Cleaning datas 

We give a preliminary clean of the datas, before cleaning and analyzing them further for specific models in different notebooks. We obtained historical weather datas for Montreal from OpenWeatherMap. The datas are from January 1st, 1979 to July 31st, 2020. 

The collected features are: 

- <code> city_name </code> City name
- <code> lat </code> Geographical coordinates of the location (latitude)
- <code> lon </code> Geographical coordinates of the location (longitude)
- <code> main </code>
    - <code> main.temp </code> Temperature
    - <code> main.feels_like </code> This temperature parameter accounts for the human perception of weather
    - <code> main.pressure </code> Atmospheric pressure (on the sea level), hPa
    - <code> main.humidity </code> Humidity, %
    - <code> main.temp_min </code> Minimum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded (use these parameter optionally).
    - <code> main.temp_max </code> Maximum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded (use these parameter optionally).
- <code> wind </code>
    - <code> wind.speed </code> Wind speed. Unit Default: meter/sec
    - <code> wind.deg </code> Wind direction, degrees (meteorological)
- <code> clouds </code>
    - <code> clouds.all </code> Cloudiness, %
- <code> rain </code>
    - <code> rain.1h </code> Rain volume for the last hour, mm
    - <code> rain.3h </code> Rain volume for the last 3 hours, mm
- <code> snow </code>
    - <code> snow.1h </code> Snow volume for the last hour, mm (in liquid state)
    - <code> snow.3h </code> Snow volume for the last 3 hours, mm (in liquid state)
- <code> weather </code> 
    - <code> weather.id </code> Weather condition id
    - <code> weather.main </code> Group of weather parameters (Rain, Snow, Extreme etc.)
    - <code> weather.description </code> Weather condition within the group
    - <code> weather.icon </code> Weather icon id
- <code> dt </code> Time of data calculation, unix, UTC
- <code> dt_iso </code> Date and time in UTC format
- <code> timezone </code> Shift in seconds from UTC

The explanation for the weather condition id and icon id can be found here: https://openweathermap.org/weather-conditions

We import the useful libraries. 

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

We read the csv file.

In [2]:
df_data = pd.read_csv('weather_data_montreal.csv')

print('Data Shape = {}'.format(df_data.shape))
print(df_data.columns)

Data Shape = (373025, 25)
Index(['dt', 'dt_iso', 'timezone', 'city_name', 'lat', 'lon', 'temp',
       'feels_like', 'temp_min', 'temp_max', 'pressure', 'sea_level',
       'grnd_level', 'humidity', 'wind_speed', 'wind_deg', 'rain_1h',
       'rain_3h', 'snow_1h', 'snow_3h', 'clouds_all', 'weather_id',
       'weather_main', 'weather_description', 'weather_icon'],
      dtype='object')


In [3]:
pd.set_option('display.max_columns', 999)
print(df_data.info())
df_data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373025 entries, 0 to 373024
Data columns (total 25 columns):
dt                     373025 non-null int64
dt_iso                 373025 non-null object
timezone               373025 non-null int64
city_name              373025 non-null object
lat                    373025 non-null float64
lon                    373025 non-null float64
temp                   373025 non-null float64
feels_like             373025 non-null float64
temp_min               373025 non-null float64
temp_max               373025 non-null float64
pressure               373025 non-null int64
sea_level              0 non-null float64
grnd_level             0 non-null float64
humidity               373025 non-null int64
wind_speed             373025 non-null float64
wind_deg               373025 non-null int64
rain_1h                36787 non-null float64
rain_3h                1622 non-null float64
snow_1h                11761 non-null float64
snow_3h               

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,feels_like,temp_min,temp_max,pressure,sea_level,grnd_level,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
94435,608936400,1989-04-18 21:00:00 +0000 UTC,-14400,Montreal,45.501689,-73.567256,278.5,271.7,277.15,281.539,1015,,,64,6.7,240,,,,,75,803,Clouds,broken clouds,04d
306595,1357718400,2013-01-09 08:00:00 +0000 UTC,-18000,Montreal,45.501689,-73.567256,268.84,265.14,267.15,270.428,1027,,,92,1.5,120,,,,,40,802,Clouds,scattered clouds,03n
253239,1166094000,2006-12-14 11:00:00 +0000 UTC,-18000,Montreal,45.501689,-73.567256,277.68,271.43,276.586,278.15,1012,,,75,6.2,230,,,,,90,804,Clouds,overcast clouds,04n
72083,531622800,1986-11-06 01:00:00 +0000 UTC,-18000,Montreal,45.501689,-73.567256,270.59,265.04,269.142,271.15,1022,,,58,3.6,50,,,,,20,801,Clouds,few clouds,02n
326462,1429228800,2015-04-17 00:00:00 +0000 UTC,-14400,Montreal,45.501689,-73.567256,286.41,283.66,285.542,287.403,1020,,,37,0.86,132,,,,,100,804,Clouds,overcast clouds,04n


We change the 'dt_iso' into a datetime format. Then we remove 'dt' and 'timezone', as we will keep the dates and times with respect to UTC. 

In [4]:
df_data['dt_iso'] = df_data['dt_iso'].map(lambda x: x.replace('+0000 UTC', ''))
df_data['dt_iso'] = pd.to_datetime(df_data['dt_iso'], format='%Y-%m-%d %H:%M:%S.%f')
print(df_data.head())
print(df_data.info())

          dt              dt_iso  timezone city_name        lat        lon  \
0  283996800 1979-01-01 00:00:00    -18000  Montreal  45.501689 -73.567256   
1  284000400 1979-01-01 01:00:00    -18000  Montreal  45.501689 -73.567256   
2  284004000 1979-01-01 02:00:00    -18000  Montreal  45.501689 -73.567256   
3  284007600 1979-01-01 03:00:00    -18000  Montreal  45.501689 -73.567256   
4  284011200 1979-01-01 04:00:00    -18000  Montreal  45.501689 -73.567256   

     temp  feels_like  temp_min  temp_max  pressure  sea_level  grnd_level  \
0  275.12      269.76   274.736   275.443      1025        NaN         NaN   
1  275.08      271.53   274.774   275.305      1023        NaN         NaN   
2  275.06      271.16   274.762   275.217      1022        NaN         NaN   
3  275.97      267.30   275.150   276.952      1021        NaN         NaN   
4  276.32      267.88   276.150   276.862      1019        NaN         NaN   

   humidity  wind_speed  wind_deg  rain_1h  rain_3h  snow_1h  

In [5]:
df_data = df_data.drop(columns = ['dt', 'timezone'])
df_data = df_data.set_index('dt_iso')

We see that there is no value for 'sea_level' and 'grnd_level' so we drop those columns. Moreover, 'city_name', 'lat', 'lon' are irrelevant as they never change. Also, the 'weather_description' and 'weather_icon' contains the same information as the 'weather_id', so we only keep weather_id. The 'weather_main' also does not contain more information than the 'weather_id' but we keep it for now, as they are nice divisions of features, compared to 'weather_id' which could contain too many categorical features. Finally, the 'temp_min' and 'temp_max' features are deviations from 'temp', which we will not use for our models.   

In [6]:
df_data = df_data.drop(columns = ['sea_level', 'grnd_level', 'city_name', 'lat', 'lon', 'weather_description', 'weather_icon', 'temp_min', 'temp_max'])
print(df_data.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 373025 entries, 1979-01-01 00:00:00 to 2020-07-31 23:00:00
Data columns (total 13 columns):
temp            373025 non-null float64
feels_like      373025 non-null float64
pressure        373025 non-null int64
humidity        373025 non-null int64
wind_speed      373025 non-null float64
wind_deg        373025 non-null int64
rain_1h         36787 non-null float64
rain_3h         1622 non-null float64
snow_1h         11761 non-null float64
snow_3h         938 non-null float64
clouds_all      373025 non-null int64
weather_id      373025 non-null int64
weather_main    373025 non-null object
dtypes: float64(7), int64(5), object(1)
memory usage: 39.8+ MB
None


In [7]:
df_data.head()

Unnamed: 0_level_0,temp,feels_like,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main
dt_iso,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1979-01-01 00:00:00,275.12,269.76,1025,80,4.6,140,,,,,90,300,Drizzle
1979-01-01 01:00:00,275.08,271.53,1023,80,2.0,90,,,,,90,600,Snow
1979-01-01 02:00:00,275.06,271.16,1022,80,2.5,120,,,,,90,804,Clouds
1979-01-01 03:00:00,275.97,267.3,1021,86,9.7,160,,,,,90,804,Clouds
1979-01-01 04:00:00,276.32,267.88,1019,93,9.7,160,0.5,,,,90,500,Rain


We get the possible values for 'weather_main' and 'weather_id'.

In [8]:
print('The possible values for "weather_main" are:' ,df_data.weather_main.unique(), '\n')
print('The possible values for "weather_id" are:', df_data.weather_id.unique())

The possible values for "weather_main" are: ['Drizzle' 'Snow' 'Clouds' 'Rain' 'Fog' 'Clear' 'Haze' 'Mist'
 'Thunderstorm' 'Dust' 'Smoke'] 

The possible values for "weather_id" are: [300 600 804 500 741 501 520 801 803 800 620 721 601 802 602 701 521 201
 211 502 301 731 621 711 612 511 522 321 202 302 503 200]


We see that there are 11 categories in 'weather_main' and 32 categories in 'weather_id'.

We put these datas into a csv file, which we will clean for different models. 

In [26]:
df_data.to_csv("weather_data_initial_clean.csv", header=True, index=True)