# Projekt przewidywania temperatury

### Przedmiot projektu

Projekt obejmuje stworzenie algorytmu uczenia maszynowego w celu przewidywania pogody dla miata Szeged na Węgrzech.<br>
Dane były pobierane w przedziale czasowym od 2006-01-01 00:00:00 do 2016-12-31 23:00:00 o każdej pełnej godzine.

In [1]:
# Imports

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
rcParams['figure.figsize'] = 10, 8

import sklearn

import warnings
warnings.filterwarnings('ignore')

In [2]:
# load dataset
df = pd.read_csv("weatherHistory.csv")
df.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [3]:
# shape of dataframe
df_shape = df.shape
print(f"Dataset weatherHistory posiada {df_shape[1]} cech oraz {df_shape[0]} obserwacji.")

Dataset weatherHistory posiada 12 cech oraz 96453 obserwacji.


In [4]:
# data frame information about features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted Date            96453 non-null  object 
 1   Summary                   96453 non-null  object 
 2   Precip Type               95936 non-null  object 
 3   Temperature (C)           96453 non-null  float64
 4   Apparent Temperature (C)  96453 non-null  float64
 5   Humidity                  96453 non-null  float64
 6   Wind Speed (km/h)         96453 non-null  float64
 7   Wind Bearing (degrees)    96453 non-null  float64
 8   Visibility (km)           96453 non-null  float64
 9   Loud Cover                96453 non-null  float64
 10  Pressure (millibars)      96453 non-null  float64
 11  Daily Summary             96453 non-null  object 
dtypes: float64(8), object(4)
memory usage: 8.8+ MB


In [5]:
pd.DataFrame(df.isna().mean()).T

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,0.0,0.0,0.00536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# description for numerical features
df.describe()

Unnamed: 0,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars)
count,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0
mean,11.932678,10.855029,0.734899,10.81064,187.509232,10.347325,0.0,1003.235956
std,9.551546,10.696847,0.195473,6.913571,107.383428,4.192123,0.0,116.969906
min,-21.822222,-27.716667,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.688889,2.311111,0.6,5.8282,116.0,8.3398,0.0,1011.9
50%,12.0,12.0,0.78,9.9659,180.0,10.0464,0.0,1016.45
75%,18.838889,18.838889,0.89,14.1358,290.0,14.812,0.0,1021.09
max,39.905556,39.344444,1.0,63.8526,359.0,16.1,0.0,1046.38


#### Wnioski

Data frame posiada wartości NaN dla cechy 'Precip Type', co stanowi 5.3%% (promila) wszystkich danych. <br> <br>

Kolumny kategoryczne: Formatted Date, Summary, Precip Type, Daily Summary <br>
Kolumny numeryczne: Temperature (C), Apparent Temperature (C), Humidity, Wind Speed (km/h), Wind Bearing (degrees), Visibility (km), Loud Cover, Pressure (millibars)

Kolumna Loud Cover posiada tylko wartości 0, dlatego nie jest istotna dla data frame.

## Feature engineering

### Zamiana pustych wartości dla columny 'Precip Type'


In [7]:
# value counts for Precip Type
df["Precip Type"].value_counts(dropna=False)

rain    85224
snow    10712
NaN       517
Name: Precip Type, dtype: int64

In [8]:
# create dataframe with  empyt's values Precip type and display 10 samples
df_precip_null = df.loc[df["Precip Type"].isnull()]
df_precip_null.sample(10)

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
95204,2016-10-18 00:00:00.000 +0200,Mostly Cloudy,,8.461111,5.727778,0.76,17.4685,148.0,16.1,0.0,1028.79,Mostly cloudy throughout the day.
95473,2016-10-28 05:00:00.000 +0200,Clear,,1.527778,-0.677778,0.85,7.2128,20.0,0.0,0.0,1032.82,Clear throughout the day.
52819,2012-04-17 19:00:00.000 +0200,Overcast,,7.777778,5.905556,0.9,10.4489,336.0,10.3523,0.0,1006.72,Light rain in the morning and afternoon.
95357,2016-10-23 09:00:00.000 +0200,Foggy,,6.2,4.977778,0.9,6.4561,135.0,1.7549,0.0,1020.75,Foggy starting overnight continuing until morn...
58916,2012-05-26 21:00:00.000 +0200,Mostly Cloudy,,16.066667,16.066667,0.7,9.4507,21.0,9.982,0.0,1016.8,Partly cloudy throughout the day.
95547,2016-10-30 06:00:00.000 +0100,Clear,,3.672222,0.633333,0.86,12.1394,307.0,0.0,0.0,1027.25,Clear throughout the day.
95226,2016-10-18 22:00:00.000 +0200,Overcast,,9.411111,7.65,0.65,11.7852,139.0,15.8263,0.0,1023.73,Mostly cloudy throughout the day.
94270,2016-11-01 02:00:00.000 +0100,Partly Cloudy,,5.311111,3.616667,0.93,7.6153,298.0,0.0,0.0,1021.61,Partly cloudy starting in the afternoon.
95102,2016-10-13 18:00:00.000 +0200,Mostly Cloudy,,9.944444,9.944444,0.77,4.186,56.0,16.0517,0.0,1020.95,Foggy starting in the evening.
95135,2016-10-15 03:00:00.000 +0200,Overcast,,8.333333,6.405556,0.79,11.3988,107.0,12.2199,0.0,1019.13,Drizzle starting in the evening.


In [9]:
# check teperatures min and max for NaNs
temp_max = round(df_precip_null["Temperature (C)"].max(),2)
temp_min = round(df_precip_null["Temperature (C)"].min(),2)
print(f"Temperatura dla brakujących wartości max={temp_max}, min={temp_min} w kolumne Precip Type")

Temperatura dla brakujących wartości max=25.04, min=1.26 w kolumne Precip Type


#### Wnioski:

Minimalna wartość temperatury dla NaN dla cechy Precip Type to 1.26 stopni, dlatego puste wartości zamieniono na wartość 'rain'.

In [10]:
# replace NaNs with 'rain'
df['Precip Type'] = df['Precip Type'].replace(np.NaN, 'rain')
df["Precip Type"].value_counts(dropna=False)

rain    85741
snow    10712
Name: Precip Type, dtype: int64

### Stworzenie kolumn dla miesiąca i godziny

In [11]:
# add column with month
df["Month"] = df["Formatted Date"].apply(lambda x: int(x[5:7]))

In [12]:
# add column with hout
df["Hour"] = df["Formatted Date"].apply(lambda x: int(x[11:13]))

### Grupowanie kolumn miesiąca i godziny na pory roku i dnia

In [13]:
# add column with grupped month

# 1 - Winter
# 2 - Spring
# 3 - Summer
# 4 - Autumn

season = {1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 1}
df["year_season"] = df["Month"].map(season)

In [14]:
# add column with grupped hour

# 1 - Morning
# 2 - Afternoon
# 3 - Evening
# 4 - Night

parts_of_the_day = {5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 3, 18: 3, 19: 3, 20: 3, 21: 3, 22: 3, 23: 4, 0: 4, 1: 4, 2: 4, 3: 4, 4: 4}

df["day_part"] = df["Hour"].map(parts_of_the_day)

### Usuwanie kolumn

In [15]:
# print column where is only one unique value

for col in df.columns:
    a = df[col].unique()
    if len(a) == 1:
        print(col, a)

Loud Cover [0.]


In [16]:
# drop column 'Loud Cover' - because it has only 1 unique value
df = df.drop('Loud Cover', axis=1)

In [17]:
# drop column 'Formatted Date' - because features for month and hour have been created from this feature
df = df.drop('Formatted Date', axis=1)

### Columny kategoryczne

In [18]:
# create featrued for snow and rain from Precip Type feature, and removed it
df = pd.get_dummies(df, columns = ['Precip Type'])
df.rename(columns={'Precip Type_rain': 'Rain', 'Precip Type_snow': 'Snow'}, inplace=True)

In [19]:
# create features for each kind of "Summary", and separete the summary where value is concated with separator ' and '. 
# Assign '1' valuer for observation when is has this kind of value

def create_summary_columns(column_name, data_frame):
    """Create column if it is not exist in currwent dataframe and assign '0'.
    :param column_name: name of column wich will be created
    :param data_frame: data frame
    """
    if column_name not in data_frame.columns:
        data_frame[column_name] = 0
        
separator = " and " # separator for concated values
list_of_values = df["Summary"].unique() # list of unique values in feature 'Summary'

# Summary column
# create new columns from Summary unique values
for unique_value in list_of_values:
    if separator in unique_value:
        value_list = unique_value.split(separator)
        for col_name in value_list:
            create_summary_columns(col_name, df)
            df.loc[df["Summary"] == unique_value, col_name] = 1
    else:
        create_summary_columns(unique_value, df)
        df.loc[df["Summary"] == unique_value, unique_value] = 1
        
# drop feature 'Summary'
df = df.drop('Summary', axis=1)

### Columny numeryczne

In [20]:
# Add column with temerature difference between measured and apparent temperature 

# this column will be not used to create a predict model
df["temperature difference"] = df["Apparent Temperature (C)"] - df["Temperature (C)"]
df = df.drop("Apparent Temperature (C)", axis=1)

In [21]:
# display data frame
df.head()

Unnamed: 0,Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Pressure (millibars),Daily Summary,Month,Hour,year_season,...,Foggy,Breezy,Clear,Humid,Windy,Dry,Dangerously Windy,Light Rain,Drizzle,temperature difference
0,9.472222,0.89,14.1197,251.0,15.8263,1015.13,Partly cloudy throughout the day.,4,0,2,...,0,0,0,0,0,0,0,0,0,-2.083333
1,9.355556,0.86,14.2646,259.0,15.8263,1015.63,Partly cloudy throughout the day.,4,1,2,...,0,0,0,0,0,0,0,0,0,-2.127778
2,9.377778,0.89,3.9284,204.0,14.9569,1015.94,Partly cloudy throughout the day.,4,2,2,...,0,0,0,0,0,0,0,0,0,0.0
3,8.288889,0.83,14.1036,269.0,15.8263,1016.41,Partly cloudy throughout the day.,4,3,2,...,0,0,0,0,0,0,0,0,0,-2.344444
4,8.755556,0.83,11.0446,259.0,15.8263,1016.51,Partly cloudy throughout the day.,4,4,2,...,0,0,0,0,0,0,0,0,0,-1.777778


#### Wnioski:
Kolumny kategoryczne zostały usunięte lub na ich podstawie stworzono nowe kolumny ktore mogą być wykorzystrane do storzenia modelu.

Cecha 'Daily Summary' nie została na chwile obecą usunięta. 

## Analiza wykresów cech