# Подготовка данных

### Проверка и установка рабочей директории, должен быть корень проекта

In [1]:
%pwd

'C:\\Users\\Kuroha\\source\\repos_py\\bauman_final_project\\notebooks'

In [2]:
%cd ..

C:\Users\Kuroha\source\repos_py\bauman_final_project


### Загрузка датасетов:

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils import *

In [4]:
def open_dataset(dataset_name):
    path = get_filepath(dataset_name, is_raw=True)
    return pd.read_csv(path, index_col=['uid', 'date'], parse_dates=True)

weather_train_df = open_dataset(DATA_WEATHER_TRAIN)
weather_test_df = open_dataset(DATA_WEATHER_TEST)
water_lvl_df = open_dataset(DATA_WATER_LEVEL)

### Информация о датасетах:

In [5]:
weather_train_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,weather,wind_dir,wind_spd
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9386,2008-01-01,-17.0,735.0,dull,snow,ЮЗ,2.0
9386,2008-01-02,-31.0,747.0,sun,clear,СЗ,2.0
9386,2008-01-03,-43.0,753.0,sun,clear,З,2.0
9386,2008-01-04,-34.0,733.0,dull,snow,Ш,0.0
9386,2008-01-05,-28.0,728.0,suncl,clear,З,1.0


In [6]:
water_lvl_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 98263 entries, (9386, Timestamp('2008-01-01 00:00:00')) to (9568, Timestamp('2017-12-31 00:00:00'))
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   water_level  98263 non-null  int64
dtypes: int64(1)
memory usage: 1.2 MB


In [7]:
water_lvl_df.columns

Index(['water_level'], dtype='object')

В датасете **water_level** представлены замеры уровня воды для постов гидрологического контроля с сайта АИС ГМВО.

In [8]:
weather_train_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,weather,wind_dir,wind_spd
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9386,2008-01-01,-17.0,735.0,dull,snow,ЮЗ,2.0
9386,2008-01-02,-31.0,747.0,sun,clear,СЗ,2.0
9386,2008-01-03,-43.0,753.0,sun,clear,З,2.0
9386,2008-01-04,-34.0,733.0,dull,snow,Ш,0.0
9386,2008-01-05,-28.0,728.0,suncl,clear,З,1.0


In [9]:
weather_train_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 102072 entries, (9386, Timestamp('2008-01-01 00:00:00')) to (9518, Timestamp('2017-12-31 00:00:00'))
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   temperature  101236 non-null  float64
 1   pressure     101236 non-null  float64
 2   cloud        100924 non-null  object 
 3   weather      101236 non-null  object 
 4   wind_dir     101236 non-null  object 
 5   wind_spd     101236 non-null  float64
dtypes: float64(3), object(3)
memory usage: 5.1+ MB


In [10]:
weather_train_df.columns

Index(['temperature', 'pressure', 'cloud', 'weather', 'wind_dir', 'wind_spd'], dtype='object')

В датасетах **weather_train** и **weather_test** с данными метеосводок Gismeteo есть следующие столбцы:
- индекс **uid** - идентификационный номер поста гидрологического контроля с сайта АИС ГМВО.
- индекс **date** - дата замера
- **temperature** - температура
- **pressure** - атмосферное давление
- **cloud** - облачность
- **weather** - погодное явление
- **wind_dir** - направление ветра
- **wind_spd** - скорость ветра в м/с

In [11]:
weather_test_df

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,weather,wind_dir,wind_spd
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9386,2018-01-01,-26.0,760.0,dull,clear,Ю,1.0
9386,2018-01-02,-20.0,758.0,dull,snow,ЮЗ,3.0
9386,2018-01-03,-13.0,753.0,sun,clear,ЮЗ,2.0
9386,2018-01-04,-12.0,749.0,sunc,clear,ЮЗ,2.0
9386,2018-01-05,-10.0,742.0,dull,snow,З,2.0
...,...,...,...,...,...,...,...
9518,2018-12-27,-24.0,752.0,sunc,clear,ЮЗ,1.0
9518,2018-12-28,-24.0,754.0,sunc,clear,ЮВ,1.0
9518,2018-12-29,-23.0,755.0,sun,clear,ЮВ,1.0
9518,2018-12-30,-28.0,758.0,suncl,clear,Ю,2.0


В датасете **weather_test** присутствуют метео-данные за 2018 год, по которым будет производиться предсказание.

### Объединение тренировочных наборов данных:

In [12]:
df = weather_train_df.join(water_lvl_df)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,weather,wind_dir,wind_spd,water_level
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9386,2008-01-01,-17.0,735.0,dull,snow,ЮЗ,2.0,138.0
9386,2008-01-02,-31.0,747.0,sun,clear,СЗ,2.0,138.0
9386,2008-01-03,-43.0,753.0,sun,clear,З,2.0,138.0
9386,2008-01-04,-34.0,733.0,dull,snow,Ш,0.0,138.0
9386,2008-01-05,-28.0,728.0,suncl,clear,З,1.0,138.0


### Анализ признаков:

#### Cloud (облачность):

In [13]:
df['cloud'].describe(), df['cloud'].unique()

(count     100924
 unique         4
 top         dull
 freq       38227
 Name: cloud, dtype: object,
 array(['dull', 'sun', 'suncl', 'sunc', nan], dtype=object))

Облачность может быть следующей:
- **sun** - ясно
- **sunс** - малооблачно
- **suncl** - облачно
- **dull** - пасмурно

Здесь прослеживается порядок - от ясного неба к пасмурному, поэтому для кодирования данного упорядоченного признака необходимо использовать метод Label Encoder.

Реализация данного метода в sklearn перед кодированием [сортирует уникальные признаки в алфавитном порядке](https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/preprocessing/_label.py#L799), в результате чего будет нарушен порядок: **dull** будет закодирован как 0, **sun** - как 1, **sunс** - 2, **suncl** - 3.

In [14]:
df['cloud'] = df['cloud'].map({'sun': 0, 'sunc': 1, 'suncl': 2, 'dull': 3})
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,weather,wind_dir,wind_spd,water_level
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9386,2008-01-01,-17.0,735.0,3.0,snow,ЮЗ,2.0,138.0
9386,2008-01-02,-31.0,747.0,0.0,clear,СЗ,2.0,138.0
9386,2008-01-03,-43.0,753.0,0.0,clear,З,2.0,138.0
9386,2008-01-04,-34.0,733.0,3.0,snow,Ш,0.0,138.0
9386,2008-01-05,-28.0,728.0,2.0,clear,З,1.0,138.0


#### Weather (осадки)

In [15]:
df['weather'].describe(), df['weather'].unique()

(count     101236
 unique         4
 top        clear
 freq       77591
 Name: weather, dtype: object,
 array(['snow', 'clear', 'rain', 'storm', nan], dtype=object))

Под осадками может пониматься следующее:
- **clear** - осадков не было
- **rain** - дождь
- **storm** - гроза
- **snow** - снег

Данный признак можно закодировать разными способами:
1. Выделение признака **наличие осадков**: и дождь, и снег образовываются из капель воды, а грозы, как правило, сопровождаются сильным дождём;
2. Объединение понятий "гроза" и "дождь", выделив 2 признака: **дождь** и **снег**;
3. 3 признака: **дождь**, **гроза**, **снег**, т.к. бывают сухие грозы.

Во всех случаях отсутствие осадков обозначается 0 во всех признаках.

In [16]:
df['weather_v1_precip'] = df['weather'].map({'clear': 0, 'rain': 1, 'storm': 1, 'snow': 1})

df['weather_v2_rain'] = df['weather'].map({'clear': 0, 'rain': 1, 'storm': 1, 'snow': 0})

# снег одинаково обозначается во 2 и 3 случаях
df['weather_snow'] = df['weather'].map({'clear': 0, 'rain': 0, 'storm': 0, 'snow': 1})

df['weather_v3_rain'] = df['weather'].map({'clear': 0, 'rain': 1, 'storm': 0, 'snow': 0})
df['weather_v3_storm'] = df['weather'].map({'clear': 0, 'rain': 0, 'storm': 1, 'snow': 0})

df = df.drop(['weather'], axis=1)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,wind_dir,wind_spd,water_level,weather_v1_precip,weather_v2_rain,weather_snow,weather_v3_rain,weather_v3_storm
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
9386,2008-01-01,-17.0,735.0,3.0,ЮЗ,2.0,138.0,1.0,0.0,1.0,0.0,0.0
9386,2008-01-02,-31.0,747.0,0.0,СЗ,2.0,138.0,0.0,0.0,0.0,0.0,0.0
9386,2008-01-03,-43.0,753.0,0.0,З,2.0,138.0,0.0,0.0,0.0,0.0,0.0
9386,2008-01-04,-34.0,733.0,3.0,Ш,0.0,138.0,1.0,0.0,1.0,0.0,0.0
9386,2008-01-05,-28.0,728.0,2.0,З,1.0,138.0,0.0,0.0,0.0,0.0,0.0


#### Wind_dir (направление ветра)

In [17]:
df['wind_dir'].describe(), df['wind_dir'].unique()

(count     101236
 unique         9
 top            Ш
 freq       16681
 Name: wind_dir, dtype: object,
 array(['ЮЗ', 'СЗ', 'З', 'Ш', 'С', 'ЮВ', 'СВ', 'В', 'Ю', nan], dtype=object))

Направление ветра обозначается как основными сторонами света: **С**евер, **Ю**г, **З**апад, **В**осток, так и промежуточными направлениями: **С**еверо-**З**апад, **С**еверо-**В**осток, **Ю**го-**З**апад, **Ю**го-**В**осток. Отсуствие ветра - **Ш**тиль.
Данный признак можно закодировать 4 столбцами, а именно как стороны света.

In [18]:
values = list(df['wind_dir'].unique())[:-1]  # убираем nan

north_dict = {val:(1 if 'С' in val else 0) for val in values}  # север
south_dict = {val:(1 if 'Ю' in val else 0) for val in values}  # юг
west_dict = {val:(1 if 'З' in val else 0) for val in values}   # запад
east_dict = {val:(1 if 'В' in val else 0) for val in values}   # восток

df['north'] = df['wind_dir'].map(north_dict)
df['south'] = df['wind_dir'].map(south_dict)
df['west'] = df['wind_dir'].map(west_dict)
df['east'] = df['wind_dir'].map(east_dict)

df = df.drop(['wind_dir'], axis=1)
df[['north', 'south', 'west', 'east']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,north,south,west,east
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9386,2008-01-01,0.0,1.0,1.0,0.0
9386,2008-01-02,1.0,0.0,1.0,0.0
9386,2008-01-03,0.0,0.0,1.0,0.0
9386,2008-01-04,0.0,0.0,0.0,0.0
9386,2008-01-05,0.0,0.0,1.0,0.0


#### Работа с пропусками:

In [19]:
print(f'Размерность water_lvl_df: {water_lvl_df.shape}')
print(f'Размерность weather_train_df: {weather_train_df.shape}')
print(f'Размерность df: {df.shape}')
print(f'Размерность weather_test_df: {weather_test_df.shape}')

Размерность water_lvl_df: (98263, 1)
Размерность weather_train_df: (102072, 6)
Размерность df: (102072, 14)
Размерность weather_test_df: (10220, 6)


В датасете **weather_df** есть строки за каждый день по каждому посту, однако в данных есть пропуски:

In [20]:
df[df.isnull().any(axis=1)]

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,wind_spd,water_level,weather_v1_precip,weather_v2_rain,weather_snow,weather_v3_rain,weather_v3_storm,north,south,west,east
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
9386,2011-06-19,,,,,137.0,,,,,,,,,
9386,2012-02-06,,,,,129.0,,,,,,,,,
9386,2013-11-10,1.0,743.0,1.0,4.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9386,2013-11-11,-3.0,740.0,3.0,4.0,,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
9386,2013-11-12,-7.0,739.0,1.0,4.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9518,2016-09-20,10.0,730.0,,1.0,79.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
9518,2016-09-28,11.0,728.0,,2.0,79.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
9518,2017-08-14,28.0,729.0,,3.0,63.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
9518,2017-11-26,,,,,81.0,,,,,,,,,


Рассмотрим данные поста 9386 за 2011-06-19:

In [21]:
test_start_date = '2017-11-25'
test_end_date   = '2017-11-28'
df.query('uid == 9518 and date >= @test_start_date and date <= @test_end_date')

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,wind_spd,water_level,weather_v1_precip,weather_v2_rain,weather_snow,weather_v3_rain,weather_v3_storm,north,south,west,east
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
9518,2017-11-25,-35.0,741.0,1.0,2.0,81.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9518,2017-11-26,,,,,81.0,,,,,,,,,
9518,2017-11-27,,,,,80.0,,,,,,,,,
9518,2017-11-28,-33.0,749.0,0.0,0.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
df.isnull().sum()

temperature           836
pressure              836
cloud                1148
wind_spd              836
water_level          4007
weather_v1_precip     836
weather_v2_rain       836
weather_snow          836
weather_v3_rain       836
weather_v3_storm      836
north                 836
south                 836
west                  836
east                  836
dtype: int64

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 102072 entries, (9386, Timestamp('2008-01-01 00:00:00')) to (9518, Timestamp('2017-12-31 00:00:00'))
Data columns (total 14 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   temperature        101236 non-null  float64
 1   pressure           101236 non-null  float64
 2   cloud              100924 non-null  float64
 3   wind_spd           101236 non-null  float64
 4   water_level        98065 non-null   float64
 5   weather_v1_precip  101236 non-null  float64
 6   weather_v2_rain    101236 non-null  float64
 7   weather_snow       101236 non-null  float64
 8   weather_v3_rain    101236 non-null  float64
 9   weather_v3_storm   101236 non-null  float64
 10  north              101236 non-null  float64
 11  south              101236 non-null  float64
 12  west               101236 non-null  float64
 13  east               101236 non-null  float64
dtypes: float64(14)
memory usage

In [24]:
#for column in df.columns:
#    df[column] = df[column].ffill()
df = df.interpolate()
df.query('uid == 9518 and date >= @test_start_date and date <= @test_end_date')

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,wind_spd,water_level,weather_v1_precip,weather_v2_rain,weather_snow,weather_v3_rain,weather_v3_storm,north,south,west,east
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
9518,2017-11-25,-35.0,741.0,1.0,2.0,81.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9518,2017-11-26,-34.333333,743.666667,0.666667,1.333333,81.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0
9518,2017-11-27,-33.666667,746.333333,0.333333,0.666667,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
9518,2017-11-28,-33.0,749.0,0.0,0.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
# конвертируем float в категориальный тип данных для уменьшения использования памяти
for dtype in ['cloud', 'weather_v1_precip', 'weather_snow', 'weather_v2_rain', 
              'weather_v3_rain', 'weather_v3_storm', 'north', 'south', 'west', 'east']:
    df[dtype] = df[dtype].round().astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 102072 entries, (9386, Timestamp('2008-01-01 00:00:00')) to (9518, Timestamp('2017-12-31 00:00:00'))
Data columns (total 14 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   temperature        102072 non-null  float64
 1   pressure           102072 non-null  float64
 2   cloud              102072 non-null  int32  
 3   wind_spd           102072 non-null  float64
 4   water_level        102072 non-null  float64
 5   weather_v1_precip  102072 non-null  int32  
 6   weather_v2_rain    102072 non-null  int32  
 7   weather_snow       102072 non-null  int32  
 8   weather_v3_rain    102072 non-null  int32  
 9   weather_v3_storm   102072 non-null  int32  
 10  north              102072 non-null  int32  
 11  south              102072 non-null  int32  
 12  west               102072 non-null  int32  
 13  east               102072 non-null  int32  
dtypes: float64(4), int32(10)
me

In [26]:
df.query('uid == 9518 and date >= @test_start_date and date <= @test_end_date')

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,pressure,cloud,wind_spd,water_level,weather_v1_precip,weather_v2_rain,weather_snow,weather_v3_rain,weather_v3_storm,north,south,west,east
uid,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
9518,2017-11-25,-35.0,741.0,1,2.0,81.0,0,0,0,0,0,0,1,0,0
9518,2017-11-26,-34.333333,743.666667,1,1.333333,81.0,0,0,0,0,0,0,1,0,0
9518,2017-11-27,-33.666667,746.333333,0,0.666667,80.0,0,0,0,0,0,0,0,0,0
9518,2017-11-28,-33.0,749.0,0,0.0,80.0,0,0,0,0,0,0,0,0,0
