<img width="200" src="https://raw.githubusercontent.com/lukwies/mid-bootcamp-project/main/data/img/bikes.png">

---


# Bikesharing in Seoul / Data Cleaning

---

### Sources

 * Data: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand
 * Image: https://global.chinadaily.com.cn/a/201801/25/WS5a69cab3a3106e7dcc136a6d.html

---

### Tasks

 * Normalize column names
 * Change type of column `date` to **datetime64**
 * Change values of table `holiday` from **Holiday**/**No Holiday** to **Yes**/**No**.
 * Create column `daytime` with values **Morning, Noon, Afternoon, Evening, Night**.
 * Create column `temperature_type` with values: **Hot, Warm, Mild, Cold, Frost**.
 * Create column `month` which holds only the month extracted from date.

---

#### Columns after cleaning
After applying our cleaning functions, the dataset will be formatted as follows:

|Column Name|Datatype|Type|Values|Unit|
|:----------|:-------|:---|:-----|:---|
|date|datetime64|categorical|2017/12/1 - 2018/11/30|
|month|int64|numerical|1-12|
|hour|int64|numerical|0 - 23|
|daytime|object|categorical|Morning,Noon,Afternoon,Evening,Night|
|weekday|int64|numerical|0-6|
|seasons|object|categorical|Spring,Summer,Autumn,Winter|
|holiday|object|categorical|Yes,No|
|functioning_day|object|categorical|Yes,No|
|temperature|float64|numerical|-17.8 - 39.4|°C|
|temperature_type|object|categorical|Hot,Warm,Mild,Cold,Frost|
|humidity|int64|numerical|0.0 - 98.0|%|
|wind_speed|float64|numerical|0.0 - 7.4|m/s|
|visibility|int64|numeric|27-2000|10m|
|solar_radiation|float64|numerical|0.0 - 3.52|MJ/m2|
|rainfall|float|numerical|0.0 - 35.0|mm|
|snowfall|float|numerical|0.0 - 8.0|cm|
|rented_bike_count|int64|numerical|0 - 3556|


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import yaml
import os
import sys

#### Import my own functions

In [2]:
sys.path.insert(0, os.path.abspath('../src'))
import lib.cleaning as clean

#### Open YAML config file

In [3]:
with open('../params.yaml') as file:
    config = yaml.safe_load(file)

#### Load dataset

In [4]:
df = pd.read_csv(config['data']['csv_raw'])
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


#### Clean data

In [5]:
df_clean = clean.clean_data(df)

#### Store cleaned data to file

In [6]:
df_clean.to_csv(config['data']['csv_cleaned'], index=False)

#### Check cleaned data

In [7]:
df_clean.isna().sum().sum()

0

In [8]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               8760 non-null   datetime64[ns]
 1   month              8760 non-null   int64         
 2   hour               8760 non-null   int64         
 3   daytime            8760 non-null   object        
 4   weekday            8760 non-null   int64         
 5   seasons            8760 non-null   object        
 6   holiday            8760 non-null   object        
 7   functioning_day    8760 non-null   object        
 8   temperature        8760 non-null   float64       
 9   temperature_type   8760 non-null   object        
 10  humidity           8760 non-null   int64         
 11  wind_speed         8760 non-null   float64       
 12  visibility         8760 non-null   int64         
 13  solar_radiation    8760 non-null   float64       
 14  rainfall

In [9]:
df_clean

Unnamed: 0,date,month,hour,daytime,weekday,seasons,holiday,functioning_day,temperature,temperature_type,humidity,wind_speed,visibility,solar_radiation,rainfall,snowfall,rented_bike_count
0,2017-12-01,12,0,Night,4,Winter,No,Yes,-5.2,Frost,37,2.2,2000,0.0,0.0,0.0,254
1,2017-12-01,12,1,Night,4,Winter,No,Yes,-5.5,Frost,38,0.8,2000,0.0,0.0,0.0,204
2,2017-12-01,12,2,Night,4,Winter,No,Yes,-6.0,Frost,39,1.0,2000,0.0,0.0,0.0,173
3,2017-12-01,12,3,Night,4,Winter,No,Yes,-6.2,Frost,40,0.9,2000,0.0,0.0,0.0,107
4,2017-12-01,12,4,Morning,4,Winter,No,Yes,-6.0,Frost,36,2.3,2000,0.0,0.0,0.0,78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8755,2018-11-30,11,19,Evening,4,Autumn,No,Yes,4.2,Cold,34,2.6,1894,0.0,0.0,0.0,1003
8756,2018-11-30,11,20,Evening,4,Autumn,No,Yes,3.4,Cold,37,2.3,2000,0.0,0.0,0.0,764
8757,2018-11-30,11,21,Evening,4,Autumn,No,Yes,2.6,Cold,39,0.3,1968,0.0,0.0,0.0,694
8758,2018-11-30,11,22,Night,4,Autumn,No,Yes,2.1,Cold,41,1.0,1859,0.0,0.0,0.0,712
