<img width="200" src="https://raw.githubusercontent.com/lukwies/mid-bootcamp-project/main/data/img/bikes.png">

---


# Bikesharing in Seoul / Data Cleaning

---

### Sources

 * Data: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand
 * Image: https://global.chinadaily.com.cn/a/201801/25/WS5a69cab3a3106e7dcc136a6d.html

---

### Tasks

 * Normalize column names
 * Change type of column `date` to **datetime64**
 * Change values of table `holiday` from **Holiday**/**No Holiday** to **Yes**/**No**.
 * Create column `daytime` with values **Morning, Noon, Afternoon, Evening, Night**.
 * Create column `temperature_type` with values: **Hot, Warm, Mild, Cold, Frost**.
 * Create column `month` which holds only the month extracted from date.

---

#### Columns after cleaning

|Column Name|Datatype|Type|Values|Unit|
|:----------|:-------|:---|:-----|:---|
|date|datetime64|categorical|2017/12/1 - 2018/11/30|
|month|int64|numerical|1-12|
|hour|int64|numerical|0 - 23|
|daytime|object|categorical|Morning,Noon,Afternoon,Evening,Night|
|weekday|int64|numerical|0-6|
|seasons|object|categorical|Spring,Summer,Autumn,Winter|
|holiday|object|categorical|Yes,No|
|functioning_day|object|categorical|Yes,No|
|temperature|float64|numerical|-17.8 - 39.4|°C|
|temperature_type|object|categorical|Hot,Warm,Mild,Cold,Frost|
|humidity|int64|numerical|0.0 - 98.0|%|
|wind_speed|float64|numerical|0.0 - 7.4|m/s|
|visibility|int64|numeric|27-2000|10m|
|solar_radiation|float64|numerical|0.0 - 3.52|MJ/m2|
|rainfall|float|numerical|0.0 - 35.0|mm|
|snowfall|float|numerical|0.0 - 8.0|cm|
|rented_bike_count|int64|numerical|0 - 3556|

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import yaml
import os
import sys

# Import my own functions
sys.path.insert(0, os.path.abspath('../src'))
import lib.cleaning as clean

#### Open YAML config file

In [8]:
# Open yaml configs
with open('../params.yaml') as file:
    config = yaml.safe_load(file)

#### Load dataset

In [9]:
df = pd.read_csv(config['data']['csv_raw'])

#### Clean data

In [10]:
df_clean = clean.clean_data(df)

#### Store cleaned data to file

In [11]:
df_clean.to_csv(config['data']['csv_cleaned'], index=False)

#### Check cleaned data

In [12]:
df_clean.isna().sum().sum()

0

In [13]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               8760 non-null   datetime64[ns]
 1   month              8760 non-null   int64         
 2   hour               8760 non-null   int64         
 3   daytime            8760 non-null   object        
 4   weekday            8760 non-null   int64         
 5   seasons            8760 non-null   object        
 6   holiday            8760 non-null   object        
 7   functioning_day    8760 non-null   object        
 8   temperature        8760 non-null   float64       
 9   temperature_type   8760 non-null   object        
 10  humidity           8760 non-null   int64         
 11  wind_speed         8760 non-null   float64       
 12  visibility         8760 non-null   int64         
 13  solar_radiation    8760 non-null   float64       
 14  rainfall