reference : https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile

# EDA & Ensemble Model (Top 10 Percentile)

This notebook explains how we can go about explore and preapre data for model building. The notebook is structured in the following way

- About Dataset
- Data Summary
- Feature Engineering
- Missing Value Analysis
- Outlier Analysis
- Correlation Analysis
- Visualizing Distribution Of Data
- Visualizing Count VS (Month, Season, Hour, Weekday, Usertype)
- Filling 0's In Windspeed Using Random Forest
- Linear Regression Model
- Regularization Models
- Ensemble Models

## About Dataset
### Overview
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, and bike return is automated via a network of kiosk locations throughtout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

#### Data Fields
- datetime - hourly date + timestamp
- season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
- holiday = whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather -
    - 1: Clear, Few clouds, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals (Dependent Variable)

In [1]:
import pylab
import calendar
import numpy as np
import pandas as pd
import seaborn as sn
from scipy import stats
import missingno as msno
from datetime import datetime
import matplotlib.pyplot as plt
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category = DeprecationWarning)
%matplotlib inline

**Lets Read In The Dataset**

In [4]:
dailyData = pd.read_csv("../kaggle/data/bike sharing/train.csv")

## Data Summary
As a first step let's do three simple steps on the dataset
- Size of the dataset
- Get a glimpse of data by printing few rows of it.
- What type of variables contribute our data

**Shape of The Dataset**

In [5]:
dailyData.shape

(10886, 12)

**Sample Of First Few Rows**

In [6]:
dailyData.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


**Variables Data Type**

In [7]:
dailyData.dtypes

datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
casual          int64
registered      int64
count           int64
dtype: object

## Feature Engineering
As we see from the above results, the columns "season", "holiday", "workingday", and "weather" should be of "categorical" data type. But the current data type is "int" for those columns. Let us transform the dataset in the following ways so that we can get started up with our EDA

- Create new columns "date", "hour", "weekDay", "month" from "datetime" column.
- Coerce the datatype of "season", "holiday", "workingday" and weather to category.
- Drop the datetime column as we already extracted useful features from it.

**Creating New Columns From "Datetime" Column**

In [15]:
dailyData["date"] = dailyData.datetime.apply(lambda x : x.split()[0])
dailyData["hour"] = dailyData.datetime.apply(lambda x : x.split()[1].split(":")[0])
dailyData["weekday"] = dailyData.date.apply(lambda dateString : calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])
dailyData["month"] = dailyData.date.apply(lambda dateString : calendar.month_name[datetime.strptime(dateString,"%Y-%m-%d").month])
dailyData["season"] = dailyData.season.map({1: "Spring", 2 : "Summer", 3 : "Fall", 4 :"Winter" })
dailyData["weather"] = dailyData.weather.map({1: " Clear + Few clouds + Partly cloudy + Partly cloudy",\
                                        2 : " Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ", \
                                        3 : " Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds", \
                                        4 :" Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog " })

In [17]:
dailyData.head(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,date,hour,weekday,month
0,2011-01-01 00:00:00,,0,0,,9.84,14.395,81,0.0,3,13,16,2011-01-01,0,Saturday,January
1,2011-01-01 01:00:00,,0,0,,9.02,13.635,80,0.0,8,32,40,2011-01-01,1,Saturday,January
2,2011-01-01 02:00:00,,0,0,,9.02,13.635,80,0.0,5,27,32,2011-01-01,2,Saturday,January


**Coercing To Category Type**

In [18]:
categoryVariableList = ["hour", "weekday", "month", "season", "weather", "holiday", "workingday"]
for var in categoryVariableList:
    dailyData[var] = dailyData[var].astype("category")

In [23]:
type(categoryVariableList)

list