# Setting Up The Usable DataFrame

### Importing the Neccessary Libraries

In [18]:
import pandas as pd
import datetime as dt

### Reading in the Data

Data retrieved from the current folder.

In [19]:
df = pd.read_csv('day.csv')

CSV file contains the data from Capital Bikeshare System in Washington D.C. ranging from 2011 to 2012.  Data includes: day, season, weather, temperature, humidity, etc.

### Creating a Function

Function to help in the creation of new columns in order to include dummy variables necessary for statistical analysis.

In [20]:
def columnize(df, column, new_vals, new_col):
    """
    Creates new columns for each dummy variable in the dataset.
    Returns a new dataframe with the newly created columns.
    """
    list1 = df[column].unique()
    dict1 = dict(zip(list1,new_vals))
    df[new_col] = df[column].map(dict1)
    df = pd.concat([df, pd.get_dummies(df[new_col])], 1)
    return df

### Cleaning the Data

- Removed outliers 
- Changed the dates to ordinal 
- Replaced the weather with correct descriptions
- Matched each month to the appropriate season
- Created dummy variables when needed.

In [21]:
# removing this outlier because Washington D.C was shut down due to Hurricane Sandy
df[df['cnt']==22]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
667,668,2012-10-29,4,1,10,0,1,1,3,0.44,0.4394,0.88,0.3582,2,20,22


In [9]:
# Filtering out the potential outliers
df = df[df['cnt']>100]

In [10]:
#changing dates (dteday) to datetime objects
df['dteday'] = pd.to_datetime(df['dteday'])

#changing dates to ordinal, to use in regression
df['dteday'] = df['dteday'].map(dt.datetime.toordinal)

In [11]:
#create and add new columns for each weather situation
weathertypes = ['clear', 'misty', 'light_storm', 'heavy_storm']

df = columnize(df, 'weathersit', weathertypes, 'weather')

In [12]:
# changing the season column to be more accurate (matching season to month)
summer = [6,7,8]
fall = [9,10,11]
winter = [12,1,2]
spring = [3,4,5]

sum_dict = dict.fromkeys(summer,'summer')
wint_dict = dict.fromkeys(winter,'winter')
spr_dict = dict.fromkeys(spring,'spring')
fall_dict = dict.fromkeys(fall,'fall')

In [13]:
# combine all the season dicts
seasons = {**sum_dict,**wint_dict,**spr_dict,**fall_dict}

In [14]:
# apply the new season column to the dataframe
df['season']=df['mnth'].map(seasons)

In [15]:
# create dummy variable columns for each season and add it to dataframe
df = pd.concat([df, pd.get_dummies(df['season'])], 1)

### Exporting the DataFrame as a JSON file

Data exported in a JSON format for use in other notebooks

In [14]:
# export dataframe to JSON file
df.to_json('cleaned_bike_share_data.json', orient='records')