# Data Preprocessing: Prepare Complete Datasets

Data for this project consist of 3 datasets:
1. Bike rental data - [download script]()
2. Bike stations' location data - [source](http://opendata.dc.gov/datasets/capital-bike-share-locations)
3. Weather data - [download script]()
4. Bike stations' distances

The goal of this script is to create one, separate dataset for each of the above sources, make sure that in general data are correct and implement consistent naming of features.

In [None]:
import os
os.chdir('..')

In [1]:
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
from itertools import combinations
import re
import os

## Bike Stations' Location Data

Performed operations:

1. Rename and reorder columns
2. Calculate station capacity based on number of free and rented bikes
3. Download information about location of stations
4. Extract postal code and city from the station's address

In [2]:
stations = pd.read_csv('data/raw_data/Capital_Bike_Share_Locations.csv')

stations.columns = stations.columns.str.lower()
stations['station_capacity'] = stations['number_of_bikes'] + stations['number_of_empty_docks']
stations = stations[['address', 'terminal_number', 'latitude', 'longitude', 'station_capacity']]

stations.head()

Unnamed: 0,address,terminal_number,latitude,longitude,station_capacity
0,4th St & Madison Dr NW,31288,38.890493,-77.017253,25
1,Henry Bacon Dr & Lincoln Memorial Circle NW,31289,38.890544,-77.049379,10
2,17th St & Independence Ave SW,31290,38.888097,-77.038325,18
3,Franklin St & S Washington St,31907,38.798133,-77.0487,18
4,Mount Vernon Ave & Four Mile Run Park,31909,38.843422,-77.064016,12


In [3]:
geolocator = Nominatim()

  """Entry point for launching an IPython kernel.


In [4]:
def find_location(coordinates):
    location = geolocator.reverse(coordinates, timeout=30)
    return(location[0])

In [5]:
stations['coordinates'] = stations[['latitude', 'longitude']].apply(lambda x :', '.join([str(x[0]), str(x[1])]), axis=1)
stations['location'] = stations['coordinates'].apply(find_location)

In [6]:
def find_postalCode(location):
    code = location.split(',')[-2].strip()
    try:
        code = int(code[:5])
    except:
        code = 0
    return(code)

def find_city(location):
    city = location.split(',')[-3].strip()
    if city == 'Arlington County':
        city = 'Virginia'
    return(city)

In [7]:
stations['postal_code'] = stations['location'].apply(find_postalCode)
stations['city'] = stations['location'].apply(find_city)
stations = stations.drop('coordinates', axis=1)

cols = stations.columns.to_list()
cols = [cols.pop(cols.index('terminal_number'))] + cols
stations = stations[cols]

stations.rename({'terminal_number': 'station_id'}, axis=1, inplace=True)

stations.head()

Unnamed: 0,station_id,address,latitude,longitude,station_capacity,location,postal_code,city
0,31288,4th St & Madison Dr NW,38.890493,-77.017253,25,"4th St and Madison Dr NW, Madison Drive Northw...",20301,D.C.
1,31289,Henry Bacon Dr & Lincoln Memorial Circle NW,38.890544,-77.049379,10,"West Potomac Park, Independence Avenue Southwe...",20418,D.C.
2,31290,17th St & Independence Ave SW,38.888097,-77.038325,18,"National Mall, Independence Avenue Southwest, ...",20227,D.C.
3,31907,Franklin St & S Washington St,38.798133,-77.0487,18,"Franklin and S Washington St, Franklin Street,...",22314,Virginia
4,31909,Mount Vernon Ave & Four Mile Run Park,38.843422,-77.064016,12,"Four Mile Run Park, 4131, Mount Vernon Avenue,...",22305,Virginia


In [8]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 535 entries, 0 to 534
Data columns (total 8 columns):
station_id          535 non-null int64
address             535 non-null object
latitude            535 non-null float64
longitude           535 non-null float64
station_capacity    535 non-null int64
location            535 non-null object
postal_code         535 non-null int64
city                535 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 33.5+ KB


In [9]:
stations.to_parquet('data/db_stations.parquet', index=False)

In [None]:
stations = None

## Bike Rental Data

Performed operations:
1. Rename and reorder columns, drop some of them
2. Ensure correct format of dates
3. Add information about city as a separate column

In [10]:
bikeRental = pd.read_parquet('data/raw_data/bikeRental.parquet')

In [14]:
bikeRental = bikeRental.drop(['Start station', 'End station'], axis=1)
bikeRental.columns = ['duration', 'start_date', 'end_date', 'start_station_id', 'end_station_id', 'bike_id', 'member_type']

cols = bikeRental.columns.to_list()
cols.append(cols.pop(cols.index('duration')))
bikeRental = bikeRental[cols]

bikeRental.reset_index(drop=True, inplace=True)
bikeRental.insert(0, 'rental_id', bikeRental.index)

bikeRental.head()

Unnamed: 0,rental_id,start_date,end_date,start_station_id,end_station_id,bike_id,member_type,duration
0,0,2010-09-20 11:27:04,2010-09-20 11:43:56,31208,31108,W00742,Member,1012
1,1,2010-09-20 11:41:22,2010-09-20 11:42:23,31209,31209,W00032,Member,61
2,2,2010-09-20 12:05:37,2010-09-20 12:50:27,31600,31100,W00993,Member,2690
3,3,2010-09-20 12:06:05,2010-09-20 12:29:32,31600,31602,W00344,Member,1406
4,4,2010-09-20 12:10:43,2010-09-20 12:34:17,31100,31201,W00883,Member,1413


In [16]:
bikeRental['start_date'] = pd.to_datetime(bikeRental['start_date'])
bikeRental['end_date'] = pd.to_datetime(bikeRental['end_date'])

In [20]:
drop_cols = [i for i in stations.columns if i not in ['city']]

bikeRental = bikeRental.merge(stations, how='left', left_on='start_station_id', right_on='station_id').drop(drop_cols, axis=1)

In [21]:
bikeRental.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22969237 entries, 0 to 22969236
Data columns (total 9 columns):
rental_id           int64
start_date          datetime64[ns]
end_date            datetime64[ns]
start_station_id    int64
end_station_id      int64
bike_id             object
member_type         object
duration            int64
city                object
dtypes: datetime64[ns](2), int64(4), object(3)
memory usage: 1.7+ GB


In [22]:
bikeRental.to_parquet('data/db_rentals.parquet', index=False)

In [23]:
bikeRental = None

## Weather Data

Weather data need to be loaded from separate files and joining it into one table.
Other performed operations:

1. Rename and reorder columns, drop some of them
2. Calculate new column containing datetime of weather observation
3. Format numeric values
4. Extract information about clouds, rain, snowfall and thunder from weather description

In [40]:
weather_dir = 'data/raw_data/weather_monthly/'
weather = pd.concat([pd.read_parquet('data/raw_data/weather_monthly/' + i) for i in os.listdir(weather_dir)])
print(len(weather))

85528


In [41]:
def format_numbers(number):
    number = float(re.sub('[^-?0-9.]+', '', number))
    return int(number)

def clouds_analysis(details):
    details = details.split(' at ')[0].lower()
    cloud_encoding = {
        'clear': 'clear',
        'scattered clouds': 'partly cloudy',
        'few clouds': 'cloudy',
        'broken clouds': 'cloudy',
        'cloudy': 'cloudy'
    }
    for i in cloud_encoding.keys():
        if i in details:
            return(cloud_encoding[i])
    return('')

def check_weatherConditions(details, fall_type):
    index = details.lower().find(fall_type)
    if index > 0:
        return 1
    else:
        return 0

In [42]:
weather['datetime']= pd.to_datetime(weather['date'] + ' ' + weather['Time'])
weather['date'] = pd.to_datetime(weather['date'])
weather.drop(['Wind Gust', 'Dew Point', 'Icon'], inplace=True, axis=1)
weather.columns = ['time', 'temperature', 'relative_temperature', 'wind', 'relative_humidity', 'pressure', 'details', 'date', 'datetime']

In [43]:
weather['wind'] = weather['wind'].apply(lambda x: x.replace('Calm', '0').replace('°','.'))
weather['details'] = weather['details'].apply(lambda x: x.split(';')[1])

for col in ['temperature', 'relative_temperature', 'wind', 'relative_humidity', 'pressure']:
    weather[col] = weather[col].apply(format_numbers)

weather['relative_humidity'] = weather['relative_humidity']/100

In [44]:
weather['clouds'] = weather['details'].apply(clouds_analysis)
weather['rain'] = weather['details'].apply(check_weatherConditions, fall_type='rain')
weather['snow'] = weather['details'].apply(check_weatherConditions, fall_type='snow')
weather['thunder'] = weather['details'].apply(check_weatherConditions, fall_type='thunder')

In [47]:
weather = weather[['datetime', 'date', 'time', 'temperature', 'relative_temperature', 'wind', 'relative_humidity', 'pressure', 'clouds', 'rain', 'snow', 'thunder']]
weather.sample(5)

Unnamed: 0,datetime,date,time,temperature,relative_temperature,wind,relative_humidity,pressure,clouds,rain,snow,thunder
18,2013-02-02 23:52:00,2013-02-02,23:52,-1,-1,0,0.86,1017,cloudy,0,0,0
20,2017-08-20 20:52:00,2017-08-20,20:52,27,29,130,0.7,1020,clear,0,0,0
6,2013-03-27 06:52:00,2013-03-27,06:52,4,1,300,0.61,1017,cloudy,0,0,0
10,2017-11-25 10:52:00,2017-11-25,10:52,11,11,170,0.54,1008,partly cloudy,0,0,0
5,2016-12-19 05:52:00,2016-12-19,05:52,0,-4,360,0.6,1038,cloudy,0,0,0


In [48]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 85528 entries, 0 to 23
Data columns (total 12 columns):
datetime                85528 non-null datetime64[ns]
date                    85528 non-null datetime64[ns]
time                    85528 non-null object
temperature             85528 non-null int64
relative_temperature    85528 non-null int64
wind                    85528 non-null int64
relative_humidity       85528 non-null float64
pressure                85528 non-null int64
clouds                  85528 non-null object
rain                    85528 non-null int64
snow                    85528 non-null int64
thunder                 85528 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(7), object(2)
memory usage: 8.5+ MB


In [49]:
weather.to_parquet('data/db_weather.parquet', index=False)

In [50]:
weather = None

## Stations Distance

Given that majority of rides are from point A to point B and knowing the location of bike stations, the code below calculates the distance between two stations.

Performed operations:
1. Create a list of all possible stations' combinations
2. Calculate the distance based on coordinates

In [5]:
stations = pd.read_parquet('data/db_stations.parquet')
stations = stations[['station_id', 'latitude', 'longitude']]

In [6]:
def find_distance(coordinates):
    a = (coordinates[0], coordinates[1])
    b = (coordinates[2], coordinates[3])
    
    return geodesic(a, b).km

In [7]:
comb = list(combinations(stations['station_id'].unique(), 2))
df_distance = pd.DataFrame(comb).rename({0: 'station_a', 1: 'station_b'}, axis=1)

df_distance = df_distance.merge(stations, how='left', left_on='station_a', right_on='station_id')
df_distance = df_distance.merge(stations, how='left', left_on='station_b', right_on='station_id', suffixes=['_a', '_b'])
df_distance = df_distance[['station_a', 'station_b', 'latitude_a', 'longitude_a', 'latitude_b', 'longitude_b']]

In [8]:
df_distance['distance'] = df_distance[['latitude_a', 'longitude_a', 'latitude_b', 'longitude_b']].apply(find_distance, axis=1)

In [9]:
df_distance_inverse = df_distance.copy()
df_distance_inverse.rename({'station_a': 'station_b', 'station_b': 'station_a'}, axis=1, inplace=True)
df_distance = df_distance.append(df_distance_inverse, ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [10]:
df_distance.to_parquet('data/db_stations_distance.parquet', index=False)

In [12]:
df_distance = None