The purpose of this notebook is to inspect the may pickle file and find opportunities for data clean up. Ideally, we want:

1. No null values in columns where we need data
2. Eliminate time periods where the scooters' charge level was at 0%, as to elimiate the reason for non-use as the scooter was not charged.
3. Condense time periods to a frame of weekday rush hours
4. Save this dataframe in a new pkl file.

Then, we will use that data to plot these points on our promise zone map in a separate notebook.

In [1]:
import pandas as pd
import pickle
import matplotlib.pyplot as plt

In [2]:
may = pd.read_pickle('../data/may.pkl')
may.head()

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,chargelevel,companyname
0,2019-05-01 00:01:41.247,36.136822,-86.799877,PoweredLIRL1,93.0,0
1,2019-05-01 00:01:41.247,36.191252,-86.772945,PoweredXWRWC,35.0,0
2,2019-05-01 00:01:41.247,36.144752,-86.806293,PoweredMEJEH,90.0,0
3,2019-05-01 00:01:41.247,36.162056,-86.774688,Powered1A7TC,88.0,0
4,2019-05-01 00:01:41.247,36.150973,-86.783109,Powered2TYEF,98.0,0


In [3]:
may.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20283582 entries, 0 to 20292502
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.1+ GB


Find null values.

In [4]:
may.isnull().sum()

pubdatetime      0
latitude         0
longitude        0
sumdid           0
chargelevel    283
companyname      0
dtype: int64

Remove the null chargelevel rows.

In [5]:
may = may.dropna()
may.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20283299 entries, 0 to 20292502
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.1+ GB


Find out how many chargelevels are at 0% and take them out.

In [6]:
(may['chargelevel'] == 0). sum()

1025190

In [7]:
may = may[may['chargelevel'] > 0]
(may['chargelevel'] == 0). sum()

0

Condense time periods to a frame of rush hours between 7 - 9am and 4 - 6pm.

In [18]:
may['hour'] = pd.to_datetime(may['pubdatetime']).dt.hour

may_by_hour = may.loc[((may['hour'] >= 7) & (may['hour'] <=9)) | ((may['hour'] >= 16) & (may['hour'] <= 18))] 
print(may_by_hour.head(20))
print(may_by_hour.tail(20))

                   pubdatetime   latitude  longitude      sumdid  chargelevel  \
154603 2019-05-01 07:00:03.897  36.121480 -86.770450  Powered447         66.0   
154604 2019-05-01 07:00:03.897  36.121393 -86.770228  Powered695         96.0   
154605 2019-05-01 07:00:03.897  36.144292 -86.811540  Powered341         90.0   
154606 2019-05-01 07:00:03.897  36.121616 -86.770332  Powered351         61.0   
154607 2019-05-01 07:00:03.897  36.121575 -86.770093  Powered759         41.0   
154608 2019-05-01 07:00:03.897  36.121421 -86.770544  Powered658         98.0   
154609 2019-05-01 07:00:03.897  36.160729 -86.777545  Powered373         56.0   
154610 2019-05-01 07:00:03.897  36.127284 -86.789176  Powered384         95.0   
154611 2019-05-01 07:00:03.897  36.141765 -86.813116  Powered704         58.0   
154612 2019-05-01 07:00:03.897  36.121581 -86.770304  Powered515          3.0   
154614 2019-05-01 07:00:03.897  36.158274 -86.793622  Powered566         98.0   
154615 2019-05-01 07:00:03.8

Make a new column for day of the week and take out Saturdays and Sundays.

In [21]:
may_by_hour['DOW'] = pd.to_datetime(may['pubdatetime']).dt.dayofweek
may_rush_hour = may_by_hour.loc[(may_by_hour['DOW'] != 5) & (may_by_hour['DOW'] != 6)]
print(may_rush_hour.head())b

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  may_by_hour['DOW'] = pd.to_datetime(may['pubdatetime']).dt.dayofweek


                   pubdatetime   latitude  longitude      sumdid  chargelevel  \
154603 2019-05-01 07:00:03.897  36.121480 -86.770450  Powered447         66.0   
154604 2019-05-01 07:00:03.897  36.121393 -86.770228  Powered695         96.0   
154605 2019-05-01 07:00:03.897  36.144292 -86.811540  Powered341         90.0   
154606 2019-05-01 07:00:03.897  36.121616 -86.770332  Powered351         61.0   
154607 2019-05-01 07:00:03.897  36.121575 -86.770093  Powered759         41.0   

        companyname  hour  DOW  
154603            2     7    2  
154604            2     7    2  
154605            2     7    2  
154606            2     7    2  
154607            2     7    2  


Save to pkl file.

In [22]:
may_rush_hour.to_pickle('../data/may_rush_hour.pkl')