# Import Dependencies

**Note** PySpark has it's own [implementation of pandas api](https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/frame.html) which computes the same functionality using distributed computing and clusters under the hood. There are some differences but the methods implemented follow along with the pyspark documentation

In [1]:
# import pyspark.pandas as pd
import pandas as pd
import os

# Load data

In [2]:
path_to_data = os.path.join('.', "data")
datas = [pd.read_csv(os.path.join(path_to_data, csv), 
                     encoding='utf-8') for csv in os.listdir(path_to_data) if 'uber-raw-data' in csv]
raw_data = pd.concat( datas, axis = 0 )

print( raw_data.shape )
raw_data.head()

(4534327, 4)


Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


# Data Descriptions

In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4534327 entries, 0 to 1028135
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Date/Time  object 
 1   Lat        float64
 2   Lon        float64
 3   Base       object 
dtypes: float64(2), object(2)
memory usage: 173.0+ MB


**There are no missing values in the 2014 uber dataset**

In [4]:
raw_data.isna().sum()

Date/Time    0
Lat          0
Lon          0
Base         0
dtype: int64

**There are 5 unique Base numbers. Their frequency proportions are shown below**

In [5]:
raw_data.get('Base').value_counts(normalize = True)

B02617    0.321735
B02598    0.307237
B02682    0.267468
B02764    0.058200
B02512    0.045359
Name: Base, dtype: float64

**There are "hot spots" where there are several trips to the same/similar location**

In [6]:
location_counts = raw_data.groupby('Lat')['Lon'].value_counts(dropna = True).sort_values()
location_counts

Lat      Lon     
39.6569  -74.2258       1
40.7435  -73.9560       1
         -73.9568       1
         -73.9573       1
         -73.9712       1
                     ... 
40.7741  -73.8726    1921
40.6449  -73.7822    1947
40.6448  -73.7820    2079
40.7685  -73.8625    2257
40.6448  -73.7819    2299
Name: Lon, Length: 574558, dtype: int64

# Data Preprocessing

**Time-Series Totals**

In [20]:
date_time = pd.to_datetime(raw_data.get("Date/Time"), infer_datetime_format = True)
date_time.name = 'Ride Count'

date_only = date_time.dt.date.value_counts(dropna = True).sort_index()
pd.concat([date_only, date_only.rolling(window = 7, min_periods = 1).mean()], 
                      axis = 1, keys = ['Total Rides', '7-Day Average Rides']).to_csv( os.path.join(path_to_data, 'rides_by_date.csv') )

date_time.dt.day_name().value_counts(dropna = True).sort_index().to_csv( os.path.join(path_to_data, 'total_rides_by_day.csv') )

location_counts.name = 'Ride Count'
location_counts.to_csv( os.path.join(path_to_data, 'total_rides_by_location.csv') )

**Location Totals**

In [31]:
date_only = date_time.dt.date.value_counts(dropna = True).sort_index()

In [28]:
pd.DataFrame([date_only, date_only.rolling(window = 7, min_periods = 1).mean()])

Unnamed: 0,2014-04-01,2014-04-02,2014-04-03,2014-04-04,2014-04-05,2014-04-06,2014-04-07,2014-04-08,2014-04-09,2014-04-10,...,2014-09-21,2014-09-22,2014-09-23,2014-09-24,2014-09-25,2014-09-26,2014-09-27,2014-09-28,2014-09-29,2014-09-30
Ride Count,14546.0,17474.0,20701.0,26714.0,19521.0,13445.0,19550.0,16188.0,16843.0,20041.0,...,28620.0,28312.0,30316.0,31301.0,38203.0,37504.0,39468.0,29656.0,29201.0,33431.0
Ride Count,14546.0,16010.0,17573.666667,19858.75,19791.2,18733.5,18850.142857,19084.714286,18994.571429,18900.285714,...,35693.142857,35530.0,34704.857143,34100.571429,33804.714286,33302.857143,33389.142857,33537.142857,33664.142857,34109.142857


Unnamed: 0,Total Rides,7-Day Average Rides
2014-04-01,14546,14546.000000
2014-04-02,17474,16010.000000
2014-04-03,20701,17573.666667
2014-04-04,26714,19858.750000
2014-04-05,19521,19791.200000
...,...,...
2014-09-26,37504,33302.857143
2014-09-27,39468,33389.142857
2014-09-28,29656,33537.142857
2014-09-29,29201,33664.142857


In [30]:
date_only

Unnamed: 0,Ride Count,Ride Count.1
2014-04-01,14546,14546.000000
2014-04-02,17474,16010.000000
2014-04-03,20701,17573.666667
2014-04-04,26714,19858.750000
2014-04-05,19521,19791.200000
...,...,...
2014-09-26,37504,33302.857143
2014-09-27,39468,33389.142857
2014-09-28,29656,33537.142857
2014-09-29,29201,33664.142857


In [26]:
date_time.dt.date.value_counts(dropna = True).sort_index().rolling(window = 7, min_periods = 1).mean()

2014-04-01    14546.000000
2014-04-02    16010.000000
2014-04-03    17573.666667
2014-04-04    19858.750000
2014-04-05    19791.200000
                  ...     
2014-09-26    33302.857143
2014-09-27    33389.142857
2014-09-28    33537.142857
2014-09-29    33664.142857
2014-09-30    34109.142857
Name: Ride Count, Length: 183, dtype: float64