# Supervised ML

The goal of this model is to predict the ridership that occurs within the University of Chicago Lyft Program Area. We will do this by using as features the ridership counts of other Chicago community areas, as well as using weather. The labels are the daily ridership counts within the program area.

We will create the model that functions up until the introduction of the University Lyft program and then look at the difference between the predictions and the actual ridership as a rough estimate of the effect of the program on rideshare usage in the area. We will do this by looking at both the change when the program was introduced, as well as when the program was reduced from 10 rides of up to 15 dollars each, to 7 rides up to 10 dollars. 

In [1]:
# read in packages create spark environment
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import matplotlib.pyplot as plt
%matplotlib inline

spark = SparkSession.builder.appName('supervised').getOrCreate()

#change configuration settings on Spark 
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

#print spark configuration settings
spark.sparkContext.getConf().getAll()

[('spark.stage.maxConsecutiveAttempts', '10'),
 ('spark.dynamicAllocation.minExecutors', '1'),
 ('spark.eventLog.enabled', 'true'),
 ('spark.submit.pyFiles',
  '/root/.ivy2/jars/com.johnsnowlabs.nlp_spark-nlp_2.12-4.4.0.jar,/root/.ivy2/jars/graphframes_graphframes-0.8.2-spark3.1-s_2.12.jar,/root/.ivy2/jars/com.typesafe_config-1.4.2.jar,/root/.ivy2/jars/org.rocksdb_rocksdbjni-6.29.5.jar,/root/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.828.jar,/root/.ivy2/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar,/root/.ivy2/jars/com.google.cloud_google-cloud-storage-2.16.0.jar,/root/.ivy2/jars/com.navigamez_greex-1.0.jar,/root/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.4.4.jar,/root/.ivy2/jars/it.unimi.dsi_fastutil-7.0.12.jar,/root/.ivy2/jars/org.projectlombok_lombok-1.16.8.jar,/root/.ivy2/jars/com.google.guava_guava-31.1-jre.jar,/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar,/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-

In [3]:
# read in rideshare data for all years, concatenate, create appropriate partitioning
# we are dropping 2020 because covid will affect the performance of our model
df_2018 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2018.csv", inferSchema=True, header=True)
df_2019 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2019.csv", inferSchema=True, header=True)
df_2021 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2021.csv", inferSchema=True, header=True)
df_2022 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2022.csv", inferSchema=True, header=True)
df_all = df_2018.union(df_2019).union(df_2021).union(df_2022)
df_all.show(5)

                                                                                

+--------------------+-------------------+-------------------+-------+-----+------------+-------------+-----------+------------+----+---+-----+-------------+--------------+-------------+--------------+-----+------------+----+---+
|                  ID|    start_timestamp|      end_timestamp|seconds|miles|pickup_tract|dropoff_tract|pickup_area|dropoff_area|Fare|Tip|total|   pickup_lat|    pickup_lon|  dropoff_lat|   dropoff_lon|month|day_of_month|hour|day|
+--------------------+-------------------+-------------------+-------+-----+------------+-------------+-----------+------------+----+---+-----+-------------+--------------+-------------+--------------+-----+------------+----+---+
|625e77ae6e0ff7191...|2018-11-06 19:00:00|2018-11-06 19:15:00|   1142|  5.8| 17031063400|  17031010400|          6|           1|12.5|  0| 15.0|41.9346591566|-87.6467297286| 42.004764559| -87.659122427|   11|           6|  19|  3|
|62945fdb2e70957f0...|2018-11-06 19:00:00|2018-11-06 19:00:00|    341|  1.2| 170

In [5]:
# TODO: REPARTITION

# we will need a year column in this model
df_all = df_all.withColumn('year', F.year(df_all.start_timestamp))

## Notes for Harsh:

I'm assuming we are predicting using the full dataset and not restricting ourselves to being within the program hours.

I started writing code that goes through the steps that I think will probably be necessary. The code is unfinished because I ran out of time to test in all or formally think through the problems I was seeing. Feel free to change things or make your own assumptions.

Here is the process that I was thinking of. I was trying all this on a sample dataframe so I could code faster.
1. Get Daily counts for each community area
2. pivot so that there is a column for each community area (y is when hyde park or woodlawn or kenwood are 1, otherwise the column is a feature)
3. merge with daily weather data
4. separate out y (counts for every day in program area) and X (column of counts for each community area outside of the program area)
5. filter for pre-program rides.
6. create supervised model on all that data
7. predict the next month or so of counts after sept 29 2021
8. Graph predictions versus reality
9. maybe do the same thing in 2023 once data is available

In [6]:
# take a sample to test these operations out on first
sample_df = df_all.sample(fraction=1/1000000)

# get only the columns needed for the model
selected_columns = ["pickup_area","dropoff_area","day","month","year","ID"]
sample_selected = sample_df.select(selected_columns)


# group the rideshare data by day and community area and create counts
sample_df = sample_df.groupby('day',"month","year",'pickup_area','dropoff_area').agg({'ID':'count'})
sample_df.show(5)



+---+-----+----+-----------+------------+---------+
|day|month|year|pickup_area|dropoff_area|count(ID)|
+---+-----+----+-----------+------------+---------+
|  4|   12|2018|          7|          24|        1|
|  5|    1|2019|         24|           8|        1|
|  3|    8|2019|         28|           8|        1|
|  2|    3|2021|         28|           8|        1|
|  6|    4|2021|         28|          28|        1|
+---+-----+----+-----------+------------+---------+
only showing top 5 rows



                                                                                

In [None]:
# the output of the sample df above looks off. investigate

# pivot so that each area is a column
# should probably create a new variable that denotes in program rides, and figure out what combination of pickup or dropoff area we want to u
pivoted_df = sample_df.groupBy("day","month","year").pivot("dropoff_area").agg({"count": "first"})

In [None]:
# read in weather data, merge with rideshare data