# Creating a station good and bad weather index

Attempt to create an index of the resiliance of a CitiBike station use to seasonally adjusted changes in weather.

* weatherIndex of 1.0 means the station has increased use on monthyly-adjusted good weather days vs. monthyly-adjusted bad weather days.
* weatherIndex of 0.0 means that station has identical use on monthyly-adjusted good weather days vs. monthyly-adjusted bad weather days.
* weatherIndex of -1.0 means that the station has decreased use on monthyly-adjusted good weather days vs. monthyly-adjusted bad weather days.

The weatherIndex is of use when characterizing a station for business purposes, and grouping CitiBike stations by similar use.


### Calculation
1. Need to categorize each day by good or bad monthly-adjusted weather based on feels_like, humidity, wind_speed, precip
    1. get mean and average feels_like for month.
    1. get mean and average humidity for month.
    1. get mean and average wind_speed per month.
    1. Curious about count of precip days per month.


# ALWAYS RUN THIS NEXT CELL
with the import libraries and spark context and session


In [1]:
# load modules
import pandas as pd
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.mllib.regression import LabeledPoint

from pyspark.sql.functions import count, avg, stddev, col, udf
from pyspark.sql.functions import lit, concat
from pyspark.sql.types import DoubleType, StringType

# import modules
from functools import reduce
from pyspark.sql import DataFrame
import os
# start spark context
spark = SparkSession.builder \
        .master("local[*]") \
        .appName("weatherwork") \
        .getOrCreate()

# set configurations
conf = SparkConf().setMaster("local[*]").setAppName("weatherwork")
sc = SparkContext.getOrCreate(conf=conf)

# TODO: Write example code for brining in the good/bad by date broadcast variable

In [None]:
# /home/hdm5s/ds5110/FinalProject/gt2017_stationName_withGoodBadWeather.parquet
goodBadByDate_filepath = "/home/hdm5s/ds5110/FinalProject/broadcastGoodBadWeatherByDate.csv"
sdf_goodBadByDate = spark.read.csv(goodBadByDate_filepath).rdd.collectAsMap()
bc_goodbadweatherdate = sc.broadcast(sdf_goodBadByDate)

# to access the GOOD OR BAD weather use tripdate
# bc_goodbadweatherdate.value.get(tripdate)

@udf(returnType=StringType())
def isBadOrGoodWeather(tripdate : str):
    return bc_goodbadweatherdate.value.get(tripdate)


# load the final data file with only the columns I am interested in using:
file = "/project/ds5559/Summer2021_TeamBike/gt2017_bikeweather.parquet" # FINAL DATA FILE
sdf = spark.read.parquet(file).select('date', 'tripduration', 'startStationId', 'startStationName', 'startStationLatitude','startStationLongitude','endStationId', 
                                      'endStationName', 'endStationLatitude', 'endStationLongitude', 'usertype', 'feels_like', 'humidity', 'wind_speed', 'weather_main',
                                      'dow', 'day', 'month', 'time_bin', 'peak_commute', 'year', 'precip')
sdf = sdf.withColumn("weatherGoodOrBad", isBadOrGoodWeather(col("date")))

# now you have the weatherGoodOrBad column for every record with either GOOD, or BAD values in it. This is the monthly adjusted binary good/bad weather category
sdf.show(5)

# TODO: Write example code for bringing in the stationGoodWeatherIndex and stationBadWeatherIndex

...

# IGNORE CELLS until the 'RUN NOW' cell 
These are here for documentation of the process to create the low and high feels-like monthly broadcast variable csv's.

In [None]:
# load the standard data file with only the columns I am interested in using:
# 'startStationLatitude', 'startStationLongitude','tripduration','month','year'
file = "/project/ds5559/Summer2021_TeamBike/gt2017_bikeweather.parquet" # FINAL DATA FILE
sdf = spark.read.parquet(file).select('date', 'hour', 'tripduration', 'starttime', 'startStationId', 'startStationLatitude', 'startStationLongitude',
    'endStationId', 'endStationName', 'endStationLatitude', 'endStationLongitude', 'bikeid', 'usertype', 'birthyear', 'feels_like', 'humidity', 'wind_speed',
    'weather_main', 'dow', 'day', 'time_bin', 'peak_commute', 'precip','month','year')
sdf.show(1)


In [None]:
sdf.show(1)

In [None]:
# group by month, year and get the average tripduration, and the count of items in the aggregation
sdf_grouped = sdf.groupBy('month').agg(
    avg('feels_like').alias('avg_feels_like'), 
    stddev('feels_like').alias('std_feels_like'),
    avg('humidity').alias('avg_humidity'), 
    stddev('humidity').alias('std_humidity'),
    avg('wind_speed').alias('avg_wind_speed'), 
    stddev('wind_speed').alias('std_wind_speed'),
    stddev('wind_speed').alias('std_wind_speed'),
)

In [None]:
sdf_grouped.show(12)
sdf_grouped.cache()

In [None]:
# group by month, year and get the average tripduration, and the count of items in the aggregation
sdf_precip_grouped = sdf.groupBy('month', 'precip').agg(
    count('date').alias('precip_count')
)
sdf_precip_grouped.orderBy('month','precip').show(24)

### What Makes a Bad Weather Day?

#### Wind Speed and Precip are relatively easy to determine threshold for good vs. bad weather day:
1. if **wind_speed >= 10 == BAD weather day.** 
    1. wind speed isn't changing much month-to-month. And somewhere between 10 and 15 mph is recogunized to be a windy by cyclists based on a informal survey of five urban bicycle commuters.
1. 1. precip == precip == BAD weather day.
    1. For our purposes, if a day has precipiation we are calling it a "BAD" weather day.
    
    
    
#### What About Feels Like?
feels_like is harder because a seasonally adjusted feels_like "bad" weather day in January is very different from a seasonally-adjusted "bad" weather day in August.

Based on the mean and std by month for feels_like I propose the following:

| Month       | BAD condition LOW   |  BAD condition HIGH
| :------------- | :----------: | :----------: |
| Jan | < (mean - 1 * std) | 90 |
| Feb | < (mean - 1 * std  | 90 |
| Mar | < (mean - 1 * std) | 90 |
| Apr | < (mean - 1 * std) | 90 |
| May | < (mean - 2 * std) | 90 |
| Jun | 45 | > (mean + 1.5 * std) |
| Jul | 45 | > (mean + 1.5 * std) |
| Aug | 45 | > (mean + 1.5 * std) |
| Sep | < (mean - 3 * std)  | 90 |
| Oct | < (mean - 1 * std) | 90 |
| Nov | < (mean - 1 * std) | 90 |
| Dec | < (mean - 1 * std) | 90 |

#### What About Humidity?
Based on the relationship between humidity, precip and feels_like, and the relative stability of humidity I propose we do not include humidity in our BAD weather classification.

### Next Step: create the broadcast variables for feels_like_LOW and feels_like_HIGH for each month
https://sparkbyexamples.com/pyspark/pyspark-broadcast-variables/



In [None]:
# why doesn't python have a switch case statment !!???!!!

@udf(returnType=DoubleType())
def calcFeelsLikeLow(fl_month, fl_avg, fl_std):
    # default value
    feels_like_low = 45
    
    if fl_month == "Jan" or fl_month == "Feb" or fl_month == "Mar" or fl_month == "Apr" or fl_month == "Oct" or fl_month == "Nov" or fl_month == "Dec" or fl_month == "May":
        feels_like_low = fl_avg - (fl_std)
    elif fl_month == "Sep":
        feels_like_low = fl_avg - (2 * fl_std)
    else:
        feels_like_low = 45.0
    
    return feels_like_low

@udf(returnType=DoubleType())
def calcFeelsLikeHigh(fl_month, fl_avg, fl_std):
    # default value
    feels_like_high = 90.0
    
    if fl_month == "Jul" or fl_month == "Aug" or fl_month == "Jun":
        feels_like_high = fl_avg + (1.5 * fl_std)
    else:
        feels_like_high = 90.0
    
    return feels_like_high


sdf_grouped = sdf_grouped.withColumn("feels_like_LOW", calcFeelsLikeLow(col("month"), col("avg_feels_like"), col("std_feels_like"))).withColumn("feels_like_HIGH", calcFeelsLikeHigh(col("month"), col("avg_feels_like"), col("std_feels_like")))

In [None]:
sdf_grouped.show()
sdf_grouped.cache()

### calculate bad weather across the entire data set for each day in the dataset based on critia:
1. if feels_like < feels_like_LOW for the month
1. if feels like > feels_like_HIGH for the month
1. if precip == precip


# NEXT TASKS
1. Assign Good or Bad weather to every ride in the dataset
Create the Broadcast Variables:

In [None]:
# create a broadcast of month, feels_like_LOW and month, feels_like_HIGH
sdf_month_low = sdf_grouped.rdd.map(lambda x: (x[0], x[8])).collectAsMap()                                                       
monthLowBroadcast = sc.broadcast(sdf_month_low)


sdf_month_high = sdf_grouped.rdd.map(lambda x: (x[0], x[9])).collectAsMap()                                                       
monthHighBroadcast = sc.broadcast(sdf_month_high)

In [None]:
# save the low broadcast to csv
sdf_grouped.rdd.map(lambda x: (x[0], x[8])).toDF().write.csv('/home/hdm5s/ds5110/FinalProject/broadcastLow.csv')

Use the Broadbase variables to create the BAD weather indicator for each row in the dataframe:

In [None]:
# save the high broadcast to csv
sdf_grouped.rdd.map(lambda x: (x[0], x[9])).toDF().write.csv('/home/hdm5s/ds5110/FinalProject/broadcastHigh.csv')

# RUN NOW: IF you want to ...
create the gt2017- data file that includes the weatherGoodOrBad column

In [None]:
# load the standard data file with only the columns I am interested in using:
file = "/project/ds5559/Summer2021_TeamBike/gt2017_bikeweather.parquet" # FINAL DATA FILE
sdf = spark.read.parquet(file).select('date', 'startStationId', 'startStationName', 'feels_like', 'month','year','precip')
sdf.show(1)

In [None]:
# load the low and high rdds
rdd_low = spark.read.csv("/home/hdm5s/ds5110/FinalProject/broadcastLow.csv", header=False).rdd.collectAsMap()
monthLowBroadcast = sc.broadcast(rdd_low)

rdd_high = spark.read.csv("/home/hdm5s/ds5110/FinalProject/broadcastHigh.csv", header=False).rdd.collectAsMap()
monthHighBroadcast = sc.broadcast(rdd_high)

In [None]:
monthLowBroadcast.value.get('Oct')

In [None]:
monthHighBroadcast.value.get('Oct')

In [None]:
# using the broadcasts for low and high and comparing with the feels_like
@udf(returnType=StringType())
def isBadOrGoodWeather(month, feels_like, precip):
    goodOrBad = "GOOD"
    
    if float(feels_like) <= float(monthLowBroadcast.value.get(month)) or float(feels_like) >= float(monthHighBroadcast.value.get(month)) or precip == "precip":
        goodOrBad = "BAD"
    else:
        goodOrBad = "GOOD"
        
    return goodOrBad

In [None]:
sdf = sdf.withColumn("weatherGoodOrBad", isBadOrGoodWeather(col("month"), col("feels_like"), col("precip")))

In [None]:
sdf.show(1)

In [None]:
# save the dataframe with Good/Bad weather columns
sdf.write.parquet('/home/hdm5s/ds5110/FinalProject/gt2017_stationName_withGoodBadWeather.parquet')

In [None]:
# group by month, year and get the average tripduration, and the count of items in the aggregation
sdf.groupBy('month', 'weatherGoodOrBad').count().show(24)

There is a nice distribution of good and bad weather days.

**Save a csv broadcast of date** 
1. Create a saved broadcast of date and GOOD or BAD weatherGoodOrBad

In [None]:
# save the low broadcast to csv
#sdf.rdd.map(lambda x: (x[0], x[7])).toDF().write.csv('/home/hdm5s/ds5110/FinalProject/broadcastGoodBadWeatherByDate.csv')
# [0] == date
# [7] == weatherGOODOrBad
sdf.rdd.map(lambda x: (x[0], x[7])).toDF().write.format('csv').option('header',False).mode('overwrite').save('/home/hdm5s/ds5110/FinalProject/broadcastGoodBadWeatherByDate.csv')

# RUN THIS if you want to ...
create the good weather and bad weather index by startStationName broadcast variable
1. Group by start station: Good weather station Index: good weather rides day average count rides / average day average count rides (EXPECT > 1, but HOW MUCH greater than 1)
1. Group by start station: Bad weather station Index:  bad weather ride day average count rides / average day average count rides (EXPECT < 1, but HOW MUCH less than 1)
1. Create a saved broadcast csv with the stationName goodWeatherIndex
1. Create a saved broadcast csv with stationName badWeatherIndex

# BUT SKIP UNTIL 'RUN NOW' IF YOU JUST WANT TO LOAD THE ALREADY CREATED broadcast variables


In [46]:
# /home/hdm5s/ds5110/FinalProject/gt2017_stationName_withGoodBadWeather.parquet
goodBadByDate_filepath = "/home/hdm5s/ds5110/FinalProject/broadcastGoodBadWeatherByDate.csv"
sdf_goodBadByDate = spark.read.csv(goodBadByDate_filepath).rdd.collectAsMap()
bc_goodbadweatherdate = sc.broadcast(sdf_goodBadByDate)

In [47]:
# to access the GOOD OR BAD weather use tripdate
# bc_goodbadweatherdate.value.get(tripdate)

@udf(returnType=StringType())
def isBadOrGoodWeather(tripdate : str):
    return bc_goodbadweatherdate.value.get(tripdate)


In [48]:
# load the standard data file with only the columns I am interested in using:
file = "/project/ds5559/Summer2021_TeamBike/gt2017_bikeweather.parquet" # FINAL DATA FILE
sdf = spark.read.parquet(file).select('date', 'tripduration', 'startStationId', 'startStationName', 'startStationLatitude','startStationLongitude','endStationId', 
                                      'endStationName', 'endStationLatitude', 'endStationLongitude', 'usertype', 'feels_like', 'humidity', 'wind_speed', 'weather_main',
                                      'dow', 'month', 'time_bin', 'peak_commute', 'year', 'precip')
sdf = sdf.withColumn("weatherGoodOrBad", isBadOrGoodWeather(col("date")))

In [52]:
# are there nulls in weatherGoodOrBad?
sdf.filter(col("weatherGoodOrBad").isNull()).show()

+----+------------+--------------+----------------+--------------------+---------------------+------------+--------------+------------------+-------------------+--------+----------+--------+----------+------------+---+-----+--------+------------+----+------+----------------+
|date|tripduration|startStationId|startStationName|startStationLatitude|startStationLongitude|endStationId|endStationName|endStationLatitude|endStationLongitude|usertype|feels_like|humidity|wind_speed|weather_main|dow|month|time_bin|peak_commute|year|precip|weatherGoodOrBad|
+----+------------+--------------+----------------+--------------------+---------------------+------------+--------------+------------------+-------------------+--------+----------+--------+----------+------------+---+-----+--------+------------+----+------+----------------+
+----+------------+--------------+----------------+--------------------+---------------------+------------+--------------+------------------+-------------------+--------+--

In [49]:
sdf.show(5)

+----------+------------+--------------+--------------------+--------------------+---------------------+------------+------------------+------------------+-------------------+----------+----------+--------+----------+------------+---+-----+--------+------------+----+---------+----------------+
|      date|tripduration|startStationId|    startStationName|startStationLatitude|startStationLongitude|endStationId|    endStationName|endStationLatitude|endStationLongitude|  usertype|feels_like|humidity|wind_speed|weather_main|dow|month|time_bin|peak_commute|year|   precip|weatherGoodOrBad|
+----------+------------+--------------+--------------------+--------------------+---------------------+------------+------------------+------------------+-------------------+----------+----------+--------+----------+------------+---+-----+--------+------------+----+---------+----------------+
|2018-10-01|         330|         293.0|Lafayette St & E ...|   40.73020660529954|   -73.99102628231049|       504.

In [50]:
sdf.cache()

DataFrame[date: string, tripduration: bigint, startStationId: string, startStationName: string, startStationLatitude: double, startStationLongitude: double, endStationId: double, endStationName: string, endStationLatitude: double, endStationLongitude: double, usertype: string, feels_like: double, humidity: bigint, wind_speed: double, weather_main: string, dow: int, month: string, time_bin: string, peak_commute: string, year: int, precip: string, weatherGoodOrBad: string]

**Next**, group by station and month and view the following:
* average trip count per month
* average trip count when weather good per month
* average trip count when weather bad per month
* average duration per month
* average duration when weather good per month
* average duration when weather bad per month

In [51]:
# group by month, startStationName and get the average tripduration, and the count of items in the aggregation
sdf_month_station = sdf.groupBy('month', 'startStationName').agg(
    count('tripduration').alias('counttrips'), 
    avg('tripduration').alias('avg_tripduration')
)

In [53]:
sdf_month_station.show(25)

+-----+--------------------+----------+------------------+
|month|    startStationName|counttrips|  avg_tripduration|
+-----+--------------------+----------+------------------+
|  Oct|  W 67 St & Broadway|     23694| 866.0102557609522|
|  Oct|Cooper Square & A...|     41372| 713.8334622449967|
|  Oct|Hanson Pl & Ashla...|     21984| 1386.932996724891|
|  Oct|     E 58 St & 3 Ave|     25413|   891.34734191162|
|  Oct|  E 98 St & Park Ave|      2383| 830.2194712547209|
|  Nov|Allen St & Riving...|     19748| 847.1000607656472|
|  Nov|    W 12 St & W 4 St|      2979|  713.860355824102|
|  Nov|E 110 St & Madiso...|      5953| 1119.487149336469|
|  Nov|  E 98 St & Park Ave|      2026| 734.2290227048371|
|  Dec|   Kent Ave & N 7 St|      8919|1102.0799416974996|
|  Dec|Pioneer St & Rich...|       661| 772.3373676248109|
|  Dec|Howard St & Centr...|     10941| 769.8045882460469|
|  Dec|Adam Clayton Powe...|      1808|  1340.64657079646|
|  Dec| Henry St & Grand St|     12724| 780.990647595095

In [None]:
sdf_month_station.cache()

In [61]:
# create a broadcast of month-station-countrips
sdf_month_station_trips = sdf_month_station.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]), x[2])).collectAsMap()                                                       
monthStationTrips = sc.broadcast(sdf_month_station_trips)

# create a broadcast of month-station-avgDuration
sdf_month_station_duration = sdf_month_station.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]), x[3])).collectAsMap()                                                       
monthStationDuration = sc.broadcast(sdf_month_station_duration)



In [62]:
# save the broadcasts to file
sdf_month_station.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]), x[2])).toDF().write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/broadcastMonthStationTrips.csv')
sdf_month_station.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]), x[3])).toDF().write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/broadcastMonthStationDuration.csv')

sdf_month_station.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]), x[2])).toDF().write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/broadcastMonthStationTrips.csv')
sdf_month_station.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]), x[3])).toDF().write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/broadcastMonthStationDuration.csv')

In [63]:
# group by month, startStationName, and weatherGoodOrBad and get the average tripduration, and the count of items in the aggregation
sdf_month_station_goodBad = sdf.groupBy('month', 'startStationName', 'weatherGoodOrBad').agg(
    count('tripduration').alias('counttrips'), 
    avg('tripduration').alias('avg_tripduration')
)

In [64]:
sdf_month_station_goodBad.show(25)

+-----+--------------------+----------------+----------+------------------+
|month|    startStationName|weatherGoodOrBad|counttrips|  avg_tripduration|
+-----+--------------------+----------------+----------+------------------+
|  Oct|Lafayette St & E ...|            GOOD|     34681| 732.5926876387647|
|  Oct|Central Park West...|            GOOD|     16946| 1294.232916322436|
|  Oct|        7 St & 3 Ave|            GOOD|      2628| 834.7534246575342|
|  Oct|    W 26 St & 10 Ave|            GOOD|     12055| 909.6550808793032|
|  Oct|  Broadway & W 51 St|            GOOD|     14936| 918.5858998393144|
|  Oct|      23 Ave & 27 St|            GOOD|      1252| 986.8825878594249|
|  Oct|Carroll St & Fran...|            GOOD|       999| 1135.003003003003|
|  Oct|    W 47 St & 10 Ave|             BAD|      5558| 835.7563871896366|
|  Nov|E 24 St & Park Ave S|            GOOD|     17351| 671.8246786928707|
|  Nov|S 3 St & Bedford Ave|            GOOD|      7084|1123.1077075098815|
|  Nov|Park 

In [None]:
sdf_month_station_goodBad.cache()

In [65]:
# create a broadcast of month-station-weather-countrips
sdf_month_station_goodbad_trips = sdf_month_station_goodBad.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]) + str(x[2]), x[3])).collectAsMap()                                                       
monthStationGoodBadTrips = sc.broadcast(sdf_month_station_goodbad_trips)

# create a broadcast of month-station-weather-avgDuration
sdf_month_station_goodbad_duration = sdf_month_station_goodBad.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]) + str(x[2]), x[4])).collectAsMap()                                                       
monthStationGppdBadDuration = sc.broadcast(sdf_month_station_goodbad_duration)


In [67]:
# save the broadcasts to file
sdf_month_station_goodBad.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]) + str(x[2]), x[3])).toDF().write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/broadcastMonthStationWeatherGoodBadTrips.csv')
sdf_month_station_goodBad.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]) + str(x[2]), x[4])).toDF().write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/broadcastMonthStationWeatherGoodBadDuration.csv')

sdf_month_station_goodBad.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]) + str(x[2]), x[3])).toDF().write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/broadcastMonthStationWeatherGoodBadTrips.csv')
sdf_month_station_goodBad.rdd.map(lambda x: (str(x[0]) + "-" + str(x[1]) + str(x[2]), x[4])).toDF().write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/broadcastMonthStationWeatherGoodBadDuration.csv')

# RUN NOW, to ... 
apply the good/bad index for the month and station for all trips in the dataset

In [51]:
# load the broadcast variables from csv
# trips by station and month
tripsByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationTrips.csv"
sdf_tripsByStationMonth = spark.read.csv(tripsByStationMonth_filepath).rdd.collectAsMap()
for x in list(sdf_tripsByStationMonth)[0:3]:
    print ("key {}, value {} ".format(x,  sdf_tripsByStationMonth[x]))
bc_tripsByStationMonth = sc.broadcast(sdf_tripsByStationMonth)

# duration by station and month
durationByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationDuration.csv"
sdf_durationByStationMonth = spark.read.csv(durationByStationMonth_filepath).rdd.collectAsMap()
for x in list(sdf_durationByStationMonth)[0:3]:
    print ("key {}, value {} ".format(x,  sdf_durationByStationMonth[x]))
bc_durationByStationMonth = sc.broadcast(sdf_durationByStationMonth)


# trips by station and month and good/bad weather
goodBadTripsByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationWeatherGoodBadTrips.csv"
sdf_goodBadTripsByStationMonth = spark.read.csv(goodBadTripsByStationMonth_filepath).rdd.collectAsMap()
for x in list(sdf_goodBadTripsByStationMonth)[0:3]:
    print ("key {}, value {} ".format(x,  sdf_goodBadTripsByStationMonth[x]))
bc_goodBadTripsByStationMonth = sc.broadcast(sdf_goodBadTripsByStationMonth)

# duration by station and month and good/bad weather
goodBadDurationByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationWeatherGoodBadDuration.csv"
sdf_goodBadDurationByStationMonth = spark.read.csv(goodBadDurationByStationMonth_filepath).rdd.collectAsMap()
for x in list(sdf_goodBadDurationByStationMonth)[0:3]:
    print ("key {}, value {} ".format(x,  sdf_goodBadDurationByStationMonth[x]))
bc_goodBadDurationByStationMonth = sc.broadcast(sdf_goodBadDurationByStationMonth)

key Oct-Maiden Ln & Pearl St, value 12743 
key Oct-E 81 St & York Ave, value 11707 
key Oct-W 16 St & The High Line, value 24878 
key Oct-Maiden Ln & Pearl St, value 1050.0648199011223 
key Oct-E 81 St & York Ave, value 873.0382677030836 
key Oct-W 16 St & The High Line, value 823.0103304124126 
key Oct-Washington Ave & Greene AveGOOD, value 4450 
key Oct-E 7 St & Avenue ABAD, value 9344 
key Oct-Carlton Ave & Dean StBAD, value 1346 
key Oct-Washington Ave & Greene AveGOOD, value 837.4429213483146 
key Oct-E 7 St & Avenue ABAD, value 688.5225813356165 
key Oct-Carlton Ave & Dean StBAD, value 737.2303120356612 


In [52]:
# to access the GOOD OR BAD weather use tripdate
# bc_goodbadweatherdate.value.get(tripdate)

@udf(returnType=DoubleType())
def setweatherGoodTripIndex(month, startStationName):
    
    averageTrips = bc_tripsByStationMonth.value.get(month + "-" + startStationName)
    goodTrips = bc_goodBadTripsByStationMonth.value.get(month + "-" + startStationName + "GOOD")
    if goodTrips is None:
    #    # if the dictionary doesn't have a good trips value, use the average trips
        goodTrips = averageTrips
    
    index_value = float(goodTrips) / float(averageTrips)
    return index_value


@udf(returnType=DoubleType())
def setweatherBadTripIndex(month : str, startStationName : str):
    averageTrips = bc_tripsByStationMonth.value.get(month + "-" + startStationName)
    badTrips = bc_goodBadTripsByStationMonth.value.get(month + "-" + startStationName + "BAD")
    if badTrips is None:
        # if the dictionary doesn't have a bad trips value, use the average trips
        badTrips = averageTrips

    index_value = float(badTrips) / float(averageTrips)
    return index_value

@udf(returnType=DoubleType())
def setweatherGoodDurationIndex(month : str, startStationName : str):
    averageDuration = bc_durationByStationMonth.value.get(month + "-" + startStationName)
    goodDuration= bc_goodBadDurationByStationMonth.value.get(month + "-" + startStationName + "GOOD")
    if goodDuration is None:
        # if the dictionary doesn't have a good duration value, use the average trips
        goodDuration = averageDuration
                         
    index_value = float(goodDuration) / float(averageDuration)
    return index_value


@udf(returnType=DoubleType())
def setweatherBadDurationIndex(month : str, startStationName : str):
    averageDuration = bc_durationByStationMonth.value.get(month + "-" + startStationName)
    badDuration= bc_goodBadDurationByStationMonth.value.get(month + "-" + startStationName + "BAD")
    if badDuration is None:
        # if the dictionary doesn't have a bad duration value, use the average trips
        badDuration = averageDuration

    index_value = float(badDuration) / float(averageDuration)
    return index_value



In [53]:
# load the standard data file with only the columns I am interested in using:
file = "/project/ds5559/Summer2021_TeamBike/gt2017_stationName_withGoodBadWeather.parquet" # FINAL DATA FILE
sdf = spark.read.parquet(file).select('startStationName', 'month')

sdf = sdf.withColumn("weatherGoodTripIndex", setweatherGoodTripIndex(col("month"), col("startStationName")))
sdf = sdf.withColumn("weatherBadTripIndex", setweatherBadTripIndex(col("month"), col("startStationName")))
sdf = sdf.withColumn("weatherGoodDurationIndex", setweatherGoodDurationIndex(col("month"), col("startStationName")))
sdf = sdf.withColumn("weatherBadDurationIndex", setweatherBadDurationIndex(col("month"), col("startStationName")))

In [54]:
sdf.show(25)

+----------------+-----+--------------------+-------------------+------------------------+-----------------------+
|startStationName|month|weatherGoodTripIndex|weatherBadTripIndex|weatherGoodDurationIndex|weatherBadDurationIndex|
+----------------+-----+--------------------+-------------------+------------------------+-----------------------+
|6 Ave & Canal St|  Jan|  0.7619760479041916|0.23802395209580837|      1.0185203735615738|     0.9407115085355908|
|6 Ave & Canal St|  Jan|  0.7619760479041916|0.23802395209580837|      1.0185203735615738|     0.9407115085355908|
|6 Ave & Canal St|  Jan|  0.7619760479041916|0.23802395209580837|      1.0185203735615738|     0.9407115085355908|
|6 Ave & Canal St|  Jan|  0.7619760479041916|0.23802395209580837|      1.0185203735615738|     0.9407115085355908|
|6 Ave & Canal St|  Jan|  0.7619760479041916|0.23802395209580837|      1.0185203735615738|     0.9407115085355908|
|6 Ave & Canal St|  Jan|  0.7619760479041916|0.23802395209580837|      1.0185203

In [10]:
sdf.cache()

DataFrame[startStationName: string, month: string, weatherGoodTripIndex: double, weatherBadTripIndex: double, weatherGoodDurationIndex: double, weatherBadDurationIndex: double]

In [11]:
sdf_distinct = sdf.select(["startStationName","month", "weatherGoodTripIndex", "weatherBadTripIndex", "weatherGoodDurationIndex", "weatherBadDurationIndex"]).distinct()

In [4]:
#sdf_distinct.show()

In [None]:
sdf_distinct.write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/weatherIndexDistinctStationMonth.csv')
sdf_distinct.write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/weatherIndexDistinctStationMonth.csv')

# Save csv's to use as broadcast variables

In [None]:
from pyspark.sql.functions import col, lit, concat
sdf_distinct.withColumn("key", concat(col("startStationName"),lit("-"),col("month"))).show()

In [None]:
sdf_weatherGoodTripIndex = sdf_distinct.withColumn("key", concat(col("startStationName"),lit("-"),col("month"))).select("key","weatherGoodTripIndex")
sdf_weatherGoodTripIndex.show()

In [None]:
sdf_weatherGoodTripIndex.write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/weatherIndexDistinctStationMonth-GoodTripIndex.csv')
sdf_weatherGoodTripIndex.write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/weatherIndexDistinctStationMonth-GoodTripIndex.csv')

In [None]:
sdf_weatherBadTripIndex = sdf_distinct.withColumn("key", concat(col("startStationName"),lit("-"),col("month"))).select("key","weatherBadTripIndex")
sdf_weatherBadTripIndex.show()
sdf_weatherBadTripIndex.write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/weatherIndexDistinctStationMonth-BadTripIndex.csv')
sdf_weatherBadTripIndex.write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/weatherIndexDistinctStationMonth-BadTripIndex.csv')

In [None]:
sdf_weatherGoodDurationIndex = sdf_distinct.withColumn("key", concat(col("startStationName"),lit("-"),col("month"))).select("key","weatherGoodDurationIndex")
sdf_weatherGoodDurationIndex.show()
sdf_weatherGoodDurationIndex.write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/weatherIndexDistinctStationMonth-GoodDurationIndex.csv')
sdf_weatherGoodDurationIndex.write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/weatherIndexDistinctStationMonth-GoodDurationIndex.csv')

In [None]:
sdf_weatherBadDurationIndex = sdf_distinct.withColumn("key", concat(col("startStationName"),lit("-"),col("month"))).select("key","weatherBadDurationIndex")
sdf_weatherBadDurationIndex.show()
sdf_weatherBadDurationIndex.write.mode("overwrite").csv('/home/hdm5s/ds5110/FinalProject/weatherIndexDistinctStationMonth-BadDurationIndex.csv')
sdf_weatherBadDurationIndex.write.mode("overwrite").csv('/project/ds5559/Summer2021_TeamBike/weatherIndexDistinctStationMonth-BadDurationIndex.csv')

# Example of loading the good weather duration and bad weather duration index to the entire dataset

In [71]:
# load the broadcast variables from csv
# duration by station and month
durationByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationDuration.csv"
sdf_durationByStationMonth = spark.read.csv(durationByStationMonth_filepath).rdd.collectAsMap()
#for x in list(sdf_durationByStationMonth)[0:3]:
#    print ("key {}, value {} ".format(x,  sdf_durationByStationMonth[x]))
bc_durationByStationMonth = sc.broadcast(sdf_durationByStationMonth)

# duration by station and month and good/bad weather
goodBadDurationByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationWeatherGoodBadDuration.csv"
sdf_goodBadDurationByStationMonth = spark.read.csv(goodBadDurationByStationMonth_filepath).rdd.collectAsMap()
#for x in list(sdf_goodBadDurationByStationMonth)[0:3]:
#    print ("key {}, value {} ".format(x,  sdf_goodBadDurationByStationMonth[x]))
bc_goodBadDurationByStationMonth = sc.broadcast(sdf_goodBadDurationByStationMonth)

In [72]:
# load the broadcast variables from csv
# duration by station and month
durationByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationDuration.csv"
sdf_durationByStationMonth = spark.read.csv(durationByStationMonth_filepath).rdd.collectAsMap()
#for x in list(sdf_durationByStationMonth)[0:3]:
#    print ("key {}, value {} ".format(x,  sdf_durationByStationMonth[x]))
bc_durationByStationMonth = sc.broadcast(sdf_durationByStationMonth)

# duration by station and month and good/bad weather
goodBadDurationByStationMonth_filepath = "/project/ds5559/Summer2021_TeamBike/broadcastMonthStationWeatherGoodBadDuration.csv"
sdf_goodBadDurationByStationMonth = spark.read.csv(goodBadDurationByStationMonth_filepath).rdd.collectAsMap()
#for x in list(sdf_goodBadDurationByStationMonth)[0:3]:
#    print ("key {}, value {} ".format(x,  sdf_goodBadDurationByStationMonth[x]))
bc_goodBadDurationByStationMonth = sc.broadcast(sdf_goodBadDurationByStationMonth)

# to add columns with the GOOD and bad weather duration indexes
from pyspark.sql.functions import col

@udf(returnType=DoubleType())
def setweatherGoodDurationIndex(month : str, startStationName : str):
    averageDuration = bc_durationByStationMonth.value.get(month + "-" + startStationName)
    goodDuration= bc_goodBadDurationByStationMonth.value.get(month + "-" + startStationName + "GOOD")
    if goodDuration is None:
        # if the dictionary doesn't have a good duration value, use the average trips
        goodDuration = averageDuration
                         
    index_value = float(goodDuration) / float(averageDuration)
    return index_value


@udf(returnType=DoubleType())
def setweatherBadDurationIndex(month : str, startStationName : str):
    averageDuration = bc_durationByStationMonth.value.get(month + "-" + startStationName)
    badDuration= bc_goodBadDurationByStationMonth.value.get(month + "-" + startStationName + "BAD")
    if badDuration is None:
        # if the dictionary doesn't have a bad duration value, use the average trips
        badDuration = averageDuration

    index_value = float(badDuration) / float(averageDuration)
    return index_value

# load the standard data file with only the columns I am interested in using:
# THE IS THE FINAL DATA FILE
file = "/project/ds5559/Summer2021_TeamBike/gt2017_bikeweather.parquet" 
sdf = spark.read.parquet(file)

sdf = sdf.withColumn("weatherGoodDurationIndex", setweatherGoodDurationIndex(col("month"), col("startStationName")))
sdf = sdf.withColumn("weatherBadDurationIndex", setweatherBadDurationIndex(col("month"), col("startStationName")))

sdf.show(5, False)

+----------+----+------------+------------------------+--------------+------------------------+--------------------+---------------------+------------+------------------+------------------+-------------------+-------+----------+---------+------+------+----------+--------+--------+--------+--------+----------+-------+-------+-------+-------+----------+------------+---+------+-----+--------+------------+----+---------+---------+---------+------------------------+-----------------------+
|date      |hour|tripduration|starttime               |startStationId|startStationName        |startStationLatitude|startStationLongitude|endStationId|endStationName    |endStationLatitude|endStationLongitude|bikeid |usertype  |birthyear|gender|temp  |feels_like|temp_min|temp_max|pressure|humidity|wind_speed|rain_1h|rain_3h|snow_1h|snow_3h|clouds_all|weather_main|dow|day   |month|time_bin|peak_commute|year|3h_precip|1h_precip|precip   |weatherGoodDurationIndex|weatherBadDurationIndex|
+----------+----+---