# Notebook V - Using ARIMA to Generate Predictions for all Stations

In [1]:
# imports
import pandas as pd
import numpy as np

# modeling
import pmdarima as pmd
from pmdarima.utils import tsdisplay, plot_acf, plot_pacf
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose

# graphing
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import seaborn as sns

## 5a. Read and Process Predictions File

In [249]:
# read predictions
predictions = pd.read_csv('./submission/predictions.csv')
predictions.head()

Unnamed: 0,1,1.1,17th & Guadalupe,-1
0,1,1,2nd & Congress,-1
1,1,1,5th & Bowie,-1
2,1,1,8th & Congress,-1
3,1,1,City Hall / Lavaca & 2nd,-1
4,1,1,Convention Center / 3rd & Trinity,-1


In [250]:
# move the header (17th & Guadalupe to row)
predictions = pd.concat([predictions.columns.to_frame().T, predictions], ignore_index = True)

# rename the columns
predictions.columns = ['month', 'day', 'start_station_name', 'predicted_station_revenue']

# change the "1.1" to 1 for day (odd format carries over from reading csv)
predictions.loc[0, 'day'] = 1
predictions.loc[0, 'month'] = 1

# now preview!
predictions.head()

Unnamed: 0,month,day,start_station_name,predicted_station_revenue
0,1,1,17th & Guadalupe,-1
1,1,1,2nd & Congress,-1
2,1,1,5th & Bowie,-1
3,1,1,8th & Congress,-1
4,1,1,City Hall / Lavaca & 2nd,-1


In [203]:
# let's store the unq stations we want to predict for
prediction_stations = list(predictions['start_station_name'].unique())
prediction_stations

['17th & Guadalupe',
 '2nd & Congress',
 '5th & Bowie',
 '8th & Congress',
 'City Hall / Lavaca & 2nd',
 'Convention Center / 3rd & Trinity',
 'Davis at Rainey Street',
 'Guadalupe & 21st',
 'South Congress & Academy',
 'West & 6th St.']

#### 5a. (i) One Final Rabbit Hole

At this stage, I had another epiphany. I initially wanted to verify that each of the stations I want to predict for have 730 rows (daily data for 2 years). If they did not, it meant something is broken. I found below that one station stood out: `Convention Center / 3rd & Trinity`

In [108]:
for name in prediction_stations:
    if daily_revenue[daily_revenue['start_station_name'] == name].shape[0] != 730:
        print(f"{name}: Not nice, what's going on?")

Convention Center / 3rd & Trinity: Not nice, what's going on?


In [109]:
# verifying that the above is true
daily_revenue[daily_revenue['start_station_name'] == 'Convention Center / 3rd & Trinity']

Unnamed: 0_level_0,start_station_name,trip_revenue
date,Unnamed: 1_level_1,Unnamed: 2_level_1


I went through the unique start station names in the data again by examining the unique names. My initial thought process was that it's possible there could be a mismatch in the names. So, I checked for any resemblance to 'Convention Center.'

In [114]:
for name in list(daily_revenue['start_station_name'].unique()):
    if 'convention' in name.lower():
        print(name)

Convention Center / 4th St. @ MetroRail


I found "Convention Center / 4th St. @ MetroRail" to resemble the missing station name. There were a couple hypotheses here:
1. There was a mistake - "Convention Center / 3rd & Trinity" was no longer active, and "4th St. @ MetroRail" replaced it
2. "Convention Center / 3rd & Trinity" is just not in the dataset

I checked Google Maps and found that there in fact is a [3rd & Trinity station](https://www.google.com/maps/place/Trinity+St+%26+E+3rd+St,+Austin,+TX+78701/@30.2645946,-97.7415888,19.01z/data=!4m6!3m5!1s0x8644b5a80b4639c7:0x4834f48bd0189e62!8m2!3d30.2642785!4d-97.7403589!16s%2Fg%2F11hb87s4xz?entry=ttu) as well as a [4th St. @ MetroRail](https://www.google.com/maps/place/ATX+MetroBike+Station/@30.2649572,-97.7390054,20.26z/data=!4m6!3m5!1s0x8644b5a7d4ad5db5:0x1716668ba7da4821!8m2!3d30.2648847!4d-97.7393963!16s%2Fg%2F11c4fc7ptt?entry=ttu), both right at the Convention Center. <br><br> After going back to the original dataset in a separate notebook and confirming that there were no trip records for "Convention Center / 3rd & Trinity", I decided to populate the predictions for this station as NaN.

---
## 5b. Read and Process Aggregated Trips Data

In [43]:
# read cleaned aggregated data
daily_revenue = pd.read_csv('./cleaned_data/trips_cleaned_aggregated.csv')
daily_revenue = daily_revenue[['date', 'start_station_name', 'trip_revenue']] # select relevant columns

# convert to datetime
daily_revenue['date'] = pd.to_datetime(daily_revenue['date'])

# set date as index
daily_revenue.set_index('date', inplace = True)

# preview
daily_revenue

Unnamed: 0_level_0,start_station_name,trip_revenue
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-01-01,6th & Navasota St.,0.0
2014-01-02,6th & Navasota St.,0.0
2014-01-03,6th & Navasota St.,0.0
2014-01-04,6th & Navasota St.,0.0
2014-01-05,6th & Navasota St.,0.0
...,...,...
2015-12-27,Red River @ LBJ Library,0.0
2015-12-28,Red River @ LBJ Library,0.0
2015-12-29,Red River @ LBJ Library,0.0
2015-12-30,Red River @ LBJ Library,0.0


---
## 5c. Writing a Function to Generate Predictions

Note: I cannot fit the same model on all stations. Each station is technically its own time-series. They have different data with different starting/end points. Although I will assume the parameters to be the same for simplicity (as mentioned before), I want to train each model with the **station's own revenue** data from 2014-2015. <br>
In this section, I will write a function that accepts a station name and computes the following: 
1. The station's revenue data in a separate data frame
2. Train data set on the log of revenue data
3. Empty predictions dataframe for 2016
4. Instantiates an ARIMA class with chosen parameters (refer to Notebook IV)
5. Fits and predicts data and populates the predictions dataframe with the exponentiated predictions
6. Returns a dataframe for the station with predictions of daily revenue for 2016

In [128]:
def station_forecast(name):
    
    ### INITIAL CHECK FOR NAME ### 
    
    # if the passed name is not found, return a dataframe with indexed dates, station name, and nan values for predicted revenue
    if name not in list(daily_revenue['start_station_name'].unique()):
        return pd.DataFrame({'predicted_station_revenue': float('nan'), 'start_station_name': name},
                            index = pd.date_range(start = '2016-01-01', end = '2016-12-31', freq = 'D'))
    
    
    ### Setting up DataFrames ###
    
    # set up df (keeping the name arbitrary)
    df = daily_revenue[daily_revenue['start_station_name'] == name]
    # train (on log1p of revenue)
    train = np.log1p(df['trip_revenue'])
#     # empty df w/ 2016 to be populated w/ preds later
#     df_preds = pd.DataFrame(index = pd.date_range(start = '2016-01-01', end = '2016-12-31', freq = 'D'))
    

    ### Setting up the model ###
    
    # instantiate ARIMA (use normal ARIMA, we will hardcode parameters)
    # please refer to Notebook 4 end, parameters selected from best model
    arima = pmd.ARIMA(order = (1, 0, 1), seasonal_order = (2, 1, 1, 52))
    
    # fit arima on the train data for the model --> this step will likely take some time
    arima.fit(y = train)
    
    # use model to make predictions!
        # ensure predicting 366 (2016 leap yr)
    df_preds = pd.DataFrame(np.exp(arima.predict(366)), columns = ['predicted_station_revenue'])
    
    # add name
    df_preds['start_station_name'] = name
    
    
    ### Return predictions data_frame ###
    return df_preds

In [123]:
# create dictionary to store dataframes
stations_revenue = {key: None for key in prediction_stations}
stations_revenue

{'17th & Guadalupe': None,
 '2nd & Congress': None,
 '5th & Bowie': None,
 '8th & Congress': None,
 'City Hall / Lavaca & 2nd': None,
 'Convention Center / 3rd & Trinity': None,
 'Davis at Rainey Street': None,
 'Guadalupe & 21st': None,
 'South Congress & Academy': None,
 'West & 6th St.': None}

Now, I will loop through the keys in the dictionary and use the key names to pass through my function. The resulting value (dataframe) gets stored with the associated key (name).

#### Disclaimer: Running the code below issues several 'ValueWarnings'. One goal for the future is to figure out specifically how to prevent this.

In [143]:
for k in stations_revenue.keys():
    stations_revenue[k] = station_forecast(k)

  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


In [144]:
# preview!
stations_revenue

{'17th & Guadalupe':             predicted_station_revenue start_station_name
 2016-01-01                  13.347392   17th & Guadalupe
 2016-01-02                  14.000361   17th & Guadalupe
 2016-01-03                  11.144340   17th & Guadalupe
 2016-01-04                  15.473278   17th & Guadalupe
 2016-01-05                  11.074507   17th & Guadalupe
 ...                               ...                ...
 2016-12-27                  19.662020   17th & Guadalupe
 2016-12-28                  12.762220   17th & Guadalupe
 2016-12-29                  12.909232   17th & Guadalupe
 2016-12-30                  14.454571   17th & Guadalupe
 2016-12-31                  13.561875   17th & Guadalupe
 
 [366 rows x 2 columns],
 '2nd & Congress':             predicted_station_revenue start_station_name
 2016-01-01                  22.018151     2nd & Congress
 2016-01-02                  36.363665     2nd & Congress
 2016-01-03                  27.182522     2nd & Congress
 2016-0

In [251]:
# use pd.concat() to combine all prediction DFs into 1
daily_revenue_predictions = pd.concat(stations_revenue.values())

In [252]:
# preview
daily_revenue_predictions

Unnamed: 0,predicted_station_revenue,start_station_name
2016-01-01,13.347392,17th & Guadalupe
2016-01-02,14.000361,17th & Guadalupe
2016-01-03,11.144340,17th & Guadalupe
2016-01-04,15.473278,17th & Guadalupe
2016-01-05,11.074507,17th & Guadalupe
...,...,...
2016-12-27,42.758178,West & 6th St.
2016-12-28,35.347591,West & 6th St.
2016-12-29,27.339568,West & 6th St.
2016-12-30,30.202653,West & 6th St.


Now, we need to add the same month & day columns. We will use month, day, and start_station_name for joining onto the `predictions` dataframe we read in earlier.

In [253]:
daily_revenue_predictions['month'] = daily_revenue_predictions.index.month
daily_revenue_predictions['day'] = daily_revenue_predictions.index.day
daily_revenue_predictions

Unnamed: 0,predicted_station_revenue,start_station_name,month,day
2016-01-01,13.347392,17th & Guadalupe,1,1
2016-01-02,14.000361,17th & Guadalupe,1,2
2016-01-03,11.144340,17th & Guadalupe,1,3
2016-01-04,15.473278,17th & Guadalupe,1,4
2016-01-05,11.074507,17th & Guadalupe,1,5
...,...,...,...,...
2016-12-27,42.758178,West & 6th St.,12,27
2016-12-28,35.347591,West & 6th St.,12,28
2016-12-29,27.339568,West & 6th St.,12,29
2016-12-30,30.202653,West & 6th St.,12,30


In [246]:
# save off daily_revenue_predictions dataset for future use
daily_revenue_predictions.to_csv('./cleaned_data/all-2016-predictions.csv', index = False)

Now, left join the `predicted_station_revenue` onto the dataframe from the `predictions.csv` file by joining on month, day, and start_station_name.

In [254]:
# merge, make sure left join
predictions = pd.merge(predictions, daily_revenue_predictions, on = ['month', 'day', 'start_station_name'], how = 'left')

In [255]:
# sample of 1 station
predictions[predictions['start_station_name'] == '17th & Guadalupe']

Unnamed: 0,month,day,start_station_name,predicted_station_revenue_x,predicted_station_revenue_y
0,1,1,17th & Guadalupe,-1,13.347392
10,1,2,17th & Guadalupe,-1,14.000361
20,1,3,17th & Guadalupe,-1,11.144340
30,1,4,17th & Guadalupe,-1,15.473278
40,1,5,17th & Guadalupe,-1,11.074507
...,...,...,...,...,...
1190,10,27,17th & Guadalupe,-1,15.410924
1200,10,28,17th & Guadalupe,-1,25.511229
1210,10,29,17th & Guadalupe,-1,15.662095
1220,10,30,17th & Guadalupe,-1,27.438606


In [256]:
# drop previous place holder column
predictions.drop(columns = ['predicted_station_revenue_x'], axis = 1, inplace = True)

# rename added column
predictions.rename(columns = {'month': 'MONTH','day': 'DAY','start_station_name':'START STATION NAME',
                              'predicted_station_revenue_y': 'PREDICTED STATION REVENUE'}, inplace = True)

In [257]:
# save off predictions file!
predictions.to_csv('./submission/predictions.csv', index = False)

This concludes my code and explanations for this bike-share daily revenue forecasting project. <br><br> Thank you for reading. Please contact me at patelprem922@gmail.com for questions, comments, concerns!