# Technical Summary of Process and Results

## Summary

## Table of Contents

* [0. Preliminaries and Setup Instructions](#preliminaries)
    * [0.1 PostgreSQL / PostGIS setup](#postgres)
    * [0.2 Configuring Your Database Connection in CaBi](#config-files)
    * [0.3 Conda Environment](#conda-env)
    * [0.4 The CaBi Package](#cabi-package)
* [1. Data Sources](#data-sources)
    * [1.1 Extract](#extract)
    * [1.2 Initial EDA](#eda-initial)
* [2. Data Preparation](#data-prep)
    * [2.1 Data Processing Steps](#data-processing)
    * [2.2 Pre-Modeling EDA](#eda-processed)
* [3. The Business Problem: Framing Our Modeling Approach](#business-problem)
    * [3.1 Background](#business-background)
    * [3.2 Goals](#business-goals)
        * [3.2.1 Countercyclical Areas](#countercyclical)
        * [3.2.2 Detecting Anomalies](#anomalies)
* [4. Modeling](#modeling)
    * [4.0 Historical Comparison of Data](#model-historical)
        * [4.0.1 Normal Range of Variance](#variance-benchmark)
    * [4.1 Benchmark: Baseline Model](#baseline)
    * [4.2 Model Evaluation Approach](#evaluation-approach)
        * [4.2.1 Tuning Comparison Criteria - AIC](#evaluation-aic)
        * [4.2.2 Cross Validation Method](#evaluation-cv)
        * [4.2.3 Out of Sample Performance - RMSE/SMAPE](#evaluation-oos)
    * [4.3 SARIMA](#sarima)
        * [4.3.1 Model Summary](#sarima-summary)
        * [4.3.2 Out of Sample Performance](#sarima-performance)
        * [4.3.3 Power Transforms](#sarima-tranforms)
        * [4.3.4 Structural Limitations](#sarima-limits)
    * [4.4 SARIMAX: Adding Exogenous Features](#sarimax)
        * [4.4.1 Date Features](#exog-date)
        * [4.4.2 Fourier Approach](#exog-fourier)
        * [4.4.3 Dummy Filter](#dummy-filter)
    * [4.5 Markov Extension: Regime Switching](#markov)
    * [4.6 VARMAX???]
    * [4.7 Best Model](#best-model)
* [5. Evaluation/Results](#results)
* [6. Business Recommendations](#business-recommendations)
* [7. Further Research/Future Improvements](#further-research)
* [8. Sources](#sources)





## 0. Preliminaries and Setup Instructions <a class="anchor" id="preliminaries"></a>

This section is primarily intended for the user to be able to replicate the results of this analysis or extend it more comfortably on new data. Below, I outline the steps for: initializing a PostgreSQL/PostGIS database, creating a config.py file which is necessary to use the database functionality of the CaBi (Capital Bikeshare) package, using the conda environment provided for the project, and installing the CaBi package for local use.


### 0.1 PostgreSQL / PostGIS <a class="anchor" id="postgres"></a>

Though it is not strictly necessary for the scope of this project as it stands, I highly recommend the use of PostgreSQL/PostGIS as an intermediate storage method if the user plans to revisit the data more than once or twice. The geopandas pickling/other traditional file formats are a bit finicky, and this approach has several advantages. First among these is flexibility with respect to the volume of data handled at any one time. A database approach, in contrast to a pandas dataframe for example, allows the user to load a smaller selection of areas to model at one time, while storing the majority of records elsewhere. Geographical data are already a bit larger to traditional datatypes and if the user wishes to observe longer periods of time the number of observations quickly expands into the millions after just a few months of trip history. Next is extensibility; below we work primarily with counts from a modeling perspective, but in addition to new trips information becoming available monthly, there are a number of logical extensions to this method which incorporate for example, weather data, in conjunction with additional features of the dataset not used for this phase of the project.  A database is well-suited to handle joining and automatically updating these features.

One can find the installation instructions for PostgreSQL on their website along with other helpful resources: https://www.postgresql.org/
The EDB installer is available for download here: https://www.enterprisedb.com/downloads/postgres-postgresql-downloads
For this project we utilize PostGIS to store point geometries for longitude/latitude coordinates. You can install PostGIS as part of the "Stack Builder" application distributed with the EDB installer referenced above, or find instructions on their website: https://postgis.net/install/

Database schema utilized was a default PostgreSQL database named "CABI", which the user can create by using either psql or pgAdmin, and all tables will be generated below directly from python. If it's your first time using postgres/postGIS, you can find a good guide to setting up the database schema here: https://www.e-education.psu.edu/spatialdb/node/1958, but remember to use the name "CABI." 


One serious advantage of the database approach is that it allows the data to be accessed from more than one computer. For this project I only used a single user and a local network access permission scheme, but it could easily be scaled. It is presumed that users for whom it makes sense to distribute data more than locally, and/or set up multiple user permissions are capable of doing so. However, for those more casual users who would like to utilize the local network approach, instructions are available here: https://www.postgresql.org/docs/12/auth-pg-hba-conf.html

An additional note to the first time user wishing to access the database from their local network, which is poorly addressed in the above documentation: once you have modified the pg_hba.conf file, it is necessary to connect to the PostgreSQL instance and run the following, otherwise the modifications will not be recognized:

```SELECT pg_reload_conf()```


### 0.2 Configuring Your Database Connection in CaBi <a class="anchor" id="config-files"></a>

We use a rudimentary method to prevent the user's login credentials from being shared, in order for the CaBi package to work correctly with the database functionality, please do the following *prior to installing the CaBi package.*

    1. After forking and cloning the project locally, create a file named "config.py" in cabi/etl
    2. In config.py define the function connection_params in the following format, filling in the four parameters with the values you defined in setting up the CABI database above:
    
   ```python
    def connection_params():
        return 'postgresql://<user>:<password>@<address>:<port>/CABI'
   ```
           
       Note, defaults are probably postgres for user, localhost for address, unless you are connecting over a local network, and 5432 for port, unless you have chosen a different one during setup.
        

### 0.3 Conda Environment <a class="anchor" id="conda-env"></a>

**FLAGGED NEEDS WORK! OUTSTANDING COMPATIBILITY ISSUE WITH STATSMODELS 0.12 AND PMDARIMA**
Possible Solutions:

1. 2 separate environments, don't love this
2. Drop either markov/fancy statespace bits or pmdarima, I think preference would be to retain pmdarima
3. Have asked for help in pmdarima issue #326 and on StackOverflow


Unfortunately, due to a bug in statsmodels 0.12 release (which by and large is excellent and contains several useful features which we make use of below) there is a compatibility issue between pmdarima's current release 1.7.1, and statsmodels 0.12. By the time you read this the issue may be resolved, however as of now there is a carefully researched fix included in this repository.  Please note that the following should work and has been tested on Mac OSX and Linux, but there may be some lingering issues for Windows users. In the event that conda is unable to solve the included environment we suggest that you remove the version constraint from statsmodels (pmdarima will want it to be less than 0.12 (opposite of what is specified in the environment file).  This will result in loss of ability to run the sections in modeling on Markov Regime Switching, but as discussed we do not select this as our final model due to the out of sample prediction limitations.  Here are the steps we recommend to recreate the environment for this project. Seasoned conda users may want to reference the discussion thread here, especially the detailed answer given by pmdarima maintainer aaronreidsmith towards the end of the thread: https://github.com/alkaline-ml/pmdarima/issues/326 

First, try the following from the home directory of this repo:

``` conda create -f environment.yaml ``` 

If you encounter an issue where the above approach "hangs" when solving environment, the first recommendation is to update conda with

``` conda update conda ```

Next, if still having an issue with hanging solve, this may take a moment to solve, but should do the trick. Run the following prior to creating the environment to limit the search space:

``` conda config --set channel_priority strict ```

Once the environment is in place, we need to install pmdarima. In order to do that without causing the above-mentioned compatibility issue, we have adapted meta.yaml file as discussed in the github issue referenced above. It is available in the pmdarima directory of this repository. Please feel free to inspect the meta.yaml before installing for details on the work-around. Next, the user will need to run the following from the home directory of this repository:

``` conda-build pmdarima ```

This will generate a local channel from which you can install pmdarima. For more details please see: https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs-skeleton.html

If you previously set channel prirority you will next need to reset it with either "false" as shown below, or "flexible"

``` conda config --set channel_priority false ```

Next, we install pmdarima with our new local channel. Copy the channel location output from conda-build (should look something like this: /opt/anaconda3/envs/cabi-env/conda-bld/osx-64/pmdarima-1.7.1-py38h1de35cc_0.tar.bz2 ), then run:

```conda config --add channels <YOUR-LOCAL-CHANNEL> ```

And last, to complete installation run:

```conda install pmdarima```

### 0.4 The CaBi package <a class="anchor" id="cabi-package"></a>

The helper functions used in this project are packaged together as CaBi, which the user can install locally to fully reproduce all steps. After creating the config.py file (important that this be done in order!) navigate to the home directory of the project from the command line and run one of the following commands according to whether or not you're using conda environment.  Both are installed with develop, so that the user can edit the underlying scripts as desired.

If using conda:

```conda develop . ```

To install, and to uninstall:

```conda develop -u . ```

If not using conda (substitute with python3 on systems that differentiate):

``` python setup.py develop ```

To install, and to uninstall:

``` python setup.py develop --uninstall ```


Last step in the setup process is to add a kernel for the environment so you can use it inside of Jupyter:

``` ipython kernel install --user --name=cabi-env ``` 



## 1. Data Sources <a class="anchor" id="data-sources"></a>

Several sources of raw data were utilized for this project.  All are listed here with a brief description:

1. The Primary Source of Data for this project is the trips history released monthly by Capital Bikeshare. This can be found at https://s3.amazonaws.com/capitalbikeshare-data/index.html


2. This project utilizes several of the shapefiles available from Open Data DC for comparing locations and mapping. These are:
    - DC Advisory Neighborhood Commission Shapefiles: "https://opendata.arcgis.com/datasets/fcfbf29074e549d8aff9b9c708179291_1.geojson"
    - DC Boundary Line: "https://opendata.arcgis.com/datasets/7241f6d500b44288ad983f0942b39663_10.geojson"


3. Last, the region and station info from Capital Bikeshare is obtained from:
    - Capital Bikeshare Regions were used to speed up some of the geographic comparisons made on the point geometries below. These can be found at 'https://gbfs.capitalbikeshare.com/gbfs/en/system_regions.json'
    - Capital Bikeshare Stations Information was not utilized in real time (though this is certainly an area for improvement), but was used similarly for geometric comparisons as well as some mapping: https://gbfs.capitalbikeshare.com/gbfs/en/station_information.json

### 1.1 Extract <a class="anchor" id="extract"></a>

The project setup was designed to loosely mimic an ETL structure to facilitate the building out of postgres triggers and automated collection in future. The functions for gathering the raw data from source can be found in cabi.etl.extract, we walk through them here to show how they can be used to replicate the project structure.






In [1]:
import cabi.etl.extract as extract

In [2]:
# Builds the data/raw directory, containing csv files downloaded in zip format 
# from https://s3.amazonaws.com/capitalbikeshare-data/index.html and unzipped in data/raw
# see comments in cabi.etl.extract for a breakdown of each function in the sequence called here
## Note to user, the resulting directory will be around 1.1 GB
extract.build_raw_data()

In [3]:
# Extracts all csvs in the raw directory just created above into one list of dataframes
# corresponding to each csv
raw_dfs = extract.dfs_from_raw()



### 1.2 Initial EDA <a class="anchor" id="eda-initial"></a>

##### The Columns Are Not All The Same

A good first step in any data science process is to view the column names we're dealing with. In this case we see that the naming convention for the trips data has recently changed.

In [None]:
import pandas as pd

In [4]:
col_names = [df.columns for df in raw_dfs]
col_names

[Index(['Duration', 'Start date', 'End date', 'Start station number',
        'Start station', 'End station number', 'End station', 'Bike number',
        'Member type'],
       dtype='object'),
 Index(['Duration', 'Start date', 'End date', 'Start station number',
        'Start station', 'End station number', 'End station', 'Bike number',
        'Member type'],
       dtype='object'),
 Index(['Duration', 'Start date', 'End date', 'Start station number',
        'Start station', 'End station number', 'End station', 'Bike number',
        'Member type'],
       dtype='object'),
 Index(['Duration', 'Start date', 'End date', 'Start station number',
        'Start station', 'End station number', 'End station', 'Bike number',
        'Member type'],
       dtype='object'),
 Index(['Duration', 'Start date', 'End date', 'Start station number',
        'Start station', 'End station number', 'End station', 'Bike number',
        'Member type'],
       dtype='object'),
 Index(['Duration', 'Star

##### But the Information Is Roughly The Same

# **FLAGGED Insert Steps from Excel Feedback here**

- Data understanding: go into what the dataset is.  What date ranges are included, what time increments the data is in, what geographic range the data covers, etc.
 
- EDA: include EDA here


- Notably different items are:
    - Rideable Type (All of the old sets are presumed to be Docked Bikes)
    - Missing Lat/Lng (need an imputation strategy)
    - Different Station ID/Numbering Convention (not a huge deal, since naming is consistent)
    - No Unique Ride ID (we may want to add some sort of unique identifier if using this Data)

## 2. Data Preparation <a class="anchor" id="data-prep"></a>


### 2.1 Data Processing Steps <a class="anchor" id="data-processing"></a>






### 2.2 Pre-Modeling EDA <a class="anchor" id="eda-processed"></a>

## 3. The Business Problem: Framing Our Modeling Approach <a class="anchor" id="business-problem"></a>

### 3.1 Background <a class="anchor" id="business-background"></a>

### 3.2 Goals <a class="anchor" id="business-goals"></a>

#### 3.2.1 Countercyclical Areas <a class="anchor" id="countercyclical"></a>

#### 3.2.2 Detecting Anomalies <a class="anchor" id="anomalies"></a>


## 4. Modeling <a class="anchor" id="modeling"></a>

        
        

### 4.0 Historical Comparison of Data <a class="anchor" id="model-historical"></a>

#### 4.0.1 Normal Range of Variance <a class="anchor" id="variance-benchmark"></a>

### 4.1 Benchmark: Baseline Model <a class="anchor" id="baseline"></a>  

### 4.2 Model Evaluation Approach <a class="anchor" id="evaluation-approach"></a>  

#### 4.2.1 Tuning Comparison Criteria - AIC <a class="anchor" id="evaluation-aic"></a>  

#### 4.2.2 Cross Validation Method <a class="anchor" id="evaluation-cv"></a>  

#### 4.2.3 Out of Sample Performance - RMSE/SMAPE <a class="anchor" id="evaluation-oos"></a>

### 4.3 SARIMA <a class="anchor" id="sarima"></a>

#### 4.3.1 Model Summary <a class="anchor" id="sarima-summary"></a>

#### 4.3.2 Out of Sample Performance <a class="anchor" id="sarima-performance"></a>

#### 4.3.3 Power Transforms <a class="anchor" id="sarima-transforms"></a>

#### 4.3.4 Structural Limitations  <a class="anchor" id="sarima-limits"></a>

### 4.4 SARIMAX: Adding Exogenous Features  <a class="anchor" id="sarimax"></a>


#### 4.4.1 Date Features  <a class="anchor" id="exog-date"></a>


#### 4.4.2 Fourier Approach  <a class="anchor" id="exog-fourier"></a>


#### 4.4.3 Dummy Filter <a class="anchor" id="dummy-filter"></a>

### 4.5 Markov Extension: Regime Switching <a class="anchor" id="markov"></a>

### 4.6 VARMAX OR OTHER MODEL APPROACH?!?! <a class="anchor" id="best-model"></a>

### 4.7 Best Model <a class="anchor" id="best-model"></a>

## 5. Evaluation/Results  <a class="anchor" id="results"></a>


## 6. Business Recommendations <a class="anchor" id="business-recommendations"></a>


## 7. Further Research/Future Improvements <a class="anchor" id="further-research"></a>


## 8. Sources <a class="anchor" id="sources"></a>