# Technical Summary of Process and Results

## Summary

## Table of Contents

* [0. Preliminaries and Setup Instructions](#preliminaries)
    * [0.1 PostgreSQL / PostGIS setup](#postgres)
    * [0.2 Configuring Your Database Connection in CaBi](#config-files)
    * [0.3 Conda Environment](#conda-env)
    * [0.4 The CaBi Package](#cabi-package)
* [1. Data Sources](#data-sources)
    * [1.1 Extract](#extract)
    * [1.2 Initial EDA](#eda-initial)
* [2. Data Preparation](#data-prep)
    * [2.1 Data Processing Steps](#data-processing)
    * [2.2 Pre-Modeling EDA](#eda-processed)
* [3. The Business Problem: Framing Our Modeling Approach](#business-problem)
    * [3.1 Background](#business-background)
    * [3.2 Goals](#business-goals)
        * [3.2.1 Countercyclical Areas](#countercyclical)
        * [3.2.2 Detecting Anomalies](#anomalies)
* [4. Modeling](#modeling)
    * [4.0 Historical Comparison of Data](#model-historical)
        * [4.0.1 Normal Range of Variance](#variance-benchmark)
    * [4.1 Benchmark: Baseline Model](#baseline)
    * [4.2 Model Evaluation Approach](#evaluation-approach)
        * [4.2.1 Tuning Comparison Criteria - AIC](#evaluation-aic)
        * [4.2.2 Cross Validation Method](#evaluation-cv)
        * [4.2.3 Out of Sample Performance - RMSE/SMAPE](#evaluation-oos)
    * [4.3 SARIMA](#sarima)
        * [4.3.1 Model Summary](#sarima-summary)
        * [4.3.2 Out of Sample Performance](#sarima-performance)
        * [4.3.3 Power Transforms](#sarima-tranforms)
        * [4.3.4 Structural Limitations](#sarima-limits)
    * [4.4 SARIMAX: Adding Exogenous Features](#sarimax)
        * [4.4.1 Date Features](#exog-date)
        * [4.4.2 Fourier Approach](#exog-fourier)
        * [4.4.3 Dummy Filter](#dummy-filter)
    * [4.5 Markov Extension: Regime Switching](#markov)
    * [4.6 VARMAX???]
    * [4.7 Best Model](#best-model)
* [5. Evaluation/Results](#results)
* [6. Business Recommendations](#business-recommendations)
* [7. Further Research/Future Improvements](#further-research)
* [8. Sources](#sources)





## 0. Preliminaries and Setup Instructions <a class="anchor" id="preliminaries"></a>

This section is primarily intended for the user to be able to replicate the results of this analysis or extend it more comfortably on new data. Below, I outline the steps for: initializing a PostgreSQL/PostGIS database, creating a config.py file which is necessary to use the database functionality of the CaBi (Capital Bikeshare) package, using the conda environment provided for the project, and installing the CaBi package for local use.


### 0.1 PostgreSQL / PostGIS <a class="anchor" id="postgres"></a>

Though it is not strictly necessary for the scope of this project as it stands, I highly recommend the use of PostgreSQL/PostGIS as an intermediate storage method if the user plans to revisit the data more than once or twice. The geopandas pickling/other traditional file formats are a bit finicky, and this approach has several advantages. First among these is flexibility with respect to the volume of data handled at any one time. A database approach, in contrast to a pandas dataframe for example, allows the user to load a smaller selection of areas to model at one time, while storing the majority of records elsewhere. Geographical data are already a bit larger to traditional datatypes and if the user wishes to observe longer periods of time the number of observations quickly expands into the millions after just a few months of trip history. Next is extensibility; below we work primarily with counts from a modeling perspective, but in addition to new trips information becoming available monthly, there are a number of logical extensions to this method which incorporate for example, weather data, in conjunction with additional features of the dataset not used for this phase of the project.  A database is well-suited to handle joining and automatically updating these features.

One can find the installation instructions for PostgreSQL on their website along with other helpful resources: https://www.postgresql.org/
The EDB installer is available for download here: https://www.enterprisedb.com/downloads/postgres-postgresql-downloads
For this project we utilize PostGIS to store point geometries for longitude/latitude coordinates. You can install PostGIS as part of the "Stack Builder" application distributed with the EDB installer referenced above, or find instructions on their website: https://postgis.net/install/

Database schema utilized was a default PostgreSQL database named "CABI", which the user can create by using either psql or pgAdmin, and all tables will be generated below directly from python. If it's your first time using postgres/postGIS, you can find a good guide to setting up the database schema here: https://www.e-education.psu.edu/spatialdb/node/1958, but remember to use the name "CABI." 


One serious advantage of the database approach is that it allows the data to be accessed from more than one computer. For this project I only used a single user and a local network access permission scheme, but it could easily be scaled. It is presumed that users for whom it makes sense to distribute data more than locally, and/or set up multiple user permissions are capable of doing so. However, for those more casual users who would like to utilize the local network approach, instructions are available here: https://www.postgresql.org/docs/12/auth-pg-hba-conf.html

An additional note to the first time user wishing to access the database from their local network, which is poorly addressed in the above documentation: once you have modified the pg_hba.conf file, it is necessary to connect to the PostgreSQL instance and run the following, otherwise the modifications will not be recognized:

```SELECT pg_reload_conf()```


### 0.2 Configuring Your Database Connection in CaBi <a class="anchor" id="config-files"></a>

We use a rudimentary method to prevent the user's login credentials from being shared, in order for the CaBi package to work correctly with the database functionality, please do the following *prior to installing the CaBi package.*

    1. After forking and cloning the project locally, create a file named "config.py" in cabi/etl
    2. In config.py define the function connection_params in the following format, filling in the four parameters with the values you defined in setting up the CABI database above:
    
   ```python
    def connection_params():
        return 'postgresql://<user>:<password>@<address>:<port>/CABI'
   ```
           
       Note, defaults are probably postgres for user, localhost for address, unless you are connecting over a local network, and 5432 for port, unless you have chosen a different one during setup.
        

### 0.3 Conda Environment <a class="anchor" id="conda-env"></a>


### 0.4 The CaBi package <a class="anchor" id="cabi-package"></a>





## 1. Data Sources <a class="anchor" id="data-sources"></a>

### 1.1 Extract <a class="anchor" id="extract"></a>

### 1.2 Initial EDA <a class="anchor" id="eda-initial"></a>




## 2. Data Preparation <a class="anchor" id="data-prep"></a>


### 2.1 Data Processing Steps <a class="anchor" id="data-processing"></a>

### 2.2 Pre-Modeling EDA <a class="anchor" id="eda-processed"></a>




## 3. The Business Problem: Framing Our Modeling Approach <a class="anchor" id="business-problem"></a>

### 3.1 Background <a class="anchor" id="business-background"></a>

### 3.2 Goals <a class="anchor" id="business-goals"></a>

#### 3.2.1 Countercyclical Areas <a class="anchor" id="countercyclical"></a>

#### 3.2.2 Detecting Anomalies <a class="anchor" id="anomalies"></a>


## 4. Modeling <a class="anchor" id="modeling"></a>

        
        

### 4.0 Historical Comparison of Data <a class="anchor" id="model-historical"></a>

#### 4.0.1 Normal Range of Variance <a class="anchor" id="variance-benchmark"></a>

### 4.1 Benchmark: Baseline Model <a class="anchor" id="baseline"></a>  

### 4.2 Model Evaluation Approach <a class="anchor" id="evaluation-approach"></a>  

#### 4.2.1 Tuning Comparison Criteria - AIC <a class="anchor" id="evaluation-aic"></a>  

#### 4.2.2 Cross Validation Method <a class="anchor" id="evaluation-cv"></a>  

#### 4.2.3 Out of Sample Performance - RMSE/SMAPE <a class="anchor" id="evaluation-oos"></a>

### 4.3 SARIMA <a class="anchor" id="sarima"></a>

#### 4.3.1 Model Summary <a class="anchor" id="sarima-summary"></a>

#### 4.3.2 Out of Sample Performance <a class="anchor" id="sarima-performance"></a>

#### 4.3.3 Power Transforms <a class="anchor" id="sarima-transforms"></a>

#### 4.3.4 Structural Limitations  <a class="anchor" id="sarima-limits"></a>

### 4.4 SARIMAX: Adding Exogenous Features  <a class="anchor" id="sarimax"></a>


#### 4.4.1 Date Features  <a class="anchor" id="exog-date"></a>


#### 4.4.2 Fourier Approach  <a class="anchor" id="exog-fourier"></a>


#### 4.4.3 Dummy Filter <a class="anchor" id="dummy-filter"></a>

### 4.5 Markov Extension: Regime Switching <a class="anchor" id="markov"></a>

### 4.6 VARMAX OR OTHER MODEL APPROACH?!?! <a class="anchor" id="best-model"></a>

### 4.7 Best Model <a class="anchor" id="best-model"></a>

## 5. Evaluation/Results  <a class="anchor" id="results"></a>


## 6. Business Recommendations <a class="anchor" id="business-recommendations"></a>


## 7. Further Research/Future Improvements <a class="anchor" id="further-research"></a>


## 8. Sources <a class="anchor" id="sources"></a>