# This is the Final build for the Project

### There are 3 main stages in this Project and one auxiliary stage which is run to set up the project. 
First, the `final_project.ipynb` is run to populate the PostGresDB and do some data cleaning. 
Later this file is run.  

This file contains 3 phases:
Data Generation -> Graph Building -> Visualization


## Stage 1: Data Generation


<div class="alert alert-block alert-danger">
<b>Danger:</b> Run Stage 0 before starting this
</div>

### Input: 
* Start time ``(2019-01-01 00:00:00)``
* End time ``(2019-01-01 23:59:59)``
* Poolsize ``(300, 420, 600)``
    
### Output: 
* CSV / Pandas DF that contains data for the next cycle
    * The dataframe will contain these fields: 
         1. id
         2. tpep_pickup_datetime
         3. tpep_dropoff_datetime
         4. passenger_count
         5. trip_distance - acquired from OSRM
         6. PULocationID
         7. DOLocationID


In [31]:
import numpy as np
import pandas as pd 
import geopandas as gpd
import psycopg2

import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sqlalchemy import create_engine

plt.style.use('ggplot')

# Constants
interval_dict = {
    "1hour": """1 HOURS""",
    "1day": """1 DAYS""",
    "1month": """1 MONTHS"""

}
pool_size = 300
day_in_seconds = 24 * 3600


# Connect to DB

conn_string = "postgresql://nycrideshare:nycrideshare@127.0.0.1:5432/nyc_taxi"
nyc_database = create_engine(conn_string)

In [32]:
def generate_data(start_time, data_size_duration):
    '''
    This function aims to get a chunk of data for a month or day. Depending on the need.

    [start_time]: datetime object
    [data_size_duration]: one of the values specified in the interval_dict 

    '''
    
    time_string = start_time.strftime("%Y%m%d_%H%M%S")
    query = \
        f"""select 
        id, 
        tpep_pickup_datetime, 
        tpep_dropoff_datetime, 
        passenger_count, 
        "PULocationID", 
        "DOLocationID"
        from nyc_taxi_schema.get_cust_between_timestamps_lgd('{start_time.strftime("%Y-%m-%d %H:%M:%S")}', '{interval_dict[data_size_duration]}');"""
    df_temp = pd.read_sql_query(query, nyc_database)
    return df_temp

In [33]:
start_time = datetime(2019, 1,1, 00, 00,00)
data_size_duration = "1hour"
df_jan = generate_data(start_time, data_size_duration)
    

9.458480834960938
1239726


## Stage 2: Graph Construction

This stage is responsible to construct graphs using networX to model the relationships between passengers. 
The connected edges represent the rides that are merged. 

### Input Parameters: 
* Poolsize
* Weight calculating functions as arguments
 

## Stage 3: Visualization

This stage is responsible to gather data from stage to for Visualization

### Input Parameters: 
* TBD
* TBD