# This is the Final build for the Project

### There are 3 main stages in this Project and one auxiliary stage which is run to set up the project. 
First, the `final_project.ipynb` is run to populate the PostGresDB and do some data cleaning. 
Later this file is run.  

This file contains 3 phases:
Data Generation -> Graph Building -> Visualization


## Stage 1: Data Generation


<div class="alert alert-block alert-danger">
<b>Danger:</b> Run Stage 0 before starting this
</div>

### Input: 
* Start time ``(2019-01-01 00:00:00)``
* End time ``(2019-01-01 23:59:59)``
* Poolsize ``(300, 420, 600)``
    
### Output: 
* CSV / Pandas DF that contains data for the next cycle
    * The dataframe will contain these fields: 
         1. id
         2. tpep_pickup_datetime
         3. tpep_dropoff_datetime
         4. passenger_count
         5. trip_distance - acquired from OSRM
         6. PULocationID
         7. DOLocationID


In [1]:
import numpy as np
import pandas as pd 
import geopandas as gpd
import psycopg2

import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sqlalchemy import create_engine

plt.style.use('ggplot')

# Constants
interval_dict = {
    "1hour": """1 HOURS""",
    "1day": """1 DAYS""",
    "1month": """1 MONTHS"""

}
pool_size = 300
day_in_seconds = 24 * 3600


# Connect to DB

conn_string = "postgresql://nycrideshare:nycrideshare@127.0.0.1:5432/nyc_taxi"
nyc_database = create_engine(conn_string)

In [2]:
def generate_data(start_time, data_size_duration):
    '''
    This function aims to get a chunk of data for a month or day. Depending on the need.

    [start_time]: datetime object
    [data_size_duration]: one of the values specified in the interval_dict 

    '''
    
    time_string = start_time.strftime("%Y%m%d_%H%M%S")
    query = \
        f"""select 
        id, 
        tpep_pickup_datetime, 
        tpep_dropoff_datetime, 
        passenger_count, 
        "PULocationID", 
        "DOLocationID"
        from nyc_taxi_schema.get_cust_between_timestamps_lgd('{start_time.strftime("%Y-%m-%d %H:%M:%S")}', '{interval_dict[data_size_duration]}');"""


    # Get the adjacency matrix
    interzonal_dist = pd.read_csv("./data/interzonal.csv")
    # Get the dataframe
    df_temp = pd.read_sql_query(query, nyc_database)
    # Add the distance to all the the rows
    df_temp["Distance"] = df_temp.apply(lambda row: interzonal_dist.iloc[row["PULocationID"]-1, row["DOLocationID"]-1], axis=1)

    return df_temp.set_index("id", drop=True)




In [3]:
start_time = datetime(2019, 1,1, 00, 00,00)
end_time = start_time + timedelta(minutes=60)
data_size_duration = "1hour" # This is to get the initial dataframe
df_generated = generate_data(start_time, data_size_duration)

In [4]:
lgd_flag = {
    "pickup": 'PULocationID',
    "drop": 'DOLocationID'
} # this flag represents if the pickup is at LGD or Dropoff

def date_iterator(ts_start, ts_end, delta_in_minutes, flag):
    current = ts_start
    delta = timedelta(minutes=delta_in_minutes)
    while current < ts_end:
        yield df_generated[
            (df_generated['tpep_pickup_datetime'] >= current) & 
            (df_generated['tpep_pickup_datetime'] < current + delta) &
            (df_generated[lgd_flag[flag]] == 138)]

        current += delta



In [5]:
# Just a demo for drop
for df_filtered in date_iterator(start_time, end_time, 10, "drop"):
    print(df_filtered.head(2))


      tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
id                                                                  
11866  2019-01-01 00:04:29   2019-01-01 00:26:02                1   

       PULocationID  DOLocationID  Distance  
id                                           
11866           164           138   13397.4  
     tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
id                                                                 
9596  2019-01-01 00:18:09   2019-01-01 00:41:11                1   

      PULocationID  DOLocationID  Distance  
id                                          
9596           161           138   12556.8  
    tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  PULocationID  \
id                                                                              
261  2019-01-01 00:27:06   2019-01-01 00:51:51                1            68   

     DOLocationID  Distance  
id                           
261       

In [7]:
# Just a demo for pickup
for df_filtered in date_iterator(start_time, end_time, 10, "pickup"):
    print(df_filtered.head(2))

     tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
id                                                                 
5915  2019-01-01 00:02:51   2019-01-01 00:08:51                1   
2455  2019-01-01 00:06:09   2019-01-01 00:29:46                2   

      PULocationID  DOLocationID  Distance  
id                                          
5915           138            79   15242.4  
2455           138           170   13219.5  
      tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
id                                                                  
7296   2019-01-01 00:10:33   2019-01-01 00:26:52                1   
11210  2019-01-01 00:13:38   2019-01-01 00:28:45                1   

       PULocationID  DOLocationID  Distance  
id                                           
7296            138           236   12901.0  
11210           138           140   12094.6  
     tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
id                

## Stage 2: Graph Construction

This stage is responsible to construct graphs using networX to model the relationships between passengers. 
The connected edges represent the rides that are merged. 

### Input Parameters: 
* Poolsize
* Weight calculating functions as arguments
 

## Stage 3: Visualization

This stage is responsible to gather data from stage to for Visualization

The idea for this phase is, Phase2 at each iteration calls this method. Parameters are TBD. 
When this function is called, the merged data and individual data is collated and stored as a DF/File. This can be used later to build graphs. 

### Input Parameters: 
* TBD
* TBD