# Understanding Hired Rides in NYC

_[Project prompt](https://docs.google.com/document/d/1VERPjEZcC1XSs4-02aM-DbkNr_yaJVbFjLJxaYQswqA/edit#)_

_This scaffolding notebook may be used to help setup your final project. It's **totally optional** whether you make use of this or not._

_If you do use this notebook, everything provided is optional as well - you may remove or add prose and code as you wish._

_Anything in italics (prose) or comments (in code) is meant to provide you with guidance. **Remove the italic lines and provided comments** before submitting the project, if you choose to use this scaffolding. We don't need the guidance when grading._

_**All code below should be consider "pseudo-code" - not functional by itself, and only a suggestion at the approach.**_

## Requirements

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project._

* Code clarity: make sure the code conforms to:
    * [ ] [PEP 8](https://peps.python.org/pep-0008/) - You might find [this resource](https://realpython.com/python-pep8/) helpful as well as [this](https://github.com/dnanhkhoa/nb_black) or [this](https://jupyterlab-code-formatter.readthedocs.io/en/latest/) tool
    * [ ] [PEP 257](https://peps.python.org/pep-0257/)
    * [ ] Break each task down into logical functions
* The following files are submitted for the project (see the project's GDoc for more details):
    * [ ] `README.md`
    * [ ] `requirements.txt`
    * [ ] `.gitignore`
    * [ ] `schema.sql`
    * [ ] 6 query files (using the `.sql` extension), appropriately named for the purpose of the query
    * [x] Jupyter Notebook containing the project (this file!)
* [x] You can edit this cell and add a `x` inside the `[ ]` like this task to denote a completed task

## Project Setup

In [1]:
# all import statements needed for the project, for example:
# !pip install matplotlib
# !pip install requests
# !pip install bs4
# !pip install sqlalchemy
# !pip install pandas
# !pip install geojsonio --upgrade
# !pip install geopandas
import math
from math import *
import sqlite3
import sqlalchemy
from sqlalchemy.orm import sessionmaker
import bs4
import matplotlib.pyplot as plt
import pandas as pd
import json
import requests
import sqlalchemy as db
import re
import datetime
import geojsonio
import numpy as np
import geopandas as gpd

In [2]:
# any general notebook setup, like log formatting

In [3]:
# any constants you might need, for example:

TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
# add other constants to refer to any local data, e.g. uber & weather
UBER_CSV = "uber_rides_sample.csv"
# weather csv data file
# 2009_weather to 2015_weather (just pick the first 6 months)
csv__= '/Users/morax/Documents/哥大/IEORE4501/IEOR4501 HW/IEOR4501 Project/'
csv09_file = csv__ + '2009_weather.csv'
csv10_file = csv__ + '2010_weather.csv'
csv11_file = csv__ + '2011_weather.csv'
csv12_file = csv__ + '2012_weather.csv'
csv13_file = csv__ + '2013_weather.csv'
csv14_file = csv__ + '2014_weather.csv'
csv15_file = csv__ + '2015_weather.csv'

NEW_YORK_BOX_COORDS = ((40.560445, -74.242330), (40.908524, -73.717047))

DATABASE_URL = "sqlite:///project.db"
DATABASE_SCHEMA_FILE = "schema.sql"
QUERY_DIRECTORY = "queries"

## Part 1: Data Preprocessing

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [x] Define a function that calculates the distance between two coordinates in kilometers that **only uses the `math` module** from the standard library.
* [ ] Taxi data:
    * [ ] Use the `re` module, and the packages `requests`, BeautifulSoup (`bs4`), and (optionally) `pandas` to programmatically download the required CSV files & load into memory.
    * You may need to do this one file at a time - download, clean, sample. You can cache the sampling by saving it as a CSV file (and thereby freeing up memory on your computer) before moving onto the next file. 
* [ ] Weather & Uber data:
    * [x] Download the data manually in the link provided in the project doc.
* [ ] All data:
    * [ ] Load the data using `pandas`
    * [ ] Clean the data, including:
        * Remove unnecessary columns
        * Remove invalid data points (take a moment to consider what's invalid)
        * Normalize column names
        * (Taxi & Uber data) Remove trips that start and/or end outside the designated [coordinate box](http://bboxfinder.com/#40.560445,-74.242330,40.908524,-73.717047)
    * [ ] (Taxi data) Sample the data so that you have roughly the same amount of data points over the given date range for both Taxi data and Uber data.
* [ ] Weather data:
    * [ ] Split into two `pandas` DataFrames: one for required hourly data, and one for the required daily daya.
    * [ ] You may find that the weather data you need later on does not exist at the frequency needed (daily vs hourly). You may calculate/generate samples from one to populate the other. Just document what you’re doing so we can follow along. 

### Calculating distance
In this section in order to calculate the distance between two points in the uber data, we must use the longitude and latitude of the pickup and drop off locations.  Therefore, by using math module in order to calculate the distance between these two coordinates is given in the function calculate_distance().  It is important to note that, there are more accurate ways to calculate the distance based upon longitude and latitude that do not use just the math module.  In addtion, in the taxi data sets, the distance is already calculated, however, it is given in miles, so the miles_to_km() function converts the miles to kilometers so that a direct comparison between taxis and ubers can be made.  It should also be noted that the taxi distance is a distance driving on city streets whereas the uber distance is just a birdseye view distance, therefore, the distances in the taxi data set are likely to be slightly longer.  The function, add_distance_column() can be used in order to add the calculated distance to the uber data set.  This function uses the pickup and drop off longitudes and latitudes in order to build the additional column row by row.

In [4]:
def calculate_distance1(from_coord, to_coord):
    # Longitude is x, Latitude is y, 
    # Longitude x
    long = (to_coord[0]-from_coord[0])*40000*math.cos((to_coord[1]+from_coord[1])*math.pi/360)/360
    # Latitude y
    lat = (to_coord[1]-from_coord[1])*40000/360
    # so the distance is just the side z followed by x^2+y^2=z^2
    distance = sqrt(long*long+lat*lat)
    return distance

from math import sin, cos, sqrt, atan2, radians
def calculate_distance2(from_coord, to_coord):
    # approximate radius of earth in km
    R = 6373.0

    # Longitude is x, Latitude is y, 
    # math.radians() converts a degree value into radians. 
    lon1 = radians(from_coord[0])
    lat1 = radians(from_coord[1])
    lon2 = radians(to_coord[0])
    lat2 = radians(to_coord[1])

    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

# well sometime calculate_distance1 is better than calculate_distance2

In [5]:
def miles_to_km(distance_miles):
    distance_km = distance_miles /0.62137119
    return distance_km

In [6]:
# Need to fix the variable names called in lambda function
def add_distance_column(dataframe): 
    dataframe['Distance'] = dataframe.apply(lambda x: calculate_distance((x['pickup_longitude'], x["pickup_latitude"]), (x['dropoff_longitude'], x['dropoff_latitude'])),axis=1)


### Processing Taxi Data

In this section, the taxi data set is being processed, the first step in processing the taxi data is first finding the links, then once those links are found they must be read in, and processed by combining each dataset for each month and year into one larger dataset and changing the distance column from miles to kilometers.

-**get_taxi_html() and find_taxi_parquet_urls():** These functions are implementing web scraping in order to find the links for each data set of yellow taxi cabs. The first function, get_taxi_html(), is returning the html content of the web page that has the Taxi data. The second function, find_taxi_parquet_links is from the web page pulling out all of the links, then iterating through those links to see which are datasets for yellow taxi cabs from January 2009 to June 2015.

-**get_and_clean_month_taxi_data(url)** This function reads in the data for each month of each year.

-**get_and_clean_taxi_data()** This function concatinates all of the data for each month of each year into one large data set.

In [7]:
def get_taxi_html():
    response = requests.get(TAXI_URL)
    html = response.content
    # check if the request was succeeded
    if not response.status_code == 200:
        return None
    return html


def find_taxi_parquet_urls():
    soup = bs4.BeautifulSoup(get_taxi_html(), 'html.parser')
    yellow_pattern = r"yellow_tripdata"
    # from Jan. 2009 to June 2015
    year_pattern = r"200\d{1}" # from Jan. 2009 to Dec. 2009
    year_pattern2 = r"201[01234]" # from Jan. 2010 to Dec.2014
    pattern2015 = r"2015-0[123456]" # from Jan. 2015 to June 2015
    link_list = [a['href'] for a in soup.find_all('a')[30:-25]]
    new_links = list()
    for item in link_list:
        # iterate through each year 2009 - 2015
        if (re.search(yellow_pattern, item) != None): 
            if (re.search(year_pattern, item) != None):
                new_links.append(item)
            if (re.search(year_pattern2, item) != None):
                new_links.append(item)
            if (re.search(pattern2015, item) != None):
                new_links.append(item)
    return new_links


In [8]:
def get_and_clean_month_taxi_data(url):
    return pd.read_parquet(url,engine='pyarrow')


In [36]:
# print(find_taxi_parquet_urls())
test_parquet= get_and_clean_month_taxi_data('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2015-01.parquet')
test_parquet.head(10)

KeyboardInterrupt: 

In [None]:
def miles_to_km(distance_miles):
    distance_km = distance_miles /0.62137119
    return distance_km

test_parquet['trip_distance'] = test_parquet.apply(lambda x: miles_to_km(x["trip_distance"]), axis = 1)


test_parquet.head(10)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2015-01-01 00:11:33,2015-01-01 00:16:48,1,1.609344,1,N,41,166,1,5.7,0.5,0.5,1.4,0.0,0.0,8.4,,
1,1,2015-01-01 00:18:24,2015-01-01 00:24:20,1,1.44841,1,N,166,238,3,6.0,0.5,0.5,0.0,0.0,0.0,7.3,,
2,1,2015-01-01 00:26:19,2015-01-01 00:41:06,1,5.632704,1,N,238,162,1,13.2,0.5,0.5,2.9,0.0,0.0,17.4,,
3,1,2015-01-01 00:45:26,2015-01-01 00:53:20,1,3.379622,1,N,162,263,1,8.2,0.5,0.5,2.37,0.0,0.0,11.87,,
4,1,2015-01-01 00:59:21,2015-01-01 01:05:24,1,1.609344,1,N,236,141,3,6.0,0.5,0.5,0.0,0.0,0.0,7.3,,
5,1,2015-01-01 00:07:31,2015-01-01 00:11:32,1,1.287475,1,N,239,238,2,5.0,0.5,0.5,0.0,0.0,0.0,6.3,,
6,1,2015-01-01 00:47:08,2015-01-01 00:54:50,1,1.770278,1,N,238,239,2,7.0,0.5,0.5,0.0,0.0,0.0,8.3,,
7,1,2015-01-01 00:58:04,2015-01-01 01:11:56,1,4.667098,1,N,238,42,1,12.2,0.5,0.5,2.7,0.0,0.0,16.2,,
8,1,2015-01-01 00:29:25,2015-01-01 00:37:25,2,2.092147,1,N,90,125,2,7.0,0.5,0.5,0.0,0.0,0.0,8.3,,
9,1,2015-01-01 00:39:02,2015-01-01 01:02:37,2,6.920179,1,N,125,141,2,18.0,0.5,0.5,0.0,0.0,0.0,19.3,,


In [None]:
def zones_within_bbox():
    #taxi_zones = pd.read_json('NYCTaxiZones.geojson')
    import json
    from shapely.geometry import Point
    #taxi_zones = json.loads('taxizonedata.json')
    #taxi_zones = taxi_zones.to_json(orient = 'columns')
    
    #taxi_zones= geopandas.read_file('NYCTaxiZones.geojson)
    df = gpd.read_file('NYCTaxiZones.geojson')

    #taxi_zones = gpd.GeoSeries(df)
    taxi_zones = gpd.GeoDataFrame(df)
    #df = df.ix[0]

    taxi_zones = taxi_zones.to_crs(4326)

    taxi_zones['lon'] = taxi_zones.centroid.x  
    taxi_zones['lat'] = taxi_zones.centroid.y

    northlimit  = 40.908524
    southlimit = 40.560445
    eastlimit = -73.717047
    westlimit = -74.242330

    
    taxi_zones = taxi_zones[(taxi_zones["lon"] <= eastlimit) & (taxi_zones["lon"] >= westlimit)] 
    taxi_zones = taxi_zones[(taxi_zones["lat"] <= northlimit) & (taxi_zones["lat"]>= southlimit)]

    return taxi_zones


def get_and_clean_taxi_data():
    # NOT OUR CODE
    all_taxi_dataframes = []
    
    all_csv_urls = find_taxi_parquet_urls()
    for csv_url in all_csv_urls:
        # maybe: first try to see if you've downloaded this exact
        # file already and saved it before trying again
        dataframe = get_and_clean_month_taxi_data(csv_url)

        # Making sure that the zone is in the [coordinate box](http://bboxfinder.com/#40.560445,-74.242330,40.908524,-73.717047)
        #dataframe = dataframe[(dataframe['PULocationID']== zone) & (dataframe['DOLocationID']== zone)]

        # Changing the distance column of each data set from miles to kilometers
        dataframe['trip_distance'] = dataframe.apply(lambda x: x["trip_distance"]/0.62137119, axis = 1)
        
        # maybe: if the file hasn't been saved, save it so you can
        # avoid re-downloading it if you re-run the function
        
        all_taxi_dataframes.append(dataframe)
        
    # create one gigantic dataframe with data from every month needed
    taxi_data = pd.concat(all_taxi_dataframes)
    
    return taxi_data
    # NOT OUR CODE

In [None]:
zones_within_bbox()


  taxi_zones['lon'] = taxi_zones.centroid.x

  taxi_zones['lat'] = taxi_zones.centroid.y


Unnamed: 0,shape_area,objectid,shape_leng,location_id,zone,borough,geometry,lon,lat
0,0.0007823067885,1,0.116357453189,1,Newark Airport,EWR,"MULTIPOLYGON (((-74.18445 40.69500, -74.18449 ...",-74.174000,40.691831
1,0.00486634037837,2,0.43346966679,2,Jamaica Bay,Queens,"MULTIPOLYGON (((-73.82338 40.63899, -73.82277 ...",-73.831299,40.616745
2,0.000314414156821,3,0.0843411059012,3,Allerton/Pelham Gardens,Bronx,"MULTIPOLYGON (((-73.84793 40.87134, -73.84725 ...",-73.847422,40.864474
3,0.000111871946192,4,0.0435665270921,4,Alphabet City,Manhattan,"MULTIPOLYGON (((-73.97177 40.72582, -73.97179 ...",-73.976968,40.723752
5,0.000606460984581,6,0.150490542523,6,Arrochar/Fort Wadsworth,Staten Island,"MULTIPOLYGON (((-74.06367 40.60220, -74.06351 ...",-74.071771,40.600324
...,...,...,...,...,...,...,...,...,...
258,0.000168611097013,256,0.0679149669603,256,Williamsburg (South Side),Brooklyn,"MULTIPOLYGON (((-73.95834 40.71331, -73.95681 ...",-73.959905,40.710880
259,0.000394552487366,259,0.126750305191,259,Woodlawn/Wakefield,Bronx,"MULTIPOLYGON (((-73.85107 40.91037, -73.85207 ...",-73.852215,40.897932
260,0.000422345326907,260,0.133514154636,260,Woodside,Queens,"MULTIPOLYGON (((-73.90175 40.76078, -73.90147 ...",-73.906306,40.744235
261,0.0000343423231652,261,0.0271204563616,261,World Trade Center,Manhattan,"MULTIPOLYGON (((-74.01333 40.70503, -74.01327 ...",-74.013023,40.709139


### Processing Uber Data

In this portion of the project, two functions load_and_clean_uber_data() and get_uber_data() are used.  The function load_and_clean_uber_data() reads in the uber data from a csv file and returns a dataframe.  The function get_uber_data() uses the previous function to read in the data, then uses the function add_distance_column previously defined in order to add the distance in kilometers of each trip taken by an uber in the dataset and returns the uber data as a dataframe.  In addition, some additional processing was done ot the data to convert the pickup_datetime into a datetime object and create a column for the day of the week the pickup occured on.

In [None]:
def load_and_clean_uber_data(csv_file):
    return pd.read_csv(csv_file)

In [None]:
def get_uber_data():
    uber_dataframe = load_and_clean_uber_data(UBER_CSV)
    add_distance_column(uber_dataframe)
    # Making Pickup_datetime a datetime object
    df["pickup_datetime"]  = df["pickup_datetime"].apply(lambda x:datetime.datetime.strptime(x,'%Y-%m-%d %H:%M:%S %Z'))
    df["day_of_week"]= df['pickup_datetime'].apply(lambda x: x.isoweekday())

    # Removing any data outside of the [coordinate box](http://bboxfinder.com/#40.560445,-74.242330,40.908524,-73.717047)
    #NEW_YORK_BOX_COORDS = ((40.560445, -74.242330), (40.908524, -73.717047))
    northlimit  = 40.908524
    southlimit = 40.560445
    eastlimit = -73.717047
    westlimit = -74.242330
    
    uber_dataframe = uber_dataframe[(uber_dataframe["pickup_longitude"] <= eastlimit) & (uber_dataframe["pickup_longitude"] >= westlimit)] 
    uber_dataframe = uber_dataframe [(uber_dataframe ["pickup_latitude"] <= northlimit) & (uber_dataframe["pickup_latitude"]>= southlimit)]

    uber_dataframe  = uber_dataframe[(uber_dataframe["dropoff_longitude"] <= eastlimit) & (uber_dataframe["dropoff_longitude"] >= westlimit)] 
    uber_dataframe = uber_dataframe[(uber_dataframe["dropoff_latitude"] <= northlimit) & (uber_dataframe["dropoff_latitude"]>= southlimit)]

    return uber_dataframe

### Processing Weather Data

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [9]:
def clean_month_weather_data_hourly(csv_file):
    # Just need to read in the data as a measurment is taken each hour 
    all_data = pd.read_csv(csv_file)

    # split data hourly
    date = all_data['DATE']
    import datetime
    date = date.apply(lambda x:datetime.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S'))
    all_data['hours'] = date.apply(lambda x:x.hour)
    all_data['newDATE'] = 0
    for i in range(len(date)):
        all_data['newDATE'][i] = all_data['DATE'][i][:10]
    # now we have cleaned the data, now rename it
    hourly_data = all_data

    # you'll find which way to use in the later part
    # combine them all together and split into 24 rows
    # return clean_month_weather_data_hourly_all.drop_duplicates(subset=['hours'])

    # split values into 24 rows for each day
    return hourly_data.drop_duplicates(subset=['hours', 'newDATE'],keep='last')


In [10]:
def clean_month_weather_data_daily(csv_file):
    #need to combine rows for each given day
    all_data = pd.read_csv(csv_file)

    # split data daily
    date = all_data['DATE']
    import datetime
    date = date.apply(lambda x:datetime.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S'))
    all_data['days'] = date.apply(lambda x:x.day)
    all_data['newDATE'] = 0
    for i in range(len(date)):
        all_data['newDATE'][i] = all_data['DATE'][i][:7]
    # now we have cleaned the data, now rename it
    daily_data = all_data

    # you'll find which way to use in the later part
    # combine them all together and split into 31 rows
    # return clean_month_weather_data_hourly_all.drop_duplicates(subset=['hours'])

    # split values into 31/30 rows for each month
    return daily_data.drop_duplicates(subset=['days', 'newDATE'],keep='last')
    

In [11]:
def load_and_clean_weather_data():
    hourly_dataframes = []
    daily_dataframes = []

    # add this at the beginnig
    # weather csv data file
    # 2009_weather to 2015_weather (just pick the first 6 months)
    csv__= '/Users/morax/Documents/哥大/IEORE4501/IEOR4501 HW/IEOR4501 Project/'
    csv09_file = csv__ + '2009_weather.csv'
    csv10_file = csv__ + '2010_weather.csv'
    csv11_file = csv__ + '2011_weather.csv'
    csv12_file = csv__ + '2012_weather.csv'
    csv13_file = csv__ + '2013_weather.csv'
    csv14_file = csv__ + '2014_weather.csv'
    csv15_file = csv__ + '2015_weather.csv'
    weather_csv09_14_files = [csv09_file, csv10_file, csv11_file, csv12_file, csv13_file, csv14_file]
    
    for csv_file in weather_csv09_14_files:
        hourly_dataframe = clean_month_weather_data_hourly(csv_file)
        daily_dataframe = clean_month_weather_data_daily(csv_file)
        hourly_dataframes.append(hourly_dataframe)
        daily_dataframes.append(daily_dataframe)
    hourly_15_dataframe = clean_month_weather_data_hourly(csv15_file).iloc[:4344]
    hourly_dataframes.append(hourly_15_dataframe)
    daily_15_dataframe = clean_month_weather_data_daily(csv15_file).iloc[:181]
    daily_dataframes.append(daily_15_dataframe)

    # create two dataframes with hourly & daily data from every month
    hourly_data = pd.concat(hourly_dataframes).reset_index(drop=True)
    daily_data = pd.concat(daily_dataframes).reset_index(drop=True)
    
    return hourly_data, daily_data

### Process All Data

Once all of the functions in order to process the data have been written each of those functions can be executed.  Executing each of these functions, provides four clean data sets, taxi_data, uber_data, hourly_weather_data, and daily_weather_data().

In [25]:
#taxi_data = get_and_clean_taxi_data()
# uber_data = get_uber_data()
hourly_weather_data, daily_weather_data = load_and_clean_weather_data()

  all_data = pd.read_csv(csv_file)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_data['newDATE'][i] = all_data['DATE'][i][:10]
  all_data = pd.read_csv(csv_file)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_data['newDATE'][i] = all_data['DATE'][i][:7]
  all_data = pd.read_csv(csv_file)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_data['newDATE'][i] = all_data['DATE'][i][:10]
  all_data = pd.read_csv(csv_file)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in t

In [None]:
uber_data.head(10)

In [None]:
taxi_data.head(10)

In [26]:
hourly_weather_data.head(10)
daily_weather_data.head(10)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,REPORT_TYPE,SOURCE,HourlyAltimeterSetting,HourlyDewPointTemperature,...,BackupDistanceUnit,BackupElements,BackupElevation,BackupEquipment,BackupLatitude,BackupLongitude,BackupName,WindEquipmentChangeDate,days,newDATE
0,72505394728,2009-01-01T23:51:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",AUTO,4,30.22,1.0,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,1,2009-01
1,72505394728,2009-01-02T23:59:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",SOD,O,,,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,2,2009-01
2,72505394728,2009-01-03T23:51:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",AUTO,4,30.08,9.0,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,3,2009-01
3,72505394728,2009-01-04T23:51:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",AUTO,4,30.0,12.0,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,4,2009-01
4,72505394728,2009-01-05T23:51:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",AUTO,4,30.03,3.0,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,5,2009-01
5,72505394728,2009-01-06T23:59:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",SOD,O,,,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,6,2009-01
6,72505394728,2009-01-07T23:59:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",SOD,O,,,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,7,2009-01
7,72505394728,2009-01-08T23:51:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",AUTO,4,29.73,16.0,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,8,2009-01
8,72505394728,2009-01-09T23:51:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",AUTO,4,30.26,12.0,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,9,2009-01
9,72505394728,2009-01-10T23:59:00,40.77898,-73.96925,42.7,"NY CITY CENTRAL PARK, NY US",SOD,O,,,...,mi,SNOW,,SNOWBOARD,,,CENTRAL PARK ZOO,2006-09-18,10,2009-01


## Part 2: Storing Cleaned Data

_Write some prose that tells the reader what you're about to do here._

In [15]:
engine = db.create_engine(DATABASE_URL)
# First, using SQLAlchemy, create a SQLite database with which you’ll load in your preprocessed datasets.

In [27]:
# if using SQL (as opposed to SQLAlchemy), define the commands 
# to create your 4 tables/dataframes
HOURLY_WEATHER_SCHEMA = """
CREATE TABLE IF NOT EXISTS hourly_weather
    (id INTEGER PRIMARY KEY,
    DATE DATE,
    LATITUDE INTEGER,
    LONGITUDE INTEGER,
    NAME TEXT,
    BackupName TEXT,
    hours INTEGER,
    newDATE DATE)
"""

DAILY_WEATHER_SCHEMA = """
CREATE TABLE IF NOT EXISTS daily_weather
    (id INTEGER PRIMARY KEY,
    DATE DATE,
    LATITUDE INTEGER,
    LONGITUDE INTEGER,
    NAME TEXT,
    BackupName TEXT,
    days INTEGER,
    newDATE DATE)
"""

TAXI_TRIPS_SCHEMA = """
TODO
"""

UBER_TRIPS_SCHEMA = """
TODO
"""

In [28]:
# create that required schema.sql file
with open(DATABASE_SCHEMA_FILE, "w") as f:
    f.write(HOURLY_WEATHER_SCHEMA)
    f.write(DAILY_WEATHER_SCHEMA)
    f.write(TAXI_TRIPS_SCHEMA)
    f.write(UBER_TRIPS_SCHEMA)

In [29]:
# create the tables with the schema files
with engine.connect() as connection:
    engine.connect().execute(
        HOURLY_WEATHER_SCHEMA
    )
    engine.connect().execute(
        DAILY_WEATHER_SCHEMA
    )

pd.read_sql_query - read data from querying a SQL table
pd.read_sql_table - read entire SQL table
df.to_sql - add data from the dataframe to a SQL table
pd.to_numeric - Convert argument to a numeric type
pd.concat - Concatenate pandas objects along a particular axis with optional set logic along the other axes
pd.merge - Merge DataFrame or named Series objects with a database-style join
pd.merge_asof - Perform a merge by key distance. This is similar to a left-join except that we match on the nearest key rather than equal keys. Both DataFrames must be sorted by the key.

### Add Data to Database

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [30]:
from sqlalchemy.orm import sessionmaker
def write_dataframes_to_table(table_to_df_dict):
    # sessionmaker returns a Session class
    Session = sessionmaker(bind=engine)
    # and we create an instance of Session
    session = Session()
    session.add_all(table_to_df_dict) # add multiple entries at once
    session.commit()

In [31]:
map_table_name_to_dataframe = {
    # "taxi_trips": taxi_data,
    # "uber_trips": uber_data,
    "hourly_weather": hourly_weather_data,
    "daily_weather": daily_weather_data,
}

In [32]:
write_dataframes_to_table(map_table_name_to_dataframe)

UnmappedInstanceError: Class 'builtins.str' is not mapped

## Part 3: Understanding the Data

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] For 01-2009 through 06-2015, what hour of the day was the most popular to take a yellow taxi? The result should have 24 bins.
* [ ] For the same time frame, what day of the week was the most popular to take an uber? The result should have 7 bins.
* [ ] What is the 95% percentile of distance traveled for all hired trips during July 2013?
* [ ] What were the top 10 days with the highest number of hired rides for 2009, and what was the average distance for each day?
* [ ] Which 10 days in 2014 were the windiest, and how many hired trips were made on those days?
* [ ] During Hurricane Sandy in NYC (Oct 29-30, 2012) and the week leading up to it, how many trips were taken each hour, and for each hour, how much precipitation did NYC receive and what was the sustained wind speed?

In [None]:
def write_query_to_file(query, outfile):



    raise NotImplemented()
    

In [None]:
def uber_popular_day_of_week(df = uber_data):
     
     day_of_week_group = df.groupby('day_of_week').size().sort_values(ascending=False)
     return day_of_week_group

uber_popular_day_of_week()

day_of_week
5    30880
6    30251
4    30021
3    29037
2    28127
7    26441
1    25243
dtype: int64

In [None]:
def taxi_most_pickup_hour(df = taxi_data):
    df["tpep_dropoff_datetime"]  = df["tpep_dropoff_datetime"].apply(lambda x:datetime.datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))
    df["tpep_pickup_datetime"]  = df["tpep_pickup_datetime"].apply(lambda x:datetime.datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))

    df['pickup_hour'] = df['tpep_pickup_datetime'].apply(lambda x:x.hour)

    pickup_hour_group = df.groupby(['pickup_hour']).size().sort_values(ascending=False)

    return pickup_hour_group


taxi_most_pickup_hour()

NameError: name 'taxi_data' is not defined

### Query N

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each query_

In [None]:
QUERY_N = """
TODO
"""

In [None]:
engine.execute(QUERY_N).fetchall()

In [None]:
write_query_to_file(QUERY_N, "some_descriptive_name.sql")

## Part 4: Visualizing the Data

_A checklist of requirements to keep you on track. Remove this whole cell before submitting the project. The order of these tasks aren't necessarily the order in which they need to be done. It's okay to do them in an order that makes sense to you._

* [ ] Create an appropriate visualization for the first query/question in part 3
* [ ] Create a visualization that shows the average distance traveled per month (regardless of year - so group by each month). Include the 90% confidence interval around the mean in the visualization
* [ ] Define three lat/long coordinate boxes around the three major New York airports: LGA, JFK, and EWR (you can use bboxfinder to help). Create a visualization that compares what day of the week was most popular for drop offs for each airport.
* [ ] Create a heatmap of all hired trips over a map of the area. Consider using KeplerGL or another library that helps generate geospatial visualizations.
* [ ] Create a scatter plot that compares tip amount versus distance.
* [ ] Create another scatter plot that compares tip amount versus precipitation amount.

_Be sure these cells are executed so that the visualizations are rendered when the notebook is submitted._

### Visualization: For 01-2009 through 06-2015, what hour of the day was the most popular to take a yellow taxi? 
_**TODO:** Write some prose that tells the reader what you're about to do here._


In [None]:
def plot_visual_hour_of_day(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)

### Visualization: Define three lat/long coordinate boxes around the three major New York airports: LGA, JFK, and EWR. Compares what day of the week was most popular for drop offs for each airport.
_**TODO:** Write some prose that tells the reader what you're about to do here._


EWR (Newark): -74.195995,40.664103,-74.148445,40.713045

JFK: -73.832496,40.618362,-73.744262,40.669421

LGA (LaGuardia): -73.892010,40.764638,-73.852357,40.787711

In [None]:
# needs to be fixed up for our purposes
# Just looking at uber data?????
def get_zone(lon,lat,airport_boxes):
    
    #MY CODE STARTS HERE
    # long is x lat is y 
    
    for item in airport_boxes:
        #print(item)
        if (item[1][0][0] <= lon <= item[1][1][0]) & (item[1][0][1] <= lat <= item[1][2][1]):
            zone = item[0]
            # want to return airport name 
        else:
            # want to return na if no airport name
            break

df['dropoff_zone'] = df.apply(lambda x: get_zone(x["dropoff_longitude"], x["dropoff_latitude"], zone_table),axis =1)
        

In [None]:
def plot_visual_hour_of_day(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)

### Visualization:  A heatmap of all hired trips over a map of the area

_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def plot_visual_hour_of_day(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)

### Visualization: The average distance traveled per month (regardless of year - so group by each month). Include the 90% confidence interval around the mean in the visualization

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each visualization._

_The example below makes use of the `matplotlib` library. There are other libraries, including `pandas` built-in plotting library, kepler for geospatial data representation, `seaborn`, and others._

### Visualization: A scatter plot that compares tip amount versus distance
_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def plot_visual_hour_of_day(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)

### Visualization: A scatter plot that compares tip amount versus precipitation amount
_**TODO:** Write some prose that tells the reader what you're about to do here._

In [None]:
def plot_visual_hour_of_day(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)

### Visualization N

_**TODO:** Write some prose that tells the reader what you're about to do here._

_Repeat for each visualization._

_The example below makes use of the `matplotlib` library. There are other libraries, including `pandas` built-in plotting library, kepler for geospatial data representation, `seaborn`, and others._

In [None]:
# use a more descriptive name for your function
def plot_visual_n(dataframe):
    figure, axes = plt.subplots(figsize=(20, 10))
    
    values = "..."  # use the dataframe to pull out values needed to plot
    
    # you may want to use matplotlib to plot your visualizations;
    # there are also many other plot types (other 
    # than axes.plot) you can use
    axes.plot(values, "...")
    # there are other methods to use to label your axes, to style 
    # and set up axes labels, etc
    axes.set_title("Some Descriptive Title")
    
    plt.show()

In [None]:
def get_data_for_visual_n():
    # Query SQL database for the data needed.
    # You can put the data queried into a pandas dataframe, if you wish
    raise NotImplemented()

In [None]:
some_dataframe = get_data_for_visual_n()
plot_visual_n(some_dataframe)