<h1>Mapping (and some data manipulation) practice</h1>
In this assignment, you will grab one month of NYC taxi trips data and draw a choropleth map for a part of Manhattan. The assignment has several parts, some of which are very tricky, so follow the steps below carefully!

The tricky part in this assignment is that we'll create our own base map for the choropleth map

We will:

<li>Divide the bounding box into n x n equal zones</li>
<li>figure out the four corners for each zone</li>
<li>create a geojson object for this collection of zones</li>
<li>use this geojson object as our base map</li>
<li>Allocate each trip to the zone that corresponds to the pickup coordinates for the ride</li>
<li>Count the number of trips in each zone by grouping by zone</li>
<li>The trip counts are skewed so we'll smooth them a little by taking the log of the counts</li>
<li>and use these logs in the choropleth map</li>

<h1>Step 0: Set the number of buckets</h1>
<li>If your code is all good, you can change this number, select Kernel --> Restart and Run All from the Jupyter menubar, and run the notebook for different values of num_buckets

In [1]:
num_buckets = 5

<h2>STEP 1: Read the data</h2>
<li>The data is at <a href="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv">https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv</a></li>
<li>Use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">pd.read_csv</a> to read the data into a dataframe</li>
<li>The fields in the data are described at <a href="https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf">https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf</a>

<h2>Read the data</h2>

In [2]:
import pandas as pd
import numpy as np

# for all the data
datasource = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv"
df = pd.read_csv(datasource)

In [3]:
import pandas as pd
import numpy as np

# for the reduced data
#df = pd.read_csv('C:/Users/iawol/Desktop/College/Senior/Data Analytics/HW/taxi_data_much_reduced.csv')
#df.head()

<h2>STEP 2: Keep data that is generally south of 125th street in Manhattan</h2>
<li>Construct a bounding box at <a href="http://boundingbox.klokantech.com">http://boundingbox.klokantech.com</a></li>
<li>Select "Dublin core" to get the directional limits in the correct format</li>
<li>Approximate is fine</li>
<li>Remove any rows whose pickup latitude and pickup longitude is not in the bounding box</li>

In [4]:
# bounding box coordinates from website
westlimit=-74.020255; southlimit=40.697944; eastlimit=-73.928999; northlimit=40.820273

# removes all rows that have latitude south of the bounding box or north of the bounding box
# also removes all rows that have longitudes east of the bounding box or west of the bounding box
df_box = df[~((df['pickup_latitude'] < southlimit) | (df['pickup_latitude'] > northlimit) 
                      | (df['pickup_longitude'] < westlimit) | (df['pickup_longitude'] > eastlimit))]
df_box.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2016-01-01 00:00:00,2016-01-01 00:00:00,2,1.1,-73.990372,40.734695,1,N,-73.981842,40.732407,2,7.5,0.5,0.5,0.0,0.0,0.3,8.8
1,2,2016-01-01 00:00:00,2016-01-01 00:00:00,5,4.9,-73.980782,40.729912,1,N,-73.944473,40.716679,1,18.0,0.5,0.5,0.0,0.0,0.3,19.3
3,2,2016-01-01 00:00:00,2016-01-01 00:00:00,1,4.75,-73.993469,40.71899,1,N,-73.962242,40.657333,2,16.5,0.0,0.5,0.0,0.0,0.3,17.3
4,2,2016-01-01 00:00:00,2016-01-01 00:00:00,3,1.76,-73.960625,40.78133,1,N,-73.977264,40.758514,2,8.0,0.0,0.5,0.0,0.0,0.3,8.8
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,40.763142,2,19.0,0.5,0.5,0.0,0.0,0.3,20.3


<h2>STEP 3: Calculate taxi trip duration</h2>
<li>Add a df_box['duration'] column to df_box</li>
<li>The new column should contain the timedelta found by subtracting the pickup time column from the drop off time column after converting both to datetime</li>

In [5]:
# sets the columns to datetime
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'],format='%Y-%m-%d %H:%M:%S')
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'],format='%Y-%m-%d %H:%M:%S')

# finds the difference between dropoff and pickup to give duration of trip
df_box['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
df_box.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration
0,2,2016-01-01 00:00:00,2016-01-01 00:00:00,2,1.1,-73.990372,40.734695,1,N,-73.981842,40.732407,2,7.5,0.5,0.5,0.0,0.0,0.3,8.8,00:00:00
1,2,2016-01-01 00:00:00,2016-01-01 00:00:00,5,4.9,-73.980782,40.729912,1,N,-73.944473,40.716679,1,18.0,0.5,0.5,0.0,0.0,0.3,19.3,00:00:00
3,2,2016-01-01 00:00:00,2016-01-01 00:00:00,1,4.75,-73.993469,40.71899,1,N,-73.962242,40.657333,2,16.5,0.0,0.5,0.0,0.0,0.3,17.3,00:00:00
4,2,2016-01-01 00:00:00,2016-01-01 00:00:00,3,1.76,-73.960625,40.78133,1,N,-73.977264,40.758514,2,8.0,0.0,0.5,0.0,0.0,0.3,8.8,00:00:00
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,40.763142,2,19.0,0.5,0.5,0.0,0.0,0.3,20.3,00:18:30


<h2>STEP 4: Convert duration into a float</h2>
<li>into minutes</li>
<li>I've done this for you</li>

In [6]:
df_box['duration_in_minutes'] = (df_box['duration'].apply(lambda x:x/np.timedelta64(1, 's')))/60.0
df_box.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,duration_in_minutes
0,2,2016-01-01 00:00:00,2016-01-01 00:00:00,2,1.1,-73.990372,40.734695,1,N,-73.981842,...,2,7.5,0.5,0.5,0.0,0.0,0.3,8.8,00:00:00,0.0
1,2,2016-01-01 00:00:00,2016-01-01 00:00:00,5,4.9,-73.980782,40.729912,1,N,-73.944473,...,1,18.0,0.5,0.5,0.0,0.0,0.3,19.3,00:00:00,0.0
3,2,2016-01-01 00:00:00,2016-01-01 00:00:00,1,4.75,-73.993469,40.71899,1,N,-73.962242,...,2,16.5,0.0,0.5,0.0,0.0,0.3,17.3,00:00:00,0.0
4,2,2016-01-01 00:00:00,2016-01-01 00:00:00,3,1.76,-73.960625,40.78133,1,N,-73.977264,...,2,8.0,0.0,0.5,0.0,0.0,0.3,8.8,00:00:00,0.0
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,...,2,19.0,0.5,0.5,0.0,0.0,0.3,20.3,00:18:30,18.5


Remove data for rides that are less than 5 minutes and greater than 1.5 hours
<li>probably bad data</li>

In [7]:
# removes all data where rides are less than 5 minutes or greater than 90 minutes
df_new = df_box[~(df_box['duration_in_minutes']<5) | (df_box['duration_in_minutes']>90)]
df_new.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,duration_in_minutes
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,...,2,19.0,0.5,0.5,0.0,0.0,0.3,20.3,00:18:30,18.5
6,2,2016-01-01 00:00:00,2016-01-01 00:26:45,2,7.45,-73.994057,40.71999,1,N,-73.966362,...,2,26.0,0.5,0.5,0.0,0.0,0.3,27.3,00:26:45,26.75
7,1,2016-01-01 00:00:01,2016-01-01 00:11:55,1,1.2,-73.979424,40.744614,1,N,-73.992035,...,2,9.0,0.5,0.5,0.0,0.0,0.3,10.3,00:11:54,11.9
8,1,2016-01-01 00:00:02,2016-01-01 00:11:14,1,6.0,-73.947151,40.791046,1,N,-73.920769,...,2,18.0,0.5,0.5,0.0,0.0,0.3,19.3,00:11:12,11.2
9,2,2016-01-01 00:00:02,2016-01-01 00:11:08,1,3.21,-73.998344,40.723896,1,N,-73.99585,...,2,11.5,0.5,0.5,0.0,0.0,0.3,12.8,00:11:06,11.1


<h2>STEP 5: trip distance</h2>
<li>Is in miles</li>
<li>We'll get rid of anything less than .2 miles (2 blocks)</li>
<li>And anything greater than 30 miles</li>
<li>Probably bad data!</li>

In [8]:
# removes all data where rides are less than 0.2 miles or greater than 30 miles
df2 = df_new[~(df_new['trip_distance']<0.2) | (df_new['trip_distance']>30)]

<h2>STEP 6: Create zones and allocate trips to zones</h2>
<b>This is the tricky part!</b>



<b>First bucket the data into latitude buckets and longitude buckets</b>
<li>Use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html">pd.cut</a> for this</li>
<li>pd.cut, with the right arguments, returns a new bucketed array and a list of category boundaries</li>
<li>I've done the latitude bucket for you</li>


In [9]:
df2['latitude_bucket'],lat_cats = pd.cut(df2['pickup_latitude'],num_buckets,
                                         labels=list(range(1,num_buckets+1)),retbins=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [10]:
df2['longitude_bucket'],lon_cats = pd.cut(df2['pickup_longitude'],num_buckets,
                                         labels=list(range(1,num_buckets+1)),retbins=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


<b>Then create a zones_table list that contains zone_labels and zone_boundaries</b>
<li>You'll need this to create the geojson file</li>
<li>I've done this for you as well</li>
<li>Note that the zone identifiers are 001002 where 001 and 002 are the latitude bucket and longitude bucket respectively</li> 
<li>We will need to add a column to the dataframe that correctly renders the zone id</li>

In [11]:
zones_table = list()
for i in range(1,len(lat_cats)):
    lat_bucket = i
    for j in range(1,len(lon_cats)):
        lon_bucket = j
        lat1 = lat_cats[i-1]
        lat2 = lat_cats[i]
        lon1 = lon_cats[j-1]
        lon2 = lon_cats[j]
        zone_bounds = [[lon1,lat1],[lon2,lat1],[lon2,lat2],[lon1,lat2],[lon1,lat1]]
        zone="%03d%03d"%(lat_bucket,lon_bucket)
        zones_table.append((zone,zone_bounds))
zones_table

[('001001',
  [[-74.0203320236206, 40.697822315216065],
   [-74.00199279785156, 40.697822315216065],
   [-74.00199279785156, 40.72240982055664],
   [-74.0203320236206, 40.72240982055664],
   [-74.0203320236206, 40.697822315216065]]),
 ('001002',
  [[-74.00199279785156, 40.697822315216065],
   [-73.98374481201171, 40.697822315216065],
   [-73.98374481201171, 40.72240982055664],
   [-74.00199279785156, 40.72240982055664],
   [-74.00199279785156, 40.697822315216065]]),
 ('001003',
  [[-73.98374481201171, 40.697822315216065],
   [-73.96549682617187, 40.697822315216065],
   [-73.96549682617187, 40.72240982055664],
   [-73.98374481201171, 40.72240982055664],
   [-73.98374481201171, 40.697822315216065]]),
 ('001004',
  [[-73.96549682617187, 40.697822315216065],
   [-73.94724884033202, 40.697822315216065],
   [-73.94724884033202, 40.72240982055664],
   [-73.96549682617187, 40.72240982055664],
   [-73.96549682617187, 40.697822315216065]]),
 ('001005',
  [[-73.94724884033202, 40.697822315216065]

<b>Then create the corresponding zone names in df2</b>
<li>add a new column df2['zone']</li>
<li>the values will be the concatenation of the latitude bucket and the longitude bucket for each row in df2</li>
<li>remember to pad the bucket ids with 0's</li>
<li>latitude bucket = 2, longitude bucket = 33, zone = 002033</li>

In [12]:
# adds a column for zone padded with 0s
df2['zone'] = df2['latitude_bucket'].apply(lambda x: "%03d"%x).astype(str) \
    + df2['longitude_bucket'].apply(lambda x: "%03d"%x).astype(str)

df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,duration_in_minutes,latitude_bucket,longitude_bucket,zone
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,...,0.5,0.0,0.0,0.3,20.3,00:18:30,18.5,2,3,2003
6,2,2016-01-01 00:00:00,2016-01-01 00:26:45,2,7.45,-73.994057,40.71999,1,N,-73.966362,...,0.5,0.0,0.0,0.3,27.3,00:26:45,26.75,1,2,1002
7,1,2016-01-01 00:00:01,2016-01-01 00:11:55,1,1.2,-73.979424,40.744614,1,N,-73.992035,...,0.5,0.0,0.0,0.3,10.3,00:11:54,11.9,2,3,2003
8,1,2016-01-01 00:00:02,2016-01-01 00:11:14,1,6.0,-73.947151,40.791046,1,N,-73.920769,...,0.5,0.0,0.0,0.3,19.3,00:11:12,11.2,4,5,4005
9,2,2016-01-01 00:00:02,2016-01-01 00:11:08,1,3.21,-73.998344,40.723896,1,N,-73.99585,...,0.5,0.0,0.0,0.3,12.8,00:11:06,11.1,2,2,2002


<h2>STEP 7: Remove zones with no or few pickups</h2>
<li>remove zones with pickups less than 100</li>
<li>this way we'll get rid of trips that are not in Manhattan but crept in anyway</li>
<li>done for you</li>

<b>Identify the zones we want to keep</b>

In [13]:
zones_with_data = df2['zone'].unique()
zone_sizes = df2.groupby('zone').size()
for zone in zone_sizes.index:
    if zone_sizes[zone]<100:
        zones_with_data=np.delete(zones_with_data,np.where(zones_with_data == zone))

<b>keep only rows with zones in zones_with_data</b>


In [14]:
# keeps rows only in the array zones_with_data
df3 = df2[df2.zone.isin(zones_with_data)]
df3.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,duration_in_minutes,latitude_bucket,longitude_bucket,zone
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,...,0.5,0.0,0.0,0.3,20.3,00:18:30,18.5,2,3,2003
6,2,2016-01-01 00:00:00,2016-01-01 00:26:45,2,7.45,-73.994057,40.71999,1,N,-73.966362,...,0.5,0.0,0.0,0.3,27.3,00:26:45,26.75,1,2,1002
7,1,2016-01-01 00:00:01,2016-01-01 00:11:55,1,1.2,-73.979424,40.744614,1,N,-73.992035,...,0.5,0.0,0.0,0.3,10.3,00:11:54,11.9,2,3,2003
8,1,2016-01-01 00:00:02,2016-01-01 00:11:14,1,6.0,-73.947151,40.791046,1,N,-73.920769,...,0.5,0.0,0.0,0.3,19.3,00:11:12,11.2,4,5,4005
9,2,2016-01-01 00:00:02,2016-01-01 00:11:08,1,3.21,-73.998344,40.723896,1,N,-73.99585,...,0.5,0.0,0.0,0.3,12.8,00:11:06,11.1,2,2,2002


<h2>STEP 8: Write a function for creating geojson based on zone boundaries </h2>
<li>Include only zones that are in zones with data</li>
<li>I've partially written this function for you</li>

In [15]:
def createZoneGeoJson(zone_table):
    zone_data_dict = dict()
    zone_data_dict['type'] = 'FeatureCollection'
    zone_data_dict_features = list()
    zone_data_dict['features'] = zone_data_dict_features
    for i in range(len(zones_table)):
        
        # adds each feature to the dictionary
        #Create Feature dictionaries of type polygon using the data in zones_table
        zone_data_dict['features'].append({"type": "Feature", 
                                           "geometry": {"type": "Polygon",
                                                        "coordinates": [
                                                            zones_table[i][1]
                                                        ]
                                                       },
                                           "properties": {
                                               "description": zones_table[i][0]
                                           }
                                          }
                                          )
        
    return zone_data_dict

zones_geojson = createZoneGeoJson(zones_table)

<h2>STEP 9: Choropleth Map all rides</h2>
<li>Draw a folium map using zones geojson object as the base map</li>
<li>and the counts for the boxes after grouping the data by zone</li>
<li>create a column with log counts and use that column in the maps</li>
<li>this is similar to the class example</li>
<a href="http://www.datasciencemadesimple.com/log-natural-logarithmic-value-column-pandas-python-2/">log examples</a>

In [16]:
# groups by zone and counts trips in each zone
zone_groups = df3.groupby("zone")
counts = pd.DataFrame({'zone': zone_groups.zone.count().index, 'count': zone_groups.zone.count().values})

# smoothing
counts['log_count'] = np.log(counts['count'])

In [17]:
import folium
import branca

# creates branca element to resize folium
f = branca.element.Figure(height=800)

# creates the map
m=folium.Map(location = [40.7589,-73.9851],zoom_start=10)

# adds folium map to brance figure
f.add_child(m)

# creates the color grid and legend
c = folium.Choropleth(geo_data=zones_geojson,
                      data=counts,
                      columns=['zone','log_count'],
                      key_on='feature.properties.description',
                      fill_color='RdYlGn',
                      legend_name="Logarithmic Distribution of Rides",
                     highlight=True)
c.add_to(m)
f

<h2>STEP 10: Choropleth Map morning rush rides</h2>
<li>Extract the hour of the day from pickup time</li>
<li>Extract only data for trips that start at 8am or later and end at 10am or earlier</li>
<li>Group the data and draw a choropleth map for morning rush rides</li>


<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.hour.html">extracting the hour</a>

<a href="https://www.geeksforgeeks.org/python-pandas-series-dt-hour/">example</a>

In [18]:
df3.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,duration_in_minutes,latitude_bucket,longitude_bucket,zone
5,2,2016-01-01 00:00:00,2016-01-01 00:18:30,2,5.52,-73.980118,40.74305,1,N,-73.91349,...,0.5,0.0,0.0,0.3,20.3,00:18:30,18.5,2,3,2003
6,2,2016-01-01 00:00:00,2016-01-01 00:26:45,2,7.45,-73.994057,40.71999,1,N,-73.966362,...,0.5,0.0,0.0,0.3,27.3,00:26:45,26.75,1,2,1002
7,1,2016-01-01 00:00:01,2016-01-01 00:11:55,1,1.2,-73.979424,40.744614,1,N,-73.992035,...,0.5,0.0,0.0,0.3,10.3,00:11:54,11.9,2,3,2003
8,1,2016-01-01 00:00:02,2016-01-01 00:11:14,1,6.0,-73.947151,40.791046,1,N,-73.920769,...,0.5,0.0,0.0,0.3,19.3,00:11:12,11.2,4,5,4005
9,2,2016-01-01 00:00:02,2016-01-01 00:11:08,1,3.21,-73.998344,40.723896,1,N,-73.99585,...,0.5,0.0,0.0,0.3,12.8,00:11:06,11.1,2,2,2002


In [19]:
# converts to datetime
df3['tpep_pickup_datetime'] = pd.to_datetime(df3['tpep_pickup_datetime'],format='%Y-%m-%d %H:%M:%S')

# extracts hour
df3['pickuphour'] = df3['tpep_pickup_datetime'].dt.hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [20]:
# keeps rides only from 8 to 10 am inclusive
morning_rush_df = df3[(df3['pickuphour'] >= 8) & (df3['pickuphour'] <= 10)]

# resets index
morning_rush_df.reset_index(drop=True, inplace=True)

In [21]:
# groups by zone
zone_groups_morning_rush = morning_rush_df.groupby("zone")

# creates new dataframe from count and zone
counts_morning_rush = pd.DataFrame({'zone': zone_groups_morning_rush.zone.count().index, 
                                    'count': zone_groups_morning_rush.zone.count().values})

# smoothing
counts_morning_rush['log_count'] = np.log(counts_morning_rush['count'])

In [22]:
# creates branca element to resize folium
f = branca.element.Figure(height=800)

# creates the map
m=folium.Map(location = [40.7589,-73.9851],zoom_start=10)

# adds folium map to brance figure
f.add_child(m)

# creates the color grid and legend
c = folium.Choropleth(geo_data=zones_geojson,
                      data=counts_morning_rush,
                      columns=['zone','log_count'],
                      key_on='feature.properties.description',
                      fill_color='RdYlGn',
                      legend_name="Logarithmic Distribution of Morning Rush Rides",
                     highlight=True)
c.add_to(m)
f

<h2>STEP 11: Choropleth Map evening rush rides</h2>
<li>Extract the hour of the day from pickup time</li>
<li>Extract only data for trips that start at 1600 or later and end at 1800 or earlier</li>
<li>Group the data and draw a choropleth map for morning rush rides</li>

In [23]:
# keeps rides only from 4 to 6 pm inclusive
evening_rush_df = df3[(df3['pickuphour'] >= 16) & (df3['pickuphour'] <= 18)]

# resets the index
evening_rush_df.reset_index(drop=True, inplace=True)

In [24]:
# groups by zone
zone_groups_evening_rush = evening_rush_df.groupby("zone")

# creates new dataframe from count and zone
counts_evening_rush = pd.DataFrame({'zone': zone_groups_evening_rush.zone.count().index, 
                                    'count': zone_groups_evening_rush.zone.count().values})

# smoothing
counts_evening_rush['log_count'] = np.log(counts_evening_rush['count'])

In [25]:
# creates branca element to resize folium
f = branca.element.Figure(height=800)

# creates the map
m=folium.Map(location = [40.7589,-73.9851],zoom_start=10)

# adds folium map to brance figure
f.add_child(m)

# creates the color grid and legend
c = folium.Choropleth(geo_data=zones_geojson,
                      data=counts_evening_rush,
                      columns=['zone','log_count'],
                      key_on='feature.properties.description',
                      fill_color='RdYlGn',
                      legend_name="Logarithmic Distribution of Evening Rush Rides",
                     highlight=True)
c.add_to(m)
f