# Nearest Neighbors Implementation - Lab

## Objectives

* Apply the nearest neighbors approach in a real world domain to answer similarity based analytical questions around data. 
* Use pandas and numpy for calculating distance based similarity measures between data objects. 
* Perform basic data exploration and feature selection for a given dataset.
* Group and plot similarity between geographical data objects using folium.
* Practice data manipulation between lists, dictionaries, numpy arrays and pandas dataframe.
* Develop a complete system based on re-usable functions to perform unit tasks.

### Introduction

In this lab, we shall apply nearest neighbors technique from previous lab to help a taxi company predict the length of their rides.  We shall use a multitude of data types and inter-conversion between them.

Imagine that we are hired to consult for LiftOff, a limo and taxi service that is just opening up in NYC.  Liftoff wants it's taxi drivers to target longer rides, as the longer the ride the more money it makes.  LiftOff has the following theory:

* *The pickup location of a taxi ride can help predict the length of the ride.*  


LiftOff asks us to do some analysis to write a function that will allow it to **predict the length of a taxi ride for any given location **.

Our technique will be the following:
  * **Collect** Obtain the data containing all of the taxi information, and only select the attributes of taxi trips that we need 
  * **Explore** Examine the attributes of our data, and plot some of our data on a map
  * **Train** Write our nearest neighbors formula, and change the number of nearby trips to predict the length of a new trip
  * **Predict** Use our function to predict trip lengths of new locations

### Collect and Explore the data

#### Collect the Data

Luckily for us, [NYC Open Data](https://opendata.cityofnewyork.us/) collects information about NYC taxi trips and provides this data on [its website](https://data.cityofnewyork.us/Transportation/2014-Yellow-Taxi-Trip-Data/gn7m-em8n).

![](./nyc-taxi.png)

For you're reading pleasure, the data has already been downloaded into `trips.json` file in this lab which you can find here.  We'll use Python's `json` library to take the data from the `trips.json` file and store it as a pandas dataframe in our notebook. Here is the code to give you a head start.

In [1]:
import pandas as pd
import numpy as np
import json

# First, read the file
trips_file = open('trips.json')

# Then, convert contents to a dataframe
trips = json.load(trips_file)
trips_df = pd.DataFrame(trips)

trips_df.head()

Unnamed: 0,dropoff_datetime,dropoff_latitude,dropoff_longitude,fare_amount,imp_surcharge,mta_tax,passenger_count,payment_type,pickup_datetime,pickup_latitude,pickup_longitude,rate_code,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,vendor_id
0,2014-11-26T22:31:00.000,40.74677,-73.99745,52.0,0.0,0.5,1,CSH,2014-11-26T21:59:00.000,40.64499,-73.78114999999998,2,,0.0,5.33,57.83,18.38,VTS
1,2014-02-22T17:54:37.000,40.781845,-73.979073,7.5,0.0,0.5,1,CSH,2014-02-22T17:47:23.000,40.766931,-73.98209799999998,1,N,0.0,0.0,8.0,1.3,CMT
2,2014-04-06T04:07:34.000,40.72661,-73.99595499999998,15.5,0.5,0.5,1,CRD,2014-04-06T03:51:59.000,40.77773,-73.951902,1,N,3.3,0.0,19.8,4.5,CMT
3,2014-09-08T21:15:30.000,40.773373,-73.949354,11.5,0.5,0.5,1,CSH,2014-09-08T21:02:17.000,40.795678,-73.97104899999998,1,N,0.0,0.0,12.5,2.4,CMT
4,2014-11-13T11:55:00.000,40.752932,-73.974688,5.5,0.0,0.5,2,CRD,2014-11-13T11:50:00.000,40.762912,-73.967782,1,,1.1,0.0,7.1,0.8399999999999999,VTS


#### Explore the data

The next step is to explore the data.  First, let's see how many trips we have.

In [2]:
len(trips_df)

1000

1000 data elements ! , not bad at all.  Now let's see what these trips looks like.  Each trip is a row in our dataframe, so we can see the attributes of the dataframe with the `columns` function.

In [10]:
# View the columns of the dataframe

trips_df.columns

# Index(['dropoff_datetime', 'dropoff_latitude', 'dropoff_longitude',
#        'fare_amount', 'imp_surcharge', 'mta_tax', 'passenger_count',
#        'payment_type', 'pickup_datetime', 'pickup_latitude',
#        'pickup_longitude', 'rate_code', 'store_and_fwd_flag', 'tip_amount',
#        'tolls_amount', 'total_amount', 'trip_distance', 'vendor_id'],
#       dtype='object')


Index(['dropoff_datetime', 'dropoff_latitude', 'dropoff_longitude',
       'fare_amount', 'imp_surcharge', 'mta_tax', 'passenger_count',
       'payment_type', 'pickup_datetime', 'pickup_latitude',
       'pickup_longitude', 'rate_code', 'store_and_fwd_flag', 'tip_amount',
       'tolls_amount', 'total_amount', 'trip_distance', 'vendor_id'],
      dtype='object')

#### Limit our data

We see a lot of variables in the dataset. Some of these may not be appropriate for our analysis. Let's begin to think through what data is relevant for our task.

### Feature Selection

Remember that our task is to 

>**use the trip location to predict the length of a trip**.  

In order to answer above analytical question, let's select the `pickup_latitude`, `pickup_longitude`, and `trip_distance` from each trip.  That will give us the trip start location and related `trip_distance` for each trip.  Then based on these **actual** trip distances we can use nearest neighbors to predict an **expected** trip distance for a new trip, provided an **actual** location. Let's move on with this idea.

** Parse and Filter the dataset for required features ** 

Write a function called `parse_trips(trips)` that filters the trips data frame and returns a new data frame containing only the following attributes for each trip: 

* `trip_distance`
* `pickup_latitude`
* `pickup_longitude`

Use `dataframe.filter()` function to select above features and create a new dataset. Make sure that all the columns in new dataset are numeric and of type `float64`.

In [11]:
def parse_trips(trips):
    
    # Create a new dataset containing only the required columns
    # Ensure the values in new dataset are numeric
    parsed = trips.filter(['trip_distance', 'pickup_latitude', 
                           'pickup_longitude'], axis=1).apply(pd.to_numeric)

    return parsed

In [12]:
parsed_trips = parse_trips(trips_df)

print(parsed_trips.head(),'\n', type(parsed_trips))

# trip_distance  pickup_latitude  pickup_longitude
# 0          18.38        40.644990        -73.781150
# 1           1.30        40.766931        -73.982098
# 2           4.50        40.777730        -73.951902
# 3           2.40        40.795678        -73.971049
# 4           0.84        40.762912        -73.967782 
# <class 'pandas.core.frame.DataFrame'>

   trip_distance  pickup_latitude  pickup_longitude
0          18.38        40.644990        -73.781150
1           1.30        40.766931        -73.982098
2           4.50        40.777730        -73.951902
3           2.40        40.795678        -73.971049
4           0.84        40.762912        -73.967782 
 <class 'pandas.core.frame.DataFrame'>


### Data Exploration and Visualization

Now that we have filtered our data with required columns, let's get a sense of our trip data. We can use the `folium` Python library to plot a map of Manhattan, and our data.  First we must import `folium`, and then use the `Map` function to pass through a `location`, and `zoom_start` (for current level of map zoom - try changing it to see the effect).


**If a map isn't showing up below, copy and paste the command `pip install folium` into your terminal to install `folium` then try again.**

In [13]:
# pip install folium
import folium
manhattan_map = folium.Map(location=[40.7589, -73.9851], zoom_start=11)

In [14]:
manhattan_map

Ok, now let's see how we could add a **"marker"** to pin a specific location using `.add_to()` method.  We'll start by getting the map of Times Square ( [40.7589, -73.9851] ) and then putting a marker on it.

In [15]:
marker = folium.CircleMarker(location = [40.7589, -73.9851], radius=5)
marker.add_to(manhattan_map)

<folium.features.CircleMarker at 0x110231a90>

Above, we first create a marker.  Then we add that circle marker to the `manhattan_map` we created earlier. 

In [16]:
manhattan_map

Do you see that blue circle near Time's Square?  That is our marker.  

So now that we can plot one marker on a map, we should have a sense of how we can plot many markers on a map to display our taxi ride data.  We simply plot a map, and then we add a marker for each location of a taxi trip.

Now let's write some functions to allow us to plot maps and add markers a little more easily.  

#### Map plotting functions

As a first step towards this, note that the functions to create both a marker and map each take in a location as two element list, representing the latitude and longitude values.  Take another look:

```python
marker = folium.CircleMarker(location = [40.7589, -73.9851])
manhattan_map = folium.Map(location=[40.7589, -73.9851])
```

So let's write a function called to create this two element list from a trip.  Write a function called `location` that  takes in a trip as an argument and returns a list where the first element is the latitude and the second is the longitude.  Remember that a location looks like the following:

In [18]:
# Set the first element of parsed_trips as first_trip
first_trip = parsed_trips.iloc[0]
first_trip

trip_distance       18.38000
pickup_latitude     40.64499
pickup_longitude   -73.78115
Name: 0, dtype: float64

Now to writing our function.

In [20]:
def location(trip):
    loc = [trip['pickup_latitude'], trip['pickup_longitude']]
    return loc

Let's pass the first trip in the dataset and see if it returns the expected values.

In [21]:
first_location = location(first_trip) 
first_location 

# [40.64499, -73.78115]

[40.64499, -73.78114999999998]

Ok, now that we can turn a trip into a location, let's turn a location into a marker.  Write a function called `loc_to_marker` that takes in a location (co-ordinates in the form of a list) as an argument, and returns a folium `circleMarker` for that location.  The radius of the marker should always equal 5.

In [22]:
def loc_to_marker(location):
    marker = folium.CircleMarker(location, radius = 5)
    return marker

Let's pass a location in the expected format and inspect the output.

In [23]:
times_square_marker = loc_to_marker([40.7589, -73.9851])
times_square_marker.location 

# [40.7589, -73.9851]

[40.7589, -73.9851]

To check if our marker radius has been saved with location, we need to use `json` library as the options for markers are stored in in the json format. So let's look for the `radius` value in `times_square_marker.options`.

In [24]:

json.loads(times_square_marker.options)['radius'] 

# 5

5

That all worked well. So now that we know how to produce a single marker for a trip, let's write a function to produce lots of markers for many trips.  We can write a function called `markers_from_trips` that takes in `parsed_trips`, and returns a list of marker objects for each trip.  

In [25]:
def markers_from_trips(trips):
    locations = []
    markers = []
    
    # In a for loop, append latitude and logitude values from each row of 'trips' to 'locations'
    for index, trip in trips.iterrows():
        locations.append([trip['pickup_latitude'], trip['pickup_longitude']])
   
    # In a for loop, use values from 'locations' and pass them to 'loc_to_marker'. 
    # Append the output for each iteration in 
    for loc in locations:
        markers.append(loc_to_marker(loc))
    
    return markers

In [26]:
trip_markers = markers_from_trips(parsed_trips)
trip_markers[0], len(trip_markers)

# (<folium.features.CircleMarker at 0x10edc9cf8>, 1000)

(<folium.features.CircleMarker at 0x11020db00>, 1000)

Great, so now we have a 1000 markers, one for each trip. 

#### A quick re-cap

Looking back at what we have achieved so far. We have a function that creates locations, and a function that creates markers, it is time to write a function to plot a map. 

Write a function called `map_from_location` that, provided the first argument of a list location and second argument an integer representing the `zoom_start`, returns a `folium` map the corresponding location and `zoom_start` attributes.

> Hint: The following is how to write a map with folium:
> ```python 
    folium.Map(location=location, zoom_start=zoom_amount)
> ```

In [27]:
def map_from_location(location, zoom_amount):
    loc_map = folium.Map(location=location, zoom_start=zoom_amount)
    return loc_map

In [28]:
times_square_map = map_from_location([40.7589, -73.9851], 15)

Now we shall add the `time_square_marker`, calculated above to this map using the format:
```
     <marker>.add_to(map)
```

In [29]:
times_square_marker.add_to(times_square_map)
times_square_map

Now that we have a marker and a map, now let's write a function that adds a lot of markers to a map. Let's first create a Manhattan map with zoom level = 13 as our base map to work on. 

In [31]:
manhattan_map = map_from_location([40.7589, -73.9851], 13)
manhattan_map

We can now write a function `add_markers` that takes in a list of markers (like `trip_markers` we created above), with a map location (`manhattan_map` in our case), and returns another map object , with markers from the the passed list. The function will simply iterate through the list and add each marker using `add_to()` method we saw earlier.

In [32]:
def add_markers(markers, map_obj):
    for marker in markers:
        marker.add_to(map_obj)
    return map_obj

In [33]:
map_with_markers = add_markers(trip_markers, manhattan_map)

In [34]:
map_with_markers

So all these circles now show our pick-up points for the 1000 rides in the dataset.

### Nearest Neighbors

Ok, now we have all the visualization functions in place. Let's write a new function that given a latitude and longitude, will predict the distance for us.  We'll do this by first finding the nearest trips given a latitude and longitude. 

#### Calculate distance between locations

Here we once again apply the distance formula based on Euclidean distance as we saw in the previous lab. As a first step, write a function named `distance_between_locations` that calculates the distance in pickup location between two trips as:

>**Distance = SquareRoot( ( trip1[latitutde] - trip2[latitude] )<sup>2</sup> + ( trip1[latitutde] - trip2[latitude] )<sup>2</sup> )**

Limit the distance values to 4 decimal places. 

In [35]:
import math

def distance_between_locations(selected_trip, neighbor_trip):
    
    # Caculate the distance between selected and neighbor trip using Pythagoras theorem
    distance_squared = (neighbor_trip['pickup_latitude'] - selected_trip['pickup_latitude'])**2 \
                     + (neighbor_trip['pickup_longitude'] - selected_trip['pickup_longitude'])**2
    
    distance = round(math.sqrt(distance_squared),4)
    
    return distance

Now we can pick up first two trips from the dataset and calculate the distance between them using `distance_between_locations`.

In [36]:
first_trip = parsed_trips.iloc[0]
second_trip = parsed_trips.iloc[1]
distance_first_and_second = distance_between_locations(first_trip, second_trip)
distance_first_and_second

# 0.2351

0.2351

Ok now we can calculate the distance between any two trips' pickup points. 

#### Calculate distance between neighbors

Next, write a function called `distance_between_neighbors` that takes in two trips as pandas dataframes, adds a new column called `distance_from_selected`, that calculates the distance of the `neighbor_trip` from the `selected_trip`. Use the `distance-between-locations` function created above, to calculate the distance. 

**CAUTION:** When adding or removing data from a dataset, it is always advisable to use `dataframe.copy()` method to create a new copy of the data. Otherwise, you may end up changing the original dataset and those changes may not be reversible. 

In [37]:
def distance_between_neighbors(selected_trip, neighbor_trip):
    
    # Copy neighbor_trip to a new dataframe and add 'distance_from_selected' column using above function
    neighbor_with_distance = neighbor_trip.copy()
    neighbor_with_distance['distance_from_selected'] = distance_between_locations(selected_trip, neighbor_trip)    
    
    return neighbor_with_distance

In [38]:
distance_between_neighbors(first_trip, second_trip)

# trip_distance              1.300000
# pickup_latitude           40.766931
# pickup_longitude         -73.982098
# distance_from_selected     0.235100
# Name: 1, dtype: float64

trip_distance              1.300000
pickup_latitude           40.766931
pickup_longitude         -73.982098
distance_from_selected     0.235100
Name: 1, dtype: float64

Now our `neighbor_trip` has another attribute called `distance_from_selected`, that indicates the distance from the `neighbor_trip`'s pickup location from the `selected_trip`.

> ** Take a step back and understand the data:** 

>Our data now has a few attributes, two of which say "distance".  Let's make sure we understand the difference. 

> * **`distance_from_selected`:** This is our calculation of the distance of the neighbor's pickup location from the selected trip.

> * **`trip_distance`:** This is the attribute we were provided initially.  It tells us the length of the neighbor's taxi trip from pickup to dropoff.  

Next, write a function called `distance_all` that takes in a selected location and a collection of neighboring locations as a data frames, and returns each of those neighbors with their respective `distance_from_selected` numbers.

In [39]:
def distance_all(selected, neighbors):
    selected_neighbors = []
    distances = []
    
    # For all trips in neighbors, check that selected trip is not used as a neighbor trip
    # Calculate the distance of pick up points from selected to all neighbors in the dataframe
    for index, n in neighbors.iterrows():
        
        if (n == selected).all():
            pass

        else:
            distances.append(distance_between_neighbors(selected, n))
        
    return distances    

Let's try our function by passing in first 5 trips from the `parsed_trips` dataframe and inspect the output. 

In [40]:
distance_all(first_trip, parsed_trips[0:5])

# [trip_distance              1.300000
#  pickup_latitude           40.766931
#  pickup_longitude         -73.982098
#  distance_from_selected     0.235100
#  Name: 1, dtype: float64, 
#  trip_distance              4.500000
#  pickup_latitude           40.777730
#  pickup_longitude         -73.951902
#  distance_from_selected     0.216300
#  Name: 2, dtype: float64, 
#  trip_distance              2.400000
#  pickup_latitude           40.795678
#  pickup_longitude         -73.971049
#  distance_from_selected     0.242400
#  Name: 3, dtype: float64, 
#  trip_distance              0.840000
#  pickup_latitude           40.762912
#  pickup_longitude         -73.967782
#  distance_from_selected     0.220800
#  Name: 4, dtype: float64]

[trip_distance              1.300000
 pickup_latitude           40.766931
 pickup_longitude         -73.982098
 distance_from_selected     0.235100
 Name: 1, dtype: float64, trip_distance              4.500000
 pickup_latitude           40.777730
 pickup_longitude         -73.951902
 distance_from_selected     0.216300
 Name: 2, dtype: float64, trip_distance              2.400000
 pickup_latitude           40.795678
 pickup_longitude         -73.971049
 distance_from_selected     0.242400
 Name: 3, dtype: float64, trip_distance              0.840000
 pickup_latitude           40.762912
 pickup_longitude         -73.967782
 distance_from_selected     0.220800
 Name: 4, dtype: float64]

Now write the `nearest_neighbors` function to calculate the distance of the `selected_trip` from all of the `parsed_trips` in our dataset.  If no number is provided, it should return the top 3 neighbors.

In [41]:
from operator import itemgetter

def nearest_neighbors(selected_trip, trips, number = 3):
    
    neighbor_distances = distance_all(selected_trip, trips)
    sorted_neighbors = sorted(neighbor_distances, key=itemgetter(3))
    
    return sorted_neighbors[:number]


In [42]:
new_trip = {'pickup_latitude': 40.64499, 
            'pickup_longitude': -73.78115, 
            'trip_distance': 18.38}
                         
new_trip_df = pd.DataFrame([new_trip])

nearest_neighbors(new_trip, parsed_trips , number = 3)


# [trip_distance             18.38000
# pickup_latitude           40.64499
# pickup_longitude         -73.78115
# distance_from_selected     0.00000
# Name: 0, dtype: float64, 
# trip_distance              7.780000
# pickup_latitude           40.644830
# pickup_longitude         -73.781578
# distance_from_selected     0.000500
# Name: 514, dtype: float64, 
# trip_distance             12.700000
# pickup_latitude           40.644657
# pickup_longitude         -73.782229
# distance_from_selected     0.001100
# Name: 33, dtype: float64]


[trip_distance             18.38000
 pickup_latitude           40.64499
 pickup_longitude         -73.78115
 distance_from_selected     0.00000
 Name: 0, dtype: float64, trip_distance              7.780000
 pickup_latitude           40.644830
 pickup_longitude         -73.781578
 distance_from_selected     0.000500
 Name: 514, dtype: float64, trip_distance             12.700000
 pickup_latitude           40.644657
 pickup_longitude         -73.782229
 distance_from_selected     0.001100
 Name: 33, dtype: float64]

Ok great! Now that we can provide a new trip location, and find the distances of the three nearest trips, we can take  calculate an estimate of the trip distance for that new trip location.  

We do so simply by calculating the median of it's nearest neighbors.

In [43]:
import statistics

def median_distance(neighbors):
    nearest_distances= []

    for neighbor in neighbors:
        nearest_distances.append(neighbor['trip_distance'])
        median = round(statistics.median(nearest_distances), 3)

    return median

In [44]:
nearest_three_neighbors = nearest_neighbors(new_trip, parsed_trips, number = 3)
distance_estimate_of_selected_trip = median_distance(nearest_three_neighbors) 
distance_estimate_of_selected_trip

# 12.7

12.7

### Choosing the correct number of neighbors

Now, as we know from the last lesson, one tricky element is to determine **how many neighbors to choose**, our **k** value,  before calculating the median (you can also consider using average instead of median).  We want to choose our value of **k** such that it properly matches actual data, and so that it applies to new data.  There are fancy formulas to ensure that we **train** our algorithm so that our formula is optimized for all data, but we shall leave that for a later lesson.  Here let's see different **k** values manually.  This is the gist of choosing our **k** value:

* If we choose a **k** value too low, our formula will be too heavily influenced by a single neighbor, whereas if our **k** value is too high, we will be choosing so many neighbors that our nearest neighbors formula will not be adjust enough according to locations.

Ok, let's experiment with this.

####  A new pickup point - a new location

First, let's choose a midtown location, to see what the trip distance would be.  A Google search reveals the coordinates of 51st and 7th avenue to be the following.

In [45]:
midtown_trip = dict(pickup_latitude=40.761710, pickup_longitude=-73.982760)
midtown_trip_df = pd.DataFrame([midtown_trip])
midtown_trip_df

Unnamed: 0,pickup_latitude,pickup_longitude
0,40.76171,-73.98276


Let's try to identify seven closest neighboring trips using the functions we developed above.

In [46]:
seven_closest = nearest_neighbors(midtown_trip, parsed_trips, number = 7)
seven_closest

# [trip_distance              1.300000
# pickup_latitude           40.766931
# pickup_longitude         -73.982098
# distance_from_selected     0.235100

[trip_distance              0.580000
 pickup_latitude           40.761372
 pickup_longitude         -73.982602
 distance_from_selected     0.000400
 Name: 202, dtype: float64, trip_distance              0.800000
 pickup_latitude           40.762444
 pickup_longitude         -73.982440
 distance_from_selected     0.000800
 Name: 995, dtype: float64, trip_distance              1.400000
 pickup_latitude           40.762767
 pickup_longitude         -73.982293
 distance_from_selected     0.001200
 Name: 801, dtype: float64, trip_distance              8.300000
 pickup_latitude           40.762868
 pickup_longitude         -73.983233
 distance_from_selected     0.001300
 Name: 326, dtype: float64, trip_distance              1.260000
 pickup_latitude           40.760057
 pickup_longitude         -73.983502
 distance_from_selected     0.001800
 Name: 118, dtype: float64, trip_distance              1.720000
 pickup_latitude           40.762107
 pickup_longitude         -73.984790
 distance_from

Looking at the `distance_from_selected` it appears that our our trips are still fairly close to our selected trip.  Notice that most of the data is within a distance of .002 away, so going to the top 7 nearest neighbors didn't seem to give us neighbors too far from each other, which is a good sign.

Still, it's hard to know what distance in latitude and longitude really look like, so let's map the data. 

In [47]:
# Set the location of midtown trip in the format [latitutde, longitude]
midtown_location = location(midtown_trip) # [40.76171, -73.98276]

# USe midtown location to create a map object. Use zoom level 16.
midtown_map = map_from_location(midtown_location, 16)

# convert the "seven_closest" list to a dataframe and get markers for all 7 trips
closest_markers = markers_from_trips(pd.DataFrame(seven_closest))

# Add the closest markers to the map 
add_markers(closest_markers, midtown_map)

Ok.  These locations stay fairly close to our estimated location of 51st street and 7th Avenue.  So they could be a good estimate of a trip distance. Let's use the median distance function we created earlier to find the median distance for `seven_closest`.

In [48]:
median_distance(seven_closest) 

# 1.26

1.26

So there we have it. Recall the original intuition that **use the trip location to predict the length of a trip.** Our analysis successfully answers this and can predict the distance of a new trip. 

#### How accurate is this ?

In order to evaluate the predictive ability of the model, we would need slightly different data modeling and analysis approach. We shall cover sophisticated versions of this approach e.g. k-nearest neighbors algorithm which will allow us to perform unsupervised clustering (grouping) of data elements based on their similarity. We shall also cover model evaluation techniques to inspect the outcome and associate an accuracy with the model. 

#### Bonus : 

* Try changing number of neighbors i.e. **k** to predict the trip length above. You can also add new locations and see how the system responds.
* Create a `average_distance()` function and compare its output with that of `median_distance()`. Which of these is more suitable for our predictions?
* Explore marker coloring options and change the contrast based on distance from selected location.

### Summary

In this lab, we used the nearest neighbors function to predict the length of a taxi ride.  To do so, we selected a location, then found a number of taxi rides closest to that location, and finally took the median trip lengths of the nearest taxi rides to find an estimate of the new ride's trip length.  You can see that even with just a little bit of math and programming we can begin to make meaningful predictions with data.