# Nearest Neighbors Lab

### Introduction

In this lab, you apply nearest neighbors technique to different applications, and in various dimensions.  Imagine that we are hired to consult for LiftOff, a limo and taxi service that is just opening up.  It wants to do some initial research on NYC trips.

Lucky for us, information about NYC taxi trips is available on [it's website](https://data.cityofnewyork.us/Transportation/2014-Yellow-Taxi-Trip-Data/gn7m-em8n).  LiftOff is interested in which locations of NYC it should target, and how to increase the length of a trip -- as it makes more money that way.  Let's get started!

### Exploring and Gathering the Data

If you go to [NYC Open Data](https://opendata.cityofnewyork.us/), you can find NYC taxi data after a quick search [it's website](https://data.cityofnewyork.us/Transportation/2014-Yellow-Taxi-Trip-Data/gn7m-em8n) if you click on the button, "API", you'll find the data that we'll be working with.  For you're reading pleasure, the data has already been moved to the "trips.json" file in this lab.

```python
[
  {
    "dropoff_datetime": "2014-11-26T22:31:00.000",
    "dropoff_latitude": "40.746769999999998",
    "dropoff_longitude": "-73.997450000000001",
    "fare_amount": "52",
    "imp_surcharge": "0",
    "mta_tax": "0.5",
    "passenger_count": "1",
    "payment_type": "CSH",
    "pickup_datetime": "2014-11-26T21:59:00.000",
    "pickup_latitude": "40.64499",
    "pickup_longitude": "-73.781149999999997",
    "rate_code": "2",
    "tip_amount": "0",
    "tolls_amount": "5.3300000000000001",
    "total_amount": "57.829999999999998",
    "trip_distance": "18.379999999999999",
    "vendor_id": "VTS"
  },
...
...
]
```

### Document Retrieval 

Now, we like the amount of data, but we don't need all of the attributes provided.  We decide that all we need for this exploration is `passenger_count`, `pickup_datetime`, `pickup_latitude`, and `trip_distance`.

The first step is to load the data from a json file, which can be a little tricky in python.  We'll write a function to do it for you - using the pandas library. 

In [2]:
import pandas

def parse_file(fileName):
    trips_df = pandas.read_json(fileName)
    return trips_df.to_dict('records')

trips = parse_file('trips.json')

In [5]:
trips[0]

{'dropoff_datetime': '2014-11-26T22:31:00.000',
 'dropoff_latitude': 40.74677,
 'dropoff_longitude': -73.99745,
 'fare_amount': 52.0,
 'imp_surcharge': 0.0,
 'mta_tax': 0.5,
 'passenger_count': 1,
 'payment_type': 'CSH',
 'pickup_datetime': '2014-11-26T21:59:00.000',
 'pickup_latitude': 40.64499,
 'pickup_longitude': -73.78115,
 'rate_code': 2,
 'store_and_fwd_flag': nan,
 'tip_amount': 0.0,
 'tolls_amount': 5.33,
 'total_amount': 57.83,
 'trip_distance': 18.38,
 'vendor_id': 'VTS'}

Ok, so as you can see from above, the `trips` variable returns an array of dictionaries with each dictionary representing a trip.

Write a function called `parse_trips(trips)` that returns an array of the trips with just the following attributes: `trip_distance`, `pickup_datetime`, `pickup_latitude`, `pickup_longitude`, `trip_distance`.  

Run the `nearest-neighbor-lab-tests.py` file to ensure that you wrote it correctly.

In [11]:
def parse_trips(trips):
    return list(map(lambda trip: {'trip_distance': trip['trip_distance'], 'pickup_latitude': trip['pickup_latitude'], 'pickup_longitude': trip['pickup_longitude'], 'passenger_count': trip['passenger_count']}, trips))
    

### Calculating the sides of the triangle

In [15]:
parsed_trips = parse_trips(trips)

parsed_trips[0]
# {'passenger_count': 1,
#  'pickup_latitude': 40.64499,
#  'pickup_longitude': -73.78115,
#  'trip_distance': 18.38}
len(parsed_trips)
# 1000

# See a unique list of attributes in the parsed_trips array  
set([key for trip in parsed_trips for key in list(trip.keys())])
# {'passenger_count', 'pickup_latitude', 'pickup_longitude', 'trip_distance'}

{'passenger_count', 'pickup_latitude', 'pickup_longitude', 'trip_distance'}

### Exploring the Data

In [18]:
# could add in to scope data down to manhattan, or brooklyn, etc.

Now that we have paired down our data, let's answer some initial questions.  Let's plot the pickup locations on a map.

In [50]:
import gmplot
gmap = gmplot.GoogleMapPlotter(40.758896, -73.985130, 12)
gmap.draw("mymap.html")


from IPython.display import IFrame

IFrame('mymap.html', width=1200, height=400)

Now, plotting the data feeds into the following function.
```python
gmap.plot(latitudes, longitudes, 'cornflowerblue', edge_width=10)
```
So we'll need an array of latitudes, each element representing the latitude of a trip, and an array of longitudes, each representing the longitudes associted with a trip.  Write a function called `trip_latitudes` that given a list of trips returns a list of latitudes, and `trip_longitudes` that given a list of trips, returns a list of `longitudes` accordingly.  Run the file `nearest-neighbor-lab-tests.py` to get feedback.  

In [74]:
def trip_latitudes(trips):
    return list(map(lambda trip: trip['pickup_latitude'], trips))

In [75]:
def trip_longitudes(trips):
    return list(map(lambda trip: trip['pickup_longitude'], trips))

In [43]:
latitudes = list(map(lambda trip: trip['pickup_latitude'], parsed_trips))
longitudes = list(map(lambda trip: trip['pickup_longitude'], parsed_trips))

In [51]:
gmap.scatter(latitudes, longitudes, 'cornflowerblue', edge_width=10)
gmap.draw("myplot.html")

Now open the `myplot.html` file to see if there's anything interesting.  Looking around, it seems like the right side of the map has more trips than the west side.

### Using Nearest Neighbors

Ok, let's write a function that given a latitude and longitude will predict the fare distance for us.  We'll do this by first finding the nearest trips given a latitude and longitude. 

 First write a distance method that calculates the distance between two individuals.

In [52]:
import math

def distance(selected_individual, neighbor):
   distance_squared = (neighbor['pickup_latitude'] - selected_individual['pickup_latitude'])**2 + (neighbor['pickup_longitude'] - selected_individual['pickup_longitude'])**2
   return math.sqrt(distance_squared)

In [53]:
def distance_between_neighbors(selected_individual, neighbor):
    neighbor_with_distance = neighbor.copy()
    neighbor_with_distance['distance'] = distance(selected_individual, neighbor)
    return neighbor_with_distance

def distance_all(selected_individual, neighbors):
    remaining_neighbors = filter(lambda neighbor: neighbor != selected_individual, neighbors)
    return list(map(lambda neighbor: distance_between_neighbors(selected_individual, neighbor), remaining_neighbors))

Write the nearest neighbors formula.  If no number is provided, it should return the top 5 neighbors.

In [64]:
def nearest_neighbors(selected_individual, neighbors, number = 3):
    number = number
    neighbor_distances = distance_all(selected_individual, neighbors)
    sorted_neighbors = sorted(neighbor_distances, key=lambda neighbor: neighbor['distance'])
    return sorted_neighbors[:number]

In [59]:
first_trip = parsed_trips[0]
first_trip

{'passenger_count': 1,
 'pickup_latitude': 40.64499,
 'pickup_longitude': -73.78115,
 'trip_distance': 18.38}

In [66]:
nearest_neighbors(first_trip, parsed_trips, number = 3)

# [{'distance': 0.0004569288784918792,
#   'passenger_count': 3,
#   'pickup_latitude': 40.64483,
#   'pickup_longitude': -73.781578,
#   'trip_distance': 7.78},
#  {'distance': 0.0011292165425673159,
#   'passenger_count': 1,
#   'pickup_latitude': 40.644657,
#   'pickup_longitude': -73.782229,
#   'trip_distance': 12.7},
#  {'distance': 0.0042359798158141185,
#   'passenger_count': 2,
#   'pickup_latitude': 40.648509,
#   'pickup_longitude': -73.783508,
#   'trip_distance': 17.3}]

[{'distance': 0.0004569288784918792,
  'passenger_count': 3,
  'pickup_latitude': 40.64483,
  'pickup_longitude': -73.781578,
  'trip_distance': 7.78},
 {'distance': 0.0011292165425673159,
  'passenger_count': 1,
  'pickup_latitude': 40.644657,
  'pickup_longitude': -73.782229,
  'trip_distance': 12.7},
 {'distance': 0.0042359798158141185,
  'passenger_count': 2,
  'pickup_latitude': 40.648509,
  'pickup_longitude': -73.783508,
  'trip_distance': 17.3}]

Looking at the trip distance between those there is quite a spread, perhaps we can increase the amount of nearest neighbors.  But we want to make sure that doing so, doesn't spread our trip distances too much.

In [78]:
seven_closest = nearest_neighbors(first_trip, parsed_trips, number = 7)

Notice that most of the data is a distance of .0045 away, so going to the top 7 nearest neighbors didn't see to add too much noise to our data.  It's hard to know what distance in latitude and longitude mean, so let's try mapping the data.  

In [88]:
len(seven_closest)

7

In [89]:
seven_lats = trip_latitudes(seven_closest[:3])
seven_lats

[40.64483, 40.644657, 40.648509]

In [90]:
seven_longs = trip_longitudes(seven_closest[:3])
seven_longs

[-73.781578, -73.782229, -73.783508]

In [92]:
gmap = gmplot.GoogleMapPlotter(first_trip['pickup_latitude'], first_trip['pickup_longitude'], 15)

gmap.scatter(seven_lats, seven_longs, 'cornflowerblue', edge_width=10)
gmap.draw("nearestneighbors.html")

Open the map, what did you see?  Well, it looks like we can't really make an assessment of a good nearest neighbor number with this data.  Our location is the airport, which probably not place to extrapolate from.

Let's choose another spot that we expect to be less atypical.  Fifty-first street and 7th Avenue is at 40.761710, -73.982760.  Now let's run the same test again to see how many nearest neighbors we should choose.

In [101]:
midtown_loc = {'pickup_latitude': 40.761710, 'pickup_longitude': -73.982760}
closest = nearest_neighbors(midtown_loc, parsed_trips, number = 7)
list(map(lambda trip: trip['distance'], closest))

[0.00037310588309379025,
 0.00080072217404248,
 0.0011555682584735844,
 0.0012508768924205918,
 0.0018118976240381972,
 0.002067074502774709,
 0.0020684557041472677]

The distance, doubles in size as we go from four to five.  How far is this really?

In [102]:
gmap = gmplot.GoogleMapPlotter(midtown_loc['pickup_latitude'], midtown_loc['pickup_longitude'], 15)
closest_lats = trip_latitudes(closest)
closest_longs = trip_longitudes(closest)

gmap.scatter(closest_lats, closest_longs, 'cornflowerblue', edge_width=10)
gmap.draw("nearestmidtown.html")

So essentially this is one or two blocks away from our location of 51st and 7th.  Not too bad.

Now write a function that will given a list of trips and an attribute, will calculate the average of that attribute. 

In [126]:
import statistics
def median_of(neighbors, attribute):
    return statistics.median(list(map(lambda x: x[attribute], neighbors)))

In [127]:
median_of(closest, 'distance')

0.0012508768924205918

In [128]:
median_of(closest, 'trip_distance')

1.26

In [129]:
median_of(parsed_trips, 'trip_distance')

1.66

In [135]:
median_of(nearest_neighbors(midtown_loc, parsed_trips, number = 15), 'trip_distance')

1.9

Interesting because there does appear to be a deviation here, but we need to increase the nearest neighbors amount (past 15, it appears to begin to get a consistent number).

Let's try another number to see how we do.

In [136]:
uws_loc = {'pickup_latitude': 40.786430, 'pickup_longitude': -73.975979}

In [142]:
median_of(nearest_neighbors(uws_loc, parsed_trips, number = 20), 'trip_distance')

1.6600000000000001

In [143]:
downtown_loc = {'pickup_latitude': 40.713186, 'pickup_longitude': -74.007243}

In [146]:
median_of(nearest_neighbors(downtown_loc, parsed_trips, number = 20), 'trip_distance')

2.2249999999999996

### Including other features

Now let's begin writing functions involved with calculating that hypotenuse of our right triangle.  Using the Pythagorean Theorem, write a function called `distance_between_students_squared` that calculates the length of the hypotenuse squared.

In [30]:
def distance_between_students_squared(first_student, second_student):
    return street_distance(first_student, second_student)^2 + avenue_distance(first_student, second_student)^2

In [31]:
distance_between_students_squared(fred, daniel) # 4

4

Now take the next step, and write a function called `distance`, that given two students returns the distance between them.  

In [33]:
import math
def distance(first_student, second_student):
    return math.sqrt(distance_between_students_squared(first_student, second_student))

In [32]:
distance(fred, daniel)

2.0

### Writing Our "Nearest Neighbors" Functions

This next section will work up to building a `nearest_neighbor` function.  This is a function that given one student, will tell us which students are closest to him.  How do we write something like this? Can we use our calculation of distance between two students, to figure out the closest students to an individual?

Sure, we first need to calculate the distances between one student and all of the others.  Next, we sort those students by their distance from the student.  Finally, we select a given number of the closest students.  Let's work through it.   

Note that we already have a function that calculates the distance between two students.  We may think we could simply use this function to loop through our students, but that would just return an array of distances.  

In [26]:
distances = []
for student in students:
    distance_between = distance(fred, student)
    distances.append(distance_between)

distances

[0.0, 2.0, 2.0, 1.4142135623730951]

The returned array from the above procedure isn't super helpful.  We need to know who each distance is associated with.  

So let's accomplish this by writing a function called `distance_with_student` that works like our distance function but instead of returning an integer, returns a dictionary representing the `second_student`, and also adds in the a key value pair indicating distance from the `first_student`.

In [14]:
def distance_with_student(first_student, second_student):
    student_with_distance = second_student.copy()
    distance = math.sqrt(distance_between_students_squared(first_student, second_student))
    student_with_distance['distance'] = distance
    return student_with_distance


In [43]:
distance_with_student(fred, daniel)
# {'avenue': 5, 'distance': 2.0, 'name': 'daniel', 'street': 1}

{'avenue': 5, 'distance': 2.0, 'name': 'daniel', 'street': 1}

Now write a function called `distance_all` that returns an array representing the distances between a `first_student` and the rest of the students.  The array should not return the `first_student` in its collection of students. 

In [19]:
def distance_all(first_student, students):
    remaining_students = filter(lambda student: student != first_student, students)
    return list(map(lambda student: distance_with_student(first_student, student), remaining_students))

In [44]:
distance_all(fred, students)
# [{'avenue': 5, 'distance': 2.0, 'name': 'daniel', 'street': 1},
#  {'avenue': 6, 'distance': 2.0, 'name': 'rachel', 'street': 2},
#  {'avenue': 10, 'distance': 1.4142135623730951, 'name': 'steven', 'street': 4}]

[{'avenue': 5, 'distance': 2.0, 'name': 'daniel', 'street': 1},
 {'avenue': 6, 'distance': 2.0, 'name': 'rachel', 'street': 2},
 {'avenue': 10, 'distance': 1.4142135623730951, 'name': 'steven', 'street': 4}]

Finally, write a function called `nearest_neighbors` given a student, returns an array of students, ordered from closest to furthest from the student.  The function should take an optional third argument that specifies how many "nearest" students are returned.

In [21]:
def nearest_neighbors(first_student, students, number = None):
    number = number or len(students) - 1
    student_distances = distance_all(first_student, students)
    sorted_students = sorted(student_distances, key=lambda student: student['distance'])
    return sorted_students[:number]

In [22]:
nearest_neighbors(fred, students, 2)
# [{'avenue': 10, 'distance': 1.4142135623730951, 'name': 'steven', 'street': 4},
#  {'avenue': 5, 'distance': 2.0, 'name': 'daniel', 'street': 1}]

[{'avenue': 10, 'distance': 1.4142135623730951, 'name': 'steven', 'street': 4},
 {'avenue': 5, 'distance': 2.0, 'name': 'daniel', 'street': 1}]