# Location Data Analysis Using Python

Step 1. Install anacondas. This will give you all the packages you need to do data science. Here is a link http://bit.ly/1NlvHdW

Step 2. Install iPython. You don't have to do this, you could use idle or PyCharm or PyDev, but iPython is the best way to write Python code hands down. It's also pretty easy to install (assuming you don't go on mass file deleting sprees like I do..). Here is the link http://bit.ly/1SRw7MP

Now we can start writing code. Import pandas, and create a dataframe using the 'read_csv' function.

In [1]:
import pandas as pd

cities = pd.read_csv('N:\\bandera\ejercicio\PuertoRico.csv') # data - cities of peurto rico
cities.head()       # shows all columns, and first 5 rows.

Unnamed: 0,zip_code,latitude,longitude,city,state,county
0,601,18.165273,-66.722583,Adjuntas,PR,Adjuntas
1,602,18.393103,-67.180953,Aguada,PR,Aguada
2,605,18.465162,-67.141486,Aguadilla,PR,Aguadilla
3,606,18.172947,-66.944111,Maricao,PR,Maricao
4,610,18.288685,-67.139696,Anasco,PR,Anasco




We can use the head() function as shown above to get a feel for the dataset. We can also use the count() function.

In [2]:
cities.count()

zip_code     89
latitude     89
longitude    89
city         89
state        89
county       89
dtype: int64

Below you can see how we index the column and row. This is really useful.

In [3]:
cities['city']        # to access the whole 'city' column
cities['city'].ix[0]  # to access just the first row of the city column - Adjuntas

'Adjuntas'

Below is something I like to do if I'm planning on running more computer science like algorithms on the data (perhaps a greedy algorithm, or something else that the dataset lends itself to). What I'm talking about is make an object for each row (only if appropriate!).

So I create a standard python class, and pass in row number to the constructor, because that's how I'm going to create an array of these objects. I'm using 'getters' instead of accessing the data members directly, just because that's a good practice.

In [4]:
class City():
    def __init__(self, rowNum):
        self.name = cities['city'].ix[rowNum]             # the most important attribute!
        self.zipCode = cities['zip_code'].ix[rowNum]
        self.latitude = cities['latitude'].ix[rowNum]
        self.longitude = cities['longitude'].ix[rowNum]
        self.county = cities['county'].ix[rowNum]
        
    def getName(self):
        return self.name
    
    def getZipCode(self):
        return self.zipCode
    
    def getLat(self):
        return self.latitude
    
    def getLong(self):
        return self.longitude
    
    def getCounty(self):
        return self.county   
     
    # we should always have a string representation of the object    
    def show(self):                                     
        string = "City = " + self.getName() + "\n" + "Latitude = " + str(self.getLat()) + "\n" + "Longitude = " + str(self.getLong()) + "\n"
        print(string)

The show() function prints a nice string representation of the city object. Below I'm going to create an array of and fill it with the whole dataset, so it will be easier to run an algorithm through it.

In [5]:
# now I'm going to make an array of cities. The point of all this is to make running 
# algorithms on the dataset easier on myself.

places = []                       # already used the name 'cities'...

for i in range(cities['city'].count()):       # the method inside will return row 
     temp = City(i)                           # length for the 'city' column.
     places.append(temp)
    

for j in range(5):           # the data in it's new array of objects format!
    places[j].show()

City = Adjuntas
Latitude = 18.165273
Longitude = -66.722583

City = Aguada
Latitude = 18.393103
Longitude = -67.180953

City = Aguadilla
Latitude = 18.465162
Longitude = -67.141486

City = Maricao
Latitude = 18.172947
Longitude = -66.944111

City = Anasco
Latitude = 18.288685
Longitude = -67.139696



Now we're going to get a little more complicated. I want to calculate the distance in between two cities (nodes) and then write a function that finds the closest city for any given city!

note - The Haversine formula below, you can ignore that. It's the mathmatical way to calculate distance between two points of longitude and latitude. It's one of those things you google when you need it then never use or remember it again. It is, however, critical to our distance function.

Notice that when I write the function to find the closest city, I'm extremely careful to make sure that it doesn't ever compare to itself. This is because if it did, it would pick itself every time, making the algorithm useless. This is important to take into account if you are writing any route planning algorithms, like the one I want you to try in the challenge part.

In [5]:
# now some other methods that might be useful for analysis on these cities!

# the haversine formula is a way to calculate distance between a longitude and latitude.
# this code is via - http://bit.ly/1bKauqS
# don't look to into it unless you love geography...
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


# a distance function to make my life easier
def distance(a, b):                  # 'a', and 'b' will just be City objects!
    return haversine(a.getLong(), a.getLat(), b.getLong(), b.getLat())


# what if I want to know the closest city?
import random

def findClosestCity(a):
    start = places[random.randrange(0, 89)]        
    while start == a:                         # if I don't make sure it can't be itself, 
        start = places[random.randrange(0, 89)]      # it will pick itself every time.

    champDistance = distance(a, start)        # the distance we will "challenge"
    closest = start
    
    for i in places:
        testDistance = distance(a, i)
        if testDistance < champDistance and not a == i:
            closest = i
            champDistance = testDistance      # now it will be the thing to challenge.
    
    return closest


In [31]:
# now let's test some of the functionality of what we just coded!!!
# let's find a location to start at. 35 is a randomish number..
places[35].show()

City = Mayaguez
Latitude = 18.219023
Longitude = -67.508068



In [32]:
# ok, Mayaguez it is!
Mayaguez = places[35]
closeToMaya = findClosestCity(Mayaguez)

closeToMaya.show()               # here's google directions to check - http://bit.ly/1BTTyks

City = Rincon
Latitude = 18.335781
Longitude = -67.252547



In [35]:
places[33].show()

City = Rincon
Latitude = 18.335781
Longitude = -67.252547



Challenge - write a greedy algorithm to try and find the minimum travel time to ten locations. Start in Mayaguez. I'll post the answer soon enough.