# Toy example of address parsing with Pandas and Geopy
## Demo data was downloaded from [data.gouv.fr - Liste des pharmacies parisiennes](https://www.data.gouv.fr/fr/datasets/carte-des-pharmacies-de-paris-idf/)

First of all let's import *pandas* and *geopy*. We will use Google service for geocoding, but you can check [geopy documentation](http://geopy.readthedocs.org/en/latest/#module-geopy.geocoders) for information about other providers.

After importing the modules, let's load csv file with address data:

In [123]:
import pandas as pd
from geopy.geocoders import GoogleV3
pharmaFile = './carte-des-pharmacies-de-paris.csv'
pharma_df = pd.read_csv(pharmaFile, sep=';')

The dataset already contains map coordinates `['lat', 'lng']`. So just for the purpose of that toy example let's drop out columns with coordinates. Also let's make a basic data cleaning by dropping raws with empty values:

In [124]:
# Drop all columns we are not interested in
pharma_df = pharma_df.drop(['nofinesset', 'nofinessej', 'rslongue', 'compvoie', 'lieuditbp', 'wgs84', 'lat', 'lng'], axis=1)

# Drop rows with missing data
pharma_df = pharma_df.dropna()

# Reset index after all clean-up operations
pharma_df.reset_index(inplace=True, drop=True)

# Convert building number to int (by default it was float)
pharma_df['numvoie'] = pharma_df['numvoie'].astype(int)

Here is a sample of our data set:

In [136]:
pharma_df.head()

Unnamed: 0,rs,complrs,numvoie,typvoie,voie,departement,libdepartement,cp,commune,telephone,telecopie,dateouv,dateautor,datemaj,fulladdress,lat,lng
0,SELARL PHARMACIE MATHIAU LAM,PHARMACIE MATHIAU LAM,3,RUE,JEANNE D'ARC,75,PARIS,75013,PARIS,145834022,145864415,2011-06-30,1994-06-29,2011-07-21,"3 RUE JEANNE D'ARC, 75013, PARIS",48.828895,2.369385
1,SELARL PHARMACIE DIDOT ALESIA,PHARMACIE DIDOT ALESIA,169,RUE,D ALESIA,75,PARIS,75014,PARIS,145423493,145426363,2011-04-11,1942-12-30,2011-07-22,"169 RUE D ALESIA, 75014, PARIS",48.830662,2.318188
2,SELARL PHARMACIE EDGAR QUINET,PHARMACIE EDGAR QUINET,43,RUE,DELAMBRE,75,PARIS,75014,PARIS,143207012,143202920,2010-07-01,1942-11-10,2011-07-21,"43 RUE DELAMBRE, 75014, PARIS",48.841301,2.325438
3,M OMAR AADRI,PHARMACIE AADRI,11,RUE,MARGUERIN,75,PARIS,75014,PARIS,145407284,145394212,2012-07-30,1943-05-05,2012-09-12,"11 RUE MARGUERIN, 75014, PARIS",48.826962,2.328586
4,EURL PLANCHE,PHARMACIE PL DENFERT ROCHEREAU,97,AVENUE,DENFERT ROCHEREAU,75,PARIS,75014,PARIS,143542971,144070600,2011-04-01,1942-12-04,2011-07-22,"97 AVENUE DENFERT ROCHEREAU, 75014, PARIS",48.834771,2.333121


We are going to use geoparser from *geopy* and to do so we need a full address in format `"building_num street, zip code, city"`. 
We have all the components of the full address, so we just apply a lambda function which formats a string and add it as a new column to our dataframe:

In [125]:
pharma_df['fulladdress'] = pharma_df.apply(
    lambda row: '{num} {st_type} {street}, {code}, {city}'.format(num=row['numvoie'],
                                                                st_type=row['typvoie'],
                                                                street=row['voie'],
                                                                code=row['cp'],
                                                                city=row['commune']), axis=1)

Let's have a look at the data after adding full address column:

In [137]:
pharma_df[['numvoie', 'typvoie', 'voie', 'cp', 'commune', 'fulladdress']].head()

Unnamed: 0,numvoie,typvoie,voie,cp,commune,fulladdress
0,3,RUE,JEANNE D'ARC,75013,PARIS,"3 RUE JEANNE D'ARC, 75013, PARIS"
1,169,RUE,D ALESIA,75014,PARIS,"169 RUE D ALESIA, 75014, PARIS"
2,43,RUE,DELAMBRE,75014,PARIS,"43 RUE DELAMBRE, 75014, PARIS"
3,11,RUE,MARGUERIN,75014,PARIS,"11 RUE MARGUERIN, 75014, PARIS"
4,97,AVENUE,DENFERT ROCHEREAU,75014,PARIS,"97 AVENUE DENFERT ROCHEREAU, 75014, PARIS"


Seems that's what we wanted!

Next move is to pass `fulladdress` data to the geocoder and store *latitude* and *longtitude* information in the dataframe.

We define a helper function which excepts string with address and geolocator object as input arguments, and outputs latitude and longtitude for the address:

In [138]:
def find_map_coordinates(full_address, geolocator):
    """
    Function to find latitude and longtitude for the address.
    Input
    :full_address: input string with full address (like "55 rue du Faubourg Saint-Honore, 75008, Paris")
    :geolocator: geocoder object from geopy module
    
    Output
    :latitude, longitude: values of latitude and longtitude for the input address
    """
    location = geolocator.geocode(full_address)
    return location.latitude, location.longitude

Let's create geo locator object (e.g. Google one):

In [None]:
# Create geocoder object
geolocator = GoogleV3()

Now, when we have function to convert address to the coordinates, let's parse `fulladdress` columns and add convertion results to the dataframe:
>**Note**: because there is a quota for geocoding request on the server part, for this toy example we will limit our dataset to the first 10 rows

In [133]:
# Limit our data to the first 10 rows
pharma_df = pharma_df[:10]

# Find coordinates by appling helper function. We pass also geolocator object
locs = pharma_df['fulladdress'].apply(find_map_coordinates, args=(geolocator,)).tolist()

# Add new columns to the dataframe with latitude and longtitude values
pharma_df['lat'], pharma_df['lng'] = zip(*locs)

This way our dataframe has two new column with information about map coordinates for each of the address:

In [135]:
pharma_df[['fulladdress','lat','lng']].head()

Unnamed: 0,fulladdress,lat,lng
0,"3 RUE JEANNE D'ARC, 75013, PARIS",48.828895,2.369385
1,"169 RUE D ALESIA, 75014, PARIS",48.830662,2.318188
2,"43 RUE DELAMBRE, 75014, PARIS",48.841301,2.325438
3,"11 RUE MARGUERIN, 75014, PARIS",48.826962,2.328586
4,"97 AVENUE DENFERT ROCHEREAU, 75014, PARIS",48.834771,2.333121


## That's all!
In a few simple steps we saw how to convert address from a text format to a map coodinates using pandas and geopy.
The future application of obtained coordinates depends only on one's imagination. For example, we can try to see if there are some districts with more pharmacies that in anothers.

---
*Alexander Usoltsev, 2016*