# Plotting Maps: Visualizing Haiti Earthquake Crisis Data
    
Ushahidi is a non-profit software company that enables crowdsourcing of information
related to natural disasters and geopolitical events via text message. Many of these data
sets are then published on their website for analysis and visualization. I downloaded the data collected during the 2010 Haiti earthquake crisis and aftermath, and I’ll show
you how I prepared the data for analysis and visualization using pandas and other tools
we have looked at thus far. After downloading the CSV file from the above link, we can
load it into a DataFrame using read_csv:

In [1]:
from pandas import DataFrame, Series
import pandas as pd
import sys
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('Haiti.csv')

In [3]:
data.head()

Unnamed: 0,Serial,INCIDENT TITLE,INCIDENT DATE,LOCATION,DESCRIPTION,CATEGORY,LATITUDE,LONGITUDE,APPROVED,VERIFIED
0,4052,* URGENT * Type O blood donations needed in #J...,05/07/2010 17:26,"Jacmel, Haiti",Birthing Clinic in Jacmel #Haiti urgently need...,"1. Urgences | Emergency, 3. Public Health,",18.233333,-72.533333,YES,NO
1,4051,"Food-Aid sent to Fondwa, Haiti",28/06/2010 23:06,fondwa,Please help food-aid.org deliver more food to ...,"1. Urgences | Emergency, 2. Urgences logistiqu...",50.226029,5.729886,NO,NO
2,4050,how haiti is right now and how it was during t...,24/06/2010 16:21,centrie,i feel so bad for you i know i am supposed to ...,"2. Urgences logistiques | Vital Lines, 8. Autr...",22.278381,114.174287,NO,NO
3,4049,Lost person,20/06/2010 21:59,Genoca,We are family members of Juan Antonio Zuniga O...,"1. Urgences | Emergency,",44.407062,8.933989,NO,NO
4,4042,Citi Soleil school,18/05/2010 16:26,"Citi Soleil, Haiti",We are working with Haitian (NGO) -The Christi...,"1. Urgences | Emergency,",18.571084,-72.334671,YES,NO


It’s easy now to tinker with this data set to see what kinds of things we might want to
do with it. Each row represents a report sent from someone’s mobile phone indicating
an emergency or some other problem. Each has an associated timestamp and a location
as latitude and longitude:

In [5]:
data[['INCIDENT DATE', 'LATITUDE', 'LONGITUDE']][:10]

Unnamed: 0,INCIDENT DATE,LATITUDE,LONGITUDE
0,05/07/2010 17:26,18.233333,-72.533333
1,28/06/2010 23:06,50.226029,5.729886
2,24/06/2010 16:21,22.278381,114.174287
3,20/06/2010 21:59,44.407062,8.933989
4,18/05/2010 16:26,18.571084,-72.334671
5,26/04/2010 13:14,18.593707,-72.310079
6,26/04/2010 14:19,18.4828,-73.6388
7,26/04/2010 14:27,18.415,-73.195
8,15/03/2010 10:58,18.517443,-72.236841
9,15/03/2010 11:00,18.54779,-72.41001


The CATEGORY field contains a comma-separated list of codes indicating the type of
message:

In [7]:
data['CATEGORY'][:6]

0          1. Urgences | Emergency, 3. Public Health, 
1    1. Urgences | Emergency, 2. Urgences logistiqu...
2    2. Urgences logistiques | Vital Lines, 8. Autr...
3                            1. Urgences | Emergency, 
4                            1. Urgences | Emergency, 
5                       5e. Communication lines down, 
Name: CATEGORY, dtype: object

If you notice above in the data summary, some of the categories are missing, so we
might want to drop these data points. Additionally, calling describe shows that there
are some aberrant locations:

In [9]:
data.describe()

Unnamed: 0,Serial,LATITUDE,LONGITUDE
count,3593.0,3593.0,3593.0
mean,2080.277484,18.611495,-72.32268
std,1171.10036,0.738572,3.650776
min,4.0,18.041313,-74.452757
25%,1074.0,18.52407,-72.4175
50%,2163.0,18.539269,-72.335
75%,3088.0,18.56182,-72.29357
max,4052.0,50.226029,114.174287


Cleaning the bad locations and removing the missing categories is now fairly simple:

In [10]:
data = data[(data.LATITUDE > 18) & (data.LATITUDE < 20) &
            (data.LONGITUDE > -75) & (data.LONGITUDE < -70)
            & data.CATEGORY.notnull()]

Now we might want to do some analysis or visualization of this data by category, but
each category field may have multiple categories. Additionally, each category is given
as a code plus an English and possibly also a French code name. Thus, a little bit of
wrangling is required to get the data into a more agreeable form. First, I wrote these
two functions to get a list of all the categories and to split each category into a code and
an English name:

In [12]:
def to_cat_list(catstr):
    stripped = (x.strip() for x in catstr.split(','))
    return [x for x in stripped if x]

In [13]:
def get_all_categories(cat_series):
    cat_sets = (set(to_cat_list(x)) for x in cat_series)
    return sorted(set.union(*cat_sets))

In [19]:
def get_english(cat):
    code, names = cat.split('.')
    if '|' in names:
        names = names.split('|')[1]
    return code, names.strip()

You can test out that the get_english function does what you expect:

In [20]:
get_english('2. Urgences logistiques | Vital Lines')

('2', 'Vital Lines')

Now, I make a dict mapping code to name because we’ll use the codes for analysis.
We’ll use this later when adorning plots (note the use of a generator expression in lieu
of a list comprehension):

In [21]:
all_cats = get_all_categories(data.CATEGORY)

In [22]:
# Generator expression
english_mapping = dict(get_english(x) for x in all_cats)

In [25]:
english_mapping['2a']


'Food Shortage'

In [27]:
english_mapping['6c']

'Earthquake and aftershocks'

There are many ways to go about augmenting the data set to be able to easily select
records by category. One way is to add indicator (or dummy) columns, one for each
category. To do that, first extract the unique category codes and construct a DataFrame
of zeros having those as its columns and the same index as data:

In [29]:
def get_code(seq):
    return [x.split('.')[0] for x in seq if x]

In [30]:
all_codes = get_code(all_cats)
code_index = pd.Index(np.unique(all_codes))
dummy_frame = DataFrame(np.zeros((len(data), len(code_index))),
        index=data.index, columns=code_index)

If all goes well, dummy_frame should look something like this:

In [33]:
dummy_frame.iloc[:, :6].head()

Unnamed: 0,1,1a,1b,1c,1d,2
0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


As you recall, the trick is then to set the appropriate entries of each row to 1, lastly
joining this with data:

In [35]:
for row, cat in zip(data.index, data.CATEGORY):
    codes = get_code(to_cat_list(cat))
    dummy_frame.ix[row, codes] = 1
data = data.join(dummy_frame.add_prefix('category_'))

data finally now has new columns like:

In [38]:
data.iloc[:, 10:15].head()

Unnamed: 0,category_1,category_1a,category_1b,category_1c,category_1d
0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0


Let’s make some plots! As this is spatial data, we’d like to plot the data by category on
a map of Haiti. The basemap toolkit (http://matplotlib.github.com/basemap), an add-on
to matplotlib, enables plotting 2D data on maps in Python. basemap provides many
different globe projections and a means for transforming projecting latitude and longitude
coordinates on the globe onto a two-dimensional matplotlib plot. After some
trial and error and using the above data as a guideline, I wrote this function which draws
a simple black and white map of Haiti:

In [40]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

ImportError: No module named 'mpl_toolkits.basemap'