# Data Science Project

The modernization of web services has recently been increasingly pushing for deployment in infrastructure-as-a-service (IaaS) clouds such as Amazon EC2, Windows Azure, and Rackspace. Industry claims that over 1% of Internet traffic goes to EC2 and that outages in EC2 are reputed to hamper a huge variety of services. Our goal is to determine who is using Iaas Clouds (more specifically EC2 clouds), how these services are using the cloud, and finally use (Planet-Lab or EDNS Queries) to estimate the impact of wide-area route outages. 

**Warning** Acquiring and cleaning data is a messy process, but your approach shouldn't be.  Approach this lab with a rigorous problem solving mindset.  Design and implement a solution that is robust to unexpected inputs and handles these anomalies gracefully.

## Set up

If you make changes to your code and rerun a python notebook, your changes may not be detected because python is lazy about reloading modules.  The following two lines will force reloads.

In [1]:
%load_ext autoreload
%autoreload 2

Install the patch for the UNIX dig utility:

    $ wget ftp://ftp.isc.org/isc/bind9/9.9.3/bind-9.9.3.tar.gz
    $ tar xf bind-9.9.3.tar.gz
    $ cd bind-9.9.3
    $ wget http://wilmer.gaa.st/edns-client-subnet/bind-9.9.3-dig-edns-client-subnet-iana.diff
    $ patch -p0 < bind-9.9.3-dig-edns-client-subnet-iana.diff
    $ ./configure --without-openssl
    $ make

Install Pytricia

    import sys
    sys.path.append('/vagrant/data-science-from-scratch/code/')
    import working_with_data as wwd   # to import the code from Ch. 10, for instance

To see how the python files align with the book chapters, go [here](https://github.com/joelgrus/data-science-from-scratch).
   

## Scraping and cleaning the data

In the file `scrape.py`, write a function called `scrape_cherry_blossoms()` that scrapes the web pages of race results for a given set of years and genders and writes the results to one file and errors to another.  The results file should be a CSV file that looks something like this:

    year,gender,place,name,hometown,age,time
    2010,m,1,Stephen Tum,Kenya,24,45.71666666666667
    2010,m,2,Lelisa Desisa,Ethiopia,20,45.733333333333334
    ...

Notice that race time has been converted to minutes. For more details, see the docstring of `scrape_cherry_blossoms()`.

Here are a few guidelines in writing this function:
- Before you start, examine the Cherry Blossom results web pages (View Source in your web browser).  You'll notice that the format varies slightly from page to page.  Try to write the code in a general way so that the same approach can be applied to all 28 pages (14 years x 2 genders).
- Smart small: debug your program on one year, then two years, etc. before unleashing it on all years.
- The web pages contain some unicode characters that may cause problems when you try to write it out to a file.  If string `s` is a unicode string, you can encode it in ASCII as follows: `s.encode('ascii', 'replace')`.
- Not only are the pages formatted differently, the *data* varies from page to page.  To aggregate the data across all years, we would like to decide on a uniform data format.  Thus, you will need to make some decisions on how to map each individual result file to this common format.
- As is often the case, there is missing data.  Some runners don't have an age, some don't have a time, etc.  Handle missing data as follows: any record that is missing a value for *any* of the *required fields* should be discarded and put into the errors file.  The required fields are those that appear in the CSV example above.
- Try to process as many years as you can.  Years 2001, 2006 and 2009 are especially tricky and are considered optional **challenge problems**.

Invoke your function here, passing in as many years as you can process:

In [2]:
import geo

NoSuchDisplayException: Cannot connect to "None"

In the space below, explain how you handled time:

We compared the format of the results for the different years and determined for each year which column number is to be used for time. Most result years used the 7th or 8th column for gun time/time, except for 2008 which used the 11th column and 1999 which used the 6th column. We created lists for each of those groups and then selected the corresponding regex group for those years. In the cases where some of the columns were empty, we used conditional statements to select the correct regex group. After extracting the correct time value, we used a helper function "time_in_mins" to calculate the total time in minutes. 

## Exploring the data

In `manipulate.py`, write a function called `load_race_data(filename)` that reads a CSV file named `filename` and parses each field into the appropriate type.  The result should be a list of dicts where each dict represents one row of data (i.e., the format shown on p. 129 of DSFS).  For example, if the result file contains men's results from 2010, the ouptut of the `load_race_data` function should look like this:

    [
     {'name': 'Stephen Tum', 'gender': 'm', 'age': 24, 
      'year': 2010, 'hometown': 'Kenya', 'place': 1, 'time': 45.71666666666667}, 
     {'name': 'Lelisa Desisa', 'gender': 'm', 'age': 20, 
      'year': 2010, 'hometown': 'Ethiopia', 'place': 2, 'time': 45.733333333333334},
     ...
    ]

Now we can manipulate the data to answer some basic questions.  

1. What is the age range for male runners?
2. What is the age range for female runners?
3. Who had the fastest running time across all years/genders, how fast did they run and in what year?
4. Who had the slowest running time across all years/genders and how fast did they run and in what year?

If any of these answers seem odd, go back your data cleaning process and the source data.  If the oddity is due to the source data, you can leave it.

In [None]:
# TODO: compute answers to the questions here.  Your code should print 
# a well formatted response (e.g."1. The age range for men is __ to __.")
# If your code gets complex, you can always write it in manipulate.py and 
# then import and call it here.
from manipulate import load_race_data
from manipulate import calculations

calculations(load_race_data('results.txt'))

Do any of your above answers seem unusual due to anomalies in the source data?  Please explain in the space below.  

The starting age range for men seems unusual considering a 1 year old completeing a 10 mile run would be the most incredible thing I've ever seen, so we went back and checked the source of this data, and found that it is just an anomily in the source data. This happened earlier as well with another source who claimed to be 0 years old, which also is very unlikely, but already previously had a condition marking 0 as an invalid age. 

How has the number of runners changed over the years?  Plot a histogram of the number of runners per year.

In [None]:
# TODO: plot a histogram of runners per year
from manipulate import runners_by_year
years = [1999, 2000, 2002, 2003, 2004, 2005, 2007, 2008, 2010, 2011, 2012]
runners_by_year(load_race_data('results.txt'),years)

## How much do runners slow down with age?  A cross-sectional analysis.

Do runners slow down with age?  If so, by how much?  Let's investigate this question by making a scatter plot of age vs. running time.  Let's make a *single* plot that has two panels, one for men and one for women.  To do this you will have to combine ideas from the scatter plot example (p. 124) with the scatter plot matrix example (p. 126) as well as [other examples of using the subplot command](http://matplotlib.org/examples/pylab_examples/subplots_demo.html).  Plot data from a single year -- I suggest you choose a year with a small number of runners.

Make the plot readable: you may need adjust the size of the points, use a "point" marker style rather than the default, and finally use alpha blending so we can more easily see overlapping points. See the [matplotlib documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter) for details.

In [None]:
# TODO: plot your scatter plot here.  you can write the code directly here or write it in
# manipulate.py and then simply call your functions here.
from manipulate import cross_section

cross_section(load_race_data('results.txt'))

What does the scatter plot suggest (if anything) about the relationship between run time and age?  Is there anything about this dataset that could .

Most people that participated in this race were between the ages of 20 and 50 with a weak positive correlation between age and speed.


Here's an example of what the final plot might look like: <img src="scatter_plot.png">

## Challenge problem: longitudinal study

If you finish the above and want to get a little more data munging experience, consider completing the following challenge problem.  Challenge problems are intended for students who are looking for, well, a little more of a challenge.  Completion of challenge problems is factored into your final course grade.

Now, on to the challenge problem.  The above scatter plot shows us a *cross section* of runners of varying ages from teenagers to folks in their 80s.  However, we also have race data for a 14 year span, and several runners may compete in the Cherry Blossom multiple years.  This means it's possible to do a *longitudinal analysis*, looking at how an individual runner's speed changes over the 14 year period.

To complete this challenge, you must do the following:
- Develop a procedure for identifying the same person across different races.  Hint: looking at a person's name is enough; you might look not only at the person's name but also their hometown and their birthyear (which can be determined from their age and the year of the race).
- Identify a subset of runners who have run the Cherry Blossom multiple times over the years.
- Analyze this set of runners in some fashion.  Get creative!  Come up with a visualization or a statistic that captures the effects of aging.

In [None]:
# TODO (optional): write your solution to the challenge problem here