# Data Science Project: Clever Name Here
### Francisco Xavier Flores and Pam Needle

The modernization of web services has recently been increasingly pushing for deployment in infrastructure-as-a-service (IaaS) clouds such as Amazon EC2, Windows Azure, and Rackspace. Industry claims that over 1% of Internet traffic goes to EC2 and that outages in EC2 are reputed to hamper a huge variety of services. Our goal is to determine who is using Iaas Clouds (more specifically EC2 clouds), how these services are using the cloud, and finally use (Planet-Lab or EDNS Queries) to estimate the impact of wide-area route outages. 

**Warning** Acquiring and cleaning data is a messy process, but your approach shouldn't be.  Approach this lab with a rigorous problem solving mindset.  Design and implement a solution that is robust to unexpected inputs and handles these anomalies gracefully.

If you make changes to your code and rerun a python notebook, your changes may not be detected because python is lazy about reloading modules.  The following two lines will force reloads.

In [1]:
%load_ext autoreload
%autoreload 2

# I. Introduction

**TODO: Set the context (introduce the dataset and questions), provide motivation for why these are interesting questions to explore. The introduction should end with a brief summary of your findings.**

- Background: 
- What is EC2?
- What is the cloud?
- all those networky terms explained

# II. Methodology (2 pages)

**TODO: Describe, at a high-level, the methods you employed. Focus more attention on the more challenging/interesting/novel aspects. Provide references to your code as appropriate.**
- describes your data and the methodologies used to acquire, clean and prepare your data
- analysis with references to code


## Data Acquistion

Amazon previously published "Alexa Top 1m Sites" which was a list of the the top 1 million web site domains ordered by Alexa Traffic Rank. This data used to be publically available, however, Amazon now charges a fee. We will be working with the top 1m domains published in 2013 as a result. We extraced a list of subdomains from a dataset derived from the Alexa's 2013 Top 1m domains that contains all subdomains for each domain in the top 1m [http://pages.cs.wisc.edu/~keqhe/imc2013/Alexa_subdomain_dns_records.tar.gz]. Note: these are all subdomains from 2013 so we are going off the assumption that these have remained the same. 
We created this list using the following command: 

In [None]:
awk -F'#' '!seen[$2]++ {print $1, $2}' ALL_subdomains_Alexa_top1m.csv > uniquewithrank.txt

- why subdomains not domains? you can host subdomain on different hosts

# III. Results (2 pages)

**TODO: Present your findings through:**
- statistics
- tables
- visualizations


# IV. Conclusions (1/2 page)

**TODO: Summarize the conclusions of your study. This might include a discussion of future work. **

# IV. Related Work (1/2 page)

**TODO: Briefly describe related work on this topic (if applicable) **

## Scraping and cleaning the data

In the file `scrape.py`, write a function called `scrape_cherry_blossoms()` that scrapes the web pages of race results for a given set of years and genders and writes the results to one file and errors to another.  The results file should be a CSV file that looks something like this:

    year,gender,place,name,hometown,age,time
    2010,m,1,Stephen Tum,Kenya,24,45.71666666666667
    2010,m,2,Lelisa Desisa,Ethiopia,20,45.733333333333334
    ...

Notice that race time has been converted to minutes. For more details, see the docstring of `scrape_cherry_blossoms()`.

Here are a few guidelines in writing this function:
- Before you start, examine the Cherry Blossom results web pages (View Source in your web browser).  You'll notice that the format varies slightly from page to page.  Try to write the code in a general way so that the same approach can be applied to all 28 pages (14 years x 2 genders).
- Smart small: debug your program on one year, then two years, etc. before unleashing it on all years.
- The web pages contain some unicode characters that may cause problems when you try to write it out to a file.  If string `s` is a unicode string, you can encode it in ASCII as follows: `s.encode('ascii', 'replace')`.
- Not only are the pages formatted differently, the *data* varies from page to page.  To aggregate the data across all years, we would like to decide on a uniform data format.  Thus, you will need to make some decisions on how to map each individual result file to this common format.
- As is often the case, there is missing data.  Some runners don't have an age, some don't have a time, etc.  Handle missing data as follows: any record that is missing a value for *any* of the *required fields* should be discarded and put into the errors file.  The required fields are those that appear in the CSV example above.
- Try to process as many years as you can.  Years 2001, 2006 and 2009 are especially tricky and are considered optional **challenge problems**.

Invoke your function here, passing in as many years as you can process:

In [2]:
import geo

ImportError: No module named geo

In the space below, explain how you handled time:

We compared the format of the results for the different years and determined for each year which column number is to be used for time. Most result years used the 7th or 8th column for gun time/time, except for 2008 which used the 11th column and 1999 which used the 6th column. We created lists for each of those groups and then selected the corresponding regex group for those years. In the cases where some of the columns were empty, we used conditional statements to select the correct regex group. After extracting the correct time value, we used a helper function "time_in_mins" to calculate the total time in minutes. 

## Exploring the data

In `manipulate.py`, write a function called `load_race_data(filename)` that reads a CSV file named `filename` and parses each field into the appropriate type.  The result should be a list of dicts where each dict represents one row of data (i.e., the format shown on p. 129 of DSFS).  For example, if the result file contains men's results from 2010, the ouptut of the `load_race_data` function should look like this:

    [
     {'name': 'Stephen Tum', 'gender': 'm', 'age': 24, 
      'year': 2010, 'hometown': 'Kenya', 'place': 1, 'time': 45.71666666666667}, 
     {'name': 'Lelisa Desisa', 'gender': 'm', 'age': 20, 
      'year': 2010, 'hometown': 'Ethiopia', 'place': 2, 'time': 45.733333333333334},
     ...
    ]

Now we can manipulate the data to answer some basic questions.  

1. What is the age range for male runners?
2. What is the age range for female runners?
3. Who had the fastest running time across all years/genders, how fast did they run and in what year?
4. Who had the slowest running time across all years/genders and how fast did they run and in what year?

If any of these answers seem odd, go back your data cleaning process and the source data.  If the oddity is due to the source data, you can leave it.