# Part 1. Exploratory Data Analysis

## Introduction
In this evaluation we ask you questions that we expect you to answer using data. For each question we ask you to complete a series of tasks that should help guide you through the data analysis. Complete these tasks and then write a short answer to the question.

#### Data
Use an online database >> The [Sean Lahman's Baseball Database](http://seanlahman.com/baseball-archive/statistics) which contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. For more details on the latest release, please [read the documentation](http://seanlahman.com/files/database/readme2012.txt)."

#### Purpose
Three main goals for you to demonstrate: 
a. Load in CSV files from the web. 
b. Create functions in python. 
C. Create plots and summary statistics for exploratory data analysis such as histograms, boxplots and scatter plots. 

#### Useful libraries for this test 
* [numpy](http://docs.scipy.org/doc/numpy-dev/user/index.html), for arrays
* [pandas](http://pandas.pydata.org/), for data frames
* [matplotlib](http://matplotlib.org/), for plotting

---

In [None]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy
import pandas
import matplotlib.pyplot as plt

### Problem setup

There is evidence that the 2002 and 2003 Oakland A's, a team that used data science, had a competitive advantage. It's widely known now that baseball relies heavily on data science and other teams have started using these methods as well. Use exploratory data analysis to determine if the competitive advantage still exists. 

#### Problem 1(a) 
Load in [these CSV files](http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip) from the [Sean Lahman's Baseball Database](http://seanlahman.com/baseball-archive/statistics). 

This task only uses the 'Salaries.csv' and 'Teams.csv' tables. Read these tables into a pandas `DataFrame` and show the head of each table. 

**Hint** Use the [requests](http://docs.python-requests.org/en/latest/), [BytesIO](https://docs.python.org/3.7/library/io.html#binary-i-o) and [zipfile](https://docs.python.org/3.7/library/zipfile.html) modules to get from the web.   

In [None]:
#your code here

#### Problem 1(b)

Summarize the Salaries DataFrame to show the total salaries for each team for each year. Show the head of the new summarized DataFrame. 

In [None]:
#your code here

#### Problem 1(c)

Merge the new summarized Salaries DataFrame and Teams DataFrame together to create a new DataFrame
showing wins and total salaries for each team for each year year. Show the head of the new merged DataFrame.

**Hint**: Merge the DataFrames using `teamID` and `yearID`.

In [None]:
#your code here

#### Problem 1(d)

How would you graphically display the relationship between total wins and total salaries for a given year? What kind of plot would be best? Choose a plot to show this relationship and specifically annotate the Oakland baseball team on the on the plot. Show this plot across multiple years. 

**Hints**: Use a `for` loop to consider multiple years. Use the `teamID` (three letter representation of the team name) to save space on the plot.  

In [None]:
#your code here

# Part 2. Web Development

#### Introduction
At the Department of Biomedical Informatics the Data Core assists faculty members with the creation of tools to enable cutting edge research in multidisciplinary fields. Common tasks such as data warehousing, collection, or protection are not the expertise of the clinical faculty and it’s our job to be those specialists.

#### Purpose
Show off your web development chops. Share an online repo demonstrating a recent project.  Alternatively if you don’t have one to share, then complete the example project described below. The goal here is to help us understand how you develop software as a whole, including design, documentation and testing. For this exercise, writing clear code and intentional design are the most important. We recognize that you probably have a lot of other 

**Example project description**
A key project managed by DBMI is the UDN, Undiagnosed Diseases Network. The UDN platform is a Django Python app that collects information about participants and allows hospitals and medical centers to review participant information.

For this test we would like you to create a small Django app to show us how you approach coding problems and demonstrate some knowledge of the Python ecosystem. 
