# What is a mini-project?

A mini-project is where you will begin to develop code and analysis to explore a data set and address questions. The format differs from a weekly lab in that you will develop your own notebook below.  **There will be no automatic test of your answers.** As a budding data scientist you need to develop the habit of checking your own work.

You will create all the notebook code and **use extensive markdown to document** the process and your observations. This can be challenging at first but it is the way to trully understand the code development process and build confidence using code to analysis data.

# Olympics Data mini-project
## Overview
In celebration of the Olympic spirit we will analyze trends in a data set which spans the from the 1896 Athens games to Rio in 2016. With this data we will explore trends in medals awarded, sports, and countries, as well as any host country advantage. The dataset is from Kaggle (https://www.kaggle.com ), a data science dataset, coding, and competition site. The mini-project represents your first chance to try out your coding and data skills to address specific questions without template code. Look to your previous labs and our work in class for ideas.

(Data source:
[Kaggle dataset](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results))

Background Resources:
- Data Tables (Inferential Thinking 6.1-6.4) 
- Visualization (Inferential Thinking 7.1, 7.2)
- Cross-classifying (Inferential Thinking 8.3)

## Data Set
Athletes: Olympic_Data/athlete_events.csv
Source: Kaggle https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

The dataset contains the following columns:
### Data Fields
1. ID - Unique number for each athlete
2. Name - Athlete's name
3. Sex-MorF
4. Age - Integer
5. Height - In centimeters
6. Weight - In kilograms
7. Team - Team name
8. NOC - National Olympic Committee 3-letter code
9. Games - Year and season
10. Year - Integer
11. Season - Summer or Winter
12. City - Host city
13. Sport - Sport
14. Event - Event
15. Medal - Gold, Silver, Bronze, or nan

## Initialization

In [1]:
# Extra Python functionality to import
from datascience import *  # datascience Table 
import EDS
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')

In [None]:
# Enter your name as a string
name = ...

We will limit our project to data from the Winter Olympics by using the where method [.where("Season","Winter") ] which leaves us with 18,923 individual athletes and 48,564 athlete/event datapoints (Many athletes compete in multiple events and/or over multiple Olympics).  

In [None]:
datafile = "Olympic_Data/winter_athletes.csv"
athletes = Table.read_table(datafile).sort("Year",descending=True).where("Season","Winter")
athletes

### Considerations
- Each observation is an athlete's performance in a particular event for a particular year.
- `.group()` is a very useful Table method to group by Team, Year, Age, etc.
- Athletes can perform in multiple events and over multiple years
- `athletes.group("Name", np.unique)` Can provide unique athlete names and combine their efforts in a list. This can be a challenge for analysis of specific performances but givene a good count of total athletes, `athletes.group("Name", np.unique).num_rows`

In [None]:
athletes.group("Name", np.unique).num_rows

In [None]:
athletes.num_rows

## Mini-Project Questions to address
Now develop your project to address the questions below. For each task a few blank code cells have been added, but you will be adding lines of code and markdown as needed below. Feel free to expand the project to explore ideas of interest.

**You typically begin exploration of a data set by computing a few basic statistics and by plotting the data distribution.**

1. What is the earliest year for a Winter Olympics in this dataset? Check this value [(https://olympics.com/en/olympic-games/olympic-results )](https://olympics.com/en/olympic-games/olympic-results ), does it fit the data?


2. Examine the distribution of the age of all Olympians with a histogram. What do you find? A better way to get a view of the distribution of ages is a five number summary which includes the min, max, median, mean, and standard deviation. To get the five number summary (min, max, median, mean, and standard deviation) using np.min, np.max, np.median, np.mean, and np.std respectively on the corresponding column array or better yet create a function to compute and display this given arguments of table name and column label. Since a given athlete can appear in multiple events, a better way to examine the age distribution of athletes is to group the data by name using a function such as np.average. i.e. `athletes.group("Name", np.average)`

3. Examine the distribution of the age of all Olympians with a histogram. What do you find? A better way to get a view of the distribution of ages is a five number summary which includes the min, max, median, mean, and standard deviation. To get the five number summary (min, max, median, mean, and standard deviation) using np.min, np.max, np.median, np.mean, and np.std respectively on the corresponding column array or better yet create a function to compute and display this given arguments of table name and column label. Since a given athlete can appear in multiple events, a better way to examine the age distribution of athletes is to group the data by name using a function such as np.average.

4. Examine the distribution of the age of all Olympians with a histogram. What do you find? A better way to get a view of the distribution of ages is a five number summary which includes the min, max, median, mean, and standard deviation. To get the five number summary (min, max, median, mean, and standard deviation) using np.min, np.max, np.median, np.mean, and np.std respectively on the corresponding column array or better yet create a function to compute and display this given arguments of table name and column label. Since a given athlete can appear in multiple events, a better way to examine the age distribution of athletes is to group the data by name using a function such as np.average.

5. What are the top ten countries in number of Gold, silver, bronze medals, and total medals? You should have four sets of top ten countries for each of the scenarios. Again, get the five number summary (min, max, median, mean, and standard deviation) for each medal type for all countries.
Hint: `.where("Medal",are.not_equal_to("nan"))` to get only medal winners. Consider how to create a column for the sum of the three medal categories.

6. What are the top 5 sports in terms of number of athletes?

7. Which sports (top 5) awarded the most medals in Lake Placid, New York (1980,
https://www.lakeplacid.com/do/activities/olympic-sites ).

**Challenge Question:** Does the host country team have an advantage? To get at this you may need to create another column in the athletes Table with the team name of the host country. Use a markdown cell to create a strategy to address this question. **This is the sort of research question that can emerge during your data exploration.**

#### Time Trends

9. Plot the trend in number of athletes per year. What is the trend?
<br>Hint: `athletes.group("Year").plot("Year","count")`

10. Plot the number of medals per year. What is the trend? How does this trend compare to that of the medals?

11. Team sports award everyone the same medal. Plot the gold medal trend excluding “Ice Hockey”, why hockey?

12. Plot the yearly trend in number of sports. Think of a strategy to code this. What is the trend? 

13. Plot an overlay of gold, silver, and bronze medals as a function of year on the same plot excluding hockey. What is the trend? Are the medals awarded at a similar rate?

14. Compare the US and Norway medal counts as a function of year by overlaying their counts. Hint: You could create separate tables for the US and Norway using an appropriate .where method. Now these tables can be combined using the Table .append method which merges two tables for instance, `NORUSA = US.append(Norway)`. 

15. Now use a scatter plot (` .scatter()` ) to look at the number of athletes per year for the US versus that for Norway. What trends do you see?

#### Ideas for future exploration

Data sets are available in this folder for average annual temperature by country, country population, and the highest peak in each country. How might you use these data to look come up with further insights into which countries have more athletes and medals? Furthermore, global warming is exerting pressure on winter sports. How might we examine this impact?

### <font color=blue> **Feedback** </font>

Please include a reflection. 
* How did this mini-project go? 
* Was it difficult to write code without a template?
* Did you seek help from any of the instructors or class assistants?
* Were there questions you found especially challenging you would like your instructor to review in class? 
* How long did the project take you to complete?
  
Share your feedback so we can continue to improve this class!

**Insert a markdown cell below this one and write your reflection on this lab.**

### <font color=blue> **Before you submit:** </font>
* Check that you answered all of the questions.
* Check that you used markdown cells to document your process (even if you couldn't get the answer) and conclusions.
* Check that all notebook cells have been executed from top to bottom.

In [None]:
print("Nice work ", user)
import time;
localtime = time.asctime( time.localtime(time.time()) )
print("Submitted @ ", localtime)