# SI 330 - Homework #1: Data Manipulation

## Background

This homework assignment focuses on the analysis of historical data from the Olympic games.  The description of the data includes the following:
> This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018.

Your main task in this assignment is to explore the data *using the data
manipulation methods we covered in class* as well as those in the assigned readings.  You may need to consult pandas documentation, Stack Overflow, or other online resources.  

** You should also feel free to ask questions on the [class Slack homework channel] (https://si330wn2019.slack.com/messages/CFA5AJPCL/)! **

A total of 50 points is available in this homework assignment, consisting of:
- 40 points for completing the specific homework questions. More comprehensive 
answers will tend to gain more points.
- 5 points for the overall quality of spelling, grammar, punctuation, and style of written responses.  (see https://faculty.washington.edu/heagerty/Courses/b572/public/StrunkWhite.pdf for a detailed specifications).
- 5 points for creating code that conforms to [PEP 8](https://www.python.org/dev/peps/pep-0008/) guidelines.  You should review those guidelines before proceding with the assignment.


## Answer the questions below. 
For each question, you should
1. Write code using Python and pandas that can help you answer the following questions, and
2. Explain your answers in plain English. You should use complete sentences that would be understood by an educated professional who is not necessarily a data scientist (like a product manager).

### <font color="magenta"> Q1 (6 points): Explore and Describe the dataset. 
- How many rows and columns are there in the data frame? Please use the appropriate dataframe property.
- Provide summary statistics (i.e. use the .describe() function) for age, height, and weight. Please present this information in a single dataframe.

In [2]:
import pandas as pd
import numpy as np
from math import *

In [6]:
# read and open athlete_events.csv
olympics = pd.read_csv('data/athlete_events.csv')
# finds dimensions of data file
olympics.shape

There are 15 columns and 271116 rows. 

In [120]:
o1 = olympics[['Age', 'Height', 'Weight']].copy()

In [121]:
o1.describe()

Unnamed: 0,Age,Height,Weight
count,261642.0,210945.0,208241.0
mean,25.556898,175.33897,70.702393
std,6.393561,10.518462,14.34802
min,10.0,127.0,25.0
25%,21.0,168.0,60.0
50%,24.0,175.0,70.0
75%,28.0,183.0,79.0
max,97.0,226.0,214.0


I used the describe() function to find the summary statistic for age, height and weight.
There are 261642 entries for age. The mean age is 25.5. The min is 10.0 and the max is 97.0. 
There are 210945 entries for height. The mean height is 175.3. The min is 127.0 and the max is 226.0.
There are 208241 entries for weight. The mean weight is 70.7. The min is 25.0 and the max is 214.0.

### <font color="magenta"> Q2 (8 points): How many unique athletes are in the dataset? What proportion of unique athletes are female?
    - Ideally, your code should return a proportion without hardcoding.

In [129]:
number = olympics.Name
number.nunique()

134732

There are 134732 unique athletes in the dataset.

In [122]:
olympics[olympics.Sex=='F']

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
5,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",
6,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,Speed Skating Women's 500 metres,
7,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,"Speed Skating Women's 1,000 metres",
8,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0,Netherlands,NED,1994 Winter,1994,Winter,Lillehammer,Speed Skating,Speed Skating Women's 500 metres,
9,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0,Netherlands,NED,1994 Winter,1994,Winter,Lillehammer,Speed Skating,"Speed Skating Women's 1,000 metres",
26,8,"Cornelia ""Cor"" Aalten (-Strannood)",F,18.0,168.0,,Netherlands,NED,1932 Summer,1932,Summer,Los Angeles,Athletics,Athletics Women's 100 metres,
27,8,"Cornelia ""Cor"" Aalten (-Strannood)",F,18.0,168.0,,Netherlands,NED,1932 Summer,1932,Summer,Los Angeles,Athletics,Athletics Women's 4 x 100 metres Relay,
32,13,Minna Maarit Aalto,F,30.0,159.0,55.5,Finland,FIN,1996 Summer,1996,Summer,Atlanta,Sailing,Sailing Women's Windsurfer,
33,13,Minna Maarit Aalto,F,34.0,159.0,55.5,Finland,FIN,2000 Summer,2000,Summer,Sydney,Sailing,Sailing Women's Windsurfer,


In [130]:
sex = olympics.groupby('Sex').size()
sex

Sex
F     74522
M    196594
dtype: int64

(Use this space to explain your answers)

In [131]:
sex.loc['F']/(sex.loc['F']+sex.loc['M'])

0.2748712728131132

Around 0.275 of the unique athletes are females.

### <font color="magenta"> Q3 (10 points): Looking at the time period from 1950 to today, which athelete competed in the most number of events? 
In which unique events did the athelete participate, and for what range of years? Which country did the athlete represent?

In [132]:
# filters athletes who participated from 1950 to today
athlete = olympics[olympics.Year>=1950].copy()
athlete.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
5,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",
6,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,Speed Skating Women's 500 metres,


In [11]:
# finds athlete id who competed in the most events
athlete.groupby('ID').size().idxmax()

89187

In [137]:
# finds athlete info with id 89187
most = athlete[athlete.ID==89187].copy()
most

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
177408,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Individual All-Around,
177409,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Team All-Around,
177410,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Floor Exercise,
177411,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Horse Vault,Bronze
177412,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Parallel Bars,
177413,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Horizontal Bar,
177414,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Rings,
177415,89187,Takashi Ono,M,20.0,160.0,58.0,Japan,JPN,1952 Summer,1952,Summer,Helsinki,Gymnastics,Gymnastics Men's Pommelled Horse,
177416,89187,Takashi Ono,M,25.0,160.0,58.0,Japan,JPN,1956 Summer,1956,Summer,Melbourne,Gymnastics,Gymnastics Men's Individual All-Around,Silver
177417,89187,Takashi Ono,M,25.0,160.0,58.0,Japan,JPN,1956 Summer,1956,Summer,Melbourne,Gymnastics,Gymnastics Men's Team All-Around,Silver


In [138]:
# used athlete id to find name and country of person who participated in the most events
most[['Name', 'Team']].head(1)

Unnamed: 0,Name,Team
177408,Takashi Ono,Japan


From 1950 to today, Takashi Ono from Japan participated in the most Olympic events

In [139]:
# Finds the events Takashi Ono participated in
most['Event']

177408    Gymnastics Men's Individual All-Around
177409          Gymnastics Men's Team All-Around
177410           Gymnastics Men's Floor Exercise
177411              Gymnastics Men's Horse Vault
177412            Gymnastics Men's Parallel Bars
177413           Gymnastics Men's Horizontal Bar
177414                    Gymnastics Men's Rings
177415          Gymnastics Men's Pommelled Horse
177416    Gymnastics Men's Individual All-Around
177417          Gymnastics Men's Team All-Around
177418           Gymnastics Men's Floor Exercise
177419              Gymnastics Men's Horse Vault
177420            Gymnastics Men's Parallel Bars
177421           Gymnastics Men's Horizontal Bar
177422                    Gymnastics Men's Rings
177423          Gymnastics Men's Pommelled Horse
177424    Gymnastics Men's Individual All-Around
177425          Gymnastics Men's Team All-Around
177426           Gymnastics Men's Floor Exercise
177427              Gymnastics Men's Horse Vault
177428            Gy

See above for list of events Takashi Ono participated in

In [27]:
# Finds the last year Takashi Ono participated in the olympics
most.loc[most['Year'].idxmax()]

ID                                         89187
Name                                 Takashi Ono
Sex                                            M
Age                                           33
Height                                       160
Weight                                        58
Team                                       Japan
NOC                                          JPN
Games                                1964 Summer
Year                                        1964
Season                                    Summer
City                                       Tokyo
Sport                                 Gymnastics
Event     Gymnastics Men's Individual All-Around
Medal                                        NaN
Name: 177432, dtype: object

The last year Takashi Ono participated in the Olympics was 1964.

In [140]:
# Finds the first year Takashi Ono participated in the olympics
most.loc[most['Year'].idxmin()]

ID                                         89187
Name                                 Takashi Ono
Sex                                            M
Age                                           20
Height                                       160
Weight                                        58
Team                                       Japan
NOC                                          JPN
Games                                1952 Summer
Year                                        1952
Season                                    Summer
City                                    Helsinki
Sport                                 Gymnastics
Event     Gymnastics Men's Individual All-Around
Medal                                        NaN
Name: 177408, dtype: object

The first year Takashi Ono participated in the Olympics was 1952.

Takashi Ono from Japan competed from 1952 to 1964.

### <font color="magenta"> Q4 (7 points): Which sport has the lowest median athlete age? Is there a tie?

In [143]:
# makes copy of dataframe sorting by age and sport
o2 = olympics[['Age', 'Sport']].copy()
o2.head()

Unnamed: 0,Age,Sport
0,24.0,Basketball
1,23.0,Judo
2,24.0,Football
3,34.0,Tug-Of-War
4,21.0,Speed Skating


In [150]:
# finds the median age for each sport and lists them in ascending order
o2.groupby('Sport').median().sort_values(by='Age', ascending = True).head()

Unnamed: 0_level_0,Age
Sport,Unnamed: 1_level_1
Rhythmic Gymnastics,18.0
Swimming,20.0
Synchronized Swimming,22.0
Figure Skating,22.0
Diving,22.0


Rhythmic gymnastic has the lowest median age of 18.0. There is no tie.

### <font color="magenta"> Q5 (8 points): In which year did Canada win the most medals?

In [117]:
# filters only data with Canada
o3 = olympics[olympics.Team=='Canada']
o3.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
269,140,"William ""Bill"" Abbott Jr.",M,42.0,172.0,80.0,Canada,CAN,1996 Summer,1996,Summer,Atlanta,Sailing,Sailing Mixed Three Person Keelboat,
270,140,"William ""Bill"" Abbott Jr.",M,46.0,172.0,80.0,Canada,CAN,2000 Summer,2000,Summer,Sydney,Sailing,Sailing Mixed Three Person Keelboat,
279,146,Jeremy Abbott,M,19.0,179.0,71.0,Canada,CAN,1976 Summer,1976,Summer,Montreal,Canoeing,"Canoeing Men's Canadian Doubles, 1,000 metres",
280,147,Joanne Abbott,F,41.0,160.0,57.0,Canada,CAN,1996 Summer,1996,Summer,Atlanta,Sailing,Sailing Mixed Three Person Keelboat,
281,148,"Kathryn ""Katie"" Abbott",F,21.0,164.0,63.0,Canada,CAN,2008 Summer,2008,Summer,Beijing,Sailing,Sailing Women's Three Person Keelboat,


In [160]:
# sorts by data by year and medal
new_o3 = o3[['Year', 'Medal']].sort_values(by='Year', ascending=True).copy()
new_o3.head()

Unnamed: 0,Year,Medal
83481,1900,
178304,1900,Bronze
178306,1900,
83480,1900,
83497,1900,


In [159]:
# finds the year with the most medal counts by sorting by descending order
new_o3.groupby('Year').count().sort_values(by='Medal', ascending=False).head(1)

Unnamed: 0_level_0,Medal
Year,Unnamed: 1_level_1
1984,91


Canada won the most medals in 1981. Canada won 91 medals.

### <font color="magenta"> Q6 (11 points): In which sports in the 2014 Winter Olympics did female medalists win more medal points than male medalists?

In [233]:
o6 = olympics.copy()
# filters data in only the 2014 olympics
winter = o6[o6.Year==2014]
# sets and counts point values for medals
medal = winter[(winter.Medal=='Gold') |(winter.Medal=='Silver') | (winter.Medal=='Bronze')]
for i in medal['Medal']:
    if i == 'Gold':
        medal['Points'] = 3
    elif i == 'Silver':
        medal['Points'] = 2
    else:
        i == 'Bronze'
        medal['Points'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


In [234]:
# separates data by sex
male = medal[medal.Sex=='M']
female = medal[medal.Sex=='F']

In [227]:
# finds points for male medalists for each sport
male.groupby('Sport').Points.sum()

Sport
Alpine Skiing                 48
Biathlon                      90
Bobsleigh                     54
Cross Country Skiing          90
Curling                       36
Figure Skating                69
Freestyle Skiing              45
Ice Hockey                   213
Luge                          54
Nordic Combined               54
Short Track Speed Skating     63
Skeleton                       9
Ski Jumping                   54
Snowboarding                  45
Speed Skating                 72
Name: Points, dtype: int64

In [235]:
# finds points for female medalists for each sport
female.groupby('Sport').Points.sum()

Sport
Alpine Skiing                 45
Biathlon                      90
Bobsleigh                     18
Cross Country Skiing          90
Curling                       36
Figure Skating                66
Freestyle Skiing              45
Ice Hockey                   177
Luge                          18
Short Track Speed Skating     66
Skeleton                       9
Ski Jumping                    9
Snowboarding                  45
Speed Skating                 81
Name: Points, dtype: int64

In the 2014 winter olympics, female medalists won more points than male medalists in short track speed skating (66 vs 63) and in speed skating (81 vs 72).

### QBonus 5 pt: For each year in which games were held, what proportion of the host country medalists were women? 
You will need to combine multiple datasets to complete the analysis.
<p>Suggested data: 
<p>- https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities (note: we suggest you use pd.read_html)

In [None]:
# put your code here

(Use this space to explain your answers)

## Please submit your completed notebook in .IPYNB and .HTML formats via Canvas