# Bringing it all together
Here, you will bring together everything you have learned in this course while working with data recorded from the Summer Olympic games that goes as far back as 1896! This is a rich dataset that will allow you to fully apply the data manipulation techniques you have learned. You will pivot, unstack, group, slice, and reshape your data as you explore this dataset and uncover some truly fascinating insights. Enjoy!

# 1. Case Study - Summer Olympics
## 1.1 Grouping and aggregating
The Olympic medal data for the following exercises comes from [The Guardian](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data). It comprises records of all events held at the Olympic games between 1896 and 2012.

Suppose you have loaded the data into a DataFrame `medals`. You now want to find the total number of medals awarded to the USA per edition. To do this, filter the `'USA'` rows and use the `groupby()` function to put the `'Edition'` column on the index:
```
USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')
```
Given the goal of finding the total number of USA medals awarded per edition, what column should you select and which aggregation method should you use?

###### Possible Answers:
1. `USA_edition_grouped['City'].mean()`
2. `USA_edition_grouped['Athlete'].sum()`
3. `USA_edition_grouped['Medal'].count()`
4. `USA_edition_grouped['Gender'].first()`

<div style="text-align: right"> <b>Answer:</b> (3) </div>

## 1.2 Using .value_counts() for ranking
For this exercise, you will use the pandas Series method `.value_counts()` to determine the top 15 countries ranked by total number of medals.

Notice that `.value_counts()` sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.

### Instructions:
* Extract the `'NOC'` column from the DataFrame `medals` and assign the result to `country_names`. Notice that this Series has repeated entries for every medal (of _any_ type) a country has won in any Edition of the Olympics.
* Create a Series `medal_counts` by applying `.value_counts()` to the Series `country_names`.
* Print the top 15 countries ranked by total number of medals won. 

In [1]:
import pandas as pd
medals = pd.read_csv('_datasets/all_medalists.csv')
medals.head()

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver


In [2]:
# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']

# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()

# Print top 15 countries ranked by medals
medal_counts.head(15)

USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
GER    1211
AUS    1075
HUN    1053
SWE    1021
GDR     825
NED     782
JPN     704
CHN     679
RUS     638
ROU     624
Name: NOC, dtype: int64

It looks like the top 5 countries here are `USA`, `URS`, `GBR`, `FRA`, and `ITA`.

## 1.3 Using .pivot_table() to count medals by type
Rather than ranking countries by total medals won and showing that list, you may want to see a bit more detail. You can use a _pivot table_ to compute how many separate bronze, silver and gold medals each country won. That pivot table can then be used to repeat the previous computation to rank by total medals won.

In this exercise, you will use `.pivot_table()` first to aggregate the total medals by type. Then, you can use `.sum()` along the columns of the pivot table to produce a new column. When the modified pivot table is sorted by the total medals column, you can display the results from the last exercise with a bit more detail.

### Instructions:
* Construct a pivot table `counted` from the DataFrame `medals` aggregating by `count`. Use `'NOC'` as the index, `'Athlete'` for the values, and `'Medal'` for the columns.
* Modify the DataFrame `counted` by adding a column `counted['totals']`. The new column `'totals'` should contain the result of taking the sum along the columns (i.e., use `.sum(axis='columns')`).
* Overwrite the DataFrame `counted` by sorting it with the `.sort_values()` method. Specify the keyword argument `ascending=False`.
* Print the first 15 rows of `counted` using `.head(15)`.

In [3]:
# Construct the pivot table: counted
counted = medals.pivot_table(index='NOC', columns='Medal', values='Athlete', aggfunc='count')

# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')

# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)

# Print the top 15 rows of counted
counted.head(15)

Medal,Bronze,Gold,Silver,totals
NOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USA,1052.0,2088.0,1195.0,4335.0
URS,584.0,838.0,627.0,2049.0
GBR,505.0,498.0,591.0,1594.0
FRA,475.0,378.0,461.0,1314.0
ITA,374.0,460.0,394.0,1228.0
GER,454.0,407.0,350.0,1211.0
AUS,413.0,293.0,369.0,1075.0
HUN,345.0,400.0,308.0,1053.0
SWE,325.0,347.0,349.0,1021.0
GDR,225.0,329.0,271.0,825.0


Take a moment to look at the results and see if you find anything interesting!

# 2. Understanding the column labels
## 2.1 Applying .drop_duplicates()
What could be the difference between the `'Event_gender'` and `'Gender'` columns? You should be able to evaluate your guess by looking at the unique values of the pairs `(Event_gender, Gender)` in the data. In particular, you should not see something like `(Event_gender='M', Gender='Women')`. However, you will see that, strangely enough, there is an observation with `(Event_gender='W', Gender='Men')`.

The duplicates can be dropped using the `.drop_duplicates()` method, leaving behind the unique observations. The DataFrame has been loaded as `medals`.

### Instructions:
* Select the columns `'Event_gender'` and `'Gender'`.
* Create a dataframe `ev_gen_uniques` containing the unique pairs contained in `ev_gen`.
* Print `ev_gen_uniques`. 

In [4]:
# Select columns: ev_gen
ev_gen = medals[['Event_gender', 'Gender']]

# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()

# Print ev_gen_uniques
ev_gen_uniques

Unnamed: 0,Event_gender,Gender
0,M,Men
348,X,Men
416,W,Women
639,X,Women
23675,W,Men


## 2.2 Finding possible errors with .groupby()
You will now use `.groupby()` to continue your exploration. Your job is to group by `'Event_gender'` and `'Gender'` and count the rows.

You will see that there is only one suspicious row: This is likely a data error.

### Instructions:
* Group `medals` by `'Event_gender'` and `'Gender'`.
* Create a `medal_count_by_gender` DataFrame with a group count using the `.count()` method.
* Print `medal_count_by_gender`. 

In [5]:
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])

# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()

# Print medal_count_by_gender
medal_count_by_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,City,Edition,Sport,Discipline,Athlete,NOC,Event,Medal
Event_gender,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
M,Men,20067,20067,20067,20067,20067,20067,20067,20067
W,Men,1,1,1,1,1,1,1,1
W,Women,7277,7277,7277,7277,7277,7277,7277,7277
X,Men,1653,1653,1653,1653,1653,1653,1653,1653
X,Women,218,218,218,218,218,218,218,218


# 2.3 Locating suspicious data
You will now inspect the suspect record by locating the offending row.

You will see that, according to the data, Joyce Chepchumba was a man that won a medal in a women's event. That is a data error as you can confirm with a web search.

### Instructions:
* Create a Boolean Series with a condition that captures the only row that has `medals.Event_gender == 'W'` and `medals.Gender == 'Men'`. Be sure to use the `&` operator.
* Use the Boolean Series to create a DataFrame called `suspect` with the suspicious row.
* Print `suspect`. 

In [6]:
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')

# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus]

# Print suspect
suspect

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
23675,Sydney,2000,Athletics,Athletics,"CHEPCHUMBA, Joyce",KEN,Men,marathon,W,Bronze


# 3. Constructing alternative country rankings
