# FINAL PROJECT - BANA 4143 TEAM 5
## Analysis

### Team members:
   * Kevin McDonald
   * Hsinke Ku
   * Leah Ngan Lai
   * Thomas Lyons
   * Zachary Harvey
### Analysis Contents:
* 3.1. Import data
* 3.2. Population and GDP per Capita
* 3.3. Medals won and contries
* 3.4. Medals by Gender and Olympics seasons

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### 3.1. First, we need to import our data.

In [None]:
olympics = pd.read_csv('olympic_medals.csv')
#reading in the dataframe
olympics.head()
olympics.tail()
#checking first & last rows to make sure dataframe are read correctly

### 3.2. We will begin by visualizing the relationship between Population and GDP per Capita

Before we visualize this relationship, it would be a good idea to run some general summary statistics.

In [None]:
olympics.agg({'Population': ['min',
                             'max',
                             'median',
                             'mean',
                             'skew'],
              'GDP per Capita': ['min',
                                 'max',
                                 'median',
                                 'mean',
                                 'skew']})
#summary statistics for Population and GDP per Capita 
#gives us a general idea of these two variables

In [None]:
sns.relplot(x = 'Population',
            y = 'GDP per Capita',
            data = olympics)
#creating a scatterplot of Population vs. GDP per Capita

It is difficult to see any relationships due to 2 outlier countries
with large populations.
We will filter these out, and re-create the visualization.

In [None]:
olympics2 = olympics[(olympics.Population < 1200000000)]
#filtering populations less than 1,200,000,000
sns.relplot(x = 'Population',
            y = 'GDP per Capita',
            data = olympics2)

The large majority of the countries in the olympics data set have populations below 50,000,000. We will re-create the filter: one group of countries with populations below 50,000,000, and one group of countries above 50,000,000.
Then, we will graph the relationship between Population and GDP per Capita for both groups and observe if there are similar trends.

In [None]:
olympics_lower = olympics[(olympics.Population < 50000000)]
#filtering below 50,000,000
olympics_greater = olympics[(olympics.Population >= 50000000)]
#filtering above 50,000,000

sns.relplot(x = 'Population',
            y = 'GDP per Capita',
            data = olympics_lower)
sns.relplot(x = 'Population',
            y = 'GDP per Capita',
            data = olympics_greater,
            hue = 'Country')
#adding a legend to list countries by color in this plot

Both groupings show that countries with smaller populations tend to have a higher GDP per Capita.

### 3.3. Next, we would like to know which countries have won the most medals. We will visualize this by calculating the mean number of medals won, and then plotting the countries that have won at least that many medals.

In [None]:
medals_count = olympics.groupby('Country')['Medal'].count()
#calculating mean number of medals per country
pd.Series.mean(medals_count)

In [None]:
medals = olympics.groupby(['Country'],
                          as_index = False).agg({'Medal': 'count'})
#counting medals earned per country
medals_df = medals[(medals.Medal > 235)]
#filtering by countries that won more than mean number of medals
plot1 = sns.catplot(x = 'Country',
                    y = 'Medal',
                    kind = 'bar',
                    data = medals_df)
plot1.set_xticklabels(rotation=45,
                      ha="right",
                      fontsize = 8)
#plotting a barchart of medals per country - we adjusted the labels
#on the x-axis for readability

While we have 26 countries that have won more than the average number of medals, it is clear the United States has won far and away the most.  In fact, it would be interesting to see how removing the United States from the calculation affects this visualization.

In [None]:
medals_nous = olympics[(olympics.Country != 'United States')]
#removing United States before we calculate mean medals won.
medals_count_nous = medals_nous.groupby('Country')['Medal'].count()
pd.Series.mean(medals_count_nous)
#recalculating mean number of medals won per country

In [None]:
medals2 = medals_nous.groupby(['Country'], as_index = False).agg({'Medal': 'count'})
#counting medals earned per country
medals_df2 = medals2[(medals2.Medal > 196)]
#filtering countries above new mean medal count
plot2 = sns.catplot(x = 'Country',
                    y = 'Medal',
                    kind = 'bar',
                    data = medals_df2)
plot2.set_xticklabels(rotation=45,
                      ha="right",
                      fontsize = 8)
#plotting barchart of medals per country above new mean

While removing the United States from the calculation obviously reduces the mean medals won by countries, it doesn't actually increase the number of countries that are above that mean.  Surprisingly, it decreased by 1 - the United States! This tells us that there is a very large gap between countries that win lots of medals and those that only win a few medals.

### 3.4. Finally, we want to look at the distribution of Medals by Gender as well as by Olympics Season.

In [None]:
CountOfMedalsbygender = olympics.groupby(['Country',
                                          'Gender', 
                                          'Medal']).agg({'Medal': 'count'})
#count of medals by country, gender, type of medal
CountOfMedalsbygender.head()

In [None]:
sns.displot(data = olympics,
            x='Gender',
            hue='Medal',
            multiple='stack')
#stacked barchart of medals by gender

This visualization shows the stark difference in medals between men and women.  Women's sports haven't had the same representation in the Olympics as men's sports until recently, but this shows just how drastic the difference really is.

We also wanted to visualize the same distribution of medals but between the Summer and Winter games.

In [None]:
CountOfMedalsbyYear = olympics.groupby(['Year',
                                        'Games',
                                        'Medal']).agg({'Medal': 'count'})
#count of medals by year, games, and medal type
CountOfMedalsbyYear.head()

In [None]:
sns.displot(data = olympics,
            x='Games',
            hue='Medal',
            multiple='stack')
#stacked barchart of Summer and Winter games' medals

The disparity in medals between the two types of games can be explained by two reasons.  Summer games been around longer, and there are many more events involved in the Summer games than the Winter games.