# Is Success in the Summer Olympic Games More Dependent on a Country's Population or Wealth?

**Authors**: Jules Brettle, Shree Madan, Anusha Karandikar

Run the code below each time the notebook is started or restarted to ensure that if you change any code in the library, this notebook will use the latest version of the library code.

In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd # library for data analysis
import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML documents
import matplotlib.pyplot as plt
import plotly.express as px
from helpers import *
from vis_helpers import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Introduction

## 1.1 Background Information

The Summer Olympic Games are an international multi-sport event typically held every four years [1]. Each sporting event awards gold, silver, and bronze medals for first, second, and third place respectively.

## 1.2 Primary Question

The goal of this project is to answer the following question: Can success in the Summer Olympic games be predicted more accurately based on a country’s population or GDP per capita?

# 2. Methodology

## 2.1 Data

We chose to use Wikipedia to get data for ease of scraping and because we found it had the most thorough data. We used Wikipedia to obtain our data for Olympic medals from every Summer Olympics from 2004 to 2016 for each country participating. We also used corresponding population and GDP per capita data from Wikipedia. Wikipedia has population data from the United Status Census Bureau for every 5 years, so since the Summer Olympics occur every 4 years, we used the closest year's population data. Our GDP per capita data comes from the International Monetary Fund's (IMF) yearly estimates based on Purchasing Power Parity (PPP) from Wikipedia as well.

### 2.1.1 Web Scraping

To obtain our data, we used the python libraries **Beautiful_Soup** to parse HTML documents, **Requests** to handle requests, and **Pandas** for data analysis. In `helpers.py`, we use the function `table_scrape`, which takes in the url of a Wikipedia page and returns, by default, the first table of the page. We can also change the input to obtain another table on the page.

In [1]:
def table_scrape(url, index=0):
    """
    This functions takes a string representing the url of a wikipedia article
    and by default returns the first table on the page. If there are multiple
    tables and the return needs to be modified the index is an integer which
    represents the index of the table which needs to be scraped.
    
    Args:
        url: string representing the url of a wikipedia article
        index: index of the table on the wikipedia page

    Returns:
        pandas dataframe consisting of data in the wikitable
    """
    wikiurl=url
    table_class="wikitable sortable jquery-tablesorter"
    response=requests.get(wikiurl)
    # status code must be 200 to legally scrape
    if response.status_code == 200:
        # parse data from the html into a beautifulsoup object
        soup = BeautifulSoup(response.text, 'html.parser')
        tables=soup.findAll('table',{'class':"wikitable"})
        df=pd.read_html(str(tables[index]))
        # convert list to dataframe
        df=pd.DataFrame(df[0])
        return df
    else:
        print("Error: This table should not be scraped due to its status code.")

### 2.1.2 Cleaning

Also in `helpers.py`, we use the function `medal_clean`, which takes in a data frame and cleans it by renaming and removing certain columns.

In [2]:
def medal_clean(df, year):
    """
    This functions takes in a data frame of the medals tables from wikipedia
    and cleans it to be easier to read.

    Args:
        df: data frame containing medals table from wikipedia
        year: integer representing the year of the olympic games

    Returns:
        cleaned dataframe
    """
    # renaming columns to have the year in the title
    df.rename(columns = {"Gold" : f"Gold-{year}", "Silver" : f"Silver-{year}",
                         "Bronze" : f"Bronze-{year}",
                         "Total" : f"Total-{year}"}, inplace = True)
    # dropping rank column because it's not relevant for our question
    df.drop(["Rank"], axis = 1, inplace = True)
    # removing final row containing total number of countries
    df = df[:-1]
    return df

We decided to merge our dataframes for medals, population, and GDP per capita to make creating visualizations easier. This required making some changes to our data; for example, we removed Independent Olympic Athletes (IOC) from our medals dataframe since the IOC does not ave any population and GDP data. Some countries also compete under different names; for example, the United Kingdom completes under the name Great Britain and Taiwan competes under Chinsest Taipei. In the original Wikipedia page, host countries also had an asterisk beside the name, which we removed.

## 2.2 Visualizations

# 3. Results

# 4. Interpretation

# 5. Sources

[1] https://en.wikipedia.org/wiki/Summer_Olympic_Games