# Mini Project I : Does the host country have an advantage?

The challenge question in the Olympic Mini-Project was to think of way of looking at whether or the host country has an advantage in the Olympics. Let's take a first look at this question.

## Initialization

In [1]:
# Extra Python functionality to import
from datascience import *  # datascience Table 
import EDS
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')

## Load the Data

In [2]:
datafile = "Olympic_Data/winter_athletes.csv"
athletes = Table.read_table(datafile).sort("Year",descending=True).where("Season","Winter")
athletes.show(3)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
16,Juhamatti Tapio Aaltonen,M,28,184,85,Finland,FIN,2014 Winter,2014,Winter,Sochi,Ice Hockey,Ice Hockey Men's Ice Hockey,Bronze
126,Forough Abbasi,F,20,164,58,Iran,IRI,2014 Winter,2014,Winter,Sochi,Alpine Skiing,Alpine Skiing Women's Slalom,
145,Jeremy Abbott,M,28,175,70,United States,USA,2014 Winter,2014,Winter,Sochi,Figure Skating,Figure Skating Men's Singles,


## Finding the Country Names given the City Names
The problem of finding addresses or coordinates from partial information is called "geocoding." Python has a module that can hlep called "geopy" but it is not installed on the server by default, so before we can import the module we must install it. This will have to be done every time we run the notebook on a fresh virtual machine. 

In [31]:
!pip install geopy



We can use the module to find the country name given the city name. Obviously, this could be a problem when multiple countries have cities of the same name. Fortunately, this is not the case with the big-name Olympic cities.

We create a geolocator using `city_country_finder` and use the `language=en` option to get the Engish name for the city. The Chinese people don't call Beijing by the english name, for example, so without this option we'd get back chinese characters.

In [32]:
from geopy.geocoders import Nominatim

# Initialize the geolocator
geolocator = Nominatim(user_agent="city_country_finder")

def find_country(city_name):
    location = geolocator.geocode(city_name, language="en")
    if location:
        return location.address.split(",")[-1].strip()  # Get the last part as the country
    else:
        return "Country not found"

In [33]:
# Whenever you create a function you should test it
city = "Sochi"
country = find_country(city)
print(f"The country of {city} is {country}.")

The country of Sochi is Russia.


### Get the Olympic City names
Use `np.unique()` to return all of the olympic city names without duplicates.

In [9]:
olympic_cities = np.unique((athletes.column("City")))
olympic_cities

array(['Albertville', 'Calgary', 'Chamonix', "Cortina d'Ampezzo",
       'Garmisch-Partenkirchen', 'Grenoble', 'Innsbruck', 'Lake Placid',
       'Lillehammer', 'Nagano', 'Oslo', 'Salt Lake City', 'Sankt Moritz',
       'Sapporo', 'Sarajevo', 'Sochi', 'Squaw Valley', 'Torino',
       'Vancouver'],
      dtype='<U22')

### Use our function to find the country to go with each city. Save the results in a dictionary.

**Notice the winter olympics were in the US three times. We'll use this later**

In [13]:
city_country = {}
for city in olympic_cities:
    city_country[city] = find_country(city)
city_country

{'Albertville': 'France',
 'Calgary': 'Canada',
 'Chamonix': 'France',
 "Cortina d'Ampezzo": 'Italy',
 'Garmisch-Partenkirchen': 'Germany',
 'Grenoble': 'France',
 'Innsbruck': 'Austria',
 'Lake Placid': 'United States',
 'Lillehammer': 'Norway',
 'Nagano': 'Japan',
 'Oslo': 'Norway',
 'Salt Lake City': 'United States',
 'Sankt Moritz': 'Switzerland',
 'Sapporo': 'Japan',
 'Sarajevo': 'Bosnia and Herzegovina',
 'Sochi': 'Russia',
 'Squaw Valley': 'United States',
 'Torino': 'Italy',
 'Vancouver': 'Canada'}

### Extract the unique country names for the athletes.

In [14]:
teams = np.unique(athletes.column("Team"))
print(teams)

['Albania' 'Algeria' 'American Samoa' 'Andorra' 'Argentina' 'Argentina-1'
 'Argentina-2' 'Armenia' 'Australia' 'Australia-1' 'Australia-2' 'Austria'
 'Austria-1' 'Austria-2' 'Azerbaijan' 'Belarus' 'Belgium' 'Belgium-1'
 'Belgium-2' 'Bermuda' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil'
 'British Virgin Islands' 'Bulgaria' 'Bulgaria-1' 'Bulgaria-2' 'Cameroon'
 'Canada' 'Canada-1' 'Canada-2' 'Canada-3' 'Cayman Islands' 'Chile' 'China'
 'China-1' 'China-2' 'China-3' 'Chinese Taipei' 'Chinese Taipei-1'
 'Chinese Taipei-2' 'Colombia' 'Costa Rica' 'Croatia' 'Cyprus'
 'Czech Republic' 'Czech Republic-1' 'Czech Republic-2' 'Czechoslovakia'
 'Czechoslovakia-1' 'Czechoslovakia-2' 'Denmark' 'Dominica' 'East Germany'
 'East Germany-1' 'East Germany-2' 'East Germany-3' 'Egypt' 'Estonia'
 'Ethiopia' 'Fiji' 'Finland' 'France' 'France-1' 'France-2' 'France-3'
 'Georgia' 'Germany' 'Germany-1' 'Germany-2' 'Germany-3' 'Ghana'
 'Great Britain' 'Great Britain-1' 'Great Britain-2' 'Great Britain-3'
 'Greece'

### A Team Name Problem
Sometimes a country fields multiple teams, for example: 'United States-1' 'United States-2' 'United States-3'

To fix this we will split each team name on the hyphen and keep only the first part.

In [34]:
# And example of splitting
'France-2'.split('-')

['France', '2']

In [35]:
# Splitting and keeping only the first element
'France-2'.split('-')[0]

'France'

In [None]:
'France-2'.split('-')

In [36]:
# Now split all the names in a list comprehension
teams = [team.split("-")[0] for team in teams]

# Get rid of duplicates
teams = np.unique(teams)
print(teams)

['Australia' 'Austria' 'Belarus' 'Belgium' 'Bulgaria' 'Canada' 'China'
 'Croatia' 'Czech Republic' 'Czechoslovakia' 'Denmark' 'East Germany'
 'Estonia' 'Finland' 'France' 'Germany' 'Great Britain' 'Hungary' 'India'
 'Italy' 'Japan' 'Kazakhstan' 'Latvia' 'Liechtenstein' 'Luxembourg' 'Nepal'
 'Netherlands' 'New Zealand' 'North Korea' 'Norway' 'Poland' 'Romania'
 'Russia' 'Slovakia' 'Slovenia' 'South Korea' 'Soviet Union' 'Spain'
 'Sweden' 'Switzerland' 'Ukraine' 'Unified Team' 'United States'
 'Uzbekistan' 'West Germany' 'Yugoslavia']


In [37]:
# Put this into a function so we can apply it to a table
def find_team_country(team_name):
    return team_name.split("-")[0]

In [39]:
# Test the function
find_team_country('France-2')

'France'

## Check the Medal Count

In [20]:
medalists = athletes.where("Medal", are.not_equal_to('nan'))
medalists.show(3)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
16,Juhamatti Tapio Aaltonen,M,28,184,85,Finland,FIN,2014 Winter,2014,Winter,Sochi,Ice Hockey,Ice Hockey Men's Ice Hockey,Bronze
145,Jeremy Abbott,M,28,175,70,United States,USA,2014 Winter,2014,Winter,Sochi,Figure Skating,Figure Skating Mixed Team,Bronze
840,"Victoria ""Vicki"" Adams",F,24,164,69,Great Britain,GBR,2014 Winter,2014,Winter,Sochi,Curling,Curling Women's Curling,Bronze


In [40]:
# Add a column with the team's country
medalists = medalists.with_column('Team_Country', medalists.apply(find_team_country, "Team"))
medalists.show(3)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,Team_Country
16,Juhamatti Tapio Aaltonen,M,28,184,85,Finland,FIN,2014 Winter,2014,Winter,Sochi,Ice Hockey,Ice Hockey Men's Ice Hockey,Bronze,Finland
145,Jeremy Abbott,M,28,175,70,United States,USA,2014 Winter,2014,Winter,Sochi,Figure Skating,Figure Skating Mixed Team,Bronze,United States
840,"Victoria ""Vicki"" Adams",F,24,164,69,Great Britain,GBR,2014 Winter,2014,Winter,Sochi,Curling,Curling Women's Curling,Bronze,Great Britain


In [42]:
# Make a list of the three US venues
us_venues = ['Lake Placid', 'Squaw Valley', 'Salt Lake City']

In [41]:
# Filter out the winner for these three venues
lake_placid = medalists.where('City', 'Lake Placid')
squaw_valley = medalists.where('City', 'Squaw Valley')
salt_lake_city = medalists.where('City', 'Salt Lake City')

In [43]:
# Combine them into one table. All the medals awarded at US games.
home = lake_placid.append(squaw_valley)
home = home.append(salt_lake_city)
home.num_rows

935

In [44]:
# Now create a table of all the medals awarded at olympic outside the US.
away = medalists.where('City', are.not_equal_to('Lake Placid'))
away = away.where('City', are.not_equal_to('Squaw Valley'))
away = away.where('City', are.not_equal_to('Salt Lake City'))
away.num_rows

4760

In [45]:
# Check the numbers add up
home.num_rows + away.num_rows == medalists.num_rows

True

In [46]:
# Create a function to calculate win percentages
def find_usa_percentage(tbl, medal):
    winners = tbl.where('Medal', medal).num_rows
    usa_winners = tbl.where('Medal', medal).where('Team_Country', 'United States').num_rows
    return 100 * usa_winners / winners

## Compare percent of US medels won home and away

In [47]:
medal_types = ['Gold', 'Silver', 'Bronze']
for medal in medal_types:
    print(f"At home the US won {find_usa_percentage(home, medal)} of the {medal} medals")
    print(f"Away the US won {find_usa_percentage(away, medal)} of the {medal} medals")
    print()
          

At home the US won 20.253164556962027 of the Gold medals
Away the US won 6.386975579211021 of the Gold medals

At home the US won 28.06451612903226 of the Silver medals
Away the US won 13.934426229508198 of the Silver medals

At home the US won 7.766990291262136 of the Bronze medals
Away the US won 8.687381103360812 of the Bronze medals



## Conclusions

**On inspection, it certainly appears that the US athletes win more medals when they compete at home!**

But this is not proof, merely food for further investigation. We need to turn this into a testable hypothesis. We will revisit this later, after you have learned more about hypothesis testing.