# Mini Project I : Does the host country have an advantage?

The challenge question in the Olympic Mini-Project was to think of way of looking at whether or the host country has an advantage in the Olympics. Let's take a first look at this question.

## Initialization

In [None]:
# Extra Python functionality to import
from datascience import *  # datascience Table 
import numpy as np
import os
user = os.getenv('JUPYTERHUB_USER')

## Load the Data

In [None]:
datafile = "../../../Mini Project I/Olympic_Data/winter_athletes.csv"
athletes = Table.read_table(datafile).sort("Year",descending=True).where("Season","Winter")
athletes.show(3)

## Finding the Country Names given the City Names
The problem of finding addresses or coordinates from partial information is called "geocoding." Python has a module that can hlep called "geopy" but it is not installed on the server by default, so before we can import the module we must install it. This will have to be done every time we run the notebook on a fresh virtual machine. 

In [None]:
!pip install geopy

We can use the module to find the country name given the city name. Obviously, this could be a problem when multiple countries have cities of the same name. Fortunately, this is not the case with the big-name Olympic cities.

We create a geolocator using `city_country_finder` and use the `language=en` option to get the Engish name for the city. The Chinese people don't call Beijing by the english name, for example, so without this option we'd get back chinese characters.

In [None]:
from geopy.geocoders import Nominatim

# Initialize the geolocator
geolocator = Nominatim(user_agent="city_country_finder")

def find_country(city_name):
    location = geolocator.geocode(city_name, language="en")
    if location:
        return location.address.split(",")[-1].strip()  # Get the last part as the country
    else:
        return "Country not found"

In [None]:
# Whenever you create a function you should test it
city = "Sochi"
country = find_country(city)
print(f"The country of {city} is {country}.")

### Get the Olympic City names
Use `np.unique()` to return all of the olympic city names without duplicates.

In [None]:
olympic_cities = np.unique((athletes.column("City")))
olympic_cities

### Use our function to find the country to go with each city. Save the results in a dictionary.

**Notice the winter olympics were in the US three times. We'll use this later**

In [None]:
city_country = {}
for city in olympic_cities:
    city_country[city] = find_country(city)
city_country

### Extract the unique country names for the athletes.

In [None]:
teams = np.unique(athletes.column("Team"))
print(teams)

### A Team Name Problem
Sometimes a country fields multiple teams, for example: 'United States-1' 'United States-2' 'United States-3'

To fix this we will split each team name on the hyphen and keep only the first part.

In [None]:
# And example of splitting
'France-2'.split('-')

In [None]:
# Splitting and keeping only the first element
'France-2'.split('-')[0]

In [None]:
'France-2'.split('-')

In [None]:
# Now split all the names in a list comprehension
teams = [team.split("-")[0] for team in teams]

# Get rid of duplicates
teams = np.unique(teams)
print(teams)

In [None]:
# Put this into a function so we can apply it to a table
def find_team_country(team_name):
    return team_name.split("-")[0]

In [None]:
# Test the function
find_team_country('France-2')

## Check the Medal Count

In [None]:
medalists = athletes.where("Medal", are.not_equal_to('nan'))
medalists.show(3)

In [None]:
# Add a column with the team's country
medalists = medalists.with_column('Team_Country', medalists.apply(find_team_country, "Team"))
medalists.show(3)

In [None]:
# Make a list of the three US venues
us_venues = ['Lake Placid', 'Squaw Valley', 'Salt Lake City']

In [None]:
# Filter out the winner for these three venues
lake_placid = medalists.where('City', 'Lake Placid')
squaw_valley = medalists.where('City', 'Squaw Valley')
salt_lake_city = medalists.where('City', 'Salt Lake City')

In [None]:
# Combine them into one table. All the medals awarded at US games.
home = lake_placid.append(squaw_valley)
home = home.append(salt_lake_city)
home.num_rows

In [None]:
# Now create a table of all the medals awarded at olympic outside the US.
away = medalists.where('City', are.not_equal_to('Lake Placid'))
away = away.where('City', are.not_equal_to('Squaw Valley'))
away = away.where('City', are.not_equal_to('Salt Lake City'))
away.num_rows

In [None]:
# Check the numbers add up
home.num_rows + away.num_rows == medalists.num_rows

In [None]:
# Create a function to calculate win percentages
def find_usa_percentage(tbl, medal):
    winners = tbl.where('Medal', medal).num_rows
    usa_winners = tbl.where('Medal', medal).where('Team_Country', 'United States').num_rows
    return 100 * usa_winners / winners

## Compare percent of US medels won home and away

In [None]:
medal_types = ['Gold', 'Silver', 'Bronze']
for medal in medal_types:
    print(f"At home the US won {find_usa_percentage(home, medal)} of the {medal} medals")
    print(f"Away the US won {find_usa_percentage(away, medal)} of the {medal} medals")
    print()
          

## Conclusions

**On inspection, it certainly appears that the US athletes win more medals when they compete at home!**

But this is not proof, merely food for further investigation. We need to turn this into a testable hypothesis. We will revisit this later, after you have learned more about hypothesis testing.