**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Calixto Calangi
- Ivan Guo
- Thoai Phan
- Brian Ponce

# Research Question

Between a country's population and how much of their GDP is allocated to sports, is there a relationship to how well their respective athletes do in medal placements (gold, silver, bronze) at the Olympics? Essentially, as a result of either a country's population or GDP/expenditure towards sports, do either have a significant factor in predicting/relating to athletes' medal placements at the Olympics?




## Background and Prior Work

One cannot deny the substantial revenues that the annual Olympics bring to many countries throughout the sporting world. According to an article from the International Olympic Committee, $590 million is usually spent towards athlete development, training of coaches and players, and the accessibility of the Olympic Games throughout the world. <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Some argue that the reason for these funding is due to the economic boom that might potentially happen as the Olympics encourage more tourism. After reading some background information on this topic, our group have decided to propose a tentative question: is it true that the higher the GDP of a country in that Olympic year, the higher the medal value they will receive? We would also do research on other factors, such as population size and healthcare accessibility, but we chose the Olympics because it is known to be the largest sporting celebration in terms of the number of events, athletes, and people gathered from the world for this annual event.

A published work that has been completed on this topic is from a few students at Georgia Institute of Technology, where they argued that GDP per capita is the main determinant to a country’s Olympic performance. <a name="cite_ref-2"></a>[<sup>1</sup>](#cite_note-2) However, after doing their research, they realized that a country’s size and healthcare expenditures per capita are the main factors that drive Olympic performance, rather than GDP per capita. To us, this is a valid argument because in the end, the better the general welfare of a country is the better their healthcare will be and athletes will have a higher chance of having better healthcare. Additionally, after analyzing some unrestricted multiple variable regression models, this group realized that the relationship between Olympic performance, country’s size, and per capita health expenditure make more sense and these were the only variables that were significant at the 5% level.

After looking at this published project, we decided to consider other factors besides GDP. This project also includes the healthcare that a country receives as a factor, but we also want to see if the amount of spending and expenditures that are allocated to sports have an affect on the medals that a country receive during the Olympics. There are a lot of hidden factors that might impact athletes’ performances during this event, but the purpose of this project is to see which factors are the most significant. We also want to see if a country’s population has an affect on one’s standing in the Olympics (i.e., is it true that the larger the population size, the more likely a country is to do better?).

1. <a name="cite_note-1"></a> [^](#cite_ref-1) How the IOC finances a better world through sport. *International Olympic Committee*. https://olympics.com/ioc/funding 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Boudreau, et.al. The Miracle on Thin Ice: How A Nation's GDP Affects its Olympic Performance. *Georgia Institute of Technology*. https://repository.gatech.edu/server/api/core/bitstreams/1aa2b537-c3de-4177-8295-3fcd3a03a965/content#:~:text=We%20estimate%20that%20GDP%20per,bronze%20medals%20a%20country%20receives. 


# Hypothesis


The hypothesis we suppose is that more allocation of government spending/expenditure towards athletes and sports, in general, should translate to better performance and better medal placements of athletes at the Olympics, whereas population isn't as a significant contributor, given that the resources needed to succeed in sports (training equipment, facilities) is an overall better metric for athletes' performance. We think this for the first point, as preliminary research has looked at the relationship between healthcare expenditures and performance, with healthcare being a strong indicator. 

It would make sense, then, that how much of government spending is allocated towards sports should then be a stronger indicator of Olympic performance; the resources that are necessary for an athlete to succeed obviously include access to the best healthcare, but, more importantly, their access to top of the line sports resources (better facilities, equipment, training.)

# Data

## Data overview


- Dataset #1
  - Dataset Name: Athlete Events in Olympic History
  - Link to the dataset: https://github.com/cstorm125/information_value/blob/master/data/120-years-of-olympic-history-athletes-and-results/athlete_events.csv
  - Number of observations: 271117 (we will most likely cut this in the future)
  - Number of variables: 15


This dataset gives us information such as the year that the player played in the Olympics, event, medal, and season. We can use this dataset to see what countries have more winners and after discovering this, we can perform research on factors such as GDP that might contribute to their outstanding performances during the Olympics. Since this dataset has a lot of datapoints, we will most likely use a scatterplot to compare the relationship between two variables such as country and success.

- Dataset #2
  - Dataset Name: Athletes in Olympics
  - Link to the dataset: https://github.com/chanronnie/Olympics/blob/main/data/athletes.csv
  - Number of observations: 476349 (we will most likely cut this in the future)
  - Number of variables: 13


This dataset is another source that gives us information about the players that participated in the Olympics, such as which team they are on, what sport they did play, and what medal did they win in certain years. We can possibly research the GDP and other factors of some countries that have high success because there could possibly be a relationship between the resources that they receive and their performane during the Olympics. Since this dataset is so large, we will most likely use scatterplots to represent data and perform some sort of linear regression to clearly see relationships (positive relationship, clear distinction between two countries, etc.).

Both of these data sources can play an important role in our research project because they provide a lot of players that participated in the Olympics and won a medal. We can pull examples from both datasets of players in certain countries (most likely the ones that have more victories in the Olympics) and see if there is a correlation between their successes and factors such as their population and GDP. Hopefully, we can use these examples to argue that according to our hypothesis, the government's expenditures towards sports has an affect on athletes' performances during the Olympics, while population is an important factor, but not as prominent as government spending.

- Dataset #3 
  - Dataset Name: Population, 10,000 BCE to 2021
  - Link to the dataset: https://ourworldindata.org/grapher/population#sources-and-processing
  - Number of observations: 58252
  - Number of variables: 4

This dataset collects populations across the globe spanning from 10,000 BCE to 2021. The variables to glean from this dataset are: what country is it representing (Afghanistan, Zimbabwe, etc.), it's country code, the year and the population associated with that year. What would need to be done with this data, as with the other 2 datasets (along with a dataset we need to find GDP for,) is to trim it down to the years we can reliably find them having the same range. The Olympics range from 1896 to, as of 5/17/2024, the 2022 Olympics, whereas this data stretches from 10,000 BCE to 2021. 




## Dataset: 120 years of Olympic data

In [2]:
#import code
import pandas as pd
import numpy as np


In [11]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
Olympics_120yr = pd.read_csv("https://raw.githubusercontent.com/cstorm125/information_value/master/data/120-years-of-olympic-history-athletes-and-results/athlete_events.csv")
Olympics_120yr = Olympics_120yr.drop(columns=['Age', 'Height','Weight'])
#We want to restrict data to who the athletes are, gender (different performance for men's, women's), what games, what nationality, what sport, and what medal placement
#Eventually we want to wrangle this data into data where only medal placements (gold, silver, bronze) are present
Olympics_120yr.dropna()


Unnamed: 0,ID,Name,Sex,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
3,4,Edgar Lindenau Aabye,M,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
37,15,Arvo Ossian Aaltonen,M,Finland,FIN,1920 Summer,1920,Summer,Antwerpen,Swimming,Swimming Men's 200 metres Breaststroke,Bronze
38,15,Arvo Ossian Aaltonen,M,Finland,FIN,1920 Summer,1920,Summer,Antwerpen,Swimming,Swimming Men's 400 metres Breaststroke,Bronze
40,16,Juhamatti Tapio Aaltonen,M,Finland,FIN,2014 Winter,2014,Winter,Sochi,Ice Hockey,Ice Hockey Men's Ice Hockey,Bronze
41,17,Paavo Johannes Aaltonen,M,Finland,FIN,1948 Summer,1948,Summer,London,Gymnastics,Gymnastics Men's Individual All-Around,Bronze
...,...,...,...,...,...,...,...,...,...,...,...,...
271078,135553,Galina Ivanovna Zybina (-Fyodorova),F,Soviet Union,URS,1956 Summer,1956,Summer,Melbourne,Athletics,Athletics Women's Shot Put,Silver
271080,135553,Galina Ivanovna Zybina (-Fyodorova),F,Soviet Union,URS,1964 Summer,1964,Summer,Tokyo,Athletics,Athletics Women's Shot Put,Bronze
271082,135554,Bogusaw Zych,M,Poland,POL,1980 Summer,1980,Summer,Moskva,Fencing,"Fencing Men's Foil, Team",Bronze
271102,135563,Olesya Nikolayevna Zykina,F,Russia,RUS,2000 Summer,2000,Summer,Sydney,Athletics,Athletics Women's 4 x 400 metres Relay,Bronze


## Dataset: Olympic Dataset 2

In [13]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
Olympic_set2 = pd.read_csv("https://raw.githubusercontent.com/chanronnie/Olympics/main/data/athletes.csv")
Olympic_set2 = Olympic_set2.drop(columns=['born', 'died', 'height', 'weight'], axis=1)
#We want to restrict data to who the athletes are, gender (different performance for men's, women's), what games, what nationality, what sport, and what medal placement
#Eventually we want to wrangle this data into data where only medal placements (gold, silver, bronze) are present
Olympic_set2.dropna()

Unnamed: 0,id,name,gender,team,game,noc,sport,event,medal
7,129369,Eunice Kirwa,Female,Bahrain,2016 Summer Olympics,BRN,Athletics,"Athletics, Marathon, Women(Olympic)",Silver
8,129369,Eunice Kirwa,Female,Bahrain,2016 Summer Olympics,BRN,Athletics,"Athletics, Marathon, Women(Olympic)",Silver
22,101764,Park Hye-Won,Female,Republic of Korea,2002 Winter Olympics,KOR,Short Track Speed Skating (Skating),"Short Track Speed Skating (Skating), 3,000 met...",Gold
23,101764,Park Hye-Won,Female,Republic of Korea,2002 Winter Olympics,KOR,Short Track Speed Skating (Skating),"Short Track Speed Skating (Skating), 3,000 met...",Gold
39,59207,Lee Jeong-Geun,Male,Republic of Korea,1984 Summer Olympics,KOR,Wrestling,"Wrestling, Featherweight, Freestyle, Men(Olympic)",Bronze
...,...,...,...,...,...,...,...,...,...
476324,15475,Vittorio Marcelli,Male,Italy,1968 Summer Olympics,ITA,Cycling Road (Cycling),"Cycling Road (Cycling), 100 kilometres Team Ti...",Bronze
476325,15475,Vittorio Marcelli,Male,Italy,1968 Summer Olympics,ITA,Cycling Road (Cycling),"Cycling Road (Cycling), 100 kilometres Team Ti...",Bronze
476334,122196,Aleksa Šaponjić,Male,Serbia,2012 Summer Olympics,SRB,Water Polo (Aquatics),"Water Polo (Aquatics), Water Polo, Men(Olympic)",Bronze
476335,122196,Aleksa Šaponjić,Male,Serbia,2012 Summer Olympics,SRB,Water Polo (Aquatics),"Water Polo (Aquatics), Water Polo, Men(Olympic)",Bronze


## Dataset: Population of the World (10,000 BCE to 2021)

In [16]:
Population_Set = pd.read_csv('https://raw.githubusercontent.com/COGS108/Group07_SP24/master/population.csv?token=GHSAT0AAAAAACRLKJALQO273H3XALHRXOC6ZSH7JMQ')
Population_Set
#since we have 1896 as the start of the Olympics, we can trim it down it 1896 onwards.
Population_Set[Population_Set['Year'] >=1896]

Unnamed: 0,Entity,Code,Year,Population (historical estimates)
133,Afghanistan,AFG,1896,4603230
134,Afghanistan,AFG,1897,4623613
135,Afghanistan,AFG,1898,4644079
136,Afghanistan,AFG,1899,4672084
137,Afghanistan,AFG,1900,4707744
...,...,...,...,...
58247,Zimbabwe,ZWE,2017,14751101
58248,Zimbabwe,ZWE,2018,15052191
58249,Zimbabwe,ZWE,2019,15354606
58250,Zimbabwe,ZWE,2020,15669663


# Ethics & Privacy

Some issues with the data provided thus far largely stem from terms of use. While some do specify it is avaliable for public domain, as in the case of '120 years of Olympic history: athletes and results' dataset we provided, as we go along and find more datasets online, it is important to see if 1. this data is for public use and 2. respectively cite/link to those projects/websites where we had collected them.

One ethical concern that may arise in our dataset particularly affects the way we are able to do work on the Olympic data, is that not all countries competed/could compete for the Olympics, so certain countries with 'head start' may show themselves to be 'better performers' when the sample of athletes may only be comprised of a few countries. One way of handling this problem is to look at when most, if not all, countries were approved to send athletes to the Olympics. By looking at these datas, we may be able to separate eras of Olympics and what countries competed in them to look at the broader picture of performance, and then do the GDP/population work therein.

Another problem is the presentation of GDP as a signficant factor in Olympic performance, a bias we have already addressed and should strive to keep in mind. We've mentioned in the background how in one study, they found out that healthcare and the country's size, rather than GDP per capita, are stronger factors in relation to Olympic performance. As such, it's important to keep in mind that it is not necessarily about GDP, but of what percent/how much of that spending is allocated to sports and sports enhancement, which is a key part of our project.

Some issues with our topic date is that, while there may be data stretching all the way from 1896 for the Olympics, the data for countries' GDP that is made avaliable to the public may not stretch as far, and may not even be as accurate as we would require it to be. We would then need to line up the data that is available for both to then collect and do analysis work on it. 


# Team Expectations 



* Team communication: Discord, smartphone group chat; use to talk about meetings, what/whose work is being done, when things are uploaded
* Event of conflicts: have meeting on Discord, compare
* Workload: 
    * Across team: compare and contrast data sets/data wrangling, decide upon best method/model/data/wrangling methods through meeting/Discord
    * Individually: data wrangle the sets we find ourselves/data sets we decided upon

# Project Timeline Proposal

Currently as of 5/15/24; timeline is more defined, as data sets we have decided upon/looked are set/assured on. Need to start looking at EDA.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/25  |  9PM | Read about projects, think of problems/good things about projects  | Discuss project review, submit project review; discuss plans for project proposal, what to decide on | 
| 5/2  |  9PM |  Do background checks on topic | Discuss project proposal, work on project proposal together | 
| 5/3  | 9PM  | Edit, last minute checks on proposal | Discuss final submission for project proposal, start thinking of datasets for project   |
| 5/6  | 9PM   | Bring up datasets | Compare, bring up datasets we have looked at, think about methods for analysis/visualization   |
| Week of 5/15 | 9PM  | Chosen datasets at this point | Data wrangling, answer Data Checkpoint for Project ,check-in |
| Week of 5/20 | 9PM(?) | Partial analysis complete, finished data wrangling | Compare progress on final product, agree on thoughts before finishing final analysis |
| Week of 5/27  | 9PM(?)  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| Week of 6/3 | 9PM(?)  | NA | Turn in Final Project & Group Project Surveys |