### Dataset #1: Olympic Medalists

This dataset is taken from Wikipedia's table of Summer Olympic medalists, available at this [link](https://en.wikipedia.org/wiki/Lists_of_Olympic_medalists). The first challenge I ran into was figuring out the best way to get the data from the table on Wikipedia. I didn’t want to dive too deep into other Python packages or libraries since we’re mostly focusing on **pandas** right now. I tried the **pandas** built-in function `read_html()`, but I received an HTTP 403 error. I searched for a workaround and found that some people had success using the **requests** library, but I wasn’t able to get `read_html()` to properly parse the text of the request object. I eventually took a manual approach and copied the table directly into Excel, then saved it as a CSV for easier use. I think **BeautifulSoup** would be helpful here, but since it looks like we’ll cover that later, I wanted to stay within the scope of this project.


In [15]:
import pandas as pd


medalists_untidy = pd.read_csv("Summer Olympic Medalists CSV.csv")

medalists_untidy.head(10)

Unnamed: 0,Discipline (link to medalists list),Contested,Number of,Unnamed: 3,Medals awarded,Unnamed: 5,Unnamed: 6,Unnamed: 7,Athlete(s) with the most medals,Athlete(s) with the most gold medals
0,,,Olympics,Medal events,"1st place, gold medalist(s)","2nd place, silver medalist(s)","3rd place, bronze medalist(s)",Total,(gold–silver–bronze),
1,,,(up to conclusion of 2024),(in 2024),,,,,,
2,Archery,1900–1908; 1920; since 1972,18,5,76,74,66,216,Hubert van Innis (BEL) (6–3–0),Hubert van Innis (BEL) (6–3–0)
3,Artistic swimming,Since 1984,11,2,22,20,21,63,Svetlana Romashina (RUS) (7–0–0),Svetlana Romashina (RUS) (7–0–0)
4,,,,,,,,,Huang Xuechen (CHN) (0–5–2),
5,Athletics,Since 1896,30,48,1075,1084,1073,3232,Paavo Nurmi (FIN) (9–3–0),Paavo Nurmi (FIN) (9–3–0)
6,"(men, women)",,,,,,,,,Carl Lewis (USA) (9–1–0)
7,Badminton,Since 1992,9,5,44,44,48,136,Gao Ling (CHN) (2–1–1),Gao Ling (CHN) (2–1–1)
8,,,,,,,,,,Fu Haifeng (CHN) (2–1–0)
9,,,,,,,,,,Viktor Axelsen (DEN) (2–0–1)


Very messy. Let's start with building a list of what we want our column names to be. Then,

In [16]:
columns = ['Discipline', 'Contested', 'Number of Olympics', 'Medal Events', '# of Gold MedalsP']

## Dataset #2: Covid Cases

In [17]:
covid_untidy = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

covid_untidy.head(10)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,47866,47875,47875,47875,47875,47875,47875,47875,47890,47890
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,105255,105277,105277,105277,105277,105277,105277,105277,105288,105288
5,,Antarctica,-71.9499,23.347,0,0,0,0,0,0,...,11,11,11,11,11,11,11,11,11,11
6,,Antigua and Barbuda,17.0608,-61.7964,0,0,0,0,0,0,...,9106,9106,9106,9106,9106,9106,9106,9106,9106,9106
7,,Argentina,-38.4161,-63.6167,0,0,0,0,0,0,...,10044125,10044125,10044125,10044125,10044125,10044125,10044957,10044957,10044957,10044957
8,,Armenia,40.0691,45.0382,0,0,0,0,0,0,...,446819,446819,446819,446819,446819,446819,446819,446819,447308,447308
9,Australian Capital Territory,Australia,-35.4735,149.0124,0,0,0,0,0,0,...,232018,232018,232619,232619,232619,232619,232619,232619,232619,232974


In [None]:
covid_tidy = covid_untidy.melt(id_vars=["Province/State", "Country/Region", "Lat", "Long"], var_name="Date", value_name="Cases")
covid_tidy["Date"] = pd.to_datetime(covid_tidy["Date"])
covid_tidy.sort_values(by=["Country/Region", "Date"]).head(10)

  covid_tidy["Date"] = pd.to_datetime(covid_tidy["Date"])


Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Cases
0,,Afghanistan,33.93911,67.709953,2020-01-22,0
289,,Afghanistan,33.93911,67.709953,2020-01-23,0
578,,Afghanistan,33.93911,67.709953,2020-01-24,0
867,,Afghanistan,33.93911,67.709953,2020-01-25,0
1156,,Afghanistan,33.93911,67.709953,2020-01-26,0
1445,,Afghanistan,33.93911,67.709953,2020-01-27,0
1734,,Afghanistan,33.93911,67.709953,2020-01-28,0
2023,,Afghanistan,33.93911,67.709953,2020-01-29,0
2312,,Afghanistan,33.93911,67.709953,2020-01-30,0
2601,,Afghanistan,33.93911,67.709953,2020-01-31,0
