# Scraping data for the Olympics story
This notebook gathers all the stesp (successful and failed) to get the data corresponding to the sex of the athletes of the Summer Olympic Games since 1896 until 2016.

In [5]:
# Import needed libraries

from bs4 import BeautifulSoup as bs
from bs4 import SoupStrainer
import requests 
import pandas as pd
import time

In [2]:
# Test bs -- this is a practice page for learning how to scrape. Just checking all is in order!

url = "http://pythonscraping.com/pages/page1.html"
req = requests.get(url)
soup = bs(req.content)
soup

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [9]:
# Get the URLs using a for loop -- URL from https://www.olympedia.org/counts

# An array with all the page numbers referring to the summmer games
pag = [1, 2, 4]

for i in pag: # there are x 
    url = f"https://www.olympedia.org/counts/edition/{i}"
    print(url)
    time.sleep(3) # this pauses the request for three seconds (avoids crash)

https://www.olympedia.org/counts/edition/1
https://www.olympedia.org/counts/edition/2
https://www.olympedia.org/counts/edition/4


### Note 2!
The 1956 Olympics in Australia have a quirk, which is that the Equestrian events were celebrated in Sweden given that Australia had strict rules about quarantining horses for 6 months before entering the country, which made it incompatible for the games. The Olympedia page stores this data this on a separate page (TBC for the counter!). This affects the number of counts on the URL. Similarly, Olympedia counts the 1906 Intercalated Games which, although they <i>technically</i> were Olympic games, are not currently recognized as such (those medals don't even count to the IOC), so it might be a good idea to skip them.

In [10]:
# Same principle, data from Wikipedia's info box

games = [1986, 1900, 1904, 1908, 1912, 1920, 1924, 1928, 1932, 1936, 
         1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980,1984, 1988,
         1992, 1996, 2000, 2004, 2008, 2012, 2016]

for i in games:
    url = f"https://en.wikipedia.org/wiki/{i}_Summer_Olympics"
    print(url)
    time.sleep(3) # this pauses the request for three seconds (avoids crash)

https://en.wikipedia.org/wiki/1986_Summer_Olympics
https://en.wikipedia.org/wiki/1900_Summer_Olympics
https://en.wikipedia.org/wiki/1904_Summer_Olympics
https://en.wikipedia.org/wiki/1908_Summer_Olympics
https://en.wikipedia.org/wiki/1912_Summer_Olympics
https://en.wikipedia.org/wiki/1920_Summer_Olympics
https://en.wikipedia.org/wiki/1924_Summer_Olympics
https://en.wikipedia.org/wiki/1928_Summer_Olympics
https://en.wikipedia.org/wiki/1932_Summer_Olympics
https://en.wikipedia.org/wiki/1936_Summer_Olympics
https://en.wikipedia.org/wiki/1948_Summer_Olympics
https://en.wikipedia.org/wiki/1952_Summer_Olympics
https://en.wikipedia.org/wiki/1956_Summer_Olympics
https://en.wikipedia.org/wiki/1960_Summer_Olympics
https://en.wikipedia.org/wiki/1964_Summer_Olympics
https://en.wikipedia.org/wiki/1968_Summer_Olympics
https://en.wikipedia.org/wiki/1972_Summer_Olympics
https://en.wikipedia.org/wiki/1976_Summer_Olympics
https://en.wikipedia.org/wiki/1980_Summer_Olympics
https://en.wikipedia.org/wiki/1

### PROBLEM:
Wikipedia's data is on the infobox and not in tabular format, which means that the data cleaning process might take longer. 
Also, there are discrepancies when comparing Wikipedia's and Olympedia's numbers. Olympedia generally registers more athletes (both male and female) than Wikis. It is true that the team at Olympedia are "more dedicated" to the content of the site and they are "experts" (their About page reveals that most contributors were academics or historians), meaning those numbers could be more turstworthy. 

In [12]:
# Test

url1 = "https://www.olympedia.org/counts/edition/1"
my_headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
req1 = requests.get(url1, headers = my_headers)
soup1 = bs(req1.content, "html.parser")

soup1

<!DOCTYPE html>

<html>
<head>
<title>Olympedia – Athlete count for 1896 Summer Olympics</title>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="9FimWLFoVnxnm2KDLmDIMgqCfZJTOYu0uaiVJ9AHtwifz6QvcfnEymea7WaqG68NwsGvQMkahuZdBuN6qKVm2w==" name="csrf-token"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="EN" http-equiv="content-language"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/assets/bootstrap.min-460a43de22fd9534d595e5aea2715cb154560291c9c6401b526e31c86a5ce32d.css" media="all" rel="stylesheet"/>
<link href="/assets/bootstrap-sortable-363d232309d54b549fa85446295ef2b5d290e3f8a49f1a646247340be3705ef9.css" media="all" rel="stylesheet"/>
<link href="/assets/jquery-ui-1.11.4.min-359ba1b9eb679ad05fb4c8fda710ee4c0239354f1ba635200b6065638295d646.css" media="all" rel="stylesheet"/>
<link href="/assets/lightbox-e29689e123fc27505d2b9d919f43ffcb6fade539cb4670f21c35aa07848105e7.css" media="screen" re

### Note!
The total number of athletes seems to have the property "rowspan" unlike the rest of the rows. This might be the way to scrape the data from these pages.

Tags have been deleted to show the actual code <br>
class="count border-left border-top" rowspan="2">176 <br>
class="count border-top" rowspan="2">0 <br>
class="count border-top" rowspan="2"><b>176</b>