<a href="https://colab.research.google.com/github/kennedot/Github-and-Jupyter-setup/blob/main/stadium_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Collection

In this assignment, you'll work through another example of extracting data from a web page.

We'll start by importing some libraries.

In [1]:
import pandas as pd

# we'll use requests to read web pages
import requests

# StringIO is becoming the "future" for reading and writing
# certain file types
from io import StringIO

## Before You Start
You will see some lines of code that call the `assert` function. **DO NOT** change or update or delete the assert statements. `assert` tests to make sure your code is running properly. These statements can help you see if things are working correctly.

## Part 1: Web Scraping
Boston College plays home football games in its Alumni Stadium. Many BC fans think think this is a big stadium as it can seat 44,500 people. That's a lot of people, but it's nowhere near the largest in the world. You may know that Michigan Stadium, also known as "The Big House", is tha largest stadium in the United States with a capacty of 107,601. Big, indeed.  

Like so many things, Wikipedia has a [page](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity) dedicated to the largest stadiums in the world. You'll see some interesting things on that page. In fact, the data may be counter to your initial expectations.

1. The largest stadium in the world is Narenda Modi, a cricket ground in India. Cricket is enormously popular in countries that don't start with "USA," so this might not be a surprise.

2. Number 2 is Rungrado 1st of May Stadium in North Korea. The North Korean national football (American translation: soccer) team plays in this stadium. Football makes sense because of its even more enormous popularity around the globe. North Korea...that one you may not have guessed.

3. Numbers 3-10 are all in the United States, and their tenants are...college football teams. Not NFL teams&mdash;*college* football teams. These stadiums are so big that they may have a larger capacity than the population of the college town they sit in.

And that's a question we might want to explore. What is the ratio of a stadium's capacity to the population of its city? To study that, we need to turn this Wikipedia page into data we can analyze.


### Task 1: Scraping
Let's start by grabbing the contents of the Wikipedia page.
  
1. Start my making a new variable called `wikipedia_URL`. Assign the following URL to that variable:     
[https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity). Make sure the URL is a string (in quotes).   

2. Use the Python function `requests.get()` to get the contents of the page. Use `wikipedia_URL` as the function's input value. Assign this to variable called `wikipedia_page`.

In [26]:
## ASSIGN VALUES TO THE VARIABLES BELOW CODE HERE
wikipedia_URL = "https://web.archive.org/web/20240825234348/https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity"
requests.get(wikipedia_URL)
wikipedia_page = requests.get(wikipedia_URL)

### Task 2: Do We Have the Right Content?

Let's make sure the text retrieved using `requests.get()` is correct by checking it's title. In HTML, titles are embedded between `<title>` and `</title>` markup tags. We're looking for the following line in all of the HTML text:

> `<title>`List of stadiums by capacity - Wikipedia`</title>`

First, remember that we can use `.text` to get the text that we retrieved to `wikipedia_page`. Try that below. Wait...just print the first 1000 characters unless you want to see **a lot** of HTML. You can slice the text string this way:

> `... .text[0:1000]`

In [29]:
# YOUR CODE HERE
# print the first 1000 characters from the stadium page
print(wikipedia_page.text[0:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<head><script type="text/javascript" src="https://web-static.archive.org/_static/js/bundle-playback.js?v=FC38Hc5A" charset="utf-8"></script>
<script type="text/javascript" src="https://web-static.archive.org/_static/js/wombat.js?v=txqj7nKC" charset="utf-8"></script>
<script>window.RufflePlayer=window.RufflePlayer||{};window.RufflePlayer.config={"autoplay":"on","unmuteOverlay":"hidden","

Now let's see how to find the title of the page in this text. There are a few ways to look for the string. You'll learn about [regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) when you take the Intro to Python course. Regular expressions allow us to specify patterns to look for in a piece of text. In this case, we'd like to find the text between `<title>` and `</title>`. But we'll do this a simpler way for now.  

Below is a function named `find_string_between`. It takes three strings as arguments:
1. the_string: The complete string you want to search in
2. beginning_substring: The beginning string to look for. The string you want to find is between this one and...
3. end_substring: The end string.

The function will return the text between the beginning_substring and end_substring. First, here's the definition.

In [32]:
def find_string_between(the_string, beginning_substring, end_substring):
  """
    Given a string, find a substring that's between two other substrings, beginning_substring
    and end_substring.

    Arguments:
      the_string: The string you want to search
      beginning_substring: the substring to start the search
      end_substring: the substring that is the end boundary

    Return:
      the string between beginning_substring and end_substring OR "" if
      nothing is found
  """
  try:
    # find the starting index. add the length of the begin substring to that
    start = the_string.find(beginning_substring) + len(beginning_substring)

    # get the ending index, where the end substring begins
    end = the_string.find(end_substring, start)

    # the text we're looking for is between start and end
    return the_string[start:end]
  except ValueError:
    print("ValueError occured in find_string_between()")
    return ""



Here are some examples of how you'd use it.

In [33]:
# some test strings
test_html_string = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h1>This is a Heading</h1><p>This is a paragraph.</p></body></html>"
phone_number = "(617) 867-5309"
medical_record = "Patient's name: John Smith, Age: 30, Medication: Aspirin."


# let's look for what's between <h1> and </h1>
in_between_string = find_string_between(test_html_string, "<h1>", "</h1>")

# grab the area code between the parentheses
area_code = find_string_between(phone_number, '(', ')')

# get the patient's name
patient_name = find_string_between(medical_record, 'name:', ',')

# Print the found strings
print("Found text:", in_between_string)
print("Area code:", area_code)
print("Patient's name:", patient_name)

Found text: This is a Heading
Area code: 617
Patient's name:  John Smith


You can use this on the wikipedia text to find the page's title. The title is in between the `<title>` and  </title>     substrings. Try it for yourself!

In [42]:
# call find_string_between with the wikipedia text and the title tags to get the page title
# get the return value in a variable named page_title

page_title = find_string_between(wikipedia_page.text, '<title>', '</title>')
print(page_title)

# you shouldn't see an error if the match worked
assert page_title == "List of stadiums by capacity - Wikipedia", "Incorrect string assigned to page_title"
assert type(page_title) == str, "page_title should be a string"

List of stadiums by capacity - Wikipedia


### Finding the Table
We learned earlier that `pandas` has the function `read_html` to read tables from HTML text. It works just like `read_csv` and similar functions that you've used before. See if you can call the function and assign it's return value to `web_tables`.

In [49]:
web_tables = pd.read_html(StringIO(wikipedia_page.text))

# print the statement below to see what we get
print("read_html found", len(web_tables), "tables")

read_html found 8 tables


Take a look at the [web page](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity). You'll see **multiple** tables with stadum information. The default for `read_html` is to find **all** tables on a page and return them in a list. Let's try to pick just one, the one containing stadiums with a capacity of 100,000 or more.  

`read_html` has an argument named `match=` that we can use to specify a unique piece of a table if we only want one of many on a page. In the table with a capacty of 100,000 or more, we have several unique items. Let's use the name of one of the stadiums—Narendra Modi Stadium—in the `match=` argument to get that table.



In [52]:
web_tables = pd.read_html(StringIO(wikipedia_page.text), match= "Narendra Modi")  # replace the [] with your call to read_html using the text from Wikipedia

# print the statement below to see what we get

print("read_html found", len(web_tables), "tables")

# check the result
assert len(web_tables) == 1

read_html found 1 tables


You should only have one table in the web_tables list. And that should be the one with stadiums with a capacity of 100,000 or more.  

Make a new variable below called `over_100000_table`. Assign the table in `web_tables` to `over_100000_table`. Then check the head of the `DataFrame` to see if it looks familiar.

In [58]:
over_100000_table = web_tables[0] # get the table from web_tables

# check the head below

In [59]:
# Some asserts to make sure the code is working as expected.
assert isinstance(over_100000_table, pd.DataFrame)

### Grabbing all of the tables
We started by grabbing one table containing stadiums with a capacity over 100,000. Now let's go back and grab all of the tables containing stadium data.   

In fact, we'll use "Stadium" as a unique identifier. Only tables with stadium data should have that word in their headers.

In [60]:
web_tables = pd.read_html(StringIO(wikipedia_page.text), match= "Stadium")  # Insert your code to get tables with a "Stadium" match

# print the length of all_tables
print("read_html found", len(web_tables), "tables")

# checking the resulr
assert len(web_tables) == 7

read_html found 7 tables


We have 7 tables. You can check that by hand/eye on the [stadium web page](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity). We can also make sure that each of these tables has the same number of columns: That'll be important when we try to join them into a single table of stadium data.

In [61]:
for index, table in enumerate(web_tables):
    number_columns = len(table.columns)
    print("Number of rows in table", index, "=", number_columns)

Number of rows in table 0 = 8
Number of rows in table 1 = 8
Number of rows in table 2 = 8
Number of rows in table 3 = 7
Number of rows in table 4 = 7
Number of rows in table 5 = 7
Number of rows in table 6 = 7


Unfortunately, we don't have the same number of columns in every table. Can you tell which column is missing in the last four tables?  

But we can try to merge, or concatenate, the tables despite the missing column. Let's do that in two steps:

1. Use `pd.concat()` to concatenate the seven tables.
2. Each table has it's own index. After doing the concatenation, call `reset_index()` so we get a single index for all the data. And include `drop=True` so `pandas` doesn't keep the old indices.

In [63]:
# concatenate the tables
stadium_data = pd.concat(web_tables)
""
# reset the index on the new table
stadium_data = stadium_data.reset_index(drop=True)

# and let's look at the number of rows and columns
stadium_data.shape

(537, 8)

In [64]:
# checking that we have the corrent number of columns
assert stadium_data.shape[0] == 537

Let's look at the last few entries in `stadium_data`. Compare these with the final entries in the [last Wikipedia table](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity#Capacity_of_40,000–50,000)...they should be the same.

In [66]:
# Call tail() on the stadium data below
stadium_data.tail()

Unnamed: 0,Stadium,Capacity,City (state),Country,Region,Tenants,Sport(s),Image
532,Xining Stadium,40000,Xining,China,East Asia,local football teams,Association football,
533,Shaoxing China Textile City Sports Center,40000,Shaoxing,China,East Asia,,Athletics,
534,Anqing Sports Centre Stadium,40000,Anqing,China,East Asia,,Athletics,
535,Monumental Stadium of Caracas Simón Bolívar,40000,Caracas,Venezuela,South America,Leones del Caracas,Baseball,
536,Kardinia Park,"40,000[155]",Geelong,Australia,Oceania,"Geelong Cats, Melbourne Renegades*","Cricket, Australian rules football",


Look closely at the data in each row. You should see a few unusual items:
1. The capacity of Kardinia Park is "40,000[155]"
2. Several rows contain values labeled "NaN"
3. It looks like the Image column many contain lots of 'NaN'
4. Kardinia Park also has multiple tenants and sports listed

We're not going to touch these things yet&mdash;we'll wait until the next module on data cleaning to deal with these.

### Task 3: Saving the Data
Everything looks good. Your final task is to save the table as a CSV file named `stadium_data.csv`. Remember how to do that?

In [68]:
### YOUR CODE TO SAVE THE STADIUM_DATA HERE
stadium_data.to_csv("stadium_data.csv", index=False)

In [69]:
## doing a check to see that the file has been written to the current directory
import os
assert os.path.exists("stadium_data.csv"), "File named stadium_data.csv is not in the current directory."

### Task 4: What about arenas?
You successfully crawled the Wikipedia page to get stadium data. But do these include **all** sporting venues? Of course not.  

Your next task is to crawl Wikipedia's [List of sports venues by capacity](https://en.wikipedia.org/wiki/List_of_sports_venues_by_capacity). Open the page, grab the table, and read it into a `DataFrame` named `venues_df`.

In [71]:
venues_URL = "https://web.archive.org/web/20240719185824/https://en.wikipedia.org/wiki/List_of_sports_venues_by_capacity"
venues_page = requests.get(venues_URL) # use requests.get to open the URL

# read the venues_page's tables into vanue_tables.

venue_tables = pd.read_html(StringIO(venues_page.text), match= "Capacity")

venues_df = venue_tables[0] # YOUR CODE HERE TO GET THE FIRST TABLE

# call shape below to see the size of the DataFrame
display(venues_df.shape)

(583, 7)

In [None]:
assert venues_df.shape[0] == 583

### Done!
You successdully crawled Wikipedia pages and pulled the stadium and venue data! You may have noticed along the way that some of the data in the `DataFrame`s look...well, odd. We'll come back to the data in the next module on data cleaning.

Only thing left for you to do is to submit the completed Notebook to the Canvas site.