<font style='font-size:1.5em'>**Notebook 01 – Collecting Wikipedia data**</font>

<font style='font-size:1.2em'>LSE DS105L (2022/23) - Week 09</font>

**AUTHOR:**  [@jonjoncardoso](http://github.com/jonjoncardoso)

**DATE:** 17 March 2023

---


## Imports 
Section with library imports.

In [1]:
# First import libraries I want to refer by name
import requests

# Now I import the libraries I want to refer to by alias
import pandas as pd

# Lastly, I import functions and classes I want to refer to by name but I don't want to import the whole library
from bs4 import BeautifulSoup

### Default functions
Section with some general functions used over the notebook

# 1. Getting Data

Let's start by scraping some data from Wikipedia. Our focus will be on the [Machine Learning wiki entry](https://en.wikipedia.org/wiki/Machine_learning).

We will be using the `requests` library to make HTTP requests and the `BeautifulSoup` library to parse the HTML.

In [2]:
# First, I need to get the HTML from the page
response = requests.get('https://en.wikipedia.org/wiki/Machine_learning')

# Now I need to parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

Nothing new, right? We have been using these libraries for a while now.

# 2. Parsing HTML (Pure Python)

Say I want to capture the **headlines** of this page. As you know, I can do so by using the `find_all` method of the `BeautifulSoup` object.

In [28]:
headlines_soup = soup.find_all(attrs={'class': 'mw-headline'})
headlines_soup

54

In [None]:
# For each headline, I want to get the parent tag:
for headline in headlines_soup:
    print(headline.parent)

In [21]:
def get_headlines_info(headlines_soup):
    """
    Parses the information from the headlines on a Wikipedia page.

    The list should have one dictionary for each headline on the page.

    Parameters
    ----------
    headlines_soup : list
        A list of BeautifulSoup objects with the class 'mw-headline'

    Returns
    -------
    list
        A list of dictionaries with the following keys:
        - headline id
        - parent tag (h2, h3, h4, h5, h6, etc.)
        - parent tag text
    """

    list_headlines = []

    # populate df with data from the page

    for headline in headlines_soup:
        list_headlines.append({'headline id': headline.get('id'),
                                'parent tag': headline.parent.name,
                                'parent tag text': headline.parent.text})

    return list_headlines

In [22]:
get_headlines_info(headlines_soup)[1:4]

[{'headline id': 'History_and_relationships_to_other_fields',
  'parent tag': 'h2',
  'parent tag text': 'History and relationships to other fields[edit]'},
 {'headline id': 'Artificial_intelligence',
  'parent tag': 'h3',
  'parent tag text': 'Artificial intelligence[edit]'},
 {'headline id': 'Data_mining',
  'parent tag': 'h3',
  'parent tag text': 'Data mining[edit]'}]

In [26]:
pd.DataFrame(get_headlines_info(headlines_soup))

|    | headline id                               | parent tag   | parent tag text                                 |
|---:|:------------------------------------------|:-------------|:------------------------------------------------|
|  0 | Overview                                  | h2           | Overview[edit]                                  |
|  1 | History_and_relationships_to_other_fields | h2           | History and relationships to other fields[edit] |
|  2 | Artificial_intelligence                   | h3           | Artificial intelligence[edit]                   |
|  3 | Data_mining                               | h3           | Data mining[edit]                               |
|  4 | Optimization                              | h3           | Optimization[edit]                              |


# Conclusions

This is a WIP notebook. I will be adding more content to it as we go along.