# Week 4: Web Scraping

In this week, we will learn to automatically extract information from a website. To do so, we will need to to understand the following:

- the HTML language, the building block of the Web;
- the CSS language, describing how HTML elements are positioned, how text is displayed, etc.;
- how CSS styling is encoded in HTML attributes withh ‘CSS selectors’;
- inspecting a website in your browser and finding out how to select information to scrape.

Once we have the above basic understanding, we will use a Python library to help us automate information extraction from the websites.

## HTML

HyperText Markup Language (HTML) is the language that is interpreted by the web browsers to display Web page. Just like how Python code is interpreted by the Python interpreter, web browsers (e.g., Chrome, Firefox, Safari) are interpreters for HTML code. 

A simple example of an HTML code looks as follows:

```
<html>
    <div>
        <h1>Hello World!</h1>
        <p>This is my first HTML code</p>
    </div>
</html>
```

<div class="alert-info alert">

**Exercise 0.1:** Go to https://htmledit.squarefree.com/ and insert the HTML code in the top input box. Notice how the the tags such as `<p>` or `</p>` disappear, thereby rendering only the text between these tags. 

</div>


### HTML Elements

**Notice** how the tags appear in pairs as `<p>` and `</p>` or `<div>` and `</div>`. A start tag (e.g., `<p>`), text, and the end tag (e.g., `</p>`) forms an **HTML element**. A web browser display the text contained in these elements according to the element's function. Each of these tags have a specific purpose. Here are some examples:

<hr />
<br />

| tag (name)              | Function | 
|:------------------------|:------------|
| `p` (paragraph)         | Display the text as is |                    
| `h1` (headline-1)       | Display the text as the main headline with bold and larger font |
| `h2` (headline-2)       | Display the text as a sub headline with bold and slighlty smaller fond than `h1`|
| `div` (division)        | Define a section in the HTML document |
| `html` (root)           | Define the end and start of the HTML code | 

<br/>
<hr/>

Finally, these **tags are further nested**. In our example, a div-tag contains an h3-tag and a p-tag. 

<div class="alert-warning alert">

A **full list of all tag lists** is available here: https://www.w3schools.com/tags/default.asp
</div>

## CSS 

Cascading Style Sheets (CSS) tells the web browser how to style the content of an element. It is the language that often accompanies any HTML code.The styling may include having specific font size, color, background color, and many more properties related to visually displaying that element. 

One can understand the tags in HTML like `<h1>` as having some default CSS. However, when wee want to do more than that is availble in HTML tags, we will need to use CSS.

```html
<html>
    <head>
        <style>
        p {
            background-color: lightblue,
            font-size: 20px
        }

        h1 {
            text-align: center
        }
        </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p>This is my first HTML code</p>
        </div>
    </body>
</html>
```



Since it is beyond the capacity of a notebook to render the above CSS in these blocks, we will use an online HTML hosting platform (JSFiddle) to test the previous example.

Click on https://jsfiddle.net/pguptacs/vg8wuzL5/ to see how the above HTML code is displayed. If HTML doesn't display at the bottom right corner, then click on the `Run` button at the top-left corner of the window. 

**Note** how specifications like `text-align` or `background-color` stylize their respective HTML elements.
**Note** that we have partitioned the HTML document into two sections using the head-tag and the body-tag. The head-tag contains all the metadata, while the body tag contains everything that needs to be displayed on a web page. 

The above code is a cleaner way to separate out CSS styling and HTML code via style-tag nested within the head-tag. However, HTML is flexible enough to let the user define these styles in the individual elements. For example, study the code below and note how we moved style within the elements themselves.

```
<html>
  <div>
    <h1 style="text-align: center";>Hello World!</h1>
    <p style="background-color:lightblue; font-size:20px">This is my first HTML code</p>
  </div>
</html>
```

<div class="alert-info alert">

**Exercise 0.2:** Go to https://htmledit.squarefree.com/ and insert the HTML code in the top input box. Notice how the individual elements have been stylized in the similar fashion as the preceding HTML code with a separate style-tag.

</div>

You can also view the above code snippet in JSFiddle here: https://jsfiddle.net/pguptacs/5s2av39u/

### Advanced CSS styling using HTML attributes

We saw above that each start tag can have arguments like `style="text-align: center"` to define the style of that element. An HTML code can quickly become too messy to read if we were to define styles in each of the elements. Thus, in an attempt to organize the CSS styling, one can imagine the following two situations -

- **(A)** There are multiple elements which all need to be styled in the same way: HTML provides `class` attrbiutes to define such CSS styles using the CSS selector `.`
- **(B)** There is an element that needs an unique CSS style: HTML provides `id` attribute to define a unique ID to that element and this element can be styled separately using the CSS selector `#`.

**Note:** We have defined **HTML Attributes** and their corresponding **CSS Selector**. This distinction will become clearer with the following example that attributes `class` and `id` to the p-tags, and define their styles separately in the style sheet using their corresponding selector `.` and `#`. 

```
<html>
    <head>
        <style>
        .done {
            background-color: lightblue;
            font-size: 20px
        }

        #todo {
            background-color: lightgreen;
            text-align: center
        }
        </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p class="done">I know how HTML works</p>
            <p class="done">I know how CSS works</p>
            <p id="todo">I am learning about HTML attributes and CSS selectors</p>
        </div>
    </body>
</html>
```

To display the above code, head to https://jsfiddle.net/pguptacs/o9vfryg8/ and notice the difference.

<div class="alert-info alert">

**Example 0.3:** Think of what will happen when we have a class and id defined in the same attribute. For example, how will the following HTML code be rendered?

</div>

```
<html>
    <head>
        <style>
        .done {
            background-color: lightblue;
            font-size: 20px
        }

        #todo {
            background-color: lightgreen;
            text-align: center
        }
        </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p class="done">I know how HTML works</p>
            <p class="done">I know how CSS works</p>
            <p class="done" id="todo">I am learning about HTML attributes and CSS selectors</p>
        </div>
    </body>
</html>
```

Head over to https://jsfiddle.net/pguptacs/utbxoe48/ and note the following:

-  `id` overwrites the style defined in CSS 
- Final style is the union of individual styles

### Combining multiple HTML attributes

Finally, many classes and ids can be combined in the same tag. For example, look at the code below and think about how it will be rendered.

```
<html>
    <head>
        <style>
        .done {
            background-color: lightblue;
            font-size: 20px
        }

        #todo {
            background-color: lightgreen;
            text-align: center
        }
        
        .weight {
          font-weight: bold;
        }
         </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p class="done weight">I know how HTML works</p>
            <p class="done weight">I know how CSS works</p>
            <p class="done " id="todo">I am learning about HTML attributes and CSS selectors</p>
        </div>
    </body>
</html>
```

Now head over to https://jsfiddle.net/pguptacs/03bu2tLs/ and check for yourself how it is rendered.

## Inspecting a web page

To extract information from a website automatically, we need to find out how to tell a computer what patterns to look for. To do so, we will make use of the HTMl code itself and the presence of elements and CSS scripts.

Let's take as an example, the following website https://www.oii.ox.ac.uk/people/faculty/. 

- **Step 1:** Where do we find the HTML code of that page?
    - Open the above URL in a separate window.
    - Two-finger click on Mac (or right-click on windows) anywhere on the web page, and select the last option `Inspect` from the dropdown menu.
    - This will open up a **console** in the web browser with options far beyond what we need in this course.
    - Focus your attention on HTML code under the **Inspector** (Firefox) or **Elements** (Chrome) tab, which will also be the default view.
    - For now, just know that this is how we can read the HTML code of any web page.


<image src="https://mdn.mozillademos.org/files/16371/landingPage_PageInspector.png" />

- **Step 2:** We want to **extract the name of all the faculty members** at the OII. We can visually locate them within the blue blocks on the web page, but what does it mean in terms of the HTML code?
    - How can we localize the HTML tags that correspond to the name of the faculty members?
    - Reading all the HTML code is a gruesome task. Therefore, move the cursor over the faculty name (e.g., hover over where its written “Professor Victoria Nash”). Then right-click and repeat the above process of inspecting the source code.
    - You will notice that the console now highlights the HTML element that contains this hovered text. Specifically, it will highlight the following element: `<h4>Professor Victoria Nash</h4>`.

- **Step 3:** We are also interested in extracting other information (e.g., **bio and the department position**) from theses blue blocks.
    - How do we locate the HTML element corresponding to the bigger blue box with name, photo, and bio?
    - As a reminder, the HTML code is written via nested elements. Once you found the name, look at its parent and repeat until you found: an element… which contains all the child elements… which contains the title, photo, and bio.
    - In the console, hover the mouse a few different elements, and notice how the corresponding color changes on the web page. This happens because the web browser highlights which element that is being currenlty hovered on. This is very helpful!
    - Now notice that we are able to locate the bigger blue box by hovering over the element: `<article class="box  people-box light-background box-has-button third-at-full  has_url people  ">`

- **Step 4:** Great! We have understood how the web page designer structured the HTML code such that the faculty information can be easily located under the article-tag with class as `box`, `people-box`, and so on. Now, perform a sanity check by hovering over the similar tags and verifying whether they correspond to the boxes of other faculty memebrs. 




## Automating Web Scraping using Python

We just understood the manual process of locating information on a web page. We will now automate that process using Python. Clearly, we need a tool to download a page and select HTML elements that match certain attributes. In our running example on extracting information about OII faculty members, we need to select elements with article-tag and attributes as box, people-box, etc. How can we do so using Python?


We need two libraries:
- `requests` to download the HTML code of a web page. It is used by `requests-html`,
- `requests-html` is a library that has functions to enable HTML parsing and search of information within the HTML code.

In [None]:
%%capture
!pip install requests
!pip install requests-html

In [None]:
import pandas as pd
from tqdm import tqdm  # fancy library to print the progress bar while iterating through a list

<div class="alert-warning alert">

**Note:** Throughout the exercise, we create a browsing `session` below. Think of it as a virtual browser that will download the pages you tell it to crawl.

</div>

In [None]:
from requests_html import HTMLSession
session = HTMLSession()

<div class="alert alert-info">

**Example 0.4:** To extract the HTML code of an URL, we will use the `session.get` method.

Extract the HTML code of the website: https://www.oii.ox.ac.uk/people/faculty/ and print the HTML code. 

</div>

In [None]:
r = session.get('https://www.oii.ox.ac.uk/people/faculty/')

In [None]:
print(r.html.html)

<div class="alert-info alert">

**Example 0.5:** Now, we are going to make use of the full potential of request-html.

We are going to find and extract all the elements corresponding to the tag `<article class="box people-box light-background box-has-button third-at-full has_url people">`.

</div>

<div class="alert-warning alert">

**Reminder:** Refer back to the definition of CSS selectors:
- classes are prefaced by a `.` (e.g., `.box`, `people-box`)
- unique id selectors by `#` (e.g., `#about`)
- elements do not have any prefix (e.g., `div`, `h1`, `h2`)

To match an HTML element, we can combine all these selectors in a nested chain: `div.people-box h2` would match any `h2` element that is inside a `div` element with the `people-box` class.

You can do more complex patterns: `div.box.people-box.has_url` matches any `div` element that has the three classes `box`, `people-box`, and `has_url`.

Be careful that:

- The sequence of selectors matter as HTML is a nested code
- The order matters: `.box p` matches a paragraph in a box, `p.box` matches a paragraph with the `box` class, `p .box` matches a box within a paragraph.

</div>

In [None]:
faculty_members = r.html.find('article.box.people-box')
faculty_members

In [None]:
faculty_members = r.html.find('.box.people-box')
faculty_members

<div class="alert-warning alert">

**Note** how both the selectors `article.box.people-box` and `.box.people-box` return the required elements. It is entirely possible to have several ways to identify the same elements.

</div>

<div class="alert-info alert">

**Example 0.6:** Let's investigate the `faculty_members` elements returned above, and look at various nested elements and how their content that is displayed on the web page.

- Select the first faculty member in the list (`faculty_members[0]`)
- Go back to the web page and inspect its child elements under the `article` element. Try to understand which child element contains the name, position in the department, bio, and the link to their home page. 


Finally, go back to Python and use the `find` method of the first faculty member to extract the name, position, bio, and link.

</div>

In [None]:
first_person = r.html.find('.box.people-box')[0]

Following is the rough sketch of article element:

```
<article>
    <a href="URL TO THE HOMEPAGE">
    <img src="URL TO THE IMAGE">
    <div>
        <h4>NAME</h4>
        <p>POSITION IN THE DEPARTMENT</p>
        <p>BIO</p>
    </div>
</article>
```


Clearly, we will find name in the first h4-tag, department position in the first p-tag, and bio in the second p-tag, and the URL to the home page in the first a-tag.

In [None]:
name = first_person.find('h4', first=True).text
pos = first_person.find('p')[0].text
bio = first_person.find('p')[1].text
homepage_link = list(first_person.find('a', first=True).links)[0]

print("Name:", name)
print("Position:", pos)
print("Bio:", bio)
print("URL:", homepage_link)

<div class="alert-info alert">

**Example 0.7:** Print name, position, and url of all the faculty members at OII. 

</div>

In [None]:
for e in r.html.find('.box.people-box'):
    name = e.find('h4', first=True).text
    pos = e.find('p')[0].text
    urls = list(e.links)
    
    print(name, " | ",  pos, " | ", urls[0])

## Exercise 1: Web page crawler

<div class="alert-info alert">

**Exercise 1.1:** Investigate the home page of Dr. Victoria Nash (or any other faculty member): https://www.oii.ox.ac.uk/people/profiles/victoria-nash/ and find the HTML elements

- that contains contents in **About**. 

- that contains contents in **Current Courses**


</div>



In [None]:
r = # COMPLETE THE CODE TO CALL session.get

In [None]:
about_section = # COMPLETE THE CODE TO SELECT THE ELEMENT BY ID
print(about_section.text)

In [None]:
teaching = r.html.find('REPLACE THIS STRING WITH APPROPRIATE CSS SELECTOR SEQUENCE')
list_of_classes = [<<REPLACE THIS WITH AN APPROPRIATE ATTRIBUTE TO ACCESS TEXT>> for t in teaching]

print(list_of_classes)

<div class="alert-info alert">

**Example 1.2:** Make a function that takes a home page URL (string) as an argument and returns the content in the **About** section as well the list of courses being taught by the faculty in that term.

- Call this function on https://www.oii.ox.ac.uk/people/profiles/victoria-nash/ and verify if its correct
- Call this function on https://www.oii.ox.ac.uk/people/profiles/luc-rocher/ and verify if its correct

</div>

In [None]:
def crawl_homepage(URL):
    r = session.get(URL)

    about_section = # COMPLETE THE CODE TO SELECT THE ELEMENT BY ID AND SELECT THE TEXT

    teaching = # COMPLETE THE CODE TO SELECT THE ELEMENT BY APPROPRIATE CSS SELECTOR SEQUENCE
    list_of_classes = # COMPLETE THE CODE

    return about_section, list_of_classes

In [None]:
crawl_homepage('https://www.oii.ox.ac.uk/people/profiles/victoria-nash/')

In [None]:
crawl_homepage('https://www.oii.ox.ac.uk/people/profiles/luc-rocher/')

<div class="alert-info alert">

**Example 1.3:** Make a dataframe containing the following information about the OII faculty members: `name` , `position` in the department, `shortbio` as specified on  https://www.oii.ox.ac.uk/people/faculty/ , `url` of their homepage, `about` contents on their homepage, and `courses` that they are teaching currently. 

</div>

In [None]:
page_to_crawl = "REPLACE THE STRING WITH APPROPRIATE URL"
r = session.get(page_to_crawl)

people = []

for people_box in r.html.find('REPLACE THIS STRING WITH APPROPRIATE CSS SELECTOR SEQUENCE'):
    name = people_box.find('h4', first=True).text
    pos = people_box.find('p')[0].text
    shortbio = people_box.find('p')[1].text
    homepage_url = list(people_box.links)[0]

    about, courses = # CALL THE FUNCTION 
    
    people.append(dict(name=name, pos=pos, shortbio=shortbio, url=homepage_url, about=about, courses=courses))

people_df = pd.DataFrame(people)

In [None]:
people_df

## Exercise 2: Understanding the limits of data scraping


<div class="alert-info alert">

- Go to https://www.instagram.com/oxford_uni/
- Inspect the elements, what selectors would you use to extract the description, number of followers, and the URL of each post?
- Now use `session` to extract HTML code of the above link
- Extract all the links from this HTML. How many links did it return?
- Print the HTML code returned above and check the header of that HTML. What does the title say? What is different this time compared to the previous exercise?

</div>

In [None]:
r = session.get('REPLACE THE STRING WITH APPROPRIATE URL')

In [None]:
## access links using `.links` attribute on r.html

# YOUR CODE HERE

What happened? Let's look at the source code:

In [None]:
## access html source code  using `.html` attribute on r.html
print(
    # YOUR CODE HERE
)

Did you notice that Instagram forces you to login? The title of that page at the top is `<title>Login • Instagram</title>`. We arrived directly at the login page!

Scraping Instagram would require a much more complex authentication process, and does not mean you won't be labelled as bot (your account could be locked or blocked). While doing so is beyond the scope of this lab, keep in mind that pitfall when evaluating which websites you can scrape or not.

## Exercise 3: Let's scrape El País

Let's now see if we can use our new scraping skills on a different website, the Spanish newspaper El País. We would like to collect all news articles published on a given date, for instance January 1st 1990 to start with.

Head towards https://elpais.com/hemeroteca/1990-01-01/ and inspect the webpage. Look at the **elements corresponding to the article headline** and the way the CSS classes are defined. While doable, the meaning of the classes is missing: the classes `c` and `c--m-n` appear to encode an article block. (This is likely an automatic optimization done to minimize bandwith and reduce the size of a webpage.)

In [None]:
# Let's start by crawling the list of articles from 1990-01-01:

r = session.get("https://elpais.com/hemeroteca/1990-01-01/")

<div class="alert alert-info">

**Exercise 3.1** Our first task is to extract the links to the pages containing articles published that date.

We have two solutions:
1. we extract all the links in the webpage into a list. We then filter that list to keep only the URLs that point to a news article
2. we find the HTML element with as children all the links to news articles, then extract all its links

Run the commands below and compare them. Which one seems easier for you?

</div>

In [None]:
# Solution 1

import re

# We create a list of all the links on the webpage we crawled:

all_links = # YOUR CODE HERE

print(f"We collected {len(all_links)} URLs")
print(all_links)

# We filter that using a regular expression (https://automatetheboringstuff.com/2e/chapter7/):
links_to_articles = [link for link in all_links if re.match("/diario/1990(.*).html", link)]

# Our new list includes all the link to a ‘diario’ page:
print(f"We filtered down to {len(links_to_articles)} article URLs")
print(links_to_articles)

In [None]:
# Solution 2: 

# after inspection, we will collect all the headers-2 (h2) elements with the class 'c_t'
article_headers = r.html.find('REPLACE THIS STRING WITH APPROPRIATE CSS SELECTOR SEQUENCE')

links_to_articles = []
for h2 in article_headers:
    links_to_articles.extend(
        # YOUR CODE HERE TO ACCESS links in h2
            ) # there should be only one link

print(f"We collected {len(links_to_articles)} article URLs")
print(links_to_articles)

<div class="alert-warning alert">

**Note:** We see that all of the links start with `/diario/year/month/date/blah/blah`. If you copy any of those links and paste it in your browser, it will not work. Full link is actually, `https://elpais.com/diario/year/month/date/blah/blah`. The former path is referred to as **relative path** and the latter path is referred to as **absolute path**. It is really up to the web site developers, which one they prefer. I our case, we have access to the relative paths, so we need to convert them to absolute paths before we can crawl the articles. 


</div>

<div class="alert-info alert">

**Exercise 3.2:** Let's now extract information from that first article
- Select the first article link
- Covert the link to absolute link and call the crawler on that link. 

</div>

In [None]:
article_one = # YOUR CODE HERE
print(article_one)

In [None]:
absolute_link = f'https://elpais.com{article_one}'
print("absolute link", absolute_link)

r = # YOUR CODE HERE TO CRAWL absolute_link
print(r.url)

<div class="alert alert-info">

**Exercise 3.3** Let's extract the title, author, and full text of the article.

Open the page of the above url in your browser and inspect it. Find which HTML element and CSS selector to use in order to match title, author, and full text.

For full text, think how would you match all the paragraphs?
</div>

In [None]:
title = r.html.find('REPLACE THIS STRING WITH APPROPRIATE TAG NAME', first=True).text
print(title)

In [None]:
author = r.html.find('REPLACE THIS STRING WITH APPROPRIATE CSS CLASS SELECTOR AND CLASS NAME', first=True).text
print(author)

In [None]:
# The full text is divided into paragraphs. Let's first match all the paragraphs:

# How would you match all the paragraphs?
all_the_paragraphs = r.html.find('REPLACE THIS STRING WITH APPROPRIATE CSS SELECTORS FOR PARAGRAPHS IN article')
print(f"We collected {len(all_the_paragraphs)} paragraphs.")

# Then, we extract the text of each paragraph and put them all into a list
paragraph_texts = [p.text for p in all_the_paragraphs]

# Finally, we concatenate the paragraph texts into a full_text object
full_text = '\n'.join(paragraph_texts)

<div class="alert alert-info">

**Exercise 3.4** Fill the function `process_article` below. It takes an article path (e.g., `/diario/1990/01/02/madrid/631283054_850215.html`), crawl it, and returns a dictionary with title, author, and full text.
</div>

In [None]:
def process_article(article_path):
    article_url = f'https://elpais.com{article_path}'
    r = session.get(article_url)

    title = # COMPLETE THIS CODE (REFER TO THE CODE SNIPPETS ABOVE)
    author = # COMPLETE THIS CODE (REFER TO THE CODE SNIPPETS ABOVE)
    paragraphs = # COMPLETE THIS CODE (REFER TO THE CODE SNIPPETS ABOVE)
    full_text = # COMPLETE THIS CODE (REFER TO THE CODE SNIPPETS ABOVE)

    return dict(title=title, author=author, full_text=full_text)

In [None]:
process_article(article_one)

<div class="alert-info alert">

**Exercise 3.5:** Loop over all the article links published on January 1st, 1990 and scrape all their contents. You can use list comprehension to loop over all the articles.

- Make a DataFrame of the data scraped above
- Add a column, `date` to your dataframe that contains the date January 1st, 1990 in pandas DateTime format. 

</div>

In [None]:
content_from_1990_01_01 = [<<CALL FUNCTION ON THIS LINK TO PROCESS IT>> for link in tqdm(links_to_articles)]
df_1990_01_01 = pd.DataFrame(content_from_1990_01_01)

In [None]:
# Let's convert the date into a meaningful Python format:
df_1990_01_01['date'] = pd.to_datetime('1990-01-01').date()

In [None]:
df_1990_01_01

<div class="alert-info alert">

**Exercise 3.5:** How would we collect **one year of data**? Complete the function below.

The code snippet below simplify our manual process. Notice how simply iterating over all the potential dates (from `'1990-01-01'` to `'1990-12-31'`) would allow us to crawl the entire 1990 archive. All that in less than a few dozens lines of Python.

</div>

 

In [None]:
def article_urls_for_one_date(str_date, str_year):
    """
    Return links to all articles of El País for a given date.

    Arguments:
        - str_date (str): a string formated as yyyy-mm-dd
    """
    absolute_link = # YOUR CODE HERE
    r = session.get(absolute_link)

    all_links = # YOUR CODE HERE TO COLLECT ALL LINKS
    links_to_articles = [link for link in all_links if re.match(f"/diario/{str_year}(.*).html", link)]
    return links_to_articles

In [None]:
articles_to_crawl = article_urls_for_one_date('1990-01-01', '1990')
df = pd.DataFrame([process_article(link) for link in tqdm(articles_to_crawl)])
df['date'] = pd.to_datetime('1990-01-01').date()

<div class="alert-warning alert">

How to iterate over dates in a year? Checkout: https://stackoverflow.com/a/1060352/3413239

</div>

In [None]:
# THIS IS DONE FOR YOU (NO EXERCISE HERE)

from datetime import date, timedelta
import re

all_articles_df = []

# code copied from the Stackoverflow link above
start_date = date(1990, 1, 1)
end_date = date(1990, 1, 4)
delta = timedelta(days=1)
while start_date <= end_date:
    str_date = start_date.strftime("%Y-%m-%d")
    str_year = start_date.strftime("%Y")

    print(str_date)

    # our own code
    articles_to_crawl = article_urls_for_one_date(str_date, str_year)
    df = pd.DataFrame([process_article(link) for link in tqdm(articles_to_crawl)])
    df['date'] = pd.to_datetime(str_date).date()

    # collect all dataframe in a list
    all_articles_df.append(df)

    start_date += delta


# concatenation of all dataframes
all_articles_df = pd.concat(all_articles_df)

# let's look
print("Number of articles collected: ", all_articles_df.shape[0])



In [None]:
all_articles_df.sample(10)

## Exercise 3 (optional): Extracting information from BitChute search engine

BitChute is a popular video platform for alt-right and far-right communities. In this optional exercise, we will see how to scrape information from BitChute. This exercise is optional and **does not work** in a Jupyter notebook.

<div class="alert alert-info">

**Exercise 4.1** Inspect https://www.bitchute.com/search/?query=oxford&kind=video and try to find the elements that match a video title.

Then run the following commands **from your local terminal** on your machine.

</div>

In [None]:
%%capture
!pip install requests-html

In [None]:
from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://www.bitchute.com/search/?query=oxford&kind=video')

In [None]:
# What should this extract?

titles = r.html.find('.video-result-title')
print(len(titles))

We are running into a problem! In fact, the browser doesn't have time to render the video elements, which are loaded a few milliseconds after we crawl the page. Let's fix that.

We will use some advanced Python functions (asynchronous features), that you don't need to fully understand. Intuitively, requests-html is going to download Google Chrome, then instruct Chrome to crawl the page. This will simulate an actual browser acccessing the page.

In [None]:
from requests_html import AsyncHTMLSession
session = AsyncHTMLSession()

r = await session.get("https://www.bitchute.com/search/?query=oxford&kind=video")

await r.html.arender()

In [None]:
titles = r.html.find('.video-result-title')
print(len(titles))

Again, this shouldn't work. Let's do a final test, this time by instructing Chrome to wait for a full second and scroll down until videos are loaded.

In [None]:
session = AsyncHTMLSession()

r = await session.get("https://www.bitchute.com/search/?query=oxford&kind=video")
await r.html.arender(sleep=1, scrolldown=10)

In [None]:
titles = r.html.find('.video-result-title')
print(len(titles)) # We should get approximately 30 titles now!

## Homework 4: this week's datasheet questions

Throughout this course, we will aim to build on the practice of documenting our datasets, using the Datasheet for Datasets framework (here is an <a href="https://github.com/zykls/folktables/blob/main/datasheet.md">example of a datasheet"</a>). In the homework for this week, you designed a small dataset of page content scraped from El País. Let's assume you plan to use this dataset in your research project.

How would you structure your Datasheet for this small dataset? For this week's homework, please answer the following questions:

> **Was any preprocessing/cleaning/labeling of the data done** (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
>
>...

>**What tasks could the dataset be used for?**
>
>...

>**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?** For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
>
>...

>**Are there tasks for which the dataset should not be used?** If so, please provide a description.
>
>...



<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=9632b0a8-3b9c-434e-a03f-c4a54dda873b' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>