# CSE621 Web Crawling Examples
---
Prepared by: Kyle Spurlock

Spring 2023

University of Louisville

---

In [8]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from IPython.display import display, Image
import pickle
import time
import json
import re
import sys

sys.path.append("../")

# I. Wikipedia
---
More information available here: [Wikimedia API](https://www.mediawiki.org/wiki/API).

## I.I Supplementing Existing Data

For this example, say I want to add some additional textual data to an existing dataset I already have, [MovieLens 10M: Hetrec 2011](https://grouplens.org/datasets/hetrec-2011/). 

First we will load the parts of the data in that we need the additional data for.

In [6]:
movies = pd.read_csv("hetrec/movies.dat", encoding="latin-1", sep="\t").rename(
    columns={"id": "movieID"}
)

movies_df = movies.loc[:, ["movieID", "title", "year"]]
movies_df["year"] = movies_df["year"].astype("str")
display(movies_df.head(7))
print(movies_df.shape)

Unnamed: 0,movieID,title,year
0,1,Toy story,1995
1,2,Jumanji,1995
2,3,Grumpy Old Men,1993
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995
5,6,Heat,1995
6,7,Sabrina,1954


(10197, 3)


To crawl Wikipedia we will mainly be using the [requests library](https://pypi.org/project/requests/) with the Wikipedia API endpoint. Requests is similar to Python's std library urllib4, or wget. Of course there are many tools that also make this process even simpler. [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is another helpful library for parsing retrieved HTML from requests. 

In [9]:
response = requests.get("https://requests.readthedocs.io/en/latest/user/advanced/")
soup = BeautifulSoup(response.content)

for i, paragraph in enumerate(soup.find_all("p")):
    print(paragraph.contents) if i < 5 else None


['This document covers some of Requests more advanced features.']
['The Session object allows you to persist certain parameters across\nrequests. It also persists cookies across all requests made from the\nSession instance, and will use ', <code class="docutils literal notranslate"><span class="pre">urllib3</span></code>, '’s ', <a class="reference external" href="https://urllib3.readthedocs.io/en/latest/reference/index.html#module-urllib3.connectionpool">connection pooling</a>, '. So if\nyou’re making several requests to the same host, the underlying TCP\nconnection will be reused, which can result in a significant performance\nincrease (see ', <a class="reference external" href="https://en.wikipedia.org/wiki/HTTP_persistent_connection">HTTP persistent connection</a>, ').']
['A Session object has all the methods of the main Requests API.']
['Let’s persist some cookies across requests:']
['Sessions can also be used to provide default data to the request methods. This\nis done by provid

So, the goal is to find the corresponding Wikipedia articles for each of these 10,197 movies, and then extract the text from them to use as additional features in another task.

However, we need to make some considerations about what kind of result Wikipedia will give us. In some cases, such as in the movie "Heat" above, look at what happens if we query directly with the title:

In [376]:
URL = "https://en.wikipedia.org/w/api.php"

SEARCHPAGE = "Toy Story"

PARAMS = {
    "action": "query",
    "format": "json",
    "list": "search",
    # "eilimit": 500,
    "srsearch": SEARCHPAGE,
}

response = requests.get(url=URL, params=PARAMS)
result = response.json()

result

{'batchcomplete': '',
 'continue': {'sroffset': 10, 'continue': '-||'},
 'query': {'searchinfo': {'totalhits': 21771,
   'suggestion': 'to story',
   'suggestionsnippet': 'to story'},
  'search': [{'ns': 0,
    'title': 'Toy Story',
    'pageid': 53085,
    'size': 113875,
    'wordcount': 11806,
    'snippet': '<span class="searchmatch">Toy</span> <span class="searchmatch">Story</span> is a 1995 American computer-animated comedy film directed by John Lasseter (in his feature directorial debut), produced by Pixar Animation Studios',
    'timestamp': '2023-04-19T15:38:10Z'},
   {'ns': 0,
    'title': 'Toy Story (franchise)',
    'pageid': 20800509,
    'size': 104363,
    'wordcount': 7447,
    'snippet': '<span class="searchmatch">Toy</span> <span class="searchmatch">Story</span> is an American media franchise owned by The Walt Disney Company. The franchise centers around <span class="searchmatch">toys</span> that, unknown to humans, are secretly living',
    'timestamp': '2023-04-16T0

Since many movie titles like "Heat" have shared names with many other things; if we just grab the top result and pull the text from there, the textual data from there will be mismatched with what the actual object is (movies in this case)

Now, getting around this is going to be highly specific to just movies. While REST API's and API endpoints are fairly common for most websites and generally uniform, you will typically have to read the documentation to determine how to use it properly. This mostly pertains to what parameters are valid for your query.

For Wikipedia, it is possible to request a "generator," which we can use to make iterative requests (essentially like going from page 1 -> page 2 -> page n of results). We are going to use this to collect all page id's that have an embedded "Template:Infobox_film" element (generic version of the box that shows up right underneath the primary image of an article), which indicates that the article is a movie.

In [None]:
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "generator": "embeddedin",
    "format": "json",
    "geititle": "Template:Infobox_film",  # Per the API docs, some parameters must be prepended with a g when using a generator
}

all_movie_pages = {}

while True:
    response = S.get(url=URL, params=PARAMS)
    result = response.json()
    if "error" in result:
        raise SystemError(result["error"])
    if "warnings" in result:
        print(result["warnings"])
    if "query" in result:
        all_movie_pages.update(result["query"]["pages"])
    if "continue" not in result:
        break
    else:
        PARAMS["geicontinue"] = result["continue"]["geicontinue"]
    
    print(f"Items collected: {len(all_movie_pages)}", end="\r")

Saving these for later:

In [11]:
with open("all_movie_pages.pickle", "wb") as handle:
    pickle.dump(all_movie_pages, handle)

In [15]:
with open("all_movie_pages.pickle", "rb") as handle:
    all_movie_pages = pickle.load(handle)

In [16]:
len(all_movie_pages)

153660

This is a slow process, but it ensures that we get the most accurate results when we start to collect the articles.

Now what we are going to do with this is essentially use it as a lookup table to confirm whether or not a query item is in fact a movie.

In [24]:
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "format": "json",
    "list": "search",
    "srsearch": "", # Fill this in in the loop
}

wiki_page_ids = {}

for i, row in movies_df.iterrows():
    iid, title, year = row.values
    PARAMS["srsearch"] = title + year
    
    response = S.get(url=URL, params=PARAMS)
    result = response.json()
    
    if "error" in result:
        raise SystemError(result["error"])
    if "warnings" in result:
        print(result["warnings"])
    if "query" in result: 
        if result["query"]["searchinfo"]["totalhits"] > 0:
            queries = result["query"]["search"]
            for query in queries:
                try:
                    pageid = query["pageid"]
                    all_movie_pages[str(pageid)] # The lookup table part
                    # If we go past this, we know that the query is a movie
                    #print(f"Match found for: {title} \t Pageid: {pageid}")
                    entry = {"title": title, "year": year, "pageid": pageid}
                    wiki_page_ids[iid] = entry
                    break
                except KeyError as e:
                    continue

In [25]:
len(wiki_page_ids)

9772

Saving for later again:

In [27]:
with open("hetrec_movie_wiki_id.pickle", "wb") as handle:
    pickle.dump(wiki_page_ids, handle)

In [28]:
with open("hetrec_movie_wiki_id.pickle", "rb") as handle:
    wiki_page_ids = pickle.load(handle)

In [197]:
soup.find("meta", {"property": "mw:PageProp/toc"}).

<meta property="mw:PageProp/toc">
<h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Toy_Story&amp;action=edit&amp;section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>In a world where toys come to life and pretend to be lifeless whenever humans aren’t around, a group of toys are preparing to move into a new house with their owner <a class="mw-redirect" href="/wiki/Andy_Davis_(Toy_Story)" title="Andy Davis (Toy Story)">Andy Davis</a>, his sister <a class="mw-redirect" href="/wiki/Molly_Davis" title="Molly Davis">Molly</a> and <a href="/wiki/List_of_Toy_Story_characters#Mrs._Davis" title="List of Toy Story characters">their single mother</a>. The toys become uneasy when Andy has his birthday party a week early. <a class="mw-redirect" href="/wiki/Sheriff_Woody" title="Sheriff Woody">Sheriff Woody</a>, Andy's favorite toy and their lea

In [215]:
import copy


wiki_page_ids_plots = copy.deepcopy(wiki_page_ids)

S = requests.Session()

URL = "https://en.wikipedia.org/w/index.php" # Note this is different from before!

PARAMS = {
    "curid": None
}

counter = 1
size = len(wiki_page_ids_plots)

for iid, attr in wiki_page_ids_plots.items():
    PARAMS["curid"] = attr["pageid"]
    
    response = S.get(url=URL, params=PARAMS)
    soup = BeautifulSoup(response.content)
    
    try:
        current_element = soup.find("table", {"class": "infobox"})
    except AttributeError as e:
        print(attr["title"])
        pass
    
    all_paragraph_string = ""
    h_count = 0
    print(f"{np.round(100*(counter / size),3)}%", end="\r")
    counter += 1
    while True:
        current_element = current_element.next_sibling
        if current_element == "\n":
            pass
        else:
            if current_element.name == "p":
                paragraph_string = ""
                for string in current_element.strings:
                    paragraph_string += string
                all_paragraph_string += paragraph_string
                
                
            elif current_element.name == "meta":
                current_element=current_element.find_next("h2")
                all_paragraph_string += "\n\nPlot: "
                pass
            elif current_element.name == "h2":
                break
            else:
                break
            
    wiki_page_ids_plots[iid]["Description"] = all_paragraph_string

100.0%%

In [216]:
wiki_page_ids_plots

{1: {'title': 'Toy story',
  'year': '1995',
  'pageid': 53085,
  'Description': 'Toy Story is a 1995 American computer-animated comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The first installment in the  Toy Story franchise, it was the first entirely computer-animated feature film, as well as the first feature film from Pixar. It was directed by John Lasseter (in his feature directorial debut) and produced by Bonnie Arnold and Ralph Guggenheim, from a screenplay written by Joss Whedon, Andrew Stanton, Joel Cohen, and Alec Sokolow and a story by Lasseter, Stanton, Pete Docter, and Joe Ranft. The film features music by Randy Newman, and was executive-produced by Steve Jobs and Edwin Catmull. The film features the voices of Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, R. Lee Ermey, John Morris, Laurie Metcalf, and Erik von Detten.\nTaking place in a world where toys come to life when humans are not p

In [217]:
with open("wiki_page_ids_plots.pickle", "wb") as f:
    pickle.dump(wiki_page_ids_plots, f)

In [230]:
wiki_df = pd.DataFrame.from_dict(wiki_page_ids_plots, orient="index").reset_index().rename({"index": "movieID"}, axis=1)
wiki_df = wiki_df.drop(["year", "title"], axis=1)

wiki_df.to_csv("./hetrec_movie_wiki_descriptions.csv", index=False)