# Fetching Movie Data from Wikipedia

Recommneder systems are one of the most common applications of machine learning. They are usually trained on previously collected ratings from users. To be able to make good suggestions, we need a substantial amount of data per user. Given we don't have it, we'll use outgoing Wikipedia links to collect a dataset of movies that we'll use to train embeddings for both movies and links.

In this notebook, we will implement the required steps to scrap data from Wikipedia dumps. **NOTE**: The final dump is around 13 GB of data and takes a LOT of time to collect, so in the next notebook, focused on the creation of the actual models, we'll use a subset of the top 10.000 movies.

## Before starting...

Let's import the libraries we will use:

In [1]:
import inspect
import os
from helpers import *

Using TensorFlow backend.


## Collecting the data

Let's collect the latest dumps from Wikipedia programatically

In [2]:
print(inspect.getsource(get_wikipedia_dumps))

def get_wikipedia_dumps():
    index = requests.get('https://dumps.wikimedia.org/enwiki/').text
    soup_index = BeautifulSoup(index, 'html.parser')
    dumps = [a['href'] for a in soup_index.find_all('a')
             if a.has_attr('href') and a.text[:-1].isdigit()]

    return dumps



In [3]:
dumps = get_wikipedia_dumps()
print(dumps)

['20180701/', '20180720/', '20180801/', '20180820/', '20180901/', '20180920/', '20181001/']


Good. Now we'll go through this list to find the newest dump that has actually finished processing and then fetch the dump. This will take many hours!

In [4]:
print(inspect.getsource(download_dump))

def download_dump(dumps):
    for dump_url in sorted(dumps, reverse=True):
        print(dump_url)
        dump_html = requests.get(f'https://dumps.wikimedia.org/enwiki/{dump_url}').text
        soup_dump = BeautifulSoup(dump_html, 'html.parser')
        pages_xml = [a['href'] for a in soup_dump.find_all('a')
                     if a.has_attr('href') and a['href'].endswith('-pages-articles.xml.bz2')]

        if pages_xml:
            break

        time.sleep(1)  # Must wait so Wikipedia does not kick us.

    wikipedia_dump = pages_xml[0].rsplit('/')[-1]
    url = f'https://dumps.wikimedia.org/{pages_xml[0]}'
    path = get_file(wikipedia_dump, url)

    return path



In [5]:
path = download_dump(dumps)

20181001/
Downloading data from https://dumps.wikimedia.org//enwiki/20181001/enwiki-20181001-pages-articles.xml.bz2
   86147072/15480471011 [..............................] - ETA: 2:04:37

KeyboardInterrupt: 

This dump we fetched is a bz2-compressed XML file. To parse it we will use `sax`. We are only interested in the `<title>` and `<page>` tags. In order to parse the aforementioned XML we will use this `ContentHandler`:

print(inspect.getsource(WikiXmlHandler))

What this class does is collect the contents of the title and the text of each `<page>` tag into `self._values`. Then, it calls `process_article` with these values.

Wikipedia pages use a concept called **templates** as a way to make sure they contain similar information and its rendered in the same way. For instance, if you visit the Wikipedia page of two cities or countries, you'll find a box to the right that tells you information about population, landmarks and such. This also happens with movie articles. These pages contain an **infobox** template with type **film** that we can use to obtain the name, outgoing links and properties stored in the box for each movie. 

We use `mwparserfromhell` as a helping tool for this task.

We create the content handler and parser with the following function:

In [7]:
print(inspect.getsource(create_wiki_xml_parser_and_handler))

def create_wiki_xml_parser_and_handler():
    parser = xml.sax.make_parser()
    handler = WikiXmlHandler()
    parser.setContentHandler(handler)

    return parser, handler



In [None]:
parser, handler = create_wiki_xml_parser_and_handler()

Let's actually parse the dumps:

In [8]:
print(inspect.getsource(parse_dumps))

def parse_dumps(parser, dumps_path):
    for line in subprocess.Popen(['bzcat'], stdin=open(dumps_path), stdout=subprocess.PIPE).stdout:
        try:
            parser.feed(line)
        except StopIteration:
            break



In [None]:
parse_dumps(parser, path)

And, finally, write results to disk:

In [9]:
print(inspect.getsource(write_movies))

def write_movies(handler, output_path='generated/wp_movies.ndjson'):
    with open(output_path, 'wt') as fout:
        for movie in handler._movies:
            fout.write(f'{json.dumps(movie)}\n')



In [None]:
if not os.path.isdir('generated'):
    os.mkdir('generated')
    
write_movies(handler)