# Extract Wikipedia articles

This notebook is intended for extracting Wikipedia articles for the entities which are available in Wikinews and storing them in the `data/wikiphrase` folder in the `wikipedia-entities.json` file.

In [1]:
import os
import pandas as pd
import wikipedia
import sys
import datetime
from tqdm import tqdm
from wikipedia.exceptions import PageError, DisambiguationError

sys.path.append(os.path.join('..', '..'))

In [2]:
data_path = os.path.join('..', '..', 'data', 'wikiphrase')
df_entities = pd.read_json(os.path.join(data_path, 'wikinews-entities.json'))

## Wikipedia fetching function

There are two types of main errors while fetching information from Wikipedia. One of the errors is the PageError. This occurs when an article can not be found for the given entity. The other error occurs when multiple Wikipedia pages are related to the given entity. This disambiguation error is solved by picking the first option and by specifying that there was an ambiguation error.

In [3]:
def fetch_wikipedia_article(entity, is_ambiguous=False):
    # Try to fetch the Wikipedia article (and a short summary of it)
    try:
        wikipedia_short_summary = wikipedia.summary(entity, sentences=2)
        wikipedia_summary = wikipedia.summary(entity)
        return wikipedia_short_summary, wikipedia_summary, is_ambiguous
    except PageError:
        # Raise a PageError when the article for the entity does not exists
        raise
    except DisambiguationError as error:
        # Go for the first option
        return fetch_wikipedia_article(error.options[0], True)

## Wikipedia fetch and store

The next section fetches the data from Wikipedia and stores it in the `data/wikiphrase/wikipedia-entities.json` file. For the Wikiphrase entities, this took roughly ~30 minutes.

In [4]:
wikipedia_entries = []

entities = tqdm(df_entities.entity.unique())

# Loop through all unique entities
for entity in entities:
    # Update the progressbar
    entities.set_description('Processing {}...'.format(entity))
    
    # Try to fetch the Wikipedia article
    try:
        wikipedia_short_summary, wikipedia_summary, is_ambiguous = fetch_wikipedia_article(entity)
    except PageError:
        continue
    
    # When successful, add it to the list of entries
    wikipedia_entries.append({
        'entity': entity,
        'wikipedia_text_short': wikipedia_short_summary,
        'wikipedia_text': wikipedia_summary,
        'is_ambiguous': is_ambiguous,
        'added_at': datetime.datetime.now()
    })



  lis = BeautifulSoup(html).find_all('li')
Processing Kathleen Kane...: 100%|██████████| 2314/2314 [38:35<00:00,  1.03s/it]


In [5]:
# Create a Dataframe out of it
df_wiki = pd.DataFrame(wikipedia_entries)
df_wiki.added_at = pd.to_datetime(df_wiki.added_at)
df_wiki.to_json(os.path.join(data_path, 'wikipedia-entities.json'))

## Conclusion

This notebook creates the `wikipedia-entities.json` file which contains Wikipedia information for unambiguous entities found in the Wikiphrase dataset.