In [1]:
%load_ext autoreload
%autoreload 2

# Pre-Processing Wikipedia data
This notebook is a workflow for pre-processing the wikipedia data. This includes cleaning the data before it is used further. Shortening the texts based on some criteria and finally saving the data in a format that can be used by the labeling program and training of the model.

In [1]:
import preprocessor
from preprocessor import Preprocessor
import os
import spacy
import pandas as pd
from datetime import date

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Beheerder\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Importing and filtering the UNESCO WORLD HERITAGE sites
### 1.1 Subset 1: Spacy embedding filtering
#### 1.1.1 Importing the UNESCO data
The first subset of the entire Wikipedia dataset will consists of articles for which we are positive that they contain UNESCO World Heritage Sites
We start he preprocessing with retrieving the unesco world heritage sites from a dataset and filtering these sites in the large wikipedia dataset. 

The data is read from a csv file, followed by making a set of all the names in this column with english names. This is done for further processing using the set of names in stead of a dataframe.

To see what kind of data we are using, a head of the dataframe is printed.

In [2]:
preprocessor = Preprocessor(ROOT_DIR)

df_unesco = pd.read_csv(os.path.join(DATA_PATH, "unesco_names.csv"), header=0, names=["landmark_name"])
landmark_names = set(df_unesco["landmark_name"].to_list())

df_unesco.head()

Unnamed: 0,landmark_name
0,Cultural Landscape and Archaeological Remains ...
1,Minaret and Archaeological Remains of Jam
2,Historic Centres of Berat and Gjirokastra
3,Butrint
4,Al Qal'a of Beni Hammad


#### 1.1.2 Embedding the UNESCO data
For filtering of the wikipedia pages based on the landmark names, the landmark names need to be embedded such that we can do a similarity search. This is needed since the titels of the wikipedia pages are rarely exactly like the landmark names described in the UNESCO dataset.

For embedding the names, we are using spacy.

In [3]:
landmark_embeddings = []

for landmark_name in landmark_names:
    landmark_embeddings.append(preprocessor.ner_spacy(landmark_name))

print("Number of embedded landmarks: ", len(landmark_embeddings))

Number of embedded landmarks:  1157


#### 1.1.3 Subsetting the Wikipedia Dataset
Comparisson with embeddings takes a lot of time and is hence not feasible for the entire Wikipedia dataset. Hence, we take a subset of the dataset by filtering on Wikipedia articles that occur in (parts of) the UNESCO landmarks list. This is done using the `process_folders` function, which can either be used to filter on title or on landmark names. Based on the variable `title_based`, it decides whether to run `process_file_regex` or `process_file_nlp` respectively.


In [4]:
folders = ["AA", "AB"]
# Title for title filtering, trailing space is important for filtering
title = "UNESCO World Heritage Site "

page_list = preprocessor.process_folders(folders = folders, landmarks = landmark_names, debug = True, title = title, title_based = False)
preprocessor.clear_dictionary()

1/100 - Started processing 'wiki_00' in folder 'AA'
2/100 - Started processing 'wiki_01' in folder 'AA'
3/100 - Started processing 'wiki_02' in folder 'AA'
4/100 - Started processing 'wiki_03' in folder 'AA'
5/100 - Started processing 'wiki_04' in folder 'AA'
6/100 - Started processing 'wiki_05' in folder 'AA'
7/100 - Started processing 'wiki_06' in folder 'AA'
8/100 - Started processing 'wiki_07' in folder 'AA'
9/100 - Started processing 'wiki_08' in folder 'AA'
10/100 - Started processing 'wiki_09' in folder 'AA'
11/100 - Started processing 'wiki_10' in folder 'AA'
12/100 - Started processing 'wiki_11' in folder 'AA'
13/100 - Started processing 'wiki_12' in folder 'AA'
14/100 - Started processing 'wiki_13' in folder 'AA'
15/100 - Started processing 'wiki_14' in folder 'AA'
16/100 - Started processing 'wiki_15' in folder 'AA'


KeyboardInterrupt: 

#### 1.1.4 Export Wikipedia subset
The resulting subset has 5160 entries, which is more feasible for Spacy embedding filtering. The next step is to save this dataset to a JSON file which can be used later on in the workflow. This is done using the `writeFile` function of the `Preprocessor` class.

In [6]:
preprocessor.writeFile(page_list, f"unesco_first_subset.json")

#### 1.1.5 Embedding filtering on the Wikipedia subset
The next step is to process the subset by filtering based on title and UNESCO landmarks similarity. All article titles are embedded just like the landmarks. A cosine similarity comparrison between the embeddings then indicates whether to include the article. This is done if the similarity is at least `0.97`. The function which does this is `process_file_nlp`.

In [None]:
file_path = os.path.join(DATA_PATH, "unesco_first_subset.json")
results = preprocessor.process_file_nlp(file_path, landmark_embeddings)

#### 1.1.6 Exporting filtered data
To be able to use the texts for further cleaning, the current filtered data is being stored as a json file to the disk.

In [None]:
preprocessor.writeFile(results, f"unesco_wikipedia_pages.json")

#### 1.2.1 Filtering the Wikipedia dataset
The second subset is filtered using the `process_file_title` function, which is called in the `process_folders` function if the parameter `title_based` is set to True. The `process_file_title` function checks whether the string "UNESCO World Heritage Site " (trailing space to prevent plural versions) occurs in the article text. If so, the article is included in the subset. 

In [None]:
folders = ["AA", "AB"]
# Title for title filtering, trailing space is important for filtering
title = "UNESCO World Heritage Site "

page_list = preprocessor.process_folders(folders = folders, landmarks = landmark_names, debug = True, title = title, title_based = True)
preprocessor.clear_dictionary()

#### 1.2.2 Exporting the second subset
The final step for the second subset is to export it to a JSON file so that it can be used later on.

In [7]:
preprocessor.writeFile(page_list, f"unesco_wikipedia_titles.json")

## 2. Clearning the wikipedia pages

### 2.1 Shortening the wikipedia pages
For the main purpose of the task, namely converting information from text to a knowledge graph, most important information of a heritage site is stored in the first few paragraphs. Therefore, we will first shorten all the texts to the first 2 paragraphs and with a minimum of 500 words (average paragraph length is 250 words).

We start of by defining a function that can shorten a text to 2 paragraphs. This is done by splitting the text on the newline character and then joining the first two paragraphs together.

In [8]:
unesco_wikipedia_pages = preprocessor.loadFile(f"unesco_wikipedia_pages.json")

In [9]:
for page in unesco_wikipedia_pages:
    splitted_text = page["text"].split("\n")

    # Get the lengths of every split and get the first index of the split where the total number is higher than 500
    total_length = 0
    for index, split in enumerate(splitted_text):
        total_length += len(split)
        if total_length > 500:
            page["text"] = ''.join(splitted_text[:index+1])
            break

At this point the unesco_wikipedia_pages are split per new line. The average paragraph length is 250 words. Therefore, we will shorten the texts with a minimum of 500 words.

### 2.2 Removing unwanted characters and fixing unicode

In [10]:
unesco_wikipedia_pages_clean = preprocessor.fix_unicode(unesco_wikipedia_pages)

Cleaning data with ftfy...


100%|██████████| 4/4 [00:00<00:00, 1996.81it/s]


Cleaning data with given regex...


100%|██████████| 4/4 [00:00<?, ?it/s]


An example of the cleaned text is shown below:

In [11]:
unesco_wikipedia_pages_clean[0]

{'id': '2750841',
 'revid': '182902',
 'url': 'https://en.wikipedia.org/wiki?curid=2750841',
 'title': 'Alejandro de Humboldt National Park',
 'text': 'Alejandro de Humboldt National Park is a national park in the Cuban provinces of Holguín and Guantánamo It is named after the German scientist Alexander von Humboldt who visited the island in 1800 and 1801 The park was inscribed as a UNESCO World Heritage Site in 2001 for of its size altitude range complex lithology landform diversity and wealth of endemic flora and fauna Geography The rivers that flow off the peaks of the park are some of the largest in the insular Caribbean The park is said to be the most humid place in Cuba and this causes a high biological diversity The park has an area of of which land area and marine area Elevation ranges from sea level to on El Toldo Peak',
 'original_text': 'Alejandro de Humboldt National Park () is a national park in the Cuban provinces of Holguín and Guantánamo. It is named after the German sc

### 2.3 Export the cleaned data in separate json files
There is a export needed of separate json file for the annotation program.

In [12]:
preprocessor.save_file(unesco_wikipedia_pages_clean, "subset_texts")