In [1]:
%load_ext autoreload
%autoreload 2

# Pre-Processing Wikipedia data
This notebook is a workflow for pre-processing the wikipedia data. This includes cleaning the data before it is used further. Shortening the texts based on some criteria and finally saving the data in a format that can be used by the labeling program and training of the model.

In [2]:
import preprocessor
from preprocessor import Preprocessor
import os
import spacy
import pandas as pd
from datetime import date

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

preprocessor = Preprocessor(ROOT_DIR)

## 1. Importing and filtering the UNESCO WORLD HERITAGE sites
### 1.1 Importing the UNESCO data
We start he preprocessing with retrieving the unesco world heritage sites from a dataset and filtering these sites in the large wikipedia dataset. 

The data is read from a csv file, followed by making a set of all the names in this column with english names. This is done for further processing using the set of names in stead of a dataframe.

To see what kind of data we are using, a head of the dataframe is printed.

In [3]:
df_unesco = pd.read_csv(os.path.join(DATA_PATH, "unesco_names.csv"), header=0, names=["landmark_name"])
landmark_names = set(df_unesco["landmark_name"].to_list())

df_unesco.head()

Unnamed: 0,landmark_name
0,Cultural Landscape and Archaeological Remains ...
1,Minaret and Archaeological Remains of Jam
2,Historic Centres of Berat and Gjirokastra
3,Butrint
4,Al Qal'a of Beni Hammad


### 1.2 Embedding the UNESCO data
For filtering of the wikipedia pages based on the landmark names, the landmark names need to be embedded such that we can do a similarity search. This is needed since the titels of the wikipedia pages are rarely exactly like the landmark names described in the UNESCO dataset.

For embedding the names, we are using spacy.

In [4]:
landmark_embeddings = []

for landmark_name in landmark_names:
    landmark_embeddings.append(preprocessor.ner_spacy(landmark_name))

print("Number of embedded landmarks: ", len(landmark_embeddings))

Number of embedded landmarks:  1157


### 1.3 Filtering the wikipedia pages
Now that we have the embedding for all the landmark names, it is important to see if we can match this embedding to the embedding of the wikipedia page titles. 

To be able to do this we need two functions. The data is stored in multiple files in two different directories. We can process the data per directory and read each file to get the data for each webpage. The function `process_file` is used for this. The function `process_folder` is used to process all the files in a directory using the `process_file` function.


In [5]:
folders = ["AA", "AB"]
# Title for title filtering, trailing space is important for filtering
title = "UNESCO World Heritage Site "
page_list = []

for folder in folders:
    page_list.append(preprocessor.process_folder(folder = folder, landmark_embeddings = landmark_embeddings, debug=False, title = title, nlp = True))

print(f"\nThere are \x1b[32m{len(page_list)}\x1b[0m wikipedia pages found with a title similar to a landmark name")

  similarity_score = title_embedding.similarity(landmark)
Processing 'test_set' in folder 'tests\data\TT': 100%|██████████| 1/1 [00:03<00:00,  3.23s/it]


There are [32m4[0m wikipedia pages found with a title similar to a landmark name





Every wikipedia text has the following raw format:

In [6]:
page_list[:3]

[{'id': '2750841',
  'revid': '182902',
  'url': 'https://en.wikipedia.org/wiki?curid=2750841',
  'title': 'Alejandro de Humboldt National Park',
  'text': 'Alejandro de Humboldt National Park () is a national park in the Cuban provinces of Holguín and Guantánamo. It is named after the German scientist Alexander von Humboldt who visited the island in 1800 and 1801. The park was inscribed as a UNESCO World Heritage Site in 2001 for of its size, altitude range, complex lithology, landform diversity, and wealth of endemic flora and fauna.\nGeography.\nThe rivers that flow off the peaks of the park are some of the largest in the insular Caribbean. The park is said to be the most humid place in Cuba and this causes a high biological diversity. The park has an area of , of which land area and marine area. Elevation ranges from sea level to on "El Toldo" Peak.\nThe region around Alejandro de Humboldt National Park is geologically complex, containing karst landscapes that originated from ocean

### 1.4 Exporting filtered data
To be able to use the texts for further cleaning, the current filtered data is being stored as a json file to the disk.

In [7]:
preprocessor.writeFile(page_list, f"unesco_wikipedia_pages_{date.today()}.json")

## 2. Clearning the wikipedia pages

### 2.1 Shortening the wikipedia pages
For the main purpose of the task, namely converting information from text to a knowledge graph, most important information of a heritage site is stored in the first few paragraphs. Therefore, we will first shorten all the texts to the first 2 paragraphs and with a minimum of 500 words (average paragraph length is 250 words).

We start of by defining a function that can shorten a text to 2 paragraphs. This is done by splitting the text on the newline character and then joining the first two paragraphs together.

In [8]:
unesco_wikipedia_pages = preprocessor.loadFile(f"unesco_wikipedia_pages.json")

In [9]:
for page in unesco_wikipedia_pages:
    splitted_text = page["text"].split("\n")

    # Get the lengths of every split and get the first index of the split where the total number is higher than 500
    total_length = 0
    for index, split in enumerate(splitted_text):
        total_length += len(split)
        if total_length > 500:
            page["text"] = ''.join(splitted_text[:index+1])
            break

At this point the unesco_wikipedia_pages are split per new line. The average paragraph length is 250 words. Therefore, we will shorten the texts with a minimum of 500 words.

### 2.2 Removing unwanted characters and fixing unicode

In [10]:
unesco_wikipedia_pages_clean = preprocessor.fix_unicode(unesco_wikipedia_pages)

Cleaning data with ftfy...


100%|██████████| 4/4 [00:00<00:00, 1996.81it/s]


Cleaning data with given regex...


100%|██████████| 4/4 [00:00<?, ?it/s]


An example of the cleaned text is shown below:

In [11]:
unesco_wikipedia_pages_clean[0]

{'id': '2750841',
 'revid': '182902',
 'url': 'https://en.wikipedia.org/wiki?curid=2750841',
 'title': 'Alejandro de Humboldt National Park',
 'text': 'Alejandro de Humboldt National Park is a national park in the Cuban provinces of Holguín and Guantánamo It is named after the German scientist Alexander von Humboldt who visited the island in 1800 and 1801 The park was inscribed as a UNESCO World Heritage Site in 2001 for of its size altitude range complex lithology landform diversity and wealth of endemic flora and fauna Geography The rivers that flow off the peaks of the park are some of the largest in the insular Caribbean The park is said to be the most humid place in Cuba and this causes a high biological diversity The park has an area of of which land area and marine area Elevation ranges from sea level to on El Toldo Peak',
 'original_text': 'Alejandro de Humboldt National Park () is a national park in the Cuban provinces of Holguín and Guantánamo. It is named after the German sc

### 2.3 Export the cleaned data in separate json files
There is a export needed of separate json file for the annotation program.

In [12]:
preprocessor.save_file(unesco_wikipedia_pages_clean, "subset_texts")

3 Calculating precision and recall

In [1]:
import requests

page = requests.get("https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_by_year_of_inscription")



In [2]:
print(page.content)

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of World Heritage Sites by year of inscription - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of World Heritage Sites by year of inscription - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-featur

In [5]:
print(soup.find_all('table'))


[<table class="wikitable sortable" style="font-size:100%;">
<tbody><tr>
<th scope="col" width="175">Country
</th>
<th scope="col" width="700">Site
</th>
<th scope="col" width="130">Category
</th>
<th scope="col" width="130">UNESCO Reference no.
</th></tr>
<tr>
<td rowspan="2"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/23px-Flag_of_Canada_%28Pantone%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/35px-Flag_of_Canada_%28Pantone%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/46px-Flag_of_Canada_%28Pantone%29.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/Canada" title="Canada">Canada</a></td>
<td><a href="/wiki/L%27Anse_aux_Meadows" ti