In [1]:
%load_ext autoreload
%autoreload 2

# Pre-Processing Wikipedia data
This notebook is a workflow for pre-processing the wikipedia data. This includes cleaning the data before it is used further. Shortening the texts based on some criteria and finally saving the data in a format that can be used by the labeling program and training of the model.

In [1]:
import preprocessor
from preprocessor import Preprocessor
import os
import pandas as pd

ROOT_DIR = preprocessor.ROOT_DIR
DATA_PATH = preprocessor.DATA_PATH

Root directory: d:\TUe\2AMM30 - Text Mining\Text-Mining


## 1. Importing and filtering the UNESCO WORLD HERITAGE sites
### 1.1 Subset 1: Spacy embedding filtering
#### 1.1.1 Importing the UNESCO data
The first subset of the entire Wikipedia dataset will consists of articles for which we are positive that they contain UNESCO World Heritage Sites
We start he preprocessing with retrieving the unesco world heritage sites from a dataset and filtering these sites in the large wikipedia dataset. 

The data is read from a csv file, followed by making a set of all the names in this column with english names. This is done for further processing using the set of names in stead of a dataframe.

To see what kind of data we are using, a head of the dataframe is printed.

In [3]:
preprocessor = Preprocessor(ROOT_DIR)

df_unesco = pd.read_csv(os.path.join(DATA_PATH, "component_1", "unesco_names.csv"), header=0, names=["landmark_name"])
landmark_names = set(df_unesco["landmark_name"].to_list())

df_unesco.head()



Unnamed: 0,landmark_name
0,Cultural Landscape and Archaeological Remains ...
1,Minaret and Archaeological Remains of Jam
2,Historic Centres of Berat and Gjirokastra
3,Butrint
4,Al Qal'a of Beni Hammad


#### 1.1.2 Embedding the UNESCO data
For filtering of the wikipedia pages based on the landmark names, the landmark names need to be embedded such that we can do a similarity search. This is needed since the titels of the wikipedia pages are rarely exactly like the landmark names described in the UNESCO dataset.

For embedding the names, we are using spacy.

In [4]:
landmark_embeddings = []

for landmark_name in landmark_names:
    landmark_embeddings.append(preprocessor.ner_spacy(landmark_name))

print("Number of embedded landmarks: ", len(landmark_embeddings))

Number of embedded landmarks:  1157


#### 1.1.3 Subsetting the Wikipedia Dataset
Comparisson with embeddings takes a lot of time and is hence not feasible for the entire Wikipedia dataset. Hence, we take a subset of the dataset by filtering on Wikipedia articles that occur in (parts of) the UNESCO landmarks list. This is done using the `process_folders` function, which can either be used to filter on title or on landmark names. Based on the variable `title_based`, it decides whether to run `process_file_regex` or `process_file_nlp` respectively.


In [6]:
folders = [os.path.join("component_1","AA"), os.path.join("component_1","AB")]

# Title for title filtering, trailing space is important for filtering
title = "UNESCO World Heritage Site "

page_list = preprocessor.process_folders(folders = folders, landmarks = landmark_names, debug = True, title = title, title_based = False)

1/100 - Started processing 'wiki_00' in folder 'component_1\AA'
2/100 - Started processing 'wiki_01' in folder 'component_1\AA'
3/100 - Started processing 'wiki_02' in folder 'component_1\AA'
4/100 - Started processing 'wiki_03' in folder 'component_1\AA'
5/100 - Started processing 'wiki_04' in folder 'component_1\AA'
6/100 - Started processing 'wiki_05' in folder 'component_1\AA'
7/100 - Started processing 'wiki_06' in folder 'component_1\AA'
8/100 - Started processing 'wiki_07' in folder 'component_1\AA'
9/100 - Started processing 'wiki_08' in folder 'component_1\AA'
10/100 - Started processing 'wiki_09' in folder 'component_1\AA'
11/100 - Started processing 'wiki_10' in folder 'component_1\AA'
12/100 - Started processing 'wiki_11' in folder 'component_1\AA'
13/100 - Started processing 'wiki_12' in folder 'component_1\AA'
14/100 - Started processing 'wiki_13' in folder 'component_1\AA'
15/100 - Started processing 'wiki_14' in folder 'component_1\AA'
16/100 - Started processing 'wiki_

#### 1.1.4 Export Wikipedia subset
The resulting subset has 5160 entries, which is more feasible for Spacy embedding filtering. The next step is to save this dataset to a JSON file which can be used later on in the workflow. This is done using the `writeFile` function of the `Preprocessor` class.

In [7]:
preprocessor.writeFile(page_list, f"component_1/unesco_first_subset.json")

#### 1.1.5 Embedding filtering on the Wikipedia subset
The next step is to process the subset by filtering based on title and UNESCO landmarks similarity. All article titles are embedded just like the landmarks. A cosine similarity comparrison between the embeddings then indicates whether to include the article. This is done if the similarity is at least `0.97`. The function which does this is `process_file_nlp`.

In [8]:
file_path = os.path.join(DATA_PATH, "component_1", "unesco_first_subset.json")
results = preprocessor.process_file_nlp(file_path, landmark_embeddings, similarity_threshold = 0.97)
print(f"Results found: {len(results)}")

  similarity_score = title_embedding.similarity(landmark)
100%|██████████| 5160/5160 [01:50<00:00, 46.63it/s]


Results found: 373


#### 1.1.6 Exporting filtered data
To be able to use the texts for further cleaning, the current filtered data is being stored as a json file to the disk.

In [9]:
preprocessor.writeFile(results, f"component_1/unesco_wikipedia_titles.json")

## 1.2 Subset 2: Title filtering
#### 1.2.1 Filtering the Wikipedia dataset
The second subset is filtered using the `process_file_title` function, which is called in the `process_folders` function if the parameter `title_based` is set to True. The `process_file_title` function checks whether the string "UNESCO World Heritage Site " (trailing space to prevent plural versions) occurs in the article text. If so, the article is included in the subset. 

In [10]:
folders = [os.path.join("component_1","AA"), os.path.join("component_1","AB")]
# Title for title filtering, trailing space is important for filtering
title = "UNESCO World Heritage Site "

page_list = preprocessor.process_folders(folders = folders, landmarks = landmark_names, debug = True, title = title, title_based = True)

1/100 - Started processing 'wiki_00' in folder 'component_1\AA'
2/100 - Started processing 'wiki_01' in folder 'component_1\AA'
3/100 - Started processing 'wiki_02' in folder 'component_1\AA'
4/100 - Started processing 'wiki_03' in folder 'component_1\AA'
5/100 - Started processing 'wiki_04' in folder 'component_1\AA'
6/100 - Started processing 'wiki_05' in folder 'component_1\AA'
7/100 - Started processing 'wiki_06' in folder 'component_1\AA'
8/100 - Started processing 'wiki_07' in folder 'component_1\AA'
9/100 - Started processing 'wiki_08' in folder 'component_1\AA'
10/100 - Started processing 'wiki_09' in folder 'component_1\AA'
11/100 - Started processing 'wiki_10' in folder 'component_1\AA'
12/100 - Started processing 'wiki_11' in folder 'component_1\AA'
13/100 - Started processing 'wiki_12' in folder 'component_1\AA'
14/100 - Started processing 'wiki_13' in folder 'component_1\AA'
15/100 - Started processing 'wiki_14' in folder 'component_1\AA'
16/100 - Started processing 'wiki_

#### 1.2.2 Exporting the second subset
The final step for the second subset is to export it to a JSON file so that it can be used later on.

In [11]:
preprocessor.writeFile(page_list, f"component_1/unesco_wikipedia_pages.json")

## 2. Clearning the wikipedia pages

### 2.1 Shortening the wikipedia pages
For the main purpose of the task, namely converting information from text to a knowledge graph, most important information of a heritage site is stored in the first few paragraphs. Therefore, we will first shorten all the texts to the first 2 paragraphs and with a minimum of 500 words (average paragraph length is 250 words).

We start of by defining a function that can shorten a text to 2 paragraphs. This is done by splitting the text on the newline character and then joining the first two paragraphs together.

In [12]:
unesco_wikipedia_pages = preprocessor.loadFile(f"component_1/unesco_wikipedia_titles.json")

In [13]:
for page in unesco_wikipedia_pages:
    splitted_text = page["text"].split("\n")

    # Get the lengths of every split and get the first index of the split where the total number is higher than 500
    total_length = 0
    for index, split in enumerate(splitted_text):
        total_length += len(split)
        if total_length > 500:
            page["text"] = ''.join(splitted_text[:index+1])
            break

At this point the unesco_wikipedia_pages are split per new line. The average paragraph length is 250 words. Therefore, we will shorten the texts with a minimum of 500 words.

### 2.2 Removing unwanted characters and fixing unicode

In [14]:
unesco_wikipedia_pages_clean = preprocessor.fix_unicode(unesco_wikipedia_pages)

100%|██████████| 373/373 [00:00<00:00, 2434.20it/s]


An example of the cleaned text is shown below:

In [15]:
unesco_wikipedia_pages_clean[0]

{'id': '2642',
 'revid': '22970011',
 'url': 'https://en.wikipedia.org/wiki?curid=2642',
 'title': 'Ajanta Caves',
 'text': 'The Ajanta Caves are 29 rock-cut Buddhist cave monuments dating from the second century BCE to about 480 CE in the Aurangabad District of Maharashtra state in India. Ajanta Caves are a UNESCO World Heritage Site. Universally regarded as masterpieces of Buddhist religious art, the caves include paintings and rock-cut sculptures described as among the finest surviving examples of ancient Indian art, particularly expressive paintings that present emotions through gesture, pose and form.The caves were built in two phases, the first starting around the second century BCE and the second occurring from 400 to 650 CE, according to older accounts, or in a brief period of 460–480 CE according to later scholarship. ',
 'original_text': 'The Ajanta Caves are 29 rock-cut Buddhist cave monuments dating from the second century BCE to about 480 CE in the Aurangabad District of M

### 2.3 Export the cleaned data in separate json files
There is a export needed of separate json file for the annotation program.

In [16]:
preprocessor.save_file(unesco_wikipedia_pages_clean, "component_1/subset_texts")

## 3 Calculating precision and recall
The precision and recall scores of the retrieved UNESCO articles are calculated. The precision is the number of retrieved relevant articles divided by the total number of retrieved articles. The recall is the number of retrieved relevant articles divided by the total number of relevant articles. This lets us determine how well the filtering worked for the first and second subset.

In [18]:
import requests
from bs4 import BeautifulSoup

### 3.1 Scraping Wikipedia
The list of UNESCO landmarks is retrieved by scraping the Wikipedia page. The list is then used to determine the confusion matrix.

In [19]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_World_Heritage_Sites_by_year_of_inscription")
soup = BeautifulSoup(page.content, 'html.parser')

tables = soup.find_all('table')
titles = []

for table in tables:
    for row in table.find_all('tr')[1:]:
        flag_icon = row.find('span', {'class': 'flagicon'})
        
        # Select the appropriate column based on the presence of 'flagicon'
        title_column_index = 1 if flag_icon else 0
        
        # Check if there is a hyperref title element
        title = row.find_all('td')[title_column_index].find('a')
        
        if title:
            title = title.get('title')
            titles.append(title)

### 3.2.1 Getting titles subset 1 (Spacy) 
Retrieving the titles from the document subsets for subset 1, which was created with spacy embedding filtering.

In [20]:
import json 

titles_found_set1 = []
folder_path = "data/component_1/subset_texts"
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)

    with open(file_path, "r", encoding="utf-8") as file:
            json_data = json.load(file)
            title = json_data.get('title')
            titles_found_set1.append(title)

### 3.2.2 Getting titles subset 2 ("UNESCO World Heritage Site " checks)
Retrieving the titles from subset 2, which was created with checking whether 'UNESCO World Heritage Site ' occured in the text.

In [21]:
titles_found_set2 = []
unesco_wikipedia_pages = preprocessor.loadFile(f"component_1/unesco_wikipedia_pages.json")
for item in unesco_wikipedia_pages:
    title = item.get('title')
    titles_found_set2.append(title)

### 3.3.1 Calculating precision and recall subset 1 (Spacy)
First we calculate the precision and recall for subset 1, done by retrieving the true positives, false positives and false negatives. The precision and recall are then calculated using these values.

In [22]:
true_positives_set1 = 0
false_positives_set1 = 0

for title in titles_found_set1:
    if title in titles:
        true_positives_set1 += 1
    else:
        false_positives_set1 +=1

##list we used has 1158 UNESCO items so that is the maximum we could find
false_negatives_set1 = 1158 - false_positives_set1

precision_set1 = true_positives_set1 / (true_positives_set1 + false_positives_set1)
recall_set1 = true_positives_set1 / (true_positives_set1 + false_negatives_set1)

### 3.3.2 Calculating precision and recall subset 2 ("UNESCO World Heritage Site " checks)
First we calculate the precision and recall for subset 2, done by retrieving the true positives, false positives and false negatives. The precision and recall are then calculated using these values.

In [23]:
true_positives_set2 = 0
false_positives_set2 = 0

for title in titles_found_set2:
    if title in titles:
        true_positives_set2 += 1
    else:
        false_positives_set2 +=1

#list we used has 1158 UNESCO items so that is the maximum we could find
false_negatives_set2 = max(0, 1158 - false_positives_set2)

precision_set2 = true_positives_set2 / (true_positives_set2 + false_positives_set2)
recall_set2 = true_positives_set2 / (true_positives_set2 + false_negatives_set2)

### 3.3.3 Results
The results indicate the desired outcome for subset 1, where the precision is 1.0. The low value for the recall was expected since only 373 of the 1158 articles were included. The results for subset 2 are not as good, with a precision of 0.1 and a recall of -0.4. This is due to the fact that the check for 'UNESCO World Heritage Site ' in the text is not very accurate. However, this subset is mainly used to test the model on and create the knowledge graph, so the results are not as important as for subset 1.

In [24]:
print(f"Precision set 1: {precision_set1}")
print(f"Recall set 1: {recall_set1}")
print(f"Precision set 2: {precision_set2}")
print(f"Recall set 2: {recall_set2}")

Precision set 1: 1.0
Recall set 1: 0.2446183953033268
Precision set 2: 0.08346547498918841
Recall set 2: -0.1252975546418524
