# Web Mining and NER in Python
## Preprocessing and Named Entity Recognition
### Phase 2 by Illia Nesterenko

### Introduction

This notebook contains the code and subsequent commentaries on the execution of the second phase of the Web Mining and NER in Python project. The end goal of the project is to "explore how artificial intelligence can be applied to extract meaningful entities from vast textual data and establish connections between them". The second phase of the project covers all of "text preprocessing" parts including: tokenization, romoving stop words, handling special characters, implementing named entity recognition (NER), entitity linking and exporting the results in a .csv file. Below is the walkthough through the whole process with code and my comments (when needed).

### 0. Preliminaries

After the first phase, when I worked with Wikidepia API directly, I discovered Python library named ```wikipediaapi``` that significantly simplyfies interaction with Wikipedia database. Furthermore is has a simple interface to find and get the text of any Wikipedia article. So, here I decided to work via this library, being mindful of the fact that I will need to refine the first phase of the project for the final presentation.

Another imporant thing to mention is that here I demonstrate the text preprocessing pipeline using one article. I did this purposefully, so that I was easier to test it during development. The adaptation of this code for more acticles will be performed for the final presentation.

Keeping all this in mind, let's begin with a motivation for using ```wikipediaapi``` by picking up where we left of, that is, accessing a json file with full HTML page of a wikipedia article on Russina-Ukrainian War.

In [4]:
import json
from bs4 import BeautifulSoup
import wikipediaapi
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
import pandas as pd

In [5]:
# Convenience code: formats pandas dataframes in full withput "cropping" the middle of the table.
pd.set_option('display.max_columns', None) # show all columns
pd.set_option('display.max_rows', None) # show all rows
pd.set_option('display.width', 200) # widen display width if needed

The code below opens json file with stored HTML of the page. When we print out the page, we can see that it is filled will a bunch of unnecessary text such as site natigation text, contents, notes, etc. It is possible to filter all this text and get pure article text but there is a more ergonomic way to do this.

In [6]:
with open('data.json') as file:
    data = json.load(file)

In [7]:
soup = BeautifulSoup(data['Russo-Ukrainian War'][1], 'html.parser')

In [8]:
print(soup.get_text())





Russo-Ukrainian War - Wikipedia



































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search





























Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1Background



Toggle Background subsection





1.1Independent Ukraine and the Orange Revolution







1.2Euromaidan, Revolution of Dignity, and pro-Russian unrest







1.3Russian military bases in Crimea







1.4Legality and declaration of war









2History



Toggle History subsection





2.1Russian annexation of Crimea (2014)







2.2War in the Donbas (2014–2015)





2.2.1Pr

First, we choose wikipedia URL from the previous phase and extract the title of the page. We also specify the User Agent as it is obligatory to identify oneself in order for library to work.

In [9]:
data['Russo-Ukrainian War'][0]

'https://en.wikipedia.org/wiki/Russo-Ukrainian_War'

In [10]:
UserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0'

page_link = data['Russo-Ukrainian War'][0]

# Extract page title from the link
page_title = page_link.split("/")[-1]
page_title

'Russo-Ukrainian_War'

Then, we create ```Wikipedia``` class, and use its ```page()``` method to look for the pages we need. After that we can easily access the text of the article via ```.text``` atribute.

In [11]:

# Initialize Wikipedia API
wiki = wikipediaapi.Wikipedia(UserAgent, 'en')

# Fetch the page
page = wiki.page(page_title)

# Extract the text of the article
article_text = page.text

print(article_text)


The Russo-Ukrainian War is an ongoing war between Russia and Ukraine, which began in February 2014. Following Ukraine's Revolution of Dignity, Russia occupied and annexed Crimea from Ukraine and supported pro-Russian separatists fighting the Ukrainian military in the Donbas war. The first eight years of conflict also included naval incidents, cyberwarfare, and heightened political tensions. In February 2022, Russia launched a full-scale invasion of Ukraine and began occupying more of the country.
In early 2014, the Euromaidan protests led to the Revolution of Dignity and the ousting of Ukraine's pro-Russian president Viktor Yanukovych. Shortly after, pro-Russian unrest erupted in eastern and southern Ukraine, while unmarked Russian troops occupied Crimea. Russia soon annexed Crimea after a highly disputed referendum. In April 2014, Russian-backed militants seized towns in Ukraine's eastern Donbas region and proclaimed the Donetsk People's Republic (DPR) and the Luhansk People's Republi

### 1. Text Preprocessing:
Clean and preprocess the text data
obtained from Wikipedia pages. Perform tasks like
tokenization, removing stop words, and handling special
characters.

For tokenization, we use nltk's ```word_tokenize()``` function. For removing stopwords we use nltk's list of English stopwords. For handling spectial charactes we use ```re``` library and its ```sub()``` function. The process takes just 5 rows of code to get clean preprocessed text.

In [13]:
# Tokenization
tokens = word_tokenize(article_text)

# Removing stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Handling special characters and lowercasing
cleaned_tokens = [re.sub(r'[^a-zA-Z0-9]', '', word.lower()) for word in filtered_tokens if word.isalnum()]

# Convert list of cleaned tokens back to a string
cleaned_text = ' '.join(cleaned_tokens)

print(cleaned_text)

war ongoing war russia ukraine began february 2014 following ukraine revolution dignity russia occupied annexed crimea ukraine supported separatists fighting ukrainian military donbas war first eight years conflict also included naval incidents cyberwarfare heightened political tensions february 2022 russia launched invasion ukraine began occupying country early 2014 euromaidan protests led revolution dignity ousting ukraine president viktor yanukovych shortly unrest erupted eastern southern ukraine unmarked russian troops occupied crimea russia soon annexed crimea highly disputed referendum april 2014 militants seized towns ukraine eastern donbas region proclaimed donetsk people republic dpr luhansk people republic lpr independent states starting donbas war separatists received considerable covert support russia ukrainian attempts fully retake areas failed although russia denied involvement russian troops took part fighting february 2015 russia ukraine signed minsk ii agreements end c

### 2. Named Entity Recognition
Implement NER using libraries like SpaCy or NLTK. Extract entities such as persons, organizations, locations, etc., from the preprocessed text.

I decide to use Spacy for NER. I have checked the avaliable lables it captures and choose the most relevant for the task. This list of the entites' labels can be seen below.

In [14]:
valid_entities = ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW','LANGUAGE']

In [15]:
# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Process the preprocessed text
doc = nlp(cleaned_text)

# # Extract named entities
entities = [(entity.text, entity.label_) for entity in doc.ents if entity.label_ in valid_entities]

print(entities)


[('russia', 'GPE'), ('russia', 'GPE'), ('russia', 'GPE'), ('ukraine', 'GPE'), ('russian', 'NORP'), ('russia', 'GPE'), ('donbas', 'GPE'), ('russia', 'GPE'), ('russia', 'GPE'), ('russian', 'NORP'), ('russia', 'GPE'), ('donbas war', 'ORG'), ('russian', 'NORP'), ('russia', 'GPE'), ('russian', 'NORP'), ('russian', 'NORP'), ('vladimir putin', 'PERSON'), ('nato', 'ORG'), ('russia', 'GPE'), ('putin', 'PERSON'), ('russia', 'GPE'), ('russian', 'NORP'), ('russia', 'GPE'), ('russia', 'GPE'), ('russia', 'GPE'), ('russia', 'GPE'), ('russia', 'GPE'), ('soviet', 'NORP'), ('russia', 'GPE'), ('soviet', 'NORP'), ('russia', 'GPE'), ('kingdom united states', 'GPE'), ('russia', 'GPE'), ('european', 'NORP'), ('eastern bloc', 'LOC'), ('nato', 'ORG'), ('russia', 'GPE'), ('russian', 'LANGUAGE'), ('first chechen war', 'EVENT'), ('putin', 'PERSON'), ('european', 'NORP'), ('viktor yushchenko', 'PERSON'), ('russia', 'GPE'), ('supreme court', 'ORG'), ('yulia', 'GPE'), ('anthony cordesman', 'PERSON'), ('russian', 'NO

Now, we filter through the list of found entities and select only unique ones.

In [16]:
# Initialize an empty set to store unique entities
unique_entities = set()

# Filter the list to contain only unique values
filtered_entities = []

for entity_text, entity_label in entities:
    # Check if the entity text is not already in the set of unique entities
    if entity_text not in unique_entities:
        # Add the entity text to the set of unique entities
        unique_entities.add(entity_text)
        # Add the entity to the filtered list
        filtered_entities.append((entity_text, entity_label))

print(filtered_entities)

[('russia', 'GPE'), ('ukraine', 'GPE'), ('russian', 'NORP'), ('donbas', 'GPE'), ('donbas war', 'ORG'), ('vladimir putin', 'PERSON'), ('nato', 'ORG'), ('putin', 'PERSON'), ('soviet', 'NORP'), ('kingdom united states', 'GPE'), ('european', 'NORP'), ('eastern bloc', 'LOC'), ('first chechen war', 'EVENT'), ('viktor yushchenko', 'PERSON'), ('supreme court', 'ORG'), ('yulia', 'GPE'), ('anthony cordesman', 'PERSON'), ('favour putin', 'PERSON'), ('georgia', 'GPE'), ('western european', 'NORP'), ('ukraine georgia', 'GPE'), ('us', 'GPE'), ('george bush', 'PERSON'), ('georgia ukraine', 'ORG'), ('eurasian', 'NORP'), ('eu', 'GPE'), ('kremlin', 'ORG'), ('rada', 'GPE'), ('ukraine russia', 'PERSON'), ('crimean', 'GPE'), ('kacha hvardiiske simferopol', 'PERSON'), ('insignia', 'GPE'), ('aksyonov', 'NORP'), ('ukraine putin', 'PERSON'), ('donbas part', 'ORG'), ('chechen', 'NORP'), ('un', 'ORG'), ('eastern ukraine', 'NORP'), ('insignia girkin', 'ORG'), ('malaysia', 'GPE'), ('united banner', 'ORG'), ('nikol

### 3. Entity Linking
Integrate entity linking to associate recognized
entities with their corresponding Wikipedia pages. Use the
Wikipedia API to enhance entity information.

We implement task 3 by specifying a function to fetch Wikipedia article for an entity (if it exists). Then, we add its URL to the list of found entities.

In [17]:
# Function to fetch Wikipedia page for an entity
def fetch_wikipedia_page(entity):
    page = wiki.page(entity)
    if page.exists():
        return page.fullurl
    else:
        return None

# Entity linking
linked_entities = []
for entity_text, entity_label in filtered_entities:
    # Fetch Wikipedia page for the entity
    page_url = fetch_wikipedia_page(entity_text)
    if page_url:
        linked_entities.append((entity_text, entity_label, page_url))
    else:
        linked_entities.append((entity_text, entity_label, None))

print(linked_entities)

[('russia', 'GPE', 'https://en.wikipedia.org/wiki/Russia'), ('ukraine', 'GPE', 'https://en.wikipedia.org/wiki/Ukraine'), ('russian', 'NORP', 'https://en.wikipedia.org/wiki/Russian'), ('donbas', 'GPE', 'https://en.wikipedia.org/wiki/Donbas'), ('donbas war', 'ORG', 'https://en.wikipedia.org/wiki/War_in_Donbas'), ('vladimir putin', 'PERSON', 'https://en.wikipedia.org/wiki/Vladimir_Putin'), ('nato', 'ORG', 'https://en.wikipedia.org/wiki/NATO'), ('putin', 'PERSON', 'https://en.wikipedia.org/wiki/Vladimir_Putin'), ('soviet', 'NORP', 'https://en.wikipedia.org/wiki/Soviet_Union'), ('kingdom united states', 'GPE', None), ('european', 'NORP', 'https://en.wikipedia.org/wiki/European'), ('eastern bloc', 'LOC', 'https://en.wikipedia.org/wiki/Eastern_Bloc'), ('first chechen war', 'EVENT', 'https://en.wikipedia.org/wiki/First_Chechen_War'), ('viktor yushchenko', 'PERSON', None), ('supreme court', 'ORG', 'https://en.wikipedia.org/wiki/Supreme_court'), ('yulia', 'GPE', 'https://en.wikipedia.org/wiki/Yu

Finally, we create pandas Dataframe and drop the entities with no URLs. Procuded result we export to the .csv file.

In [18]:
df = pd.DataFrame(linked_entities, columns=['Entity', 'Label', 'URL']).dropna()
df

Unnamed: 0,Entity,Label,URL
0,russia,GPE,https://en.wikipedia.org/wiki/Russia
1,ukraine,GPE,https://en.wikipedia.org/wiki/Ukraine
2,russian,NORP,https://en.wikipedia.org/wiki/Russian
3,donbas,GPE,https://en.wikipedia.org/wiki/Donbas
4,donbas war,ORG,https://en.wikipedia.org/wiki/War_in_Donbas
5,vladimir putin,PERSON,https://en.wikipedia.org/wiki/Vladimir_Putin
6,nato,ORG,https://en.wikipedia.org/wiki/NATO
7,putin,PERSON,https://en.wikipedia.org/wiki/Vladimir_Putin
8,soviet,NORP,https://en.wikipedia.org/wiki/Soviet_Union
10,european,NORP,https://en.wikipedia.org/wiki/European


In [19]:
df.to_csv('linked_entities.csv')

### Conclusions
In this notebook I presented the code that performs text processing and stores linked entities in a .csv file. This is a key part for the project. With these results, after we iterate for several articles, we can generate knowledge graph for phase 3.

--------------
Illia Nestenko