# Web Mining and NER in Python
## Entity Recognition and Linking with Wikipedia API
### Phase 1 by Illia Nesterenko

### Introduction

This notebook contains the code and subsequent commentaries on the execution of the first phase of the Web Mining and NER in Python project. The end goal of the project is to "explore how artificial intelligence can be applied to extract meaningful entities from vast textual data and establish connections between them". The first phase of the project covers all of "setting the scene" parts including: defining the topic, extracting data via Wikipedia API and storing scraped pages in a JSON pages. Below is the walkthough through the whole process with code and my comments (when needed).

### 1. Identify Topics
Choose specific topics or categories of interest
for entity extraction. This could be based on user input,
predefined categories, or a mix of both. 

For this project I decided to choose a predefined topic: **Russian Invasion of Ukraine**.


### 2. Wikipedia API Integration
Implement code to interact with
the Wikipedia API to fetch relevant pages based on the
identified topics. Explore different API endpoints for extracting
content (e.g., action=query, prop=extracts).

After I have examined Wikipedia API, I found out that the most straightforward way to get the pages was to query relevant titles, capture page ids of the relevant articles. Then, via page ids obtain the URLs to the articles and use those URLs to get page contents. Finally, the content of the pages as well as URLs and page titles will be saved in JSON for further data processing. To complete all this we need only two Python libraries: _requests_ for making API queries and _json_ for storing data. Also, we will use _BeautifulSoup_ to illustrate parced pages.

#### 2.1. Query page ids

In [30]:
import requests as r
from bs4 import BeautifulSoup
import json

In [2]:
API_ENDPOINT = 'https://en.wikipedia.org/w/api.php' # API endpoint to query pages from

search_params = {
    'action' : 'query', # to fetch data
    'list' : 'search', # to perform a full-text search
    'srsearch' : 'Russian-Ukrainian War', # query for text
    'srlimit' : 3, # items to return
    'format' : 'json', # return format
    'maxlag' : 1 # will prevent task from running when the load on the servers is high
}

resp = r.get(API_ENDPOINT, params=search_params, timeout=10)
resp.json()

{'batchcomplete': '',
 'continue': {'sroffset': 3, 'continue': '-||'},
 'query': {'searchinfo': {'totalhits': 43810},
  'search': [{'ns': 0,
    'title': 'Russo-Ukrainian War',
    'pageid': 42085878,
    'size': 307171,
    'wordcount': 24480,
    'snippet': 'Russo-<span class="searchmatch">Ukrainian</span> <span class="searchmatch">War</span> is an ongoing <span class="searchmatch">war</span> between <span class="searchmatch">Russia</span> and <span class="searchmatch">Ukraine</span>, which began in February 2014. Following <span class="searchmatch">Ukraine\'s</span> Revolution of Dignity, <span class="searchmatch">Russia</span> occupied',
    'timestamp': '2024-04-11T15:32:37Z'},
   {'ns': 0,
    'title': 'Russian invasion of Ukraine',
    'pageid': 70149799,
    'size': 389296,
    'wordcount': 33997,
    'snippet': 'On 24 February 2022, <span class="searchmatch">Russia</span> invaded <span class="searchmatch">Ukraine</span> in an escalation of the Russo-<span class="searchmatch">U

Above is the illustration of the successful query. Here we can see matching results. I decided to limit the number of pages so that it would be easier to test and illustrate the whole pipeline. Later, as the amount of data needed for the project will grow, the limit will be changed to a higher number.

In [15]:
titles = {}
pages = resp.json()['query']['search']

for page in pages:
    titles.update({page['title'] : page['pageid']})
titles

{'Russo-Ukrainian War': 42085878,
 'Russian invasion of Ukraine': 70149799,
 'War crimes in the Russian invasion of Ukraine': 70167888}

Above is the intermediate result in the form of a dictionary with page titles as keys and corresponding page is and values. Next, we will iterate through this dictionary to get page URLs.

#### 2.2 Query URLs via page_ids

In [16]:
urls = {}
for title, pageid in titles.items():
    parse_params = {
        'action' : 'query',
        'prop': 'info', # to query for an info about a page
        'inprop': 'url', # to query specificaty for a URL
        'pageids': pageid,
        'format': 'json',
        'maxlag' : 1 
    }

    resp1 = r.get(API_ENDPOINT, params=parse_params, timeout=10)
    url = resp1.json()['query']['pages'][f'{pageid}']['fullurl']
    urls.update({title : url})
urls

{'Russo-Ukrainian War': 'https://en.wikipedia.org/wiki/Russo-Ukrainian_War',
 'Russian invasion of Ukraine': 'https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine',
 'War crimes in the Russian invasion of Ukraine': 'https://en.wikipedia.org/wiki/War_crimes_in_the_Russian_invasion_of_Ukraine'}

In [21]:
urls.values()

dict_values(['https://en.wikipedia.org/wiki/Russo-Ukrainian_War', 'https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine', 'https://en.wikipedia.org/wiki/War_crimes_in_the_Russian_invasion_of_Ukraine'])

Now, with the URLs at hand we can make a HTTP GET request and obtain the contents with the page. Below is the illustration of page's textual data that we got by parsing html via BeautifulSoup.

In [22]:
test_url = 'https://en.wikipedia.org/wiki/Russo-Ukrainian_War'

resp2 = r.get(test_url, timeout=10)
soup = BeautifulSoup(resp2.text)
print(soup.get_text())





Russo-Ukrainian War - Wikipedia



































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search





























Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1Background



Toggle Background subsection





1.1Independent Ukraine and the Orange Revolution







1.2Euromaidan, Revolution of Dignity, and pro-Russian unrest







1.3Russian military bases in Crimea







1.4Legality and declaration of war









2History



Toggle History subsection





2.1Russian annexation of Crimea (2014)







2.2War in the Donbas (2014–2015)





2.2.1Pr

## 3) Data Storage
Design a data structure to store the retrieved
Wikipedia pages. Consider using a suitable data format, such
as JSON or a database, to organize and store the data.

In the beginning I faced a pseudo-dilema of whether I should save only URLs and later make new HTTP requests and get the data of should I just save the contents of a page. Fortunately I decided to save both in list and the connect this list to a title in dictionary. Below is the execution of saving the whole data in a single dictionary.

In [31]:
pages = {}
for title, url in urls.items():
    resp = r.get(url, timeout=10)
    pages.update({title : [url, resp.text]})
pages

{'Russo-Ukrainian War': ['https://en.wikipedia.org/wiki/Russo-Ukrainian_War',
 'Russian invasion of Ukraine': ['https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine',
 'War crimes in the Russian invasion of Ukraine': ['https://en.wikipedia.org/wiki/War_crimes_in_the_Russian_invasion_of_Ukraine',
  '<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>War crimes in the Russian invasion 

In [33]:
new_data = pages

try:
    with open(r'data.json', mode='r') as data_file:
        # Reading old data
        data = json.load(data_file)

except FileNotFoundError:
    with open(r'data.json', mode='w') as data_file:
        # Saving updated data
        json.dump(new_data, data_file, indent=4)

else:
    for title, url in urls.items():        
        if title in data:
            print(f'The page "{title}" has already been saved.')
        else:
            # Updating old data with new data
            data.update({title : url})
            with open(r'data.json', mode='w') as data_file:
                # Saving updated data
                json.dump(data, data_file, indent=4)

The page "Russo-Ukrainian War" has already been saved.
The page "Russian invasion of Ukraine" has already been saved.
The page "War crimes in the Russian invasion of Ukraine" has already been saved.


Above is moderately branched version of saving data in a JSON format. First, I use try-except-else statements to check if there is a file containing data. If not, then a new file is created a filled with the data from current session. If there is a file, then the old data are loaded and new data are appended. Finally the updated data are written back to the file. On top of that, the writing process is executed piece by piece to avoid duplicates. Overall, this way may create a bottleneck in the future due to extensive read-write workload. But as for now, it works alrights, so the benefit of not spamming a bunch of JSON files overcomers the possible drawbacks.

### Conclusions

In this notebook I presented the code that scrapes Wikipedia pages and stores the results in a JSON file. This is but a foundation for future analysis. Nevertheless, it is an important part of the project, since without data no further exploration is possible.

---------
Illia Nestenko