This file shows the steps taken when generating the Namarsh dataset

The first step was to extract data from .html files
The data is: the name of the sections from the website and texts from the articles

In [None]:
# The library to extract data from HTML
from bs4 import BeautifulSoup
# The library used for data manipulation: https://pandas.pydata.org
import pandas as pd
# The libraries to access local files and folders which will be used to iterate over thousands of .html files in the namarsh folder
import os
import glob

Further, the folder that contains Namarsh HTML files in defined.
full = [] creates a new array where processed data will then be stored during the next step.

In [None]:
files = glob.glob('materials/*.html')
full = []

The next step is to iterate over the HTML files in the folder and pull the website articles.
The records are then further stored in the dataframe.
The subsections are found by <p></p> HTML tag which contains the articles.
The search is done using the BeautifulSoup library.
All the data is then converted to a data frame for further manipulation.

In [None]:
for file in files:

    extractor = open(file, 'rb')
    soup = BeautifulSoup(extractor, features="html.parser")

    text = soup.find_all("div", class_= "content-container")
    p_text = [p.get_text() for p in text]

    full.append([p_text, os.path.basename(file)])

df_full = pd.DataFrame(full)

Now we are creating a function that will run over the dataframe columns and find News of Protest subsection.
The first column of df_full is being renamed into ['Name']

In [None]:
def match_cat(column):
    if "Новости протеста" in column['0']: return column[0]

df_full['0'] = df_full.apply(match_cat, axis=1)
df_full['Name'] = df_full['0']

Create a new dataset that will store cleaner data for the dataset for convenience
Then the function is applied and null values are removed using dropna() pandas function that does this
The output of the function is then stored in the clean dataframe

In [None]:
clean_df = pd.DataFrame()
clean_df['Name'] = df_full['Name'].dropna()

Since data in the <p></p> still included more markup, the next steps are to get rid of it and split the data into dataset columns.
The columns are: title of the event, description, short description (which will be used as the event identifier or title when merged with the Forthcoming Events subsection).
The data is stripped in the order it is presented in the articles (title short description in the first paragraph followed by the main description in the second and on).
The split is done using the \t and \n symbols which were pulled during the <p></p> step together with the data.
The \n symbols are "new line", and new lines are used to begin new sections.
The 'output_df' will store the final version of the News of Protest subsection

In [None]:
#The date column because <p></p> started with the date
output_df = clean_df['Name'].str.split("t", n=1, expand=True)
output_df = output_df.drop([0], axis=1)
output_df = output_df[1].str.split(" ", n=1, expand=True)

#The title and description columns. The latter contains both the introduction paragraph and main description
output_df[['title', 'description']] = output_df[1].str.split("n", n=1, expand=True)
output_df['title'] = output_df['title'].str.strip("\\")
output_df['description'] = output_df['description'].str.strip(r"\n")
output_df = output_df.drop([1], axis=1)

#Split description in "short description" and "details"
output_df[['short_description', 'details']] = output_df['description'].str.split("n", n=1, expand=True)
output_df['short_description'] = output_df.short_description.str.strip("\\")
output_df['details'] = output_df.details.str.strip(r"\n")

# Get rid of the original column since it has been split
output_df = output_df.drop(['description'], axis=1)

Now that the data is split into columns, the next step is to remove unnecessary symbols that contaminate the dataset.
For this, the cleaner() function is introduced which replaces the symbols with " "

In [None]:
def cleaner(data):
    lister = {r'\r', r'\n', r'\xa0'}
    for i in lister:
        data = data.replace(i, " ")
    return data

The function is then applied to relevant columns with extra symbols

In [None]:
output_df['short_description'] = output_df['short_description'].apply(str).apply(lambda x: cleaner(x))
output_df['details'] = output_df['details'].apply(str).apply(lambda x: cleaner(x))

# Additional cleaning
output_df['short_description'] = output_df['short_description'].str.replace(r'\s+', ' ', regex=True)
output_df['details'] = output_df['details'].str.replace(r'\s+', ' ', regex=True)
output_df['details'] = output_df['details'].str.strip(r'\']')

To extract locations out of the data, string matching needs to be done.
For that, a list of Russian cities is exported and iterated one by one.
To find out if the description contains any of the cities from the list, it needs to be iterated over.
The matching lines are then stored in an array.
The array is then connected to the dataframe.
This is done using the city_extractor() function.

In [None]:
# The list of cities is imported
cities_list = pd.read_csv('towns.csv')

def city_extractor(row):
    for cities in cities_list['city']:
        if cities in row.short_description: return cities
        if row.cities == 'None' and cities in row.details: return cities
    return 'None'

#This line is specific to the Russian language because it has the declension system which changes the ending of words
#The purpose is to remove the vowels from the ending to keep the root of the word and use this root to find matches
cities_list['city'] = cities_list['city'].apply(lambda x: x[:-1] if x[-1] in set('аея') else x)

The function is then applied
The words with missing ending vowels are returned

In [None]:
output_df['cities'] = output_df.apply(city_extractor, axis=1)

replacements = {
    'cities': {
        "Москв" : "Москва", "Калуг" : "Калуга", "Тул" : "Тула", "Самар" : "Самара", "Раменск" : "Раменское",
        "Балаших" : "Балашиха", "Костром" : "Кострома", "Вологд" : "Вологда", "Кашир" : "Кашира", "Донецк" : "None",
        "Козловк" : "Козловка", "Махачкал" : "Махачкала", "Туапс" : "Туапсе",
        "Чит" : "Чита", "Пенз" : "Пенза", "Каменк" : "Каменка", "Кушв" : "Кушва", "Тарус" : "Таруса",
        "Ухт" : "Ухта", "Уф" : "Уфа", "Находк" : "Находка", "Балахн" : "Балахна", "Юрг" : "Юрга", "Раменско" : "Раменское",
        "Лобн" : "Лобна"}
    }

output_df.replace(replacements, inplace=True)

The last step is to save the output to CSV file

In [None]:
newer.to_csv("test.csv", index=False)