# Preparation

Downloading packages for russian text lemmatization.

In [1]:
pip install pymorphy2 > 0

Note: you may need to restart the kernel to use updated packages.


Modules for working with data, possible preprocessing and checking the progress of algorithms in the status bar.

In [20]:
import nltk
import re
import pandas as pd

# russian lemmatization
import pymorphy2
# status bar
from tqdm.auto import tqdm
# stopwords
from nltk.corpus import stopwords

In [21]:
nltk.download('stopwords')
stopwords = stopwords.words('russian')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\justa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This dataset is a corpus of the sparsed texts from arhcived websites where labels mean bynary classes: belonging to a school site (1) or not (0). Broken html markup has been preserved in many cells.

In [10]:
df = pd.read_csv('school.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,main_page,school
0,0,"\nздравствуйте\n,\nвы сейчас на главной страни...",1
1,1,\nхостинг от \nucoz\nуважаемые пользователи!\n...,1
2,2,,0
3,3,\n #js-show-iframe-wrapper{position:relative;d...,1
4,4,\n адрес школы\nадрес: \nадрес: ул. л...,1


In [11]:
# checking NaN cells
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3630 entries, 0 to 3629
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  3630 non-null   int64 
 1   main_page   3604 non-null   object
 2   school      3630 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 85.2+ KB


In [12]:
df = df.dropna()

# Tasks

- I'll try to work with original and preprocessed texts at the same time in order to compare future results in the classifying texts by using different methods of vectorizing and ml-algorithms
- It would be interesting to clasterize non-school classes and define their sources
- To visualize some data, I can try to pull out named entities based on statistical approaches (frequency of words) or features of vectorizing models

# Preprocessing

It's hard to clear all html markup, so I'll try to delete latin symbols (possibly they are not necessary cause of cyrillic texts). Text preprocessing algorithm:
- lowercase
- deleting punct
- deleting latin
- lemmatization

In [22]:
# class for russian lemmatisation
morph = pymorphy2.MorphAnalyzer()

def normalized(text):
    no_punct_cap = re.sub(r'[^А-Яа-я]+', ' ', text.lower()).split(' ')
    norm_sentence = [morph.parse(word)[0].normal_form 
                     for word in no_punct_cap 
                     if word not in stopwords]
    return ' '.join(norm_sentence)

In [28]:
#example
normalized(df['main_page'][0])[:165]

' здравствуйте главный страница официальный сайт всош главный визитка лицензия приоритетный национальный проект образование управлять совет положение управлять совет '

It could be a litte bit slow due to pymorphy2 lemmatization.

In [29]:
tqdm.pandas()

df_norm = df.copy()
df_norm['main_page'] = df_norm['main_page'].progress_apply(normalized)

  0%|          | 0/3604 [00:00<?, ?it/s]

In [30]:
df_norm.to_csv('school_normalized.csv', index=False)