## Data cleaning and preprocesing notebook.

As our data is raw and unprocessed in any way.

let\`s sort out uninformative texts and do some preprocessing and preparations before we start to train models.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
from numpy.random import choice

import nltk
from nltk.corpus import stopwords as nltk_stopwords

from sklearn.model_selection import train_test_split

from os.path import join
from glob import glob

import preprocessing_tools as pt
from tqdm import tqdm_notebook as tqdm

import pickle

from os import mkdir 
from shutil import move

ModuleNotFoundError: No module named 'preprocessing_tools'

In [8]:
#loading data
data_path = '../Data/raw_data'

template = join(data_path,'*.txt')
filenames = glob(template)
print(len(filenames))

4154


In [None]:
#Let`\s tokenize all texts and collect texts lenghts to see texts lengths distribution.
texts_lens = []
for name in tqdm(filenames):
    with open(name, 'r') as f:
        text = f.read()
        tok_text = nltk.word_tokenize(text)
        tok_text = pt.normalize(tok_text, tokenized=True)
        texts_lens.append(len(tok_text))

In [None]:
#texts lengths distribution
plt.hist(texts_lens, bins=[i * 100 for i in range(int(50000/100))])
plt.show()

### A lot of very short(<100 words) texts.

Such a short texts are quite uninformative and only will consume memory
while being very sparse rows of Term-doc matrix.

So let\`s put them aside.

In [None]:
#collecting small texts
texts_lens = np.array(texts_lens)
small_texts_mask = texts_lens < 100
small_texts_names = np.array(filenames)[small_texts_mask]

#creating separate directory for them.
small_texts_dir = join(data_path, 'small_texts')
mkdir(small_texts_dir)

#moving them down there
for name in small_texts_names:
    filename = name.split('/')[-1]
    dst = join(small_texts_dir, filename)
    move(name, dst)

## Note:

In the begining, I tried to create my own normalization/preprocessing/tokenization module.

As I read more and more documentation of different NLP modules I found everything what I needed to preprocess texts.

So, It is really excessive to create your own preproc, unless your preproc is very specific.

Still, I kept my own prepoc as example and as a legacy of texts lengths filtering code.

### Keywords:

We got few sometimes intersecting but not identical sets of stopwords.

Let\`s put them all toghether.

#### Note:
Some sets of this keywords are obtained after first LDA models were trained while tuning those models.

Exaclty, it is a file named 'final_extra_stop_words.pkl'

In [10]:
#loading stopwords from all the sources we got

#loading text files with external stopwords
#merging via set.update()
stopwords_path = join('../Data/stopwords/raw_stopwords')
stopwords_template = join(stopwords_path, '*.txt')
stopwords_files = glob(stopwords_template)
extra_stopwords = set()
for name in stopwords_files:
    with open(name, 'r') as f:
        words = f.readlines()
        words = [word.strip() for word in words]
        extra_stopwords.update(set(words))

#Also putting there nltk stopwords
stopwords = set()
stopwords.update(set(nltk_stopwords.words('english')))
stopwords.update(set(nltk_stopwords.words('russian')))
stopwords.update(extra_stopwords)

#some euristically added stopwords
custom_stopwords = set(['http', 'https', 'ru', 'com', 'vk',
                         'привет', 'здравствуйте', 'например', 'репост'])

#stopwords from final LDA tuning
#comment this piece it if you haven`t got them.
final_stopwords_path = join(stopwords_path, 'final_extra_stopwords.pkl')
with open(final_stopwords_path, 'rb') as f:
    stopwords_from_top = pickle.load(f)
stopwords.update(set(stopwords_from_top))

stopwords.update(custom_stopwords)
stopwords = list(stopwords)
print('Total numver of stopwords:', len(stopwords))

#serializing all-merged stopwords set
with open('../Data/stopwords/stopwords.pkl', 'wb') as f:
    pickle.dump(stopwords, f)

Total numver of stopwords: 796


### Test/train split.
Also I\`ll split data on train and test in this notebook

In [10]:
template = join(data_path,'*.txt')
filenames = glob(template)
train_names, test_names = train_test_split(filenames, test_size=0.1, random_state=666)
print(len(train_names), len(test_names))

3738 416


In [11]:
#Serializing splits
with open('../Data/train_names.pkl', 'wb') as f:
    pickle.dump(train_names, f)
    
with open('../Data/test_names.pkl', 'wb') as f:
    pickle.dump(test_names, f)    