# Text_Processing

In this file I process `.txt` files, obtained from the previous step ([Text_Preprocessing](./Text_Preprocessing.ipynb)). At this point, they contain only the main body of the thesis (from Introduction to Bibliography).

In this notebook I provide the following pipeline:

0. Deal with in-text citations
1. Tokenize
2. Delete non-alphabetic characters
3. Lemmatize
4. POS tag
5. Delete stop-words
6. Join frequent bi- and trigrams 

## Preparation

In [1]:
import re
import os
import nltk

import numpy as np

from collections import Counter
from razdel import tokenize

from pymorphy2 import MorphAnalyzer
morph = MorphAnalyzer()

In [2]:
def get_text(file, folder):
    with open(os.path.join(folder, file), 'r', encoding='utf-8') as f:
        text = f.read()
        
    return text

## Dealing with citations

In [3]:
citation = re.compile("([([][А-ЯЁA-Z][А-Яа-яёЁA-Za-z'`-]+[А-Яа-яёЁA-Za-z'`&\-,. ]*,? \d{3,4}[)\]])")

def find_citations(text):
    matches = re.findall(citation, text)
    return len(matches), matches

def delete_citations(text):
    return re.sub(citation, ' ', text)

## Text processing

Main processing: tokenization, lemmatisation, stopwords deletion. 
I also implemented POS tagging as it is used for filtering ngramms, and a function to remove tags for more redability.

In [9]:
# tokenization with options to lowercase and delete tokens, cointaining
# numbers, punctuation, or any non-alphabetic charactes
def tokenizer(text_data, lower=False, delete=False):
    
    if lower:
        text_data = text_data.lower()
    
    if delete:
        # leave only alphabetic charactes
        if delete == 'all':
            contains_extra = r'[^а-яА-ЯёЁa-zA-Z]'
        # delete only numerics
        elif delete == 'num':
            contains_extra = r'\d'
        # delete only punctuation
        elif delete == 'punct':
            contains_extra = r'[^а-яА-ЯёЁa-zA-Z\d]'
            
        tokens = [_.text for _ in list(tokenize(text_data)) if not re.search(contains_extra, _.text)]
        
    else:
        tokens = [_.text for _ in list(tokenize(text_data))]
        
    return [t for t in tokens if len(t) > 2]

# lemmatiztion with option to POS tag
def lemmatizer(tokens, pos_tag=False):
    lem_tokens = []
    for word in tokens:
        p = morph.parse(word)[0]
        lem = p.normal_form
        
        if pos_tag:
            lem = f'{lem}_{p.tag.POS}'
        
        lem_tokens.append(lem)
    return lem_tokens

# delete stop words (works for both tagged and clean token lists)
def delete_stops(tokens, stops, tagged=False):
    text = ' '.join(tokens)
    
    for stop in stops:
        # check for match to be a seperate phrase (and  not a substring of another word)
        if tagged:
            pattern = re.compile(r'\b'+stop+r'_\w{4}')
        else:
            pattern = re.compile(r'\b'+stop+r'\b')
        # delete all found matches   
        text = re.sub(pattern, '', text)
    return text.split()

# delete POS tags and glue everything into one string
def get_clean_text(tagged_tokens):
    clean_tokens = [re.sub(r'_\w{4}$', '', t) for t in tagged_tokens]
    return ' '.join(clean_tokens)

Simple counter to get frequencies of tokens (for general data description, but also to use them later for working with ngrams).

In [10]:
def count_words(tokens):
    c = Counter(tokens).most_common()
    return list(c)

## Dealing with ngrams

I filter ngrams by POS of the words, allowing following combinations:

*bigrams*: Noun/Adjective/None + Noun/None

*trigrams*: Noun/Adjective/None + Noun/Adjective/None + Noun/None

In [11]:
def check_bigram(tags):
    
    cond1 = tags[0] in ['NOUN', 'ADJF', 'None']
    cond2 = tags[1] in ['NOUN', 'None']
    
    if cond1 and cond2:
        return True
    else:
        return False
    
def check_trigram(tags):
    
    cond1 = tags[0] in ['NOUN', 'ADJF', 'None']
    cond2 = tags[1] in ['NOUN', 'ADJF', 'None']
    cond3 = tags[2] in ['NOUN', 'None']
    
    if cond1 and cond2 and cond3:
        return True
    else:
        return False

# filter out results with negative scores <- re-think this if using chi square
# seperate tags from words and send them for checking
def check_ngram(ngram):
    
    if ngram[1] <= 0:
        return False
    
    words = ngram[0]
    tags = [w.split('_')[-1] for w in words]
    
    if len(tags) == 2:
        return check_bigram(tags)
    elif len(tags) == 3:
        return check_trigram(tags)

Get bi- and trigrams, get rid of once with frequency below the limit (same for both) and filter the rest by POS checking. 

I also return top N bi- and trigrams, so I can analyse them later and tune some parametres.

In [12]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

def get_ngrams(tokens, freq_limit, metric='pmi'):
    
    # added measure choice for further research
    if metric == 'pmi':
        bi_measure = bigram_measures.pmi
        tri_measure = trigram_measures.pmi
    elif metric == 'chi':
        bi_measure = bigram_measures.chi_sq
        tri_measure = trigram_measures.chi_sq
        
    # get trigrams
    trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(tokens)
    # filter by frequency
    trigramFinder.apply_freq_filter(freq_limit)
    # get scores by chosen metric
    tri_scores = trigramFinder.score_ngrams(tri_measure)
    # filter the results by POS
    tri_grams = [n for n in tri_scores if check_ngram(n)]

    # same for bigrams
    bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
    bigramFinder.apply_freq_filter(freq_limit)
    bi_scores = bigramFinder.score_ngrams(bi_measure)
    bi_grams = [n for n in bi_scores if check_ngram(n)]
    
    # get top N for both list
    # where N is [0,5] depending on the number of total found ngrams
    top_grams = []
    for grams in [tri_grams, bi_grams]:
        l = len(grams)
        if l <= 4:
            top_grams.extend(grams[:l])
        else:
            top_grams.extend(grams[:5])
    
    # trigrams should be first so they get replaced first
    # in case some bigram is a substring of some trigram
    final_grams = tri_grams + bi_grams
    
    
    return final_grams, top_grams

Join ngrams together (with `_`) and return the final text without tags.

In [13]:
def replace_ngrams(text, nrgam_list):
    
    for ngram in nrgam_list:
        # delete tags and join all 2-3 words together with whitespace
        phrase = get_clean_text(ngram[0])
        # replace whitespace with undercsore
        replacement = phrase.replace(' ', '_')
        # check for match to be a seperate phrase (and  not a substring of another word)
        pattern = re.compile(r'\b'+phrase+r'\b')
        text = re.sub(pattern, replacement, text)
        
    return text

### Main pipeline

By default it only deletes non-alphabetic characters, but by setting `tok_del=False` you can get just raw tokenization+lemmatizaition (but then punctuation tokens would be counted in wordforms and lemmas). For my research purposes I set everything to True, to delete as much non relevant information as possible.

**Note on minimal frequency for ngrams**: I dinamically set the minimal absolute frequency by taking the 80 percentile
of the lemma's frequency distribution. My research showed that this value is always greater than mean value and gives the most apropriate results.

*For the future*: "Personally, I find it effective to multiply PMI and frequency to take into account both probability lift and frequency of occurrence." via [Medium](https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a)

In [14]:
stops = nltk.corpus.stopwords.words("russian")
stops.extend(nltk.corpus.stopwords.words("english"))
stops.extend(nltk.corpus.stopwords.words("german"))
stops.extend(['рисунок', 'таблица', 'автор', 'статья', 'c', 
              'ее', 'этот', 'работа', 'это', 'который', 'свой',
             ])

def process_text(text, del_citations=False, tok_del='all', del_stops=False, n_grams=False):
    
    text = text.replace('ё', 'е')
    
    cits = None
    if del_citations:
        cits = find_citations(text)
        text = delete_citations(text)
        
    tok = tokenizer(text, delete=tok_del, lower=True)
    lem = lemmatizer(tok, pos_tag=n_grams)
    
    if del_stops:
        lem = delete_stops(lem, stops, tagged=n_grams)
     
    wordsforms = count_words(tok)
    lems = count_words(lem)  
    
    freq_stats = None
    top_ngrams = []
    
    if n_grams:
    
        
        freq = [l[1] for l in lems]
        freq_limit = np.percentile(freq, 80)
        freq_stats = {'80_percentile': freq_limit,
                      'mean': np.mean(freq)}
        
        final_ngrams, top_ngrams = get_ngrams(lem, freq_limit, metric='pmi')
        
        final_text = replace_ngrams(get_clean_text(lem), final_ngrams)
        lems = count_words(final_text.split())
    
    else:
        final_text = get_clean_text(lem)
        
    metadata = {'n_wordforms': len(wordsforms), 
                'n_lemmas': len(lems), 
                'frequency': freq_stats,
                'top5_lem': lems[:5],
                'top_ngrams': top_ngrams if len(top_ngrams) > 0 else None,
                'citations': cits
               }
        
    return final_text, metadata

Custom function for printing out information about the processing of a document in more readable format.

In [15]:
from tabulate import tabulate

In [18]:
def print_meta(meta):
    
    print(f"number of wordforms: {meta['n_wordforms']}")
    print(f"number of lemmas: {meta['n_lemmas']}")
          
    freq = meta['frequency']
    if freq is not None:
        percent = freq['80_percentile']
        mean = freq['mean']
        print("frequency:")
        print(f"\t80 percentile = {percent} <- is used as freq treshold for ngrams")
        print(f"\tmean = {mean}")
          
    lems = meta['top5_lem']
    print('top 5 lemmas:')
    print(tabulate([['\t', l[0], l[1]] for l in lems]))
          
    top_n = meta['top_ngrams']
    if top_n is not None:
        print(f'top tri- and bigrams:')
        
        ngrams = []
        for n in  top_n:
            words = get_clean_text(n[0])
            metric = n[1]
            ngrams.append(['\t', words, metric])
          
        print(tabulate(ngrams))

    cit = meta['citations']
    if cit is not None:
          cits = sorted(list(set(cit[1])))
          cits = '\n\t'.join(cits)
          print(f'number of citations: {cit[0]}\n\n\t{cits}')

# Bulck processing

In [19]:
import os
import json
from tqdm.auto import tqdm

In [20]:
raw_folder = './cut_texts'
processed_folder = './clean_texts'

In [21]:
files = [f for f in os.listdir(path=raw_folder)]

texts = []

for file in files:
    texts.append(get_text(file, raw_folder))
    
assert len(texts) == len(files)

In [22]:
with open('processing_meta.json', 'w', encoding='utf-8') as metafile:
    metafile.write('[\n')

In [23]:
for i in tqdm(range(len(files))):
    text = texts[i]
    file = files[i]
    try:
        processed_text, meta = process_text(text,
                                            del_citations=True,
                                            tok_del='all',
                                            del_stops=True,
                                            n_grams=True)

        with open(os.path.join(processed_folder, file), 'w', encoding='utf-8') as out:
            out.write(processed_text)
        
        with open('processing_meta.json', 'a', encoding='utf-8') as metafile:
            json.dump({f'{file}': meta}, metafile, ensure_ascii=False, indent=4)
            
    except Exception as e:
        print(file)
        print(e)
        print('-----------')
        
with open('processing_meta.json', 'a', encoding='utf-8') as metafile:
    metafile.write('\n]')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=578.0), HTML(value='')))




# Testing

In [24]:
lengths = [len(t) for t in texts]
short = np.argmax(lengths)

In [25]:
m = short
test = texts[m]
print(files[m], lengths[m])

366645257.txt 369008


In [85]:
%%timeit
_, _ = process_text(test, del_citations=False, tok_del='all', del_stops=False, n_grams=False)

15.7 s ± 1.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [86]:
%%timeit
_, _ = process_text(test, del_citations=True, tok_del='all', del_stops=True, n_grams=True)

17.7 s ± 703 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
_, meta = process_text(test, del_citations=False, tok_del='all', del_stops=False, n_grams=False)

In [27]:
print_meta(meta)

number of wordforms: 10290
number of lemmas: 4772
top 5 lemmas:
  ---------  ---
  он         393
  как        322
  свой       309
  испанский  254
  быть       248
  ---------  ---


In [28]:
res, meta = process_text(test, del_citations=True, tok_del='all', del_stops=True, n_grams=True)

In [29]:
print_meta(meta)

number of wordforms: 10290
number of lemmas: 4921
frequency:
	80 percentile = 5.0 <- is used as freq treshold for ngrams
	mean = 5.810691703391455
top 5 lemmas:
  -------  ---
  также    228
  историк  179
  боливар  167
  год      155
  однако   124
  -------  ---
top tri- and bigrams:
  --------------------------  -------
  отмена подушный подать      21.3138
  венесуэла новый гранада     15.7566
  президент великий колумбия  15.6175
  широкий народный масса      15.2461
  роль народный масса         15.2461
  подушный подать             11.8585
  тупак амар                  11.7183
  приходский священник        11.5663
  имущественный ценз          11.5257
  девятнадцать столетие       11.4028
  --------------------------  -------
number of citations: 0

	


In [30]:
print(res)

появление создать детально рассматривать основной тенденция изменение подход советский современный российский проблемный поль хронологический рамка последний восемьдесят год первое интересовать исследование мирошевский екатерина франсиско_миранда вопрос международный связь сепаратист xviii век опубликовать журнал историк марксист ныне вопрос_история находить историографический обоснование ряд ретроспективный труд историк ссср наиболее детальный подробный небольшой объём страница уникальный хронологический охват монография альпер советский_историография страна_латинский_америка кратко охарактеризовать многие выйти время немногочисленный ещё отечественный история_борьба независимость_испанский_америка однако нисколько умалять значение данный труд хотеть заметить пятьдесят год прошедшее момент публикация прекращаться нарастание корпус историография различный вопрос латиноамериканистика попытка провести подобный исследование предприниматься сторона также осознать отличие ситуация конец наш

In [92]:
print(test)

Появление созданной нами  работы-обобщения, детально рассматривающей основные  тенденции и изменения в подходах советских и современных российских  историков-латиноамериканистов к этому проблемному полю в хронологических рамках последних восьмидесяти лет (первое, интересующее нас исследование, статья В.М. Мирошевского « Екатерина II и Франсиско  Миранда (к вопросу о международных связях испано-американских сепаратистов в XVIII веке)» было опубликовано в журнале «Историк - Марксист» (ныне - «Вопросы истории») в 1940 г.) находит своё историографическое обоснование в  ряде ретроспективных трудов  историков СССР.  Наиболее детальной и подробной из них была небольшая по своему объёму (80 стр.), но уникальная по хронологическому охвату монография  (1917-1966 гг.) М.С. Альперовича «Советская историография стран Латинской Америки» (1968 г.).  В ней кратко охарактеризованы многие из вышедших к тому времени немногочисленных ещё отечественных работ по истории Борьбы за независимость в Испанской А

In [32]:
total_words = []

for t in texts:
    tokens = [_.text for _ in list(tokenize(t))]
    total_words.append(tokens)
    
sum([len(t) for t in total_words])

9688788

*Anna Polyanskaya, 2021*