<a id = 'toc'></a>
# Table of Contents

- ### [Preprocessing-Light](#preprocessing_light)
- ### [Preprocessing-Pandas](#preprocessing_pandas)

<a id = 'preprocessing_light'></a>
# Preprocessing-Light

---
## Decoding
converting a sequence of bytes into a sequence of characters.

- **Unpacking** \
*.plain/.zip/.gz/...*
- **Encoding** \
*ASCII/utf-8/Windows-1251/...*
- **Format** \
*csv/xml/json/doc/...*

---

## Split into tokens
splitting a sequence of characters into parts (tokens), possibly excluding some characters from consideration.
Naive approach: split the string with spaces and throw out punctuation marks.

**Problems:**  
* example@example.com, 127.0.0.1
* С++, C#
* York University vs New York University
* Language dependency (“Lebensversicherungsgesellschaftsangestellter”, “l’amour”)

Alternative: n-grams

---

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Leo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from nltk.tokenize import RegexpTokenizer

sequence = 'The quick brown fox jumps, and jumps over the lazy dog'

tokenizer = RegexpTokenizer('\w+|[^\w\s]+')
for token in tokenizer.tokenize(sequence):
    print(token)

The
quick
brown
fox
jumps
,
and
jumps
over
the
lazy
dog


---
## Stop words
the most frequent words in the language that do not contain any information about the content of the text

**Problem**: To be or not to be.

---

In [3]:
from nltk.corpus import stopwords

print(' '.join(stopwords.words('english')[1:20]))

me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his


---
## Normalization
Bringing tokens to a single form in order to get rid of superficial differences in spelling

**Approaches:**
* formulate a set of rules by which the token is transformed \
New-Yorker → new-yorker → newyorker → newyork
* explicity store connections betweens tokens (WordNet - Princeton) \
car → auto, Window 6 → window
машина → автомобиль, Windows 6→ window

---

In [33]:
word = 'New-Yorker'

word_1 = word.lower()

import re
word_2 = re.sub(r'\W', '', word_1, flags = re.U)

word_3 = re.sub(r'er', '', word_2, flags = re.U)

print(f'{word} → {word_1} → {word_2} → {word_3}')

New-Yorker → new-yorker → newyorker → newyork


---
## Stemming and Lemmatization
**Stemming** is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.
**Lemmatization** considers the conext and converts the word to its meaningful base form, which is called *Lemma*.

**Example:**
* Steeming
Caring → Car
* Lemmatization
Caring → Care

---

## Stemming

In [54]:
from nltk.stem.snowball import PorterStemmer
from nltk.stem.snowball import EnglishStemmer

p_stemmer = PorterStemmer()
print(f'[Porter Stemmer]: {p_stemmer.stem("New-Yorker")}')
print(f'[Porter Stemmer]: {p_stemmer.stem("Tokenization")}')

eng_stemmer = EnglishStemmer()
print(f'[English Stemmer]: {eng_stemmer.stem("Perfection")}')
print(f'[English Stemmer]: {eng_stemmer.stem("Difference")}')

[Porter Stemmer]: new-york
[Porter Stemmer]: token
[English Stemmer]: perfect
[English Stemmer]: differ


## Lemmatization

In [78]:
import pymorphy2

morph = pymorphy2.MorphAnalyzer()
print(f'[pymorphy2]: {morph.parse("New-Yorker")[0].normal_form}')
print(f'[pymorphy2]: {morph.parse("Tokenization")[0].normal_form}')
print(f'[pymorphy2]: {morph.parse("Perfection")[0].normal_form}')
print(f'[pymorphy2]: {morph.parse("Difference")[0].normal_form}')

[pymorphy2]: new-yorker
[pymorphy2]: tokenization
[pymorphy2]: perfection
[pymorphy2]: difference


---

## Heap's Law (Herdan's law)
An empirical regularity in linguistics that describes the distribution of the number of unique words in a document (or set of documents) as a function of it's length.


$ M = kT^{\beta}$
- $M$ - dictionary size
- $T$ - word count
- $30 \leq k \leq 100, b \approx 0.5$

---

[UP](#toc)

<a id = 'preprocessing_pandas'></a>
# Preprocessing-Pandas

#### Using methods

In [141]:
sequences_list = pd.Series(['Our mother washed - The Dishes', 'The countdown, Is over'], dtype = "string")
print(f'[BEFORE]: {sequences_list[0]}')

sequences_list = sequences_list.str.lower()
sequences_list = sequences_list.str.strip()
#sequences_list = sequences_list.str.split(' ', expand = True)
sequences_list = sequences_list.str.split(' ')
print(f'[AFTER]: {sequences_list[0]}')

[BEFORE]: Our mother washed - The Dishes
[AFTER]: ['our', 'mother', 'washed', '-', 'the', 'dishes']


#### Using functions

In [140]:
import string
import pymorphy2

morpher = pymorphy2.MorphAnalyzer()
sw = ['dishes']

def preprocess_txt(line):
    exclude = set(string.punctuation)
    spls = ''.join(i for i in line.strip() if i not in exclude).split()
    spls = [morpher.parse(i.lower())[0].normal_form for i in spls]
    spls = [i for i in spls if i not in sw and i != '']
    return spls
    
sequences_list = pd.Series(['Our mother washed - The Dishes', 'The countdown, Is over'], dtype = "string")
print(f'[BEFORE]: {sequences_list[0]}')
sequences_list = sequences_list.apply(lambda x: preprocess_txt(x))
print(f'[AFTER]: {sequences_list[0]}')

[BEFORE]: Our mother washed - The Dishes
[AFTER]: ['our', 'mother', 'washed', 'the']


[UP](#toc)

<a id = 'kmeans_inertia'></a>
<left>
<div style="color:white;
           display:fill;
           border: 0px;
           border-bottom: 2px solid #AAA;
           font-size:80%;
           letter-spacing:0.5px">
<h2 style="padding: 10px;
           color:#212121;">Inertia
</h2>
</div>    
</left>