# Text normalization

__Text normalization may contain the following steps. Choose them according to your task.__

* Sentences segmentation

* Words tokenization – splitting sentences into separate words

* Lemmatization – determination of the word’s root (sing –common lemma for sang, sings, sung)

* Stemming – removing affixes 

* Lowercasing

* Removing punctuation and/or digits

* Removing stop words

* Expanding contractions




## Sentence tokenization

In [6]:
from nltk.tokenize import sent_tokenize

In [7]:
text = "this’s a sent tokenize test. This is sent two. is this sent three? Sent 4 in place! Sent 5 is coming right away."

In [8]:
sent_tokenize_list = sent_tokenize(text)
sent_tokenize_list

['this’s a sent tokenize test.',
 'This is sent two.',
 'is this sent three?',
 'Sent 4 in place!',
 'Sent 5 is coming right away.']

## Words tokenization

In [15]:
from nltk.tokenize import word_tokenize # tree bank tokenizer - slower
from nltk.tokenize.regexp import regexp_tokenize #regexps - faster
from nltk.tokenize import ToktokTokenizer # good for Russian

In [16]:
word_tokenize("this’s a test.")

['this', '’', 's', 'a', 'test', '.']

In [17]:
word_tokenize("Александр Эдуардович был не в духе сегодня, ему снились ночью кошмары.")

['Александр',
 'Эдуардович',
 'был',
 'не',
 'в',
 'духе',
 'сегодня',
 ',',
 'ему',
 'снились',
 'ночью',
 'кошмары',
 '.']

In [18]:
tokenizer_rus = ToktokTokenizer()
tokenizer_rus.tokenize("Александр Эдуардович был не в духе сегодня, ему снились ночью кошмары.")

['Александр',
 'Эдуардович',
 'был',
 'не',
 'в',
 'духе',
 'сегодня',
 ',',
 'ему',
 'снились',
 'ночью',
 'кошмары',
 '.']

In [19]:
import re

In [20]:
WORD = re.compile("\w+")

In [21]:
WORD.findall("Александр Эдуардович был не в духе сегодня, ему снились ночью кошмары.")

['Александр',
 'Эдуардович',
 'был',
 'не',
 'в',
 'духе',
 'сегодня',
 'ему',
 'снились',
 'ночью',
 'кошмары']

## Lemmatization & Stemming

In [None]:
#https://interviewbubble.com/porterstemmer-vs-lancasterstemmer-vs-snowballstemmer/

In [18]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.stem.snowball import RussianStemmer
from nltk.stem.snowball import SnowballStemmer
from pymystem3 import Mystem
import pymorphy2

In [23]:
def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive



In [24]:
PorterStemmer().stem("completeness")

'complet'

In [25]:
PorterStemmer().stem("поднимающийся")

'поднимающийся'

In [26]:
#for Russian use nltk.SnowbalStemmer or any other you may find usefull and applicable for Russian
SnowballStemmer(language='russian').stem ("поднимающийся")

'поднима'

In [27]:
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w))) 

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


In [12]:
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "studies studying cries cry"
tokens = word_tokenize(text)
lemma_function = WordNetLemmatizer()
for token, tag in pos_tag(tokens):
    lemma = lemma_function.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)

studies => study
studying => study
cries => cry
cry => cry


#### Pymorphy2

pymorphy2 написан на языке Python (работает под 2.7 и 3.3+). Он умеет:

- приводить слово к нормальной форме (например, “люди -> человек”, или “гулял -> гулять”).
- ставить слово в нужную форму. Например, ставить слово во множественное число, менять падеж слова и т.д.
- возвращать грамматическую информацию о слове (число, род, падеж, часть речи и т.д.)

При работе используется словарь OpenCorpora; для незнакомых слов строятся гипотезы. Библиотека достаточно быстрая: в настоящий момент скорость работы - от нескольких тыс слов/сек до > 100тыс слов/сек (в зависимости от выполняемой операции, интерпретатора и установленных пакетов); полностью поддерживается буква ё.

In [19]:
morph = pymorphy2.MorphAnalyzer() 

In [21]:
morph.parse('деньги')[0].normal_form

'деньга'

In [22]:
p = morph.parse('стали')
print(p[0].normal_form)
p

стать


[Parse(word='стали', tag=OpencorporaTag('VERB,perf,intr plur,past,indc'), normal_form='стать', score=0.975342, methods_stack=((DictionaryAnalyzer(), 'стали', 945, 4),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn sing,gent'), normal_form='сталь', score=0.010958, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 1),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn plur,nomn'), normal_form='сталь', score=0.005479, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 6),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn sing,datv'), normal_form='сталь', score=0.002739, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 2),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn sing,loct'), normal_form='сталь', score=0.002739, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 5),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn plur,accs'), normal_form='сталь', score=0.002739, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 9),))]

In [31]:
p = morph.parse('ебали') #к сожалению, pymorphy не очень хорош в случае ненормативной лексики
print(p[0].normal_form)
p

ебали


[Parse(word='ебали', tag=OpencorporaTag('NOUN,anim,masc,Fixd,Name sing,nomn'), normal_form='ебали', score=0.06686046511627906, methods_stack=((<DictionaryAnalyzer>, 'али', 62, 0), (<UnknownPrefixAnalyzer>, 'еб'))),
 Parse(word='ебали', tag=OpencorporaTag('NOUN,anim,masc,Fixd,Name sing,gent'), normal_form='ебали', score=0.06686046511627906, methods_stack=((<DictionaryAnalyzer>, 'али', 62, 1), (<UnknownPrefixAnalyzer>, 'еб'))),
 Parse(word='ебали', tag=OpencorporaTag('NOUN,anim,masc,Fixd,Name sing,datv'), normal_form='ебали', score=0.06686046511627906, methods_stack=((<DictionaryAnalyzer>, 'али', 62, 2), (<UnknownPrefixAnalyzer>, 'еб'))),
 Parse(word='ебали', tag=OpencorporaTag('NOUN,anim,masc,Fixd,Name sing,accs'), normal_form='ебали', score=0.06686046511627906, methods_stack=((<DictionaryAnalyzer>, 'али', 62, 3), (<UnknownPrefixAnalyzer>, 'еб'))),
 Parse(word='ебали', tag=OpencorporaTag('NOUN,anim,masc,Fixd,Name sing,ablt'), normal_form='ебали', score=0.06686046511627906, methods_stack

#### Pymystem

This module contains a wrapper for an excellent morphological analyzer for Russian language Yandex Mystem 3.0 released in June 2014. A morphological analyzer can perform lemmatization of text and derive a set of morphological attributes for each token. For more details about the algorithm see I. Segalovich «A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine», MLMTA-2003, Las Vegas, Nevada, USA.

In [32]:
mstem = Mystem()

In [33]:
mstem.analyze('деньги')

[{'text': 'деньги',
  'analysis': [{'lex': 'деньги', 'gr': 'S,мн,неод=(вин|им)'}]},
 {'text': '\n'}]

In [34]:
mstem.analyze('ебали')

[{'text': 'ебали',
  'analysis': [{'lex': 'ебать', 'gr': 'V,обсц,несов=прош,мн,изъяв'}]},
 {'text': '\n'}]

In [35]:
mstem.analyze('стали')

[{'text': 'стали',
  'analysis': [{'lex': 'становиться', 'gr': 'V,нп=прош,мн,изъяв,сов'}]},
 {'text': '\n'}]

In [36]:
mstem.analyze('мы не стали этого делать')[4]

{'text': 'стали',
 'analysis': [{'lex': 'становиться', 'gr': 'V,нп=прош,мн,изъяв,сов'}]}

In [37]:
mstem.analyze('предмет сделан из стали')[6]

{'text': 'стали',
 'analysis': [{'lex': 'сталь',
   'gr': 'S,жен,неод=(пр,ед|вин,мн|дат,ед|род,ед|им,мн)'}]}

In [38]:
mstem.lemmatize('мы не стали этого делать')

['мы', ' ', 'не', ' ', 'становиться', ' ', 'это', ' ', 'делать', '\n']

## Lowercasing

In [39]:
"Александр Эдуардович был не в духе сегодня, ему снились ночью кошмары.".lower()

'александр эдуардович был не в духе сегодня, ему снились ночью кошмары.'

## Removing punctuation and/or digits

In [40]:
import re
import string

In [41]:
sentence = "The development of snowboarding was inspired by 55 skateboarding, sledding, surfing and skiing."
pattern = "([^\w])|([^\D])"
print(re.sub(pattern, " ", sentence))

The development of snowboarding was inspired by    skateboarding  sledding  surfing and skiing 


In [42]:
words = word_tokenize(sentence)
without_punctuation = " ".join([word for word in words if word.isalpha()])
print(without_punctuation)

The development of snowboarding was inspired by skateboarding sledding surfing and skiing


In [43]:
tokens = " ".join([i for i in words if ( i not in string.punctuation )])
tokens

'The development of snowboarding was inspired by 55 skateboarding sledding surfing and skiing'

## Stopwords

In [44]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [45]:
print(stopwords.words("russian"))

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты', 'к', 'у', 'же', 'вы', 'за', 'бы', 'по', 'только', 'ее', 'мне', 'было', 'вот', 'от', 'меня', 'еще', 'нет', 'о', 'из', 'ему', 'теперь', 'когда', 'даже', 'ну', 'вдруг', 'ли', 'если', 'уже', 'или', 'ни', 'быть', 'был', 'него', 'до', 'вас', 'нибудь', 'опять', 'уж', 'вам', 'ведь', 'там', 'потом', 'себя', 'ничего', 'ей', 'может', 'они', 'тут', 'где', 'есть', 'надо', 'ней', 'для', 'мы', 'тебя', 'их', 'чем', 'была', 'сам', 'чтоб', 'без', 'будто', 'чего', 'раз', 'тоже', 'себе', 'под', 'будет', 'ж', 'тогда', 'кто', 'этот', 'того', 'потому', 'этого', 'какой', 'совсем', 'ним', 'здесь', 'этом', 'один', 'почти', 'мой', 'тем', 'чтобы', 'нее', 'сейчас', 'были', 'куда', 'зачем', 'всех', 'никогда', 'можно', 'при', 'наконец', 'два', 'об', 'другой', 'хоть', 'после', 'над', 'больше', 'тот', 'через', 'эти', 'нас', 'про', 'всего', 'них', 'какая', 'много', 'разве', 'три', 'эту', 'моя', 'впр

In [46]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


## Expanding contractions

In [5]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}

In [47]:
text = "what's the case"

In [48]:
' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])

'what is the case'

-------------