# Introduction

This notebook explores basic text processing procedures in python. We will only stick to preprocessing and creating meaningful representations based on existing models, rather than delve into Natural Language Processing discussions. 

Here is an outline

**Basic Preprocessing Techniques**

1. Tokenization
2. Stemming & Lemmatization
3. Stop Words
4. Removing punctuations

**Basic Vector representation**

4. Bag of Words
5. Term Frequency — Inverse Document Frequency (TF-IDF)
6. n-grams (unigrams, bigarams, ...)

**Semantic Vecotr representation**

7. Word2vec

**Popular Libraries**
1. Gensim
2. nltk
3. spaCy

# Get Data

In [1]:
import nltk # Natural Language Toolkit
import numpy as np
import pandas as pd

#nltk.download()
#nltk.download('brown')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


[nltk_data] Downloading package brown to /Users/mohamed/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.
This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. See a list [here](https://www.nltk.org/book/ch02.html)

In [2]:
from nltk.corpus import brown

brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [3]:
reader = brown.words(categories='science_fiction')[:200]
words_list = list(reader)
text = ' '.join(words_list)
print(text)

Now that he knew himself to be self he was free to grok ever closer to his brothers , merge without let . Self's integrity was and is and ever had been . Mike stopped to cherish all his brother selves , the many threes-fulfilled on Mars , corporate and discorporate , the precious few on Earth -- the unknown powers of three on Earth that would be his to merge with and cherish now that at last long waiting he grokked and cherished himself . Mike remained in trance ; ; there was much to grok , loose ends to puzzle over and fit into his growing -- all that he had seen and heard and been at the Archangel Foster Tabernacle ( not just cusp when he and Digby had come face to face alone ) why Bishop Senator Boone made him warily uneasy , how Miss Dawn Ardent tasted like a water brother when she was not , the smell of goodness he had incompletely grokked in the jumping up and down and wailing -- Jubal's conversations coming and going -- Jubal's words troubled him most ; ; he studied them , compa

# Basic Text Preprocessing

## Tokenization

In [4]:
# make sentences (each sentence is a token)
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

Now that he knew himself to be self he was free to grok ever closer to his brothers , merge without let .

Self's integrity was and is and ever had been .

Mike stopped to cherish all his brother selves , the many threes-fulfilled on Mars , corporate and discorporate , the precious few on Earth -- the unknown powers of three on Earth that would be his to merge with and cherish now that at last long waiting he grokked and cherished himself .

Mike remained in trance ; ; there was much to grok , loose ends to puzzle over and fit into his growing -- all that he had seen and heard and been at the Archangel Foster Tabernacle ( not just cusp when he and Digby had come face to face alone ) why Bishop Senator Boone made him warily uneasy , how Miss Dawn Ardent tasted like a water brother when she was not , the smell of goodness he had incompletely grokked in the jumping up and down and wailing -- Jubal's conversations coming and going -- Jubal's words troubled him most ; ; he studied them , co

In [5]:
# make words (each word is a token)
words = nltk.word_tokenize(text)[:5]
print(words, '...')

['Now', 'that', 'he', 'knew', 'himself'] ...


## Stemming & Lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is $\Rightarrow$ be

car, cars, car's, cars' $\Rightarrow$ car

**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

Source: Stanford NLP Group

In [6]:
# stemming example
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ['grows', 'leaves', 'fairly']
for word in words:
    stemmed_word = ps.stem(word)
    print(stemmed_word)

grow
leav
fairli


In [7]:
# lemmetization example 
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
words = ['grows', 'leaves', 'fairly']
for word in words:
    lemmatized_word = lemmatizer.lemmatize(word)
    print(lemmatized_word)

grows
leaf
fairly


! much better

## Stopwords

Stop Words usually refers to the most common words in a language like of, them , they, or, etc. Words that do not add too much information for our application

In [8]:
from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')

# use the brown scifi corpus data
reader = brown.words(categories='science_fiction')[:1000]
words_list = list(reader)

# filter out stop words
filtered_words = [word for word in words_list if word not in stopwords_en]

print(f'Number of raw words {len(words_list)}')
print(f'Number of filtered words {len(filtered_words)}')

Number of raw words 1000
Number of filtered words 639


## Punctuations

we can also remove punctuations

In [9]:
from nltk.tokenize import RegexpTokenizer

regex_tokenizer = RegexpTokenizer(r'\w+')

clean_words = regex_tokenizer.tokenize(text)

' '.join(clean_words)

'Now that he knew himself to be self he was free to grok ever closer to his brothers merge without let Self s integrity was and is and ever had been Mike stopped to cherish all his brother selves the many threes fulfilled on Mars corporate and discorporate the precious few on Earth the unknown powers of three on Earth that would be his to merge with and cherish now that at last long waiting he grokked and cherished himself Mike remained in trance there was much to grok loose ends to puzzle over and fit into his growing all that he had seen and heard and been at the Archangel Foster Tabernacle not just cusp when he and Digby had come face to face alone why Bishop Senator Boone made him warily uneasy how Miss Dawn Ardent tasted like a water brother when she was not the smell of goodness he had incompletely grokked in the jumping up and down and wailing Jubal s conversations coming and going Jubal s words troubled him most he studied them compared them with what he'

In [10]:
clean_words = [word for word in nltk.word_tokenize(text) if word.isalpha()]

' '.join(clean_words)

'Now that he knew himself to be self he was free to grok ever closer to his brothers merge without let Self integrity was and is and ever had been Mike stopped to cherish all his brother selves the many on Mars corporate and discorporate the precious few on Earth the unknown powers of three on Earth that would be his to merge with and cherish now that at last long waiting he grokked and cherished himself Mike remained in trance there was much to grok loose ends to puzzle over and fit into his growing all that he had seen and heard and been at the Archangel Foster Tabernacle not just cusp when he and Digby had come face to face alone why Bishop Senator Boone made him warily uneasy how Miss Dawn Ardent tasted like a water brother when she was not the smell of goodness he had incompletely grokked in the jumping up and down and wailing Jubal conversations coming and going Jubal words troubled him most he studied them compared them with what he'

alternatively we can use the isalpha method

# Basic Vector Representation

## Bag of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

let's get some sentences and try to represent them using the BOW vector representation

In [11]:
# data
reader = brown.words(categories='science_fiction')[:500]
words_list = list(reader)
text = ' '.join(words_list)

# lowercase only
text = text.lower()

# words tokenizer
words_list = nltk.word_tokenize(text)

# remove stop words
filtered_words = [word for word in words_list if word not in stopwords_en]
text = ' '.join(filtered_words)

# get sentences
sentences = nltk.sent_tokenize(text)

# remove punctuations, symbols (!, ?), and letters
clean_sentences = []
for sent in sentences:
    # tokenize words
    sent_words = nltk.word_tokenize(sent)
    # remove all tokens that are not alphabetic
    sent_words = [word for word in sent_words if word.isalpha()]
    # join words back together to form sentence
    sent = ' '.join(sent_words)
    # drop empty sentences
    if not sent:
        continue
    clean_sentences.append(sent)
    
for sent in clean_sentences:
    print(sent)

knew self free grok ever closer brothers merge without let
self integrity ever
mike stopped cherish brother selves many mars corporate discorporate precious earth unknown powers three earth would merge cherish last long waiting grokked cherished
mike remained trance much grok loose ends puzzle fit growing seen heard archangel foster tabernacle cusp digby come face face alone bishop senator boone made warily uneasy miss dawn ardent tasted like water brother smell goodness incompletely grokked jumping wailing jubal conversations coming going jubal words troubled studied compared taught nestling struggling bridge languages one thought one learning think
word church turned among jubal words gave knotty difficulty martian concept match unless one took church worship god congregation many words equated totality world known forced concept back english phrase rejected differently jubal mahmoud digby
thou art god
closer understanding english although could never inevitability martian concept st

our BOW vector representation for each each sentence is a vector of dimensions = number of all words, and values 1 or 0 depending on whether the word is mentioned in the sentence at a given frequency threshold

example

- they wrote some code
- code is being written
- they like sports

removing stop words and lemmatizing we get

- write some code
- code write
- like sport

all words:

write, some, code, like, sport

representation

- sent1 : [1, 1, 1, 0, 0] first word write is present last word sport is not present
- sent2 : [1, 0, 1, 0, 0]
- sent3 : [0, 0, 0, 1, 1]

You can see how this could give a rough indication of similarity between vectors representing sentences that look alike. However, the reperesentation is sparse and problematic. For instance, different words (or terms) are all given the same weight, when some can be more meaningful than others 

In [12]:
from collections import Counter
from typing import List


def make_bow(sentences: List, threshold: int=1)-> List[List[int]]:
    # unique words
    text = ' '.join(sentences)
    all_words = text.split(' ')
    unique_words = list(set(all_words))
    n = len(unique_words)
    print(f'We have {n} unique words')

    # instantiate vectors
    vectors = [[0 for i in range(n)] for _ in range(len(sentences))]

    threshold = 1
    for sent_index, sent in enumerate(sentences):
        
        # tokenize words
        words = nltk.word_tokenize(sent)
        
        # compute frequency of each word in the sentence
        counter = Counter(words)
        
        # for every word, check if its freq passes the threshold
        # and if it does set the word index in the sentence vector
        # to 1
        for word, freq in counter.items():
            if freq >= threshold:
                word_index = unique_words.index(word)
                vectors[sent_index][word_index] = 1
    return vectors

vectors = make_bow(clean_sentences)

# sanity check
nrows, ncols = np.array(vectors).shape
print(f'our matrix has {nrows} rows and {ncols} cols')

We have 167 unique words
our matrix has 19 rows and 167 cols


In [13]:
# try our simple example for ease of visualization
sentences = ['write some code', 'code write', 'like sport']
np.array(make_bow(sentences))

We have 5 unique words


array([[1, 1, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [0, 0, 1, 0, 1]])

Nice!

We don't need to implement this from scratch, we can use sklearn

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=20)
matrix = cv.fit_transform(sentences, )
matrix.todense()

matrix([[1, 0, 1, 0, 1],
        [1, 0, 0, 0, 1],
        [0, 1, 0, 1, 0]])

In [25]:
#Note that the order of columns is different, so we can get the feature names from the count vectorizer
cv.get_feature_names()

['code', 'like', 'some', 'sport', 'write']

## Term Frequency - Inverse Document Frequency (TF-IDF)

The tf–idf is the product of two statistics, term frequency and inverse document frequency. It is intended to reflect how important a word is to a document in a collection or corpus. There are various ways for determining the exact values of both statistics

Term Frequency = $\frac{term\:count}{number\,of\,words\,in\,the\,sentence}$


Inverse Document Frequency = $log(\frac{1 + number\:of\:sentences}{1 + number\,of\,sentences\,with\,this\,word}) + 1$

TF-IDF = TF * IDF

Example

- write some code
- write python code
- write essay

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
sentences = ['write some code', 'write python code', 'write essay']

data = tfidf.fit_transform(sentences).todense()
words = tfidf.get_feature_names()
pd.DataFrame(data, columns=words, index=['sent1', 'sent2', 'sent3'])

Unnamed: 0,code,essay,python,some,write
sent1,0.547832,0.0,0.0,0.720333,0.425441
sent2,0.547832,0.0,0.720333,0.0,0.425441
sent3,0.0,0.861037,0.0,0.0,0.508542


# Advanced Vector Representations 

## Word2Vec (gensim)

In [95]:
from gensim.models.word2vec import Word2Vec

sentences = ['write some code', 'code write', 'like sport']

# represent each sentence as a list of words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Apply the model to the sentences (size indicates length of embedding vector)
model = Word2Vec(tokenized_sentences, min_count=1, size=4)

# access model vocab
words = model.wv.vocab
print(words)
print('-----')

# to extract a specific word vector
word = 'code'
word_vec = model.wv[word]
print(f'Vector Representation of the word "{word}": ', word_vec)
print('-----')

# find top similar words to your word
model.wv.most_similar(word, topn=2)

{'write': <gensim.models.keyedvectors.Vocab object at 0x136920390>, 'some': <gensim.models.keyedvectors.Vocab object at 0x136920128>, 'code': <gensim.models.keyedvectors.Vocab object at 0x136920048>, 'like': <gensim.models.keyedvectors.Vocab object at 0x1369200f0>, 'sport': <gensim.models.keyedvectors.Vocab object at 0x1369205c0>}
-----
Vector Representation of the word "code":  [ 0.06649669 -0.0045267   0.10990859  0.07124589]
-----


[('like', -0.08335989713668823), ('sport', -0.1381823718547821)]