# Text representation


____________________
- **Name:** Maite Giménez.
- **Mail:** mgimenez@dsic.upv.es
- **Github:** maigimenez

## Table of contents

1. Definition of the Problem: Text Representation in NLP.
2. Basic Word Representation.
    1. Bag of words (BoW).
    2. Bag of n-grams (n-grams).
3. Not so Basic Word Representation.
    1. Weigthed models (Tf-Idf).
    2. Probabilistic language models.
4. New approaches.
    1. Classical approaches, classical problems. 
    2. The deep (and not so deep) learning approach.

# 1. Definition of the Problem: Text Representation in NLP.



Machine Learning models learn from an input of numbers(images, measures) how to predict the class of input. 

![problem](imgs/problem.png "Problems with NLP")

But words are NOT numbers!

![problem2](imgs/problem2.png "Problems with NLP")


# Practical example:  Brown Corpus

- The first million-word electronic corpus of English
- Contains text from 500 sources.
- Sources have been categorized by genre.

In [1]:
from nltk.corpus import brown

print("* Categories: {} \n".format(', '.join(brown.categories())))

print("* Sentences from a category:")
for sentence in brown.sents(categories=['news'])[:2]:
    # Sentences are a list of words
    print("  - ", ' '.join(sentence))

* Categories: adventure, belles_lettres, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction 

* Sentences from a category:
  -  The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
  -  The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .


# 2. Basic Text Representation.
## A. Bag of words (BoW).

- Simplest way to represent text.
- A text is represented as a set of its words. 
- Does **not** consider the order of the words.

- Create a vector with the size of the vocabulary seen in **train**.
- Each sentence is represented counting the number of times each word appears(frequency). Or simply counting if the word appears or not (One-hot representation). 
- Frequencies can be normalized.


In [2]:
dataset = {"El gato comerá pato dentro de un rato", 
           "El pato se esconde de un gato en un zapato"}
# Some preprocessing might happen 
vocabulary = {"el","gato","comera","pato","dentro","de","un","rato","se","esconde","zapato"}

onehot_representation = [[1,1,1,1,1,1,1,1,0,0,0],
                         [1,1,0,0,1,1,1,0,1,1,1]]

bow_representation = [[1,1,1,1,1,1,1,1,0,0,0],
                      [1,1,0,0,1,2,2,0,1,1,1]]


In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word", 
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             ngram_range=(1, 1)) 

sentences = [' '.join(sentence) for sentence in brown.sents(categories=['news'])]
sentences_token = [sentence for sentence in brown.sents(categories=['news'])]

# Fit the text
BOW_text = vectorizer.fit_transform(sentences)
BOW_text = BOW_text.toarray()

print('Text: {{0|1}}^({}x{})'.format(BOW_text.shape[0], BOW_text.shape[1]))
vocab = vectorizer.get_feature_names()
print('\nVOCABULARY EXTRACT: {}'.format(', '.join(vocab[500:550])))
# np.set_printoptions(threshold=np.nan)
print('\nTWEET REPRESENTATION: {}. Size of the vector: {}'.format(BOW_text[0], BOW_text[0].shape))

Text: {0|1}^(4623x11880)

VOCABULARY EXTRACT: acid, acknowledge, acknowledged, acknowledgment, acquaint, acquaintance, acquire, acquired, acquisition, acquittal, acre, acreage, acres, acrobatic, across, act, acted, acting, action, actions, active, activities, activity, actor, actors, actress, acts, actual, actually, acute, ad, adair, adam, adamant, adams, adamson, adaptation, adapting, adc, adcock, add, added, addicts, adding, addition, additional, address, addressed, addresses, addressing

TWEET REPRESENTATION: [0 0 0 ..., 0 0 0]. Size of the vector: (11880,)


# 2. Basic Text Representation.
## B. Bag of n-grams (n-grams).

- A BoW approach can be atomized to use smaller elements.
- An n-gram is a continuous sequence of *n* elements: words or characters.
- Most common n-grams:
    - 1-gram (unigrams)
    - 2-grams (bigrams)
    - 3-grams (tri|grams)

In [11]:
# WORD-BIGRAM representation
dataset = {"El gato comerá pato dentro de un rato", 
          "El pato se esconde de un gato en un zapato"}
# Some preprocessing might happen 
vocabulary = {"el gato","gato comera", "comera pato", "pato dentro", 
              "dentro de","de un","un rato","rato <EOF>", "el pato",
              "pato se", "se esconde", "esconde de", "un gato", "gato en", "en un", "un zapato"}

ngrams_representation = [[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0],
                         [0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,1]]

In [12]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word", 
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             ngram_range=(1, 2)) 

sentences = [' '.join(sentence) for sentence in brown.sents(categories=['news'])]
sentences_token = [sentence for sentence in brown.sents(categories=['news'])]

# Fit the text
ngram_text = vectorizer.fit_transform(sentences)
ngram_text = ngram_text.toarray()

print('Text: {{0|1}}^({}x{})'.format(ngram_text.shape[0], ngram_text.shape[1]))
vocab = vectorizer.get_feature_names()
print('\nVOCABULARY EXTRACT: {}'.format(', '.join(vocab[500:550])))
# np.set_printoptions(threshold=np.nan)
print('\nTWEET REPRESENTATION: {}. Size of the vector: {}'.format(ngram_text[0], ngram_text[0].shape))

Text: {0|1}^(4623x67007)

VOCABULARY EXTRACT: 1913, 1913 few, 1914, 1914 the, 1917, 1917 and, 1919, 1919 white, 192, 192 865, 1920, 1920 as, 1920 presidential, 1920s, 1920s following, 1921, 1921 from, 1922, 1922 the, 1923, 1923 to, 1924, 1924 and, 1925, 1925 and, 1925 as, 1926, 1926 ne, 1927, 1927 by, 1927 fewer, 1927 ruth, 1927 season, 1927 two, 1928, 1928 and, 1930, 1930 made, 1930 they, 1930 when, 1930s, 1930s he, 1932, 1932 or, 1933, 1933 individuals, 1934, 1934 farmers, 1934 implicit, 1935

TWEET REPRESENTATION: [0 0 0 ..., 0 0 0]. Size of the vector: (67007,)


# 3. Not so Basic Text Representation.
## A. Weigthed models: term frequency–inverse document frequency (Tf-Idf).

- Common technique from Information Retrieval (IR)
- This word representation tries to model how important a word is to a document in a corpus.
- Two terms are combined (or weighted): **term frequency** and **inverse document frequency**.

### Term frequency
- Count the number of times that term *t* occurs in a corpus *c*.
$$tf(t, c) = \frac{f(t,c)}{\max \{f(w,c) \forall w \in c \}}$$

### Inverse document frequency
- For avoiding that common terms with low relevance this measure is weigthed using the idf formula:
$$idf(t, C) = \frac{|C|}{|c \in C : t \in c|}$$
$|c \in C : t \in c|$: number of documents where the term *t* appears.

### Term frequency–Inverse document frequency 
$$tf-idf(t, c, C) = tf(t,c) \times idf(t,C) $$


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, 
                             min_df=2, 
                             stop_words='english')

# Fit the text
tfidf_text = vectorizer.fit_transform(sentences)
tfidf_text = tfidf_text.toarray()

print('Text: {{0|1}}^({}x{})'.format(tfidf_text.shape[0], tfidf_text.shape[1]))
vocab = vectorizer.get_feature_names()
print('\nVOCABULARY EXTRACT: {}'.format(', '.join(vocab[500:550])))
# np.set_printoptions(threshold=np.nan)
print('\nTWEET REPRESENTATION: {}. Size of the vector: {}'.format(ngram_text[0], ngram_text[0].shape))

Text: {0|1}^(4623x5807)

VOCABULARY EXTRACT: athletic, athletics, atlanta, atlantic, atmosphere, atomic, attachment, attack, attacked, attacks, attempt, attempted, attempting, attempts, attend, attended, attending, attends, attention, attitude, attorney, attorneys, attract, attracted, attraction, attractive, atty, audience, audio, auditorium, audubon, aug, august, augusta, aunt, aurora, austere, austin, author, authorities, authority, authorize, authorized, authorizing, auto, automatic, automatically, automobile, autumn, av

TWEET REPRESENTATION: [0 0 0 ..., 0 0 0]. Size of the vector: (67007,)


# 3. Not so Basic Text Representation.
## B. statistical language models.


- A probability distribution over sequences of words.
- Able to handle some word ordering. 
- Several assumptions are taken into account:
    - The probability of a word *w* depends on certain history (previous words) seen.
    - *The Markov assumption*: the probability of a word only depend on a fixed number of the previous words (1 for unigrams, 2 for bigrams, ...).
    

- Given a sequence of length *n*, the probability that W random variables took the values of the sequence $w_1^n$ can be described using the chain rule:

$$ P(w_1^n) = P(w_1)*P(w_2|w_1)*P(w_3|w_1^2)\ldots P(w_n|w_1^{n-1}) $$
$$ = \prod_{i=1}^{n}{P(w_i|w_1^{i-1}) }$$

- Applying the Markov assumption:

$$ P(w_1^n) \approx \prod_{k=1}^{n}{P(w_k|w_{k-n+1}^{k-1}) }$$

- Statistical language models can deal with out-of-vocabulary words applying different smoothing techniques such as:
    - Backoff.
    - Linear interpolation.
    - Good-Turing. 
    - ...

# 4. New approaches.
## A. Classical approaches, classical problems.

- Unseen words.
- No semantical relationships
    - One-hot Representation: Represents every word as an $\mathbb{R}^{|V|}$ with no semantical relationships.
   $w^{hotel}\, and\, w^{motel} and\, w^{hostel} = 0$

In [7]:
hotel = [0, 0, 0, 0, 0, 0, 1, 0]
motel = [0, 0, 0, 0, 1, 0, 0, 0]
hostel = [1, 0, 0, 0, 0, 0, 0, 0]

![problem3](imgs/problem3.png "Similar words not similar representations")

### The curse of dimensionality
- Generalizing locally (eg. nearest neighbors) requires representative examples for all relevant variations
- The number of possible configurations of the variables of interest is much larger than the number of training samples.


![curse](imgs/curse.jpg "The curse of dimensionality")


# 4. New approaches.
## B. The deep (and not so deep) learning approach.

> You shall know a word by the company it keeps.
>
> -- <cite>John Rupert Firth</cite>
![curse](imgs/firth.png "John Rupert Firth")


### Word2vec

   - Word2vec is a family of algorithms for learning word embeddings. You may want to talk about a skipgram model trained using negative sampling.
    - Word2vec is NOT the holy grail (If you don't understand it)
    
    > Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).


- With Word2vec algorithms are able to learn word embedings from unsupervised raw text.
- Two algorithms are implemented:
    - Continuous Bag-of-Words model (CBOW)
        - Predicts target words (e.g. 'mat') from source context words ('the cat sits on the').
        - CBOW smoothes over a lot of the distributional information.
        - Useful for smaller datasets.
    - Skip-Gram model
         - Predicts source context words ('the cat sits on the') from target words (e.g. 'mat').
         - Treats each context-target pair as a new observation.
         - Useful for larger datasets.

- These algorithms are trained using:
    - Negative sampling.
    - Hierarchical Softmax

![w2v](imgs/nce-nplm.png "w2v")



In [8]:
from gensim.models import word2vec

# Initialize and train the model (this will take some time)
model = word2vec.Word2Vec(sentences_token, 
                          workers = 4,
                          size = 300,        # Embeddings size  
                          min_count = 1,     # How many times a word should appear to be taken into account
                          window = 5, 
                          sample = 1e-3 ,    # Downsample setting for frequent words
                         )

# This model won't be updated
model.init_sims(replace=True)



In [9]:
model.most_similar('price')

[('return', 0.9964651465415955),
 ('economic', 0.9960927963256836),
 ('recently', 0.9959506988525391),
 ('order', 0.9958515763282776),
 ('side', 0.9956673383712769),
 ('Stein', 0.9956244230270386),
 ('South', 0.9955523610115051),
 ('Johnny', 0.9954700469970703),
 ('gown', 0.9954312443733215),
 ('newly', 0.995410144329071)]

In [10]:
from sklearn.manifold import TSNE
from nltk.tokenize import TweetTokenizer, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import re
import scipy.stats as stats

def get_most_common_vocab(most_common, vocabulary):
    """ Get the most common words in a vocabulary

    Args:
        most_common (int): Number of most common word that want to be retrieved.
        vocabulary (Counter): Vocabulary with words and frequencies of each word.
        
    Returns:
        set: set of most common words in this vocabulary.
    """
    most_common_words = vocabulary.most_common(int(most_common))
    return set(word for word, _ in most_common_words)

def get_vocabulary(corpus, tokenizer):
    """ Get the vocabulary of a dataset. 

    Get a vocabulary of a set of tweets after removing stopwords, non letters, 
    and replacing each number by the token <number>

    Args:
        corpus (list of tweets): A list of tweets.
        tokenizer (function): tokenizer function. To get the tokens of each tweet.

    Returns:
        Counter: Vocabulary with the frequency of each word in it.
    """
    stop_words = stopwords.words('english')

    # Remove puntuation marks
    no_punks = [re.sub(r'\W', ' ', tweet) for tweet in corpus]
    
    # Tokenize and remove stop words
    clean_tokens = []
    for tweet in no_punks:
        # Replace different numbers with a token
        tweet = re.sub(r"\.\d+\s*", ".<number> ", tweet)
        tweet = re.sub(r"\d+\s*", " <number> ", tweet)
    
        tokens = tokenizer(tweet)
        tokens = [token for token in tokens if token not in stop_words]
        clean_tokens.extend(tokens)

    # Build the vocabulary
    return Counter(clean_tokens)


def get_words_to_plot(most_common, vocabulary, dictionary):
    words_to_plot = {}
    unseen_words = []
    for word in get_most_common_vocab(most_common, vocabulary):
        if word in dictionary:
            words_to_plot[word] = dictionary[word]
        else:
            unseen_words.append(word)
    return words_to_plot, unseen_words

def plot_tsne(dictionary, most_common):
    tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
    tknzr = TweetTokenizer()
    vocabulary = get_vocabulary(sentences, tknzr.tokenize)
    words_to_plot, unseen_words = get_words_to_plot(most_common, vocabulary, dictionary)
    
    low_dim_embs = tsne.fit_transform(list(words_to_plot))
    
    range_words_bros=list(range(1,len(words_bros)+1))
    source_bros = ColumnDataSource(data=dict(range_words=range_words_bros,
                                             words=words_to_plot,
                                             x=low_dim_embs[:,0], 
                                             y=low_dim_embs[:,1]))

    hover = HoverTool()
    hover.point_policy = "follow_mouse"
    hover = HoverTool(
            tooltips=[
                ("words_bros", "@words_bros"),
                ("words_sis", "@words_sis"),
            ]
        )

    TOOLS="pan,wheel_zoom,box_zoom,reset,save"

    p = figure(title = "Word visualization", tools=[TOOLS, hover])
    p.circle('x', 'y', source=source_bros, fill_alpha=0.2, size=10, color='navy')
    p.circle('x', 'y', source=source_sis, fill_alpha=0.2, size=10, color='red')

    show(p)
    return set(unseen_words_bros + unseen_words_sis)

