# Description

Given a text file with a description of a movie, predict if the movie is has a positive, negative or neutral evaluation.

## Tokenization
Tokenization is the action that splits a sentence into a list of words.

NLTK (Natual Language ToolKit) is a popular tokenization method.

In [3]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

# your sentence
sentence = "hi, how are you?"

[nltk_data] Downloading package punkt to /home/nuno/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Compare this two methods

The first one only splits the senteces using the space. So, we have words mixed with pontuation.

THe second one uses the NLTK whicg separeates in a propriated way the words.

In [8]:
sentence.split()

['hi,', 'how', 'are', 'you?']

In [4]:
word_tokenize(sentence)

['hi', ',', 'how', 'are', 'you', '?']

# Bag of words

Counts how many times an word appears in all sentences (corpus).
It stors the information in a sparse matrix.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
# create a corpus of sentences
corpus = [
"hello, how are you?",
"im getting bored at home. And you? What do you think?",
"did you know about counts",
"let's see if this works!",
"YES!!!!"
]
# initialize CountVectorizer
ctv = CountVectorizer()
# fit the vectorizer on corpus
ctv.fit(corpus)
corpus_transformed = ctv.transform(corpus)

In [6]:
print(corpus_transformed)

  (0, 2)	1
  (0, 9)	1
  (0, 11)	1
  (0, 22)	1
  (1, 1)	1
  (1, 3)	1
  (1, 4)	1
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 13)	1
  (1, 17)	1
  (1, 19)	1
  (1, 22)	2
  (2, 0)	1
  (2, 5)	1
  (2, 6)	1
  (2, 14)	1
  (2, 22)	1
  (3, 12)	1
  (3, 15)	1
  (3, 16)	1
  (3, 18)	1
  (3, 20)	1
  (4, 21)	1


In [9]:
print(ctv.vocabulary_)

{'hello': 9, 'how': 11, 'are': 2, 'you': 22, 'im': 13, 'getting': 8, 'bored': 4, 'at': 3, 'home': 10, 'and': 1, 'what': 19, 'do': 7, 'think': 17, 'did': 6, 'know': 14, 'about': 0, 'counts': 5, 'let': 15, 'see': 16, 'if': 12, 'this': 18, 'works': 20, 'yes': 21}


**Explanation:**

In the second sentence the word 'you' appears two times:
    ``(2, 22) 1``

We can integrate word_tokenize into CountVectorizer to **take account special characters**.
In this case:
- stop pontuation (.)
- exclamation pontuation (!)
- comma (,)
- splits words that uses the charcater (')
    - let'
    - 's

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize

In [4]:
# create a corpus of sentences
corpus = ["hello, how are you?",
          "im getting bored at home. And you? What do you think?",
          "did you know about counts",
          "let's see if this works!",
          "YES!!!!"]

# initialize CountVectorizer with word_tokenize from nltk
# as the tokenizer
ctv = CountVectorizer(tokenizer=word_tokenize, token_pattern=None)

# fit the vectorizer on corpus
ctv.fit(corpus)

corpus_transformed = ctv.transform(corpus)
print(ctv.vocabulary_)

{'hello': 14, ',': 2, 'how': 16, 'are': 7, 'you': 27, '?': 4, 'im': 18, 'getting': 13, 'bored': 9, 'at': 8, 'home': 15, '.': 3, 'and': 6, 'what': 24, 'do': 12, 'think': 22, 'did': 11, 'know': 19, 'about': 5, 'counts': 10, 'let': 20, "'s": 1, 'see': 21, 'if': 17, 'this': 23, 'works': 25, '!': 0, 'yes': 26}


# TF-IDF
- TF-IDF: `imbd_tfidf.py`. It is calculate using the follow information:
    - TF = (Number of times a term X appears in a document) / (Total numbers of terms in the document)
    - IDF = (Total number of documents) / (Number of documents with term X in it )
    - TF-IDF(X) = TF(X) * IDF(X)

We see that instead of integer values, this time we get floats.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

# create a corpus of sentences
corpus = ["hello, how are you?",
          "im getting bored at home. And you? What do you think?",
          "did you know about counts",
          "let's see if this works!",
          "YES!!!!"]

#initialize TfidfVectorizer with word_tokenize from nltk
# as the tokenizer
tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)
# fit the vectorizer on corpus
tfv.fit(corpus)
corpus_transformed = tfv.transform(corpus)
print(corpus_transformed)

  (0, 27)	0.2965698850220162
  (0, 16)	0.4428321995085722
  (0, 14)	0.4428321995085722
  (0, 7)	0.4428321995085722
  (0, 4)	0.35727423026525224
  (0, 2)	0.4428321995085722
  (1, 27)	0.35299699146792735
  (1, 24)	0.2635440111190765
  (1, 22)	0.2635440111190765
  (1, 18)	0.2635440111190765
  (1, 15)	0.2635440111190765
  (1, 13)	0.2635440111190765
  (1, 12)	0.2635440111190765
  (1, 9)	0.2635440111190765
  (1, 8)	0.2635440111190765
  (1, 6)	0.2635440111190765
  (1, 4)	0.42525129752567803
  (1, 3)	0.2635440111190765
  (2, 27)	0.31752680284846835
  (2, 19)	0.4741246485558491
  (2, 11)	0.4741246485558491
  (2, 10)	0.4741246485558491
  (2, 5)	0.4741246485558491
  (3, 25)	0.38775666010579296
  (3, 23)	0.38775666010579296
  (3, 21)	0.38775666010579296
  (3, 20)	0.38775666010579296
  (3, 17)	0.38775666010579296
  (3, 1)	0.38775666010579296
  (3, 0)	0.3128396318588854
  (4, 26)	0.2959842226518677
  (4, 0)	0.9551928286692534


# N-Grams

Making sets of words. The order is important.

Until now we are considered one word (one-gram). With N-gram we can 
considered a sets of words that become a part of our vocabulary. 

We can use n-gram as a parameter in CountVectorizer and TfidfVectorize. By default 
the minimum and maximum limit are (1,1). We can change it to (1,3).


In [7]:
from nltk import ngrams
from nltk.tokenize import word_tokenize
# let's see 3 grams
N = 3
# input sentence
sentence = "hi, how are you?"
# tokenized sentence
tokenized_sentence = word_tokenize(sentence)
# generate n_grams
n_grams = list(ngrams(tokenized_sentence, N))
print(n_grams)

[('hi', ',', 'how'), (',', 'how', 'are'), ('how', 'are', 'you'), ('are', 'you', '?')]


# Stemming and Lemmatization

- These techniques reduce the word to its smallest form.
- Lemmatization is more agressive than Stemmig.
- Stemming is more popular and widely used.
- Types of stemmers and lemmatizers:
    - Snowball Stemmer
    - WordNet Lemmatizer

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

nltk.download('wordnet')

# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# initialize stemmer
stemmer = SnowballStemmer("english")
words = ["fishing", "fishes", "fished"]

for word in words:
    print(f"word={word}")
    print(f"stemmed_word={stemmer.stem(word)}")
    print(f"lemma={lemmatizer.lemmatize(word)}")
    print("")

[nltk_data] Downloading package wordnet to /home/nuno/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


word=fishing
stemmed_word=fish
lemma=fishing

word=fishes
stemmed_word=fish
lemma=fish

word=fished
stemmed_word=fish
lemma=fished

