# Feature Extraction With TF-IDF

I tried to create a simple example of a step by step guide on how to compute TF-IDF.

### 1. Have a data set to work on.
**movie_reviews** variable will contain 5 movie review from the IMDB Dataset of 50K Movie Reviews.

In [1]:
# Here we have our sample movie reviews.
# These reviews came from IMDB Dataset of 50K Movie Reviews
# https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv
movie_reviews = [
    """
    Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.
    """.strip(),
    """
    Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.
    """.strip(),
    """
    I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it.
    """.strip(),
    """
    This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air.
    """.strip(),
    """
    Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only.
    """.strip()
]

### 2. Tokenize the data.
The code below uses the NLTK function to tokenize the movie review.
We uses the first 30 characters of the movie review as the key of a dictionary to store its tokens.

**movie_review_tokens** variable contains the first 30 characters of the review as the key and an array of tokens as value.
```javascript
{
  "PetterMatteisLoveintheTimeofMo": [ "petter","mattei","love","in","the","time","of","money","is"...]
}
```

In [2]:
# We need to tokenize the reviews using NLTK (Natural Language Tool Kit), it contains basic functions that we need.
# Token is just another term for 'word'.

# create a dictionary where we can store our review tokens.
movie_review_tokens = {}

import nltk
import re
for review in movie_reviews:
    review_tokens = nltk.word_tokenize(review)
    alphanumeric_tokens = []
    for token in review_tokens:
        if token.isalnum():
            alphanumeric_tokens.append(token.lower())
    movie_review_tokens[re.sub("[^0-9a-zA-Z]+", "", review)[:30]] = alphanumeric_tokens

### 3. Get how many times the word appearead in a review.

**movie_review_word_count** variable will contain the following data. Here we have the document name and inside of it is the words and the number of times it appearead in the document.

```javascript
{
  "PetterMatteisLoveintheTimeofMo": {
    "petter": 1,
    "mattei": 4,
    "love": 1,
    "in": 6,
    "the": 20,
    ...
  }
}
```

In [3]:
# Here we will store the number of times a word appear in a movie review.
movie_review_word_count = {}

for review_token in movie_review_tokens:
    tokens = movie_review_tokens[review_token]
    for token in tokens:
        
        if review_token not in movie_review_word_count:
            movie_review_word_count[review_token] = {}
            
        if token not in movie_review_word_count[review_token]:
            movie_review_word_count[review_token][token] = 0
        # add frequency to the words.
        movie_review_word_count[review_token][token] = movie_review_word_count[review_token][token] + 1

### 4. Get the TF of a word in a review.
TF = number_of_times_appearead_in_a_document / total_number_of_words_in_a_document.

**movie_review_tf** variable contains the TF of a word in a review.

```javascript
{
  "PetterMatteisLoveintheTimeofMo": {
    "petter": 0.004405286343612335,
    "mattei": 0.01762114537444934,
    "love": 0.004405286343612335,
    "in": 0.02643171806167401,
    "the": 0.0881057268722467,
    ...
  }
}
```

In [4]:
# Here we will have the TF value of a word for each movie review.
# TF = word_count / total_number_of_words
# Term Frequency means how many times the word has appeaared in are view.
movie_review_tf = {}

for review,word_counts in movie_review_word_count.items():
    total_number_of_words = 0
    for word, counts in word_counts.items():
        total_number_of_words = total_number_of_words + counts
    
    for word, counts in word_counts.items():
        tf = counts / total_number_of_words
        
        if review not in movie_review_tf:
            movie_review_tf[review] = dict()
        
        movie_review_tf[review][word] = tf

### Get IDF of a word.
IDF = log( number_of_documents / in_how_many_documents_the_word_appeared) *It does not matter how many times a word appeared in a document even if it appeared 3 or 4 or more times in a single document it will be counted as 1.*

**movie_review_idf** contains the word as a key and the IDF as the value. IDF of 0 means that the word appeared almost everywhere in the document, this can indicate that the word is an article or stop words where it does not have any meaning or cannot discriminate the context of the document.

```javascript
{
  "petter": 1.6094379124341003,
  "mattei": 1.6094379124341003,
  "love": 1.6094379124341003,
  "in": 0.0,
  "the": 0.0,
  "time": 0.9162907318741551,
  ...
}
```

In [10]:
# in which documents the term appear.
movie_review_df = {}
for review,tokens in movie_review_tokens.items():
    for token in tokens:
        if token not in movie_review_df:
            movie_review_df[token] = set()
        movie_review_df[token].add(review)

for token,appearead_in_docs in movie_review_df.items():
    movie_review_df[token] = len(appearead_in_docs)
import numpy as np

movie_review_idf = {}
for token,df in movie_review_df.items():
    movie_review_idf[token] = np.log( len(movie_reviews) / df )

### Get TF-IDF Value of a word in a document.
**movie_review_tfidf** variable contains the key of the review and a dictionary of tokens alongside with the TFIDF value.

The value will be 0 if the word is very common, this means that it will not provide any significance in the document.

The value will approach 1.0 if the word is significant in a document, this means that the word carries the meaning or weight of the document.

```javascript
{
  "PetterMatteisLoveintheTimeofMo": {
    "petter": 0.007090034856537888,
    "mattei": 0.02836013942615155,
    "love": 0.007090034856537888,
    "in": 0.0,
    "the": 0.0,
    "time": 0.008073046095807534,
    "of": 0.0,
    "money": 0.014180069713075776,
    "is": 0.0,
    "a": 0.0,
    ...
  }
}
```

In [11]:
movie_review_tfidf = {}
for review , tokens in movie_review_tf.items():
    for token in tokens:
        if review not in movie_review_tfidf:
            movie_review_tfidf[review] = dict()
        tf = tokens[token]
        idf = movie_review_idf[token]
        tfidf_score = tf*idf
        movie_review_tfidf[review][token] = tfidf_score

## Conclusion
The product of TF and IDF gives the TF-IDF. In other words, we assign to term ‘t’ a weight in the document d that is

**Highest** when a term occurs many times within a small number of documents (thus lending high discriminating power to these documents)

**Lower** when the term occurs in many documents (thus offering a less pronounced relevance signal)

**Lowest** when the term occurs virtually in all documents