[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanhuguet/intro_to_nlp/blob/main/notebooks/02-basic-sentiment-classification-dictionary.ipynb)

# How do we make machines read text

We have seen that machine learning models are devised to be used with numerical data.

**We need then to find a numerical representation for textual data.**

Let’s make a simple example for a sentiment classification.

Let’s say we have the next restaurant review:

> ”Overall the restaurant is very good as the food was tasty and good even though the service was bad.”

We could use a simple approach, which is, **count how many positive or negative words we have** and calculate the overall sentiment.

1. First, you need to choose a sentiment dictionary that maps words to sentiment scores. A popular choice is the AFINN lexicon, which assigns a score to words between -5 (very negative) and 5 (very positive).
2. Next, you need to count the occurrence of each word in your text. You can do this by tokenizing the text into words and then counting the frequency of each word.
3. Finally, you can calculate the sentiment score of the text by summing the sentiment scores of each word, weighted by its frequency. 
    
    
    | Word | Word Score | Num Occur | Total |
    | --- | --- | --- | --- |
    | good | 3 | 2 | 6 |
    | tasty | 1 | 1 | 1 |
    | bad | -3 | 1 | -3 |
    |  |  | Total | 4 |
4. You can then classify the text as positive, negative, or neutral based on the sentiment score. For example, you could say that a score greater than 0 is positive, less than 0 is negative, and 0 is neutral.

## Load the lexicon

In [1]:
import pandas as pd

In [2]:
sentiment = pd.read_csv("afinn/afinn-111.csv")

In [3]:
WORD_SENTIMENTS_DICT = sentiment.set_index("word")["sentiment_score"].to_dict()

## Create a tokenizer function

Basically, split by words

In [4]:
def tokenizer(text):
    return text.split(" ")

In [5]:
tokenizer("the food was bad")

['the', 'food', 'was', 'bad']

## Calculate word frequencies in the text to classify


In [6]:
def word_freq_dict(tokenized_text):
    freqs = {}
    for w in tokenized_text:
        if w in freqs:
            freqs[w] += 1
        else:
            freqs[w] = 1
    return freqs

In [7]:
tokenized_test = tokenizer("this is a bad test")

word_freqs = word_freq_dict(tokenized_test)

word_freqs

{'this': 1, 'is': 1, 'a': 1, 'bad': 1, 'test': 1}

## Assign to each word a score
score = freq_word * afinn_score 

In [8]:
def assign_word_score(word_freq):
    scores = {}
    for word, num_appearances in word_freq.items():
        sentiment_weight = WORD_SENTIMENTS_DICT.get(word, 0)
        scores[word] = sentiment_weight * num_appearances
    return scores

In [9]:
word_scores = assign_word_score(word_freq=word_freqs)

In [10]:
word_scores

{'this': 0, 'is': 0, 'a': 0, 'bad': -3, 'test': 0}

## Calculat the total score by summing all the sentiment scores

In [11]:
def calculate_total_sentiment_score(scores):
    return sum(scores.values())

In [12]:
total_score = calculate_total_sentiment_score(word_scores)

total_score

-3

## Now, just create the decission function

In [13]:
def assign_sentiment(total_score, threshold=0):
    if total_score > threshold:
        return "positive"
    elif total_score == 0:
        return "neutral"
    else:
        return "negative"

In [14]:
assign_sentiment(total_score)

'negative'

## Finally, let's put it all together

In [15]:
def classify_review(text):
    tokenized_text = tokenizer(text)
    freqs = word_freq_dict(tokenized_text)
    scores = assign_word_score(freqs)
    total_score = calculate_total_sentiment_score(scores)
    sentiment = assign_sentiment(total_score)
    return sentiment, total_score

In [16]:
text = "the movie was dull and bad though the actors where good"

In [17]:
classify_review(text)

('negative', -2)

> Think! What are the main limitations of such approximation ?