<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/Class03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Word Frequency**
Word frequency analysis is a fundamental technique in natural language processing that helps computers analyze human language and supports advanced applications such as sentiment analysis, topic modeling, and text summarization. The process begins with text preprocessing, which includes normalization, removal of punctuation, special characters, and stop words to ensure accurate counting. In Python, dictionaries—particularly the Counter class—are commonly used to store and compute word frequencies efficiently. To fairly compare texts of different lengths, relative frequency is calculated by normalizing word counts by the total number of words. Interpreting frequency results requires contextual awareness, as common words may not always signal meaningful themes, and visualizations like bar charts can aid in identifying patterns and insights within the data.

In [1]:
# Importing all the necessary libraries and resources:
from collections import Counter

## **Example: Word Counting**
Word counting can be achieved by creating a dictionary in Python, where each word is mapped to its frequency of occurrence in the texts. This is accomplished by processing the texts to transform the content into a list of words (tokens) and then counting each word using a data structure that accumulates the frequencies.

In [2]:
# Text example for preprocessing:
text = ''

# Splitting the sentence into words
words = text.split()

# Coounting words with Counter:
frequencies = Counter(words)
print(frequencies)

Counter()


## **Example: Relative Frequency**
In the context of Natural Language Processing (NLP), relative frequency is a crucial concept for analyzing and comparing texts of different sizes and types. When examining academic and e-commerce datasets, we note that simply counting tokens can lead to biased analyses due to the disparity in the sizes of the datasets. The academic dataset, for example, has 34 million tokens, while the e-commerce dataset has approximately 3 million.

In [4]:
# Assuming we have the token count:
academic_count = 768000
ecommerce_count = 101000
total_tokens_academic = 34000000
total_tokens_ecommerce = 3000000

# Calculating the relative frequency:
freq_prop_academic = academic_count / total_tokens_academic
freq_prop_ecommerce = ecommerce_count / total_tokens_ecommerce

print(f'Relative frequency in academic: {freq_prop_academic:.4f}')
print(f'Relative frequency in e-commerce: {freq_prop_ecommerce:.4f}')

Relative frequency in academic: 0.0226
Relative frequency in e-commerce: 0.0337


## **Example: Implementation and Visualization of Relative Frequency**
In the practical implementation of relative frequency in Natural Language Processing (NLP), it is crucial to understand that this metric adjusts the word count to the document size, allowing for fairer comparisons between texts of different sizes. For example, comparing the word frequency between a large academic corpus and a smaller e-commerce dataset requires data normalization to avoid biased conclusions.

In [7]:
# Following is a dictionary example of 'frequency_counter' with the token count and a dictionary 'token_quantity' with the total number of tokens per dataset:
frequency_counter = {
    'male': {'data': 100, 'analysis': 50, 'tool': 20},
    'female': {'data': 120, 'research': 60, 'study': 30}
}
token_quantity = {
    'male': 2000,
    'female': 2500
}

relative_frequency = {}

# Calculating the relative frequency for each token in each dataset:
for gender in frequency_counter.keys():
  relative_frequency[gender] = {}

  for token in frequency_counter[gender]:
    freq_abs = frequency_counter[gender][token]
    total_tokens = token_quantity[gender]
    relative_frequency[gender][token] = freq_abs / total_tokens

# Displaying the results:
for gender, tokens in relative_frequency.items():
  print(f'Proportional frequencies for {gender}.')

  for token, freq in tokens.items():
    print(f"{token}: {freq:.4f}")

Proportional frequencies for male.
data: 0.0500
analysis: 0.0250
tool: 0.0100
Proportional frequencies for female.
data: 0.0480
research: 0.0240
study: 0.0120


## **Interpreting Relative Frequency Results**
Interpreting relative frequency results in Natural Language Processing (NLP) involves understanding how the values ​​reflect the linguistic and thematic characteristics of the analyzed texts. Through the analysis of the relative frequency of tokens in the datasets, it is possible to identify patterns in word usage, differences between text genres, and terms unique to each dataset.

In [9]:
# Supondo a existência de um dicionário ‘frequencia_relativa’ para cada dataset
sorted_tokens_male = sorted(relative_frequency['male'].items(), key=lambda item: item[1], reverse=True)
sorted_tokens_female = sorted(relative_frequency['female'].items(), key=lambda item: item[1], reverse=True)

# Displaying the most frequent words:
print('Most frequent words for male:', sorted_tokens_male[:20])
print('Most frequent words for female:', sorted_tokens_female[:20])

Most frequent words for male: [('data', 0.05), ('analysis', 0.025), ('tool', 0.01)]
Most frequent words for female: [('data', 0.048), ('research', 0.024), ('study', 0.012)]
