# Portuguese Bigram Probabilities

In this notebook, we will use the [BrWac Dataset](https://huggingface.co/datasets/brwac) (a large corpus of Brazilian Portuguese) to count the number of occurrences of each bigram (two-letter sequence) in the corpus. We will then compute the frequency of each bigram, and sort the bigrams by frequency in order to create a random string detector.

In [1]:
# Add root directory to path
import sys
from pathlib import Path
sys.path.append(str(Path('.').absolute().parent))

## Load data

In [2]:
from datasets import load_dataset

dataset = load_dataset("brwac", data_dir="../data")

In [3]:
from tqdm import tqdm

def flatten(l):
    """recursive function to flatten nested lists."""
    out = []
    for item in l:
        if isinstance(item, (list, tuple)):
            out.extend(flatten(item))
        else:
            out.append(item)
    return out

processed = {}
texts = []
for i, text in enumerate(tqdm(dataset["train"])):
    if i not in processed:
        processed[i] = True
        texts.extend(flatten(text["text"]["paragraphs"]))
    
print(f"Total of texts: {len(texts)}")

100%|██████████| 3530796/3530796 [12:36<00:00, 4664.86it/s] 

Total of texts: 145370673





## Preprocess text

In [4]:
texts = [text.lower() for text in tqdm(texts)]

100%|██████████| 145370673/145370673 [06:19<00:00, 383120.74it/s]


In [5]:
# Sanity Check: check if there are no uppercase letters
for text in tqdm(texts):
    assert text == text.lower()

100%|██████████| 145370673/145370673 [03:51<00:00, 627512.36it/s]


In [6]:
from src.preprocessing import TextPreprocessing

preprocessing = TextPreprocessing()

texts = [preprocessing(text.lower()) for text in tqdm(texts)]

100%|██████████| 145370673/145370673 [56:22<00:00, 42972.38it/s] 


## Word counts

In this section, we will count the number of occurrences of each word in the corpus.

In [7]:
from collections import Counter

word_counts = Counter()

for text in tqdm(texts):
    word_counts.update(text.lower().split())

print(f"Total of words: {len(word_counts)}")

  0%|          | 0/145370673 [00:00<?, ?it/s]

100%|██████████| 145370673/145370673 [10:41<00:00, 226507.20it/s]

Total of words: 11646606





Show the 20 most frequent words in the corpus.

In [8]:
total_words = sum(word_counts.values())

for word, count in word_counts.most_common(20):
    print(f"{word}: {count / total_words * 100:.2f}%")

de: 4.83%
e: 3.97%
a: 3.56%
o: 2.91%
que: 2.46%
do: 1.71%
da: 1.47%
em: 1.31%
para: 1.19%
com: 1.03%
um: 0.96%
no: 0.88%
os: 0.84%
nao: 0.84%
uma: 0.79%
na: 0.73%
as: 0.68%
se: 0.64%
por: 0.62%
como: 0.51%


## Two-Letter Sequence (Bigram) Counts

Now we turn to sequences of letters: consecutive letters anywhere within a word. In the list below are the 50 most frequent two-letter sequences (which are called "bigrams"):

In [9]:
from collections import Counter

bigram_counts = Counter()

for text in tqdm(texts):
    text = "".join(text.lower().split())
    bigram_counts.update(text[i : i + 2] for i in range(len(text) - 1))

# Remove bigrams with non-alphabetic characters
for bigram in list(bigram_counts):
    if not bigram.isalpha():
        del bigram_counts[bigram]

print(f"Total of bigrams: {len(bigram_counts)}")

100%|██████████| 145370673/145370673 [28:02<00:00, 86393.94it/s] 

Total of bigrams: 676





## Compute Bigram Frequencies

Calculate combinations of bigrams with associated frequency numbers (from 0 to 100) where higher values represent more frequent bigrams and lower values represent less frequent bigrams.

In [10]:
pt_bigrams_freqs = {}

total_bigrams = sum(bigram_counts.values())
for bigram, count in bigram_counts.items():
    pt_bigrams_freqs[bigram] = count / total_bigrams * 100

# min-max normalization
min_freq = min(pt_bigrams_freqs.values())
max_freq = max(pt_bigrams_freqs.values())

for bigram, freq in pt_bigrams_freqs.items():
    pt_bigrams_freqs[bigram] = (freq - min_freq) / (max_freq - min_freq)

## Sort bigrams by frequency

In [11]:
pt_bigrams_freqs = dict(
    sorted(pt_bigrams_freqs.items(), key=lambda item: item[1], reverse=True)
)

## Save bigram frequencies

In [12]:
import json

with open("../src/bigrams/portuguese.json", "w") as f:
    json.dump(pt_bigrams_freqs, f, indent=4, ensure_ascii=False)