# **Practice 1. Introduction to Text Processing**

## 1. Introduction to Python

Python is programing language that provides high levels of freedom to code following its syntactics rules. However, there are some recommendations to follow some good practices programming with Python. We recommend to consult them before starting to program with Python.

* [Python guidelines: PEP 8](https://www.python.org/dev/peps/pep-0008/)

Other Python guidelines:

* [Python guidelines by Google](https://google.github.io/styleguide/pyguide.html).

Next, some basic programing concepts to start coding with Python.

### Lists and tuples

Another very important data type that we are going to use are sequences: tuples and lists. Both are ordered sets of elements: tuples are delimited by parentheses ( ) and lists by square brackets [ ].

Differences:
* A list can be altered, a tuple cannot. Tuples are "immutable".
* A tuple can be used as a key in a dictionary, a list cannot.
* A tuple consumes less memory than a list.

Some examples:

In [None]:
mi_lista = [1, 2, "5", 2]
mi_tupla = (1, 2, "5", 2)

In [None]:
# Podemos comprobar si un elemento está o no dentro de una secuencia
print(2 in mi_lista)
print(2 not in mi_tupla)

In [None]:
# Usamos len() para extraer la cantidad de elementos de la secuencia.
print(len(mi_lista))
print(len(mi_tupla))

In [None]:
mi_lista.append("A") # Añade el caracter A al final de la lista
mi_lista.extend(["B", "C"]) # Añade los caracteres B y C al final
mi_lista.insert(0, "D") # Añade el caracter D en la posición 0 (al principio)
mi_lista.remove(2) # Elimina la primera ocurrencia del elemento 2
dato = mi_lista.pop(0) # Extrae el primer elemento y lo devuelve
dato = mi_lista.pop() # Por defecto, extrae el último elemento. Igual que mi_lista.pop(-1)

print(mi_lista)
print(dato)

In [None]:
# Concatenar listas y tuplas
lista_1 = [1, 2, 3]
lista_2 = [4, 5, 6]
lista3 = lista_1 + lista_2
print(lista3)

tupla_1 = (1, 2, 3)
tupla_2 = (4, 5, 6)
tupla_3 = tupla_1 + tupla_2
print(tupla_3)

In [None]:
# Para buscar y ordenar también tenemos varios métodos
estudiantes = ['Rosa', 'Antonio', 'Ismael', 'Anabel', 'Miguel', 'Cristina', 'Lucas', 'Miguel']

estudiantes.reverse()   # Invierte el orden de los elementos
print(".reverse()", estudiantes)

estudiantes.sort()      # Ordena los elementos (alfabéticamente para str)
print(".sort()", estudiantes)

estudiantes.sort(reverse=True)  # Ordena los elementos en orden inverso
print(".sort(reverse=True)", estudiantes)

print(f"Miguel aparece {estudiantes.count('Miguel')} veces.")   # Cuenta el número de apariciones del elemento buscado
print(f"Miguel aparece en la posición {estudiantes.index('Miguel')}")   # Extrae la posición del elemento buscado

### Rangos

Los rangos son tipos especiales en Python que devuelven un objeto que produce una secuencia de enteros desde `start` (incluido) hasta `stop` (no incluido) saltando `step` (opcional). Si solo se especifica un valor, Python lo interpretará como el valor de `stop`, y `start` valdrá 0.

Son especialmente útiles para iterar por ellos dentro de un bucle for.

In [None]:
print(list(range(6)))
print(list(range(0, 6, 2)))
print(list(range(5, -1, -1)))

### Dictionaries

In Python, a dictionary is an unordered collection of values that are accessed through a key. This means that instead of accessing information using a numerical index (position), as is the case with lists and tuples, it is possible to access **values** through their **keys**, which can be of various types.

Keys are **unique** within a dictionary, meaning that there cannot be a dictionary that has the same key twice. If a value is assigned to an existing key, the previous value is replaced.

There is no direct way to access a key through its value, and nothing prevents the same value from being assigned to different keys.

The information stored in dictionaries does not have a particular order. Neither by key, nor by value, nor by the order in which they have been added to the dictionary.

Any **immutable variable** can be a **key** in a dictionary: strings, integers, tuples (with immutable values in their members), etc. **There are no restrictions on the values** that the dictionary can contain; any type can be the value: lists, strings, tuples, other dictionaries, objects...

Similar to lists, it is possible to define a dictionary directly with the members it will contain, or to initialize an empty dictionary and then add values one by one or in bulk.

To define it along with the members it will contain, the list of values is enclosed in curly braces, the key-value pairs are separated by commas, and the key and value are separated by a colon ":".

In [None]:
punto = {"x": 2, "y": 1, "z": 4}

materias = {}
materias["lunes"] = [6103, 7540]
materias["martes"] = [6201]
materias["miercoles"] = [6103, 7540]
materias["jueves"] = []
materias["viernes"] = [6201]

# Para acceder al valor asociado a una determinada clave, se hace de la misma
# forma que con las listas, pero utilizando la clave elegida en lugar del índice.

valor = materias["lunes"]
print(valor)

# También se puede acceder a los valores de un diccionario con el método
# "get(key, value)". Si la clave no existe, devuelve value.

valor = materias.get("sábado", [777])
print(valor)

### Methods

In Python, the definition of functions is done using the `def` instruction followed by a descriptive function `name`, for which the same rules as for variable names apply, followed by opening and closing parentheses. The definition of the function header ends with a colon (:). The algorithm that makes up the function will be indented with 4 spaces:

In [None]:
def my_method():
    print('Hello world!')

my_method()

When defining the function, you can specify as many arguments or input parameters as needed, which may or may not have default values.

### Files

Python has several modes for reading and writing files, which are specified as a parameter to the open() function:

* "r": Read - Default value. Opens an existing file for reading; returns an error if the file does not exist.
* "a": Append - Opens an existing file for appending content; creates the file if it does not exist.
* "w": Write - Opens a file for writing content; creates the file if it does not exist.
* "x": Create - Creates a new file; returns an error if the file already exists."



### Libraries

A library is a collection of modules that contain code that can be reused in different programs. Python has a wide variety of libraries natively, but it is possible to install many more using the bash command `pip install <library>` (in Colab, `!` is used to execute bash commands).

There are several ways to import a library:

* `import <library>`
* `import <library> as <alias>`
* `from <library> import <module>`

Some of the most commonly used built-in libraries are:

* `os` - Operating system dependent functionalities.
* `math` - Mathematical functions.
* `random` - Generation of pseudo-random numbers.
* `datetime` - Date-related functions.

Some of the most commonly used installable libraries are (being so popular, Colab has them pre-installed, but to use them locally on your computer you would have to install them):

* `numpy` - For using numerical arrays. Typically imported under the alias np (import numpy as np).
* `pandas` - For analysis of datasets in csv, tsv, xlsl, etc. files. Typically imported under the alias pd (import pandas as pd).
* `sklearn` - For machine learning."
* `polar` - data manipulation.

In [None]:
from datetime import datetime
print(f"Fecha actual: {datetime.now()}") # Creation of data

import math
print(math.sqrt(144))   # Square root

import numpy as np
array_aleatorio = np.random.rand(5) # Array of size 5 with random values.
print(array_aleatorio)
print(array_aleatorio.argmax()) # The index with max value in a array.

## 2. Preprocesamiento de texto

### ¿What is **NLTK**?

[NLTK](http://www.nltk.org/) is a library that provides interfaces for easily using a large number of lexical resources, as well as methods for text processing, analysis, and classification.

The library has an associated book, which, in addition to instructing on its use, explains many concepts of NLP: http://www.nltk.org/book/.

#### 1. Instalando NLTK en notebook

Este notebook tiene algunas dependencias, la mayoría de las cuales se pueden instalar a través del gestor de paquetes de python `pip`.


In [None]:
!pip install nltk

### Import and Donwloading of linguistic resourcesImportación y Descarga de Recursos

Using NLTK requires importing it. NLTK is more than just a library, as it offers the download of linguistic resources.

In [None]:
import nltk
nltk.download()

### Text preprocessing

The preprocessing is linked to *tokenization* and sentence splitting. There are some specific libraries or packages for that operations, as the package *punkt*.

To install it: `nltk.download('punkt')`.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

#### 1. Sentence splitting

The standard method for sentence splitting is:

```
sent_tokenize(text, language='english')
```

This function splits the text passed as an argument into sentences using the language we want to analyze. This function uses a language model including characters that mark the beginning and end of a sentence, and they are available for 17 European languages (Spanish, English, Dutch, French...). By default, if no language is specified, the English model is used.

Let's see an example. First, we import the sent_tokenize function and then call it, passing the text we want to split as an argument. The data type it returns is a list containing the sentences of the text.

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
text = "Esto es una oración de prueba. ¿También divide las preguntas, Sr. Smith? Además, este tokenizador no separa por comas."
sent_tokenize(text, language="spanish")

#### 2. División de las oraciones en palabras (*tokenization*)

Una vez separado el texto en oraciones vamos a ver cómo dividir una oración en palabras, concretamente en tokens. La forma básica de *tokenización* consiste en separar el texto en tokens por medio de espacios y signos de puntuación. Para ello, nosotros vamos a utilizar el tokenizador *TreebankWordTokenizer* (aunque hay muchos más).

Lo primero que debemos hacer será importar el tokenizador y posteriomente instanciar la clase.

In [None]:
from nltk.tokenize import TreebankWordTokenizer

In [None]:
tokenizer = TreebankWordTokenizer()

In [None]:
text = "Esto es una oración de prueba. ¿También divide las preguntas, Sr. Smith? Además, este tokenizador no separa por comas."
tokenizer.tokenize(text)

NLTK provides other tokenizers such as `RegexpTokenizer`, `WhitespaceTokenizer`, `SpaceTokenizer`, `WordPunctTokenizer`, etc., which you should try out to complete the exercises in this practice.

#### 3. Stopwords removal

Stop words are words that lack meaning on their own. They are usually articles, pronouns, prepositions...

In some Natural Language Processing tasks, it is useful to remove these words, so next we are going to see how we could eliminate the stop words that are part of a set of tokens.

NLTK has a list of stop words for different languages. Let's see how it is used:

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
spanish_stops = stopwords.words('spanish')
print(spanish_stops)

Next, given a list of words or tokens, we are going to filter them to remove those words that are considered stop words:

In [None]:
words = ['Esto', 'es', 'una', 'oración', 'de', 'prueba.', '¿También', 'divide', 'las', 'preguntas', ',', 'Sr.', 'Smith', '?', 'Además', ',', 'este', 'tokenizador', 'no', 'separa', 'por', 'comas', '.']

"""
filtered = []
for word in words:
#   if word not in spanish_stops:
#     filtered.append(word)
"""
filtered = [word for word in words if word not in spanish_stops]

print(filtered)

#### 4. Stemming

`Stemming` is the technique used to remove the affixes of a word with the objective of obtaining its root or stem. For example, the stem of 'biblioteca' is 'bibliotec'.

This method is often used in information retrieval systems for word indexing because, instead of storing all forms of a word, it allows storing only the stems, reducing the index size and improving the results.

There are different stemming algorithms: Porter Stemmer, Lancaster Stemmer, Snowball Stemmer...

NLTK has an implementation of some of these algorithms that are very easy to use. Simply instantiate the class, for example, PorterStemmer, and call the stem() method with the word for which you want to obtain its stem.

Next, we are going to see an example of how to obtain the stems of a list of tokens using the Snowball algorithm:

In [None]:
from nltk.stem.snowball import SnowballStemmer

In [None]:
stemmer = SnowballStemmer("spanish")

In [None]:
print(stemmer.stem("corriendo"))
print(stemmer.stem("biblioteca"))
print(stemmer.stem("aburridos"))

#### 5. BPE: The tokenizer of some LLMs (ChatGPT among others)

Byte-Pair Encoding (BPE) was initially developed as an algorithm for text compression, and OpenAI used it for tokenization during the pre-training of the GPT model, although today it is used by many other Transformer models such as the GPT family, RoBERTa, Llama-3, or Gemma.

BPE iteratively replaces the most frequent pair of elements with a new element that was not contained in the initial dataset until the desired vocabulary size is reached. For example:

Starting with a corpus with the following 5 words:

```
"hug", "pug", "pun", "bun", "hugs"
```

The vocabulary is `["b", "g", "h", "n", "p", "s", "u"]`. In each step of the training of this tokenizer, the algorithm will search for the most frequent pair of consecutive tokens and merge them. Thus, the first rule learned by this tokenizer would be: ("u", "g") -> "ug" because this pair appears 3 times in the corpus, resulting in the following updated vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]. This process is repeated as many times as necessary until the desired vocabulary size is reached (currently 8). The vocabulary of modern tokenizers usually is around 100k tokens.

Pre-trained tokenizers from OpenAI can be used through the Tiktoken library" (https://github.com/openai/tiktoken).

```
!pip install tiktoken

import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

encoded_text = encoding.encode("tiktoken is great!")
print(encoded_text)

print(encoding.decode(encoded_text))
print([encoding.decode_single_token_bytes(token) for token in encoded_text])

```

* More info at: https://huggingface.co/learn/nlp-course/en/chapter6/5

* Testing tools: https://tiktokenizer.vercel.app/?model=gpt-3.5-turbo

In [None]:
!pip install tiktoken

import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

encoded_text = encoding.encode("Fragmentación de oraciones mediante el tokenizador de GPT4o.")

print("Texto codificado:", encoded_text)
print("Texto decodificado:", encoding.decode(encoded_text))
print("Visualización de tokens independientes:", [encoding.decode_single_token_bytes(token) for token in encoded_text])

## Exercises

The results of this first practice should be submitted to PLATEA by **11:59 PM on February 17, 2025**. You should submit this same notebook with the .ipynb extension, renaming it as follows: pr1_user1_user2.ipynb. Replace "user1" and "user2" with your email aliases.

For the development of these exercises, you must use the collection of documents from PLATEA in the "Material Complementario" folder called "colección_SciELO_PLN".

This collection is composed of 25 files in XML format. You should process each file and consider the text included in the <dc:description xml:lang="en"> tag.

In [2]:
import os
import xml.etree.ElementTree as ET
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
data_path = "/content/drive/MyDrive/NLP/colección_SciELO_PLN"
# Create an empty dictionary to store descriptions for each file
file_data = []
# Iterate through each file in the data directory
for filename in os.listdir(data_path):
  if filename.endswith(".xml"):
    filepath = os.path.join(data_path, filename)
    tree = ET.parse(filepath)
    root = tree.getroot() # Get the root element of the XML tree
    file_descriptions = []
    description_element = root.find(".//{http://purl.org/dc/elements/1.1/}description[@{http://www.w3.org/XML/1998/namespace}lang='en']")
    if description_element is not None:
      description = description_element.text
      file_data.append({
          "filename": filename,
          "description": description
      })


for item in file_data:
    print(f"File: {item['filename']}")
    print(f"Description: {item['description']}")

File: S0211-69952009000500006.xml
Description: Hemodialysis (HD) patients have an impaired response to hepatitis B (HB) vaccines, and the persistence of immunity, the efficacy of revaccination and the periodicity of postvaccination testing are not well defined. We present the experience during 18 years in an outpatient dialysis center of 136 HD patients who completed a HB vaccination program consisting in 3 doses of 40 µg intramuscular recombinant B vaccine (Engerix-B). In all patients anti-HBs titers were determined annually and in 31 patients every 6 months. Nonresponders patients and responders patients that lost their antibodies (<10 UI/ml) received annually a booster double dose of vaccine. Seventy-four patients (54.4%) developed immunity and the remaining 62 patients were considered nonresponders. When compared both groups, gender and the etiology of chronic kidney disease did not differ between the two groups; nevertheless, nonresponders patients were significantly older than re

### Exercise 1

Create a function that splits the texts into sentences, using the sent_tokenize function. The function should display the average number of sentences per file analyzed, the name of the file containing the fewest sentences, and the name of the file containing the most sentences.

In [4]:
import nltk
from nltk.tokenize import sent_tokenize

In [25]:
for item in file_data:
    text = item['description']
    sentences = sent_tokenize(text, language="english")
    item['sentence_count'] = len(sentences)

In [26]:
total_sentences = sum(item["sentence_count"] for item in file_data)
average_sentences = total_sentences / len(file_data)

min_file = min(file_data, key=lambda item: item["sentence_count"])
max_file = max(file_data, key=lambda item: item["sentence_count"])

In [27]:
print(f"Average amount of sentences in file: {average_sentences}")
print(f"File with fewest sentences: {min_file['filename']} with {min_file['sentence_count']} sentences")
print(f"File with most sentences: {max_file['filename']} with {max_file['sentence_count']} sentences")

Average amount of sentences in file: 10.84
File with fewest sentences: S0211-69952009000500015.xml with 2 sentences
File with most sentences: S0211-69952009000500008.xml with 24 sentences


### Exercise 2

Create a program that splits the texts into sentences. Subsequently, perform word tokenization using the `WordPunctTokenizer` class. Finally, the function should display the average number of words per file, the file containing the fewest words, and the file containing the most words.

In [28]:
from nltk import WordPunctTokenizer

In [29]:
tokenizer = WordPunctTokenizer()

In [30]:
for item in file_data:
    text = item['description']
    sentences = sent_tokenize(text, language="english")
    words = []
    for sentence in sentences:
        words.extend(tokenizer.tokenize(sentence))
    item['word_count'] = len(words)

In [31]:
total_words = sum(item["word_count"] for item in file_data)
average_words = total_words / len(file_data)

min_file = min(file_data, key=lambda item: item["word_count"])
max_file = max(file_data, key=lambda item: item["word_count"])

In [32]:
print(f"Average amount of words in file: {average_words}")
print(f"File with fewest words: {min_file['filename']} with {min_file['word_count']} words")
print(f"File with most words: {max_file['filename']} with {max_file['word_count']} words")

Average amount of words in file: 294.08
File with fewest words: S0211-69952009000500015.xml with 26 words
File with most words: S0211-69952009000500012.xml with 666 words


### Exercise 3

Split the sentence shown below into tokens using the following tokenizers: TreebankWordTokenizer, WhitespaceTokenizer, SpaceTokenizer, and WordPunctTokenizer from NLTK and gpt-4o-mini from Tiktoken.

You have to explain the differences among the tokenizers with the following sentence


In [33]:
sentence = "Sorry, I can't go to the meeting.\n"

In [35]:
from nltk import TreebankWordTokenizer
TreebankWordTokenizer = TreebankWordTokenizer()
print(TreebankWordTokenizer.tokenize(sentence))

['Sorry', ',', 'I', 'ca', "n't", 'go', 'to', 'the', 'meeting', '.']


In [36]:
from nltk import WhitespaceTokenizer
WhitespaceTokenizer = WhitespaceTokenizer()
print(WhitespaceTokenizer.tokenize(sentence))

['Sorry,', 'I', "can't", 'go', 'to', 'the', 'meeting.']


In [37]:
from nltk import SpaceTokenizer
SpaceTokenizer = SpaceTokenizer()
print(SpaceTokenizer.tokenize(sentence))

['Sorry,', 'I', "can't", 'go', 'to', 'the', 'meeting.\n']


In [34]:
from nltk import WordPunctTokenizer
WordPunctTokenizer = WordPunctTokenizer()
print(WordPunctTokenizer.tokenize(sentence))

['Sorry', ',', 'I', 'can', "'", 't', 'go', 'to', 'the', 'meeting', '.']


In [39]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [42]:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

encoded_text = encoding.encode(sentence)
print(encoded_text)
print([encoding.decode_single_token_bytes(token) for token in encoded_text])

[33680, 11, 357, 8535, 810, 316, 290, 9176, 558]
[b'Sorry', b',', b' I', b" can't", b' go', b' to', b' the', b' meeting', b'.\n']


## Explanation of Tokenizer Differences

**TreebankWordTokenizer:** Splits contractions (e.g., "can't" -> "ca", "n't"). Keeps punctuation attached.

**WhitespaceTokenizer:** Splits on all whitespace (spaces, tabs, newlines). Keeps punctuation attached.

**SpaceTokenizer:** Splits only on spaces. Other whitespace (like \n) is part of the token.

**WordPunctTokenizer:** Splits all punctuation separately. Splits contractions.

**Tiktoken (gpt-4o-mini):** Uses Byte Pair Encoding (BPE). Handles contractions as single tokens. Adds leading spaces to some tokens.  Newlines are separate. Designed for LLMs.

**Key Differences:**

*   **Contractions:** Treebank & WordPunct split. Whitespace & Space keep together. Tiktoken keeps together (but can vary).
*   **Punctuation:** WordPunct separates. Others attach.
*   **Whitespace:** Whitespace splits on all. Space only on spaces. Tiktoken handles contextually.
*   **Subwords:** Tiktoken uses subwords. Others don't.

### Exercise 4

Create a tokenizer based on regular expressions using the RegexpTokenizer class from NLTK that extracts only the words present in the text, meaning it should not return punctuation marks, tabs/line breaks, etc., as output.

Furthermore, the tokenizer should not separate contractions in the text.

What are the tokens extracted if we pass the following sentence to it?


In [43]:
sentence = "Sorry, I can't go to the meeting.\n"

In [45]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:'[a-z]+)?")

tokens = tokenizer.tokenize(sentence)
print(tokens)

['Sorry', 'I', "can't", 'go', 'to', 'the', 'meeting']


### Exercise 5

Using the SFU corpus, composed of 400 opinion documents from 8 different domains (books, cars, computers, kitchen utensils, hotels, movies, and phones), the following operations should be performed:

**Note**: The corpus is located in the "Material Complementario" section of PLATEA.

* Show the size of the vocabulary (unique tokens) of each domain (using 2 tokenizers, one of them based on BPE).
* Show the total number of stop words per domain (using 2 tokenizers, one of them based on BPE).
* Show the percentage of stop words in relation to the number of unique tokens and unique words (without punctuation marks; using 2 tokenizers, one of them based on BPE).
* Show the 5 most common stems in each domain, obviously without considering stop words (using 2 tokenizers, one of them based on BPE).




In [56]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from collections import Counter
from nltk import TreebankWordTokenizer

nltk.download('stopwords')

domains = []
path = "/content/drive/MyDrive/NLP/SFU"

for item in os.listdir(path):
  item_path = os.path.join(path, item)
  if os.path.isdir(item_path):
    domains.append(item)

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")

tokenizer = TreebankWordTokenizer()
bpe_tokenizer = tiktoken.encoding_for_model("gpt-4o-mini")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [70]:
def analyze_domain(domain_path, tokenizer, bpe_tokenizer):
    all_words = []
    bpe_words = []

    for file in os.listdir(domain_path):
        if file.endswith(".txt"):
            file_path = os.path.join(domain_path, file)
            with open(file_path, "r", encoding='ascii', errors='ignore') as f:
                text = f.read()
                words = tokenizer.tokenize(text)
                all_words.extend(words)
                bpe_tokens = bpe_tokenizer.encode(text)
                bpe_decoded_tokens = [bpe_tokenizer.decode([token]) for token in bpe_tokens]
                bpe_words.extend(bpe_decoded_tokens)

    return all_words, bpe_words

def statistics(words, bpe_words, domain):
  # Show the size of the vocabulary (unique tokens) of each domain
  vocabulary = set(words)
  bpe_vocabulary = set(bpe_words)

  # Show the total number of stop words per domain
  stop_words_count = 0
  for word in vocabulary:
    if word in stop_words:
      stop_words_count += 1

  bpe_stop_words_count = 0
  for word in bpe_vocabulary:
    if word in stop_words:
      bpe_stop_words_count +=1

  # Show the percentage of stop words in relation to the number of unique tokens and unique words (without punctuation marks)
  vocabulary_no_punct = []
  for w in vocabulary:
    if w.isalnum():
      vocabulary_no_punct.append(w)

  bpe_vocabulary_no_punct = []
  for w in bpe_vocabulary:
    if w.isalnum():
      bpe_vocabulary_no_punct.append(w)

  stop_words_percentage = (stop_words_count / len(vocabulary_no_punct)) * 100 if vocabulary_no_punct else 0
  bpe_stop_words_percentage = (bpe_stop_words_count / len(bpe_vocabulary_no_punct)) * 100 if bpe_vocabulary_no_punct else 0

  # Show the 5 most common stems in each domain, without considering stop words
  stemmed_words = []
  for w in vocabulary_no_punct:
    if w not in stop_words:
        stemmed_w = stemmer.stem(w)
        stemmed_words.append(stemmed_w)

  bpe_stemmed_words = []
  for w in bpe_vocabulary_no_punct:
    if w not in stop_words:
        stemmed_w = stemmer.stem(w)
        bpe_stemmed_words.append(stemmed_w)

  stem_counts = Counter(stemmed_words)
  bpe_stem_counts = Counter(bpe_stemmed_words)

  most_common_stems = stem_counts.most_common(5)
  bpe_most_common_stems = bpe_stem_counts.most_common(5)


  print(f"Domain: {domain}")
  print(f"Vocabulary size: {len(vocabulary)}")
  print(f"Vocabulary size (BPE): {len(bpe_vocabulary)}")
  print(f"Stop words count: {stop_words_count}")
  print(f"Stop words count (BPE): {bpe_stop_words_count}")
  print(f"Stop words percentage: {stop_words_percentage}%")
  print(f"Stop words percentage (BPE): {bpe_stop_words_percentage}%")
  print(f"Most common stems: {most_common_stems}")
  print(f"Most common stems (BPE): {bpe_most_common_stems}")


In [66]:
print(domains)

['BOOKS', 'CARS', 'COMPUTERS', 'COOKWARE', 'HOTELS', 'MOVIES', 'MUSIC', 'PHONES']


In [71]:
for domain in domains:
    domain_path = os.path.join(path, domain)
    all_words, bpe_words = analyze_domain(domain_path, tokenizer, bpe_tokenizer)
    statistics(all_words, bpe_words, domain)

Domain: BOOKS
Vocabulary size: 5553
Vocabulary size (BPE): 5760
Stop words count: 118
Stop words count (BPE): 53
Stop words percentage: 2.619893428063943%
Stop words percentage (BPE): 4.985888993414863%
Most common stems: [('narrat', 8), ('murder', 8), ('like', 7), ('believ', 7), ('develop', 6)]
Most common stems (BPE): [('ell', 4), ('ate', 4), ('ere', 4), ('ation', 4), ('ing', 4)]
Domain: CARS
Vocabulary size: 7939
Vocabulary size (BPE): 7460
Stop words count: 130
Stop words count (BPE): 66
Stop words percentage: 2.2146507666098807%
Stop words percentage (BPE): 3.9927404718693285%
Most common stems: [('acceler', 9), ('impress', 8), ('posit', 8), ('engin', 8), ('adjust', 7)]
Most common stems (BPE): [('ate', 5), ('anc', 5), ('age', 5), ('ide', 4), ('ine', 4)]
Domain: COMPUTERS
Vocabulary size: 7135
Vocabulary size (BPE): 6935
Stop words count: 118
Stop words count (BPE): 65
Stop words percentage: 2.1965748324646315%
Stop words percentage (BPE): 4.049844236760125%
Most common stems: [('

In [64]:
import chardet

with open("/content/drive/MyDrive/NLP/SFU/BOOKS/no1.txt", 'rb') as f:  # Открываем файл в бинарном режиме
    result = chardet.detect(f.read())

print(result['encoding'])  # Выводит предполагаемую кодировку

ascii
