Imports for the notebook

In [1]:
import nltk
import re

Make a corpus available for local usage with the library. If you try using a corpora that is not available it will throw an error.

In [None]:
# Note that this corpus is in spanish.
# The content are news headlines in spanish from multiple sources.
nltk.download(
    'cess_esp',
    download_dir='../nltk_data'
)

[nltk_data] Downloading package cess_esp to ../nltk_data...
[nltk_data]   Unzipping corpora/cess_esp.zip.


True

NOTE: Running this cell will download the corpus to the local machine.
You can go to your home directory and find the `nltk_data` folder if no `download_dir` is specified.

If you delete the folder where the corpus is hosted, you will have to run this cell again to download the corpus.

In [None]:
# Only run this if using a custom download_dir
import pathlib

# Add the path to the nltk_data directory
abs_path = pathlib.Path('../nltk_data').resolve()
nltk.data.path.append(str(abs_path))

In [9]:
# First get the corpus available for usage
corpus = nltk.corpus.cess_esp.sents()

In [10]:
# Check corpus content (it's already tokenized)
print(f"Sample corpus content: {corpus}")
print(f"Number of sentences in corpus: {len(corpus)}")

Sample corpus content: [['El', 'grupo', 'estatal', 'Electricité_de_France', '-Fpa-', 'EDF', '-Fpt-', 'anunció', 'hoy', ',', 'jueves', ',', 'la', 'compra', 'del', '51_por_ciento', 'de', 'la', 'empresa', 'mexicana', 'Electricidad_Águila_de_Altamira', '-Fpa-', 'EAA', '-Fpt-', ',', 'creada', 'por', 'el', 'japonés', 'Mitsubishi_Corporation', 'para', 'poner_en_marcha', 'una', 'central', 'de', 'gas', 'de', '495', 'megavatios', '.'], ['Una', 'portavoz', 'de', 'EDF', 'explicó', 'a', 'EFE', 'que', 'el', 'proyecto', 'para', 'la', 'construcción', 'de', 'Altamira_2', ',', 'al', 'norte', 'de', 'Tampico', ',', 'prevé', 'la', 'utilización', 'de', 'gas', 'natural', 'como', 'combustible', 'principal', 'en', 'una', 'central', 'de', 'ciclo', 'combinado', 'que', 'debe', 'empezar', 'a', 'funcionar', 'en', 'mayo_del_2002', '.'], ...]
Number of sentences in corpus: 6030


We need to unpack the content of the corpus for easier usage, thus we flatten the nested structure

In [11]:
flattened_corpus = [word for sentence in corpus for word in sentence]

In [12]:
print(f"Total words in corpus: {len(flattened_corpus)}")

Total words in corpus: 192686


Leverage regex for text cleaning

In [19]:
# Filter words that start with 'es'
# https://docs.python.org/3/library/re.html#search-vs-match
filtered_corpus = [
    word
    for word in flattened_corpus
    if re.search(
        r'es',
        word
    )
]

In [20]:
print(filtered_corpus[:5])

['estatal', 'jueves', 'empresa', 'centrales', 'francesa']


In [21]:
# Only get words that end with 'es'
filtered_corpus_end_word = [
    word
    for word in flattened_corpus
    if re.search(
        r'es$',
        word
    )
]

In [24]:
print(filtered_corpus_end_word[:5])

['jueves', 'centrales', 'millones', 'millones', 'dólares']


In [25]:
# Get words that start with 'es'
filtered_corpus_start_word = [
    word
    for word in flattened_corpus
    if re.search(
        r'^es',
        word
    )
]

In [26]:
print(filtered_corpus_start_word[:5])

['estatal', 'es', 'esta', 'esta', 'eso']


For more complex filtering, we can use regex ranges, such as `a-z` for all lowercase letters

In [27]:
# Get words that begin with 'g', 'h' or 'i'
filtered_corpus_range_example = [
    word
    for word in flattened_corpus
    if re.search(
        r'^[ghi]',
        word
    )
]

In [28]:
print(filtered_corpus_range_example[:5])

['grupo', 'hoy', 'gas', 'gas', 'intervendrá']


By combining quantifiers (indicate numbers of characters or expressions to match, eg. `*` or `+`) with ranges, we can improve our filtering.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet

In [34]:
# Get words that begin with at least 'no' zero or more times
filtered_corpus_quantifiers_example = [
    word
    for word in flattened_corpus
    if re.search(
        r'^(no)+',
        word
    )
]

In [36]:
print(filtered_corpus_quantifiers_example[:5])

['norte', 'no', 'no', 'noche', 'no']
