{{< include _include_d4.qmd >}}

## Tokenization

First we attempt to install needed packages, and their associated language data. These downloads may require quite a lot of free disk space!

In [None]:
#| eval: false
#| echo: true
#| output: true

!pip install -q nltk spacy
!python -m nltk.downloader popular
!python -m spacy download en_core_web_sm

# downloads to ~/nltk_data/
#import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

Let's assume the following text data:

In [None]:
#| eval: true
#| echo: true
#| output: true

import pandas as pd

# Read the uploaded TSV file into a Pandas DataFrame named 'df'
df = pd.read_csv('https://raw.githubusercontent.com/nils-holmberg/socs-qmd/main/txt/zen_of_python.tsv', sep='\t')

# Display the first few rows of the DataFrame
df.head()

To tokenize the text data using the Natural Language Toolkit (NLTK) package, you can follow these steps:

1. First, import the necessary NLTK library: `from nltk.tokenize import word_tokenize`.
2. Create an empty DataFrame to store the tokenized words along with their corresponding 'id' from the original text.
3. Loop through each row of the original DataFrame (`df`), tokenize the text in the 'text' column using `word_tokenize()`, and append the tokens along with their 'id' to the new DataFrame.

Here's an inline code example:

In [None]:
#| eval: true
#| echo: true
#| output: true

import pandas as pd
from nltk.tokenize import word_tokenize

# Create an empty DataFrame to store tokens and ids
tokens_df = pd.DataFrame(columns=['id', 'token'])

# Loop through each row in the original DataFrame
for index, row in df.iterrows():
    id_value = row['id']
    text_value = row['text']
    
    # Tokenize the text
    tokens = word_tokenize(text_value)
    
    # Create a temporary DataFrame to hold tokens and ids
    temp_df = pd.DataFrame({'id': [id_value]*len(tokens), 'token': tokens})
    
    # Append to the main DataFrame
    tokens_df = pd.concat([tokens_df, temp_df], ignore_index=True)

# Show the first few rows of the resulting DataFrame
tokens_df.head()

Running this code will create a new DataFrame `tokens_df` that contains one token per row, along with the original 'id' to associate each token with its originating text.

## Matrix representation

Turning unstructured language data into structured tables @fig-mr

![matrix representation](../../res/img/nlp-image_0-259d7a671398a16dc7cdfe05d89d4880.png){#fig-mr}

In [None]:
#| eval: true
#| echo: true
#| output: true

!pip install -q scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation as LDA
#from nltk.corpus import stopwords

Consider the following corpus composed of five short sentences (all taken from New York Times headlines). The algorithm should clearly identify one topic related to politics and coronavirus, and a second one related to Nadal and tennis.

In [None]:
#| eval: true
#| echo: true
#| output: true

corpus = ["Rafael Nadal Joins Roger Federer in Missing U.S. Open",
          "Rafael Nadal Is Out of the Australian Open",
          "Biden Announces Virus Measures",
          "Biden's Virus Plans Meet Reality",
          "Where Biden's Virus Plan Stands"]

Using `CountVectorizer()`, we generate the matrix that denotes the frequency of the words of each text using `CountVectorizer()`. Note that the CountVectorizer allows for preprocessing if you include parameters such as stop_words to include the stop words, ngram_range to include n-grams, or `lowercase=True` to convert all characters to lowercase.

In [None]:
#| eval: true
#| echo: true
#| output: true

from nltk.corpus import stopwords

count_vect = CountVectorizer(stop_words=stopwords.words('english'), lowercase=True)
x_counts = count_vect.fit_transform(corpus)
x_counts.todense()
#count_vect.get_feature_names_out()

Term frequency–inverse document frequency (tf–idf). Use the coefficient of tf–idf instead of noting the frequency of each word within each cell of the matrix. It consists of two numbers, multiplied:

- tf: the frequency of a given term or word in a text, and
- idf: the logarithm of the total number of documents divided by the number of documents that contain that given term.

tf-idf is a measure of how frequently a word is used in the corpus. To be able to subdivide words into groups, it is important to understand not only which words appear in each text, but also which words appear frequently in one text but not at all in others.

In [None]:
#| eval: true
#| echo: true
#| output: true

tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)
x_tfidf

## Parts of speech

Spacy is a prominent Python library for natural language processing. To analyze the Zen of Python with Spacy, one must first install the package and its English model. After loading the model, the Zen text can be processed to tokenize it. For a visual syntactic analysis of the first sentence, Spacy's `displacy` module can be employed. 

In [None]:
#| eval: true
#| echo: true
#| output: true

import spacy
from spacy import displacy

# Open the file in read mode
with open('../../txt/zen_of_python.txt', 'r') as file:
    zen_text = file.read()

# Load the English model
nlp = spacy.load('en_core_web_sm')
#nlp._path

# Process the Zen of Python text
doc = nlp(zen_text)

# Visualize the syntactic structure of the first sentence
displacy.render(list(doc.sents)[0], style='dep', jupyter=True)

This code provides a graphical representation of the sentence's grammatical relationships.

## Named entities

In [None]:
#| eval: true
#| echo: true
#| output: true

doc = nlp("Apple is looking at buying a Hong Kong startup for $1 billion")
for token in doc:
    print(token.text)

To learn more about entity recognition in spaCy, how to **add your own
entities** to a document and how to **train and update** the entity predictions
of a model, see the usage guides on
[named entity recognition](/usage/linguistic-features#named-entities) and
[training pipelines](/usage/training).

A named entity is a "real-world object" that's assigned a name – for example, a
person, a country, a product or a book title. spaCy can **recognize various
types of named entities in a document, by asking the model for a
prediction**. Because models are statistical and strongly depend on the
examples they were trained on, this doesn't always work _perfectly_ and might
need some tuning later, depending on your use case.

Named entities are available as the `ents` property of a `Doc`:

> - **Text:** The original entity text.
> - **Start:** Index of start of entity in the `Doc`.
> - **End:** Index of end of entity in the `Doc`.
> - **Label:** Entity label, i.e. type.

| Text        | Start | End | Label   | Description                                          |
| ----------- | :---: | :-: | ------- | ---------------------------------------------------- |
| Apple       |   0   |  5  | `ORG`   | Companies, agencies, institutions.                   |
| U.K.        |  27   | 31  | `GPE`   | Geopolitical entity, i.e. countries, cities, states. |
| \$1 billion |  44   | 54  | `MONEY` | Monetary values, including unit.                     |

Using spaCy's built-in [displaCy visualizer](/usage/visualizers), here's what
our example sentence and its named entities look like:

In [None]:
#| eval: true
#| echo: true
#| output: true

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
#| eval: true
#| echo: true
#| output: true

text = """Apple decided to fire Tim Cook and hire somebody called John Doe as the new CEO.
They also discussed a merger with Google. On the long run it seems more likely that Apple
will merge with Amazon and Microsoft with Google. The companies will all relocate to
Austin in Texas before the end of the century. John Doe bought a car."""

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

## Topic modeling

To cluster our corpus, we can choose from several algorithms, including non-negative matrix factorization (NMF), sparse principal components analysis (sparse PCA), and latent dirichlet allocation (LDA). We'll focus on LDA because it is widely used by the scientific community due to its good results in social media, medical science, political science, and software engineering.

LDA is a model for unsupervised topic decomposition: It groups texts based on the words they contain and the probability of a word belonging to a certain topic. The LDA algorithm outputs the topic word distribution. With this information, we can define the main topics based on the words that are most likely associated with them. Once we have identified the main topics and their associated words, we can know which topic or topics apply to each text.

In [None]:
#| eval: true
#| echo: true
#| output: true

tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)

In order to perform the LDA decomposition, we have to define the number of topics. In this simple case, we know there are two topics or "dimensions." But in general cases, this is a hyperparameter that needs some tuning, which could be done using algorithms like random search or grid search:

In [None]:
#| eval: true
#| echo: true
#| output: true

dimension = 2
lda = LDA(n_components = dimension)
lda_array = lda.fit_transform(x_tfidf)
lda_array

LDA is a probabilistic method. Here we can see the probability of each of the five headlines belonging to each of the two topics. We can see that the first two texts have a higher probability of belonging to the first topic and the next three to the second topic, as expected.

Finally, if we want to understand what these two topics are about, we can see the most important words in each topic:

In [None]:
#| eval: true
#| echo: true
#| output: true

components = [lda.components_[i] for i in range(len(lda.components_))]
#features = count_vect.get_feature_names()
#features = count_vect.get_feature_names_out()
#important_words = [sorted(features, key = lambda x: components[j][features.index(x)], reverse = True)[:3] for j in range(len(components))]
#important_words

```python
[['virus', 'biden', 'plan'], ['open', 'nadal', 'rafael']]
```

As expected, LDA correctly assigned words related to tennis tournaments and Nadal to the first topic and words related to Biden and virus to the second topic.

## Try it yourself!

Here are 10 tasks focusing on cleaning language data using the NLTK package:

**Tasks:**

1. Tokenize the given text into words.
2. Convert all the tokens into lowercase.
3. Remove any punctuation from the tokenized words.
4. Use stemming to reduce the words to their root form.
5. Lemmatize the words to obtain their base or dictionary form.
6. Remove English stopwords from the tokenized list.
7. Tokenize the given text into sentences.
8. Count the frequency of each word in the tokenized list.
9. Find the bigrams (two consecutive words) in the tokenized list.
10. Identify the parts of speech (POS) for each token.

In [None]:
#| eval: false
#| echo: false
#| output: false

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.probability import FreqDist
from nltk.util import bigrams
from nltk import pos_tag

# Sample text for processing
text = "NLTK is a leading platform for building Python programs to work with human language data."

# 1. Tokenize
tokens = word_tokenize(text)

# 2. Convert to lowercase
lower_tokens = [token.lower() for token in tokens]

# 3. Remove punctuation
words = [word for word in lower_tokens if word.isalnum()]

# 4. Stemming
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in words]

# 5. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# 6. Remove stopwords
filtered_words = [word for word in lemmatized_words if word not in stopwords.words('english')]

# 7. Sentence tokenization
sentences = sent_tokenize(text)

# 8. Frequency distribution
freq_dist = FreqDist(filtered_words)

# 9. Bigrams
word_bigrams = list(bigrams(tokens))

# 10. POS tagging
pos_tags = pos_tag(tokens)

tokens, lower_tokens, words, stemmed_words, lemmatized_words, filtered_words, sentences, freq_dist, word_bigrams, pos_tags

Note: Before executing the code, ensure you have the required NLTK data downloaded (e.g., tokenizers, stopwords, and the averaged_perceptron_tagger for POS tagging). You can use `nltk.download('package_name')` to download the necessary datasets.
