{{< include _include_d4.qmd >}}

## Tokenization

Let's assume the following text data:

In [None]:
#| eval: true
#| echo: true
#| output: true

import pandas as pd
# Read the uploaded TSV file into a Pandas DataFrame named 'df'
df = pd.read_csv('../../txt/zen_of_python.tsv', sep='\t')

# Display the first few rows of the DataFrame
df.head()

To tokenize the text data using the Natural Language Toolkit (NLTK) package, you can follow these steps:

1. First, import the necessary NLTK library: `from nltk.tokenize import word_tokenize`.
2. Create an empty DataFrame to store the tokenized words along with their corresponding 'id' from the original text.
3. Loop through each row of the original DataFrame (`df`), tokenize the text in the 'text' column using `word_tokenize()`, and append the tokens along with their 'id' to the new DataFrame.

Here's an inline code example:

In [None]:
#| eval: true
#| echo: true
#| output: true

from nltk.tokenize import word_tokenize
import pandas as pd

# Create an empty DataFrame to store tokens and ids
tokens_df = pd.DataFrame(columns=['id', 'token'])

# Loop through each row in the original DataFrame
for index, row in df.iterrows():
    id_value = row['id']
    text_value = row['text']
    
    # Tokenize the text
    tokens = word_tokenize(text_value)
    
    # Create a temporary DataFrame to hold tokens and ids
    temp_df = pd.DataFrame({'id': [id_value]*len(tokens), 'token': tokens})
    
    # Append to the main DataFrame
    tokens_df = pd.concat([tokens_df, temp_df], ignore_index=True)

# Show the first few rows of the resulting DataFrame
tokens_df.head()

Running this code will create a new DataFrame `tokens_df` that contains one token per row, along with the original 'id' to associate each token with its originating text.

## Parts of speech

Spacy is a prominent Python library for natural language processing. To analyze the Zen of Python with Spacy, one must first install the package and its English model. After loading the model, the Zen text can be processed to tokenize it. For a visual syntactic analysis of the first sentence, Spacy's `displacy` module can be employed. 

In [None]:
#| eval: true
#| echo: true
#| output: true

import spacy
from spacy import displacy

# Open the file in read mode
with open('../../txt/zen_of_python.txt', 'r') as file:
    zen_text = file.read()

# Load the English model
nlp = spacy.load('en_core_web_sm')
#nlp._path

# Process the Zen of Python text
doc = nlp(zen_text)

# Visualize the syntactic structure of the first sentence
displacy.render(list(doc.sents)[0], style='dep', jupyter=True)

This code provides a graphical representation of the sentence's grammatical relationships.

## Topic modeling

- Upcoming
