<h2>Exercise: Apply Tokenization on Text Data With NLTK and Store Processed Tokens in S3</h2>

<h3>Set up dataset</h3>

In [1]:
import pandas as pd

#Sample data (5 short reviews)
#You can store this in a CSV, or just build it in code:
data = {
    "review_id": [1, 2, 3, 4, 5],
    "review_text": [
        "I absolutely loved this product! It works great and the quality is fantastic.",
        "Not bad, but shipping was slow... I might try a different seller next time.",
        "Terrible experience. The item broke in two days and support was unhelpful.",
        "Decent value for the price. Could be better packaged.",
        "Amazing! Fast delivery and excellent customer service. Highly recommend."
    ]
}
df = pd.DataFrame(data)
print(df)


   review_id                                        review_text
0          1  I absolutely loved this product! It works grea...
1          2  Not bad, but shipping was slow... I might try ...
2          3  Terrible experience. The item broke in two day...
3          4  Decent value for the price. Could be better pa...
4          5  Amazing! Fast delivery and excellent customer ...


<h3>Download NLTK resources & set up helpers</h3>

In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /Users/sksingh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sksingh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<h3>Tokenize + clean </h3>
We'll tokenize the text using word_tokenize, convert tokens to lowercase, keep only alphabetic tokens using str.isalpha(), and remove English stop words.

In [3]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

def tokenize_and_clean(text: str):
    tokens = word_tokenize(text)
    tokens = [t.lower() for t in tokens]               # lowercase
    tokens = [t for t in tokens if t.isalpha()]        # remove punctuation/numbers
    tokens = [t for t in tokens if t not in stop_words]# remove stop words
    return tokens

df["tokens"] = df["review_text"].apply(tokenize_and_clean)
print(df[["review_id", "tokens"]])


   review_id                                             tokens
0          1  [absolutely, loved, product, works, great, qua...
1          2  [bad, shipping, slow, might, try, different, s...
2          3  [terrible, experience, item, broke, two, days,...
3          4    [decent, value, price, could, better, packaged]
4          5  [amazing, fast, delivery, excellent, customer,...


<h3># (Optional) Create an exploded (1 token per row) table</h3>

In [4]:
tokens_exploded = df[["review_id", "tokens"]].explode("tokens") \
    .rename(columns={"tokens": "token"})
print(tokens_exploded.head())


   review_id       token
0          1  absolutely
0          1       loved
0          1     product
0          1       works
0          1       great


<h3>Save to CSV locally</h3>

In [5]:
df.to_csv("tokenized_reviews_nested.csv", index=False)       # tokens as Python lists (stringified)
tokens_exploded.to_csv("tokenized_reviews_exploded.csv", index=False)  # one token per row


<h3>Upload to S3 with boto3</h3>

In [7]:
import boto3

region = "us-east-1"                    # change if needed
bucket = "knodax-feature-engineering"             # <- change me
prefix = "nlp/"                         # optional folder in the bucket

s3 = boto3.client("s3", region_name=region)

# Upload files
s3.upload_file("tokenized_reviews_nested.csv", bucket, f"{prefix}tokenized_reviews_nested.csv")
s3.upload_file("tokenized_reviews_exploded.csv", bucket, f"{prefix}tokenized_reviews_exploded.csv")

print("Files uploaded to S3!")


Files uploaded to S3!


<h3>What was done</h3>
<li>Tokenized and cleaned raw text using NLTK.</li>
<li>Saved the processed tokens in both nested and exploded formats.</li>
<li>Persisted the processed artifacts to Amazon S3 for future workflows (ETL, SageMaker, Glue, Athena, etc.).</li>

For most machine learning (ML) workloads, the exploded format is typically preferred — especially when you are working on token-level tasks like:
<li>Building vocabulary</li>
<li>Training word embeddings (e.g., Word2Vec, FastText)</li>
<li>Preparing inputs for sequence models (LSTM, Transformer)</li>
<li>Token frequency analysis or TF-IDF</li>
<li>Feeding tokenized inputs to NLP pipelines (e.g., Hugging Face models)</li>

<h3> Why Exploded Format Is Preferred:</h3>



| Format       | Structure                                            | Pros                                                       | Cons                                         |
| ------------ | ---------------------------------------------------- | ---------------------------------------------------------- | -------------------------------------------- |
| **Exploded** | 1 row per token (`review_id`, `token`)               | Easy to aggregate, count, map vocab IDs, filter, vectorize | File is longer, not nested                   |
| **Nested**   | 1 row per doc with list of tokens (`tokens = [...]`) | Better for direct reuse of full token lists per document   | Requires parsing; harder for per-token stats |


<h3>Example Use Case</h3>

| Use Case                       | Preferred Format                              |
| ------------------------------ | --------------------------------------------- |
| Word frequency/counts          | **Exploded**                                  |
| TF-IDF / CountVectorizer       | **Exploded** or nested (depends on tool)      |
| Sequence modeling (e.g., LSTM) | **Nested** (with padding/token index mapping) |
| Word embeddings                | **Exploded**                                  |
| Custom NLP pretraining         | **Exploded**                                  |


For data analysis and vocab generation, use the exploded format. For feeding into ML models, especially if batching sentences/documents, nested token lists may be useful — but often converted to tensors or padded arrays later on. So in practice, you often start with exploded format, then convert as needed.