# Part 1: Text Processing and Exploratory Data Analysis

Author/s: <font color="blue">Jhonatan Barcos Gambaro | Daniel Alexander Yearwood</font>

E-mail: <font color="blue">jhonatan.barcos01@estudiant.upf.edu | danielalexander.yearwood01@estudiant.upf.edu </font>

Date: <font color="blue">24/10/2025</font>

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# Download nltk resources
#nltk.download('punkt')
#nltk.download('stopwords')

In [3]:
# Upload dataset
data_path = '../../data/fashion_products_dataset.json'
products = pd.read_json(data_path)

# Display head of the dataset
display(products.head(5))

Unnamed: 0,_id,actual_price,average_rating,brand,category,crawled_at,description,discount,images,out_of_stock,pid,product_details,seller,selling_price,sub_category,title,url
0,fa8e22d6-c0b6-5229-bb9e-ad52eda39a0a,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:51,Yorker trackpants made from 100% rich combed c...,69% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EA7H5FYZH,"[{'Style Code': '1005COMBO2'}, {'Closure': 'El...",Shyam Enterprises,921,Bottomwear,Solid Women Multicolor Track Pants,https://www.flipkart.com/yorker-solid-men-mult...
1,893e6980-f2a0-531f-b056-34dd63fe912c,1499,3.9,York,Clothing and Accessories,2021-02-10 20:11:52,Yorker trackpants made from 100% rich combed c...,66% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EJZV2UVRZ,"[{'Style Code': '1005BLUE'}, {'Closure': 'Draw...",Shyam Enterprises,499,Bottomwear,Solid Men Blue Track Pants,https://www.flipkart.com/yorker-solid-men-blue...
2,eb4c8eab-8206-59d0-bcd1-a724d96bf74f,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:52,Yorker trackpants made from 100% rich combed c...,68% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EHFCY5Z4Y,"[{'Style Code': '1005COMBO4'}, {'Closure': 'El...",Shyam Enterprises,931,Bottomwear,Solid Men Multicolor Track Pants,https://www.flipkart.com/yorker-solid-men-mult...
3,3f3f97bb-5faf-57df-a9ff-1af24e2b1045,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:53,Yorker trackpants made from 100% rich combed c...,69% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9ESZZ7YWEF,"[{'Style Code': '1005COMBO3'}, {'Closure': 'El...",Shyam Enterprises,911,Bottomwear,Solid Women Multicolor Track Pants,https://www.flipkart.com/yorker-solid-men-mult...
4,750caa3d-6264-53ca-8ce1-94118a1d8951,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:53,Yorker trackpants made from 100% rich combed c...,68% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EVXKBSUD7,"[{'Style Code': '1005COMBO1'}, {'Closure': 'Dr...",Shyam Enterprises,943,Bottomwear,"Solid Women Brown, Grey Track Pants",https://www.flipkart.com/yorker-solid-men-brow...


## 1.1. Pre-Processing text

Pre-process of the documents. In particular, for the text fields (title,
description)

In [4]:
# Define clean_text function to preprocess documents:
# 1. Removing stop words with nltk
# 2. Tokenization with nltk
# 3. Removing punctuation marks
# 4. Stemming with nltk's PorterStemmer

def clean_text(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text.lower())
    textos_limpios = ' '.join([word for word in word_tokens if word not in stop_words and word.isalnum()])
    stemmer = nltk.PorterStemmer()
    textos_limpios = ' '.join([stemmer.stem(word) for word in word_tokenize(textos_limpios)])
    return textos_limpios

In [5]:
# Apply clean_text function to the columns 'title' and 'description' of the products dataset
products_cleaned = products.copy()
products_cleaned['title'] = products_cleaned['title'].apply(clean_text)
products_cleaned['cleaned'] = products_cleaned['description'].apply(clean_text)

In [6]:
# Print title, cleaned_title, description, cleaned_description of the first 2 products
for i in range(2):
    print("Title:", products['title'].iloc[i])
    print("Cleaned Title:", products_cleaned['title'].iloc[i])
    print("Description:", products['description'].iloc[i])
    print("Cleaned Description:", products_cleaned['cleaned'].iloc[i], "\n")

Title: Solid Women Multicolor Track Pants
Cleaned Title: solid women multicolor track pant
Description: Yorker trackpants made from 100% rich combed cotton giving it a rich look.Designed for Comfort,Skin friendly fabric,itch-free waistband & great for all year round use Proudly made in India
Cleaned Description: yorker trackpant made 100 rich comb cotton give rich comfort skin friendli fabric waistband great year round use proudli made india 

Title: Solid Men Blue Track Pants
Cleaned Title: solid men blue track pant
Description: Yorker trackpants made from 100% rich combed cotton giving it a rich look.Designed for Comfort,Skin friendly fabric,itch-free waistband & great for all year round use Proudly made in India
Cleaned Description: yorker trackpant made 100 rich comb cotton give rich comfort skin friendli fabric waistband great year round use proudli made india 



## 1.2. Handle of category, sub_category, brand, product_details, and seller during pre-processing. 


In [None]:
# First we'll analyze the columns category, sub_category, brand, product_details, and seller to decide how to handle them during pre-processing
columns_to_analyze = ['category', 'sub_category', 'brand', 'product_details', 'seller']

# Describe each column 
display(products[columns_to_analyze].describe())


# Describe each column 


Should they be merged into a single text field, indexed as separate fields in the inverted index or any other alternative?

Justify your choice, considering how their distinctiveness may affect retrieval effectiveness. 

What are pros and cons of each approach?

# PART 2: Exploratory Data Analysis