# Task 3 - Predictia categoriei produsului pe baza titlului

Scop: antrenez si evaluez modele ML care prezic categoria produsului folosind titlul.
La final aleg modelul cel mai bun si il salvez in format .pkl pentru utilizare in scriptul de predictie.


In [82]:
import pandas as pd

# Incarcam setul de date cu produsele
df = pd.read_csv("/content/products.csv")

# Curatam numele coloanelor (elimina spatii de la inceput / sfarsit)
df.columns = df.columns.str.strip()

df.columns


# Afisam primele randuri pentru a intelege structura
df.head()


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


## 1. Incarcarea si intelegerea setului de date

In aceasta etapa am incarcat setul de date products.csv folosind pandas.
Scopul este de a intelege structura informatiilor disponibile si coloanele
care vor fi utilizate pentru clasificarea produselor pe baza titlului.


In [83]:
# Dimensiunea dataset-ului
df.shape


(35311, 8)

In [84]:
# Informatii despre coloane si tipuri de date
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3   Category Label   35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7   Listing Date     35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


In [85]:
# Verificam valori lipsa
df.isna().sum()


Unnamed: 0,0
product ID,0
Product Title,172
Merchant ID,0
Category Label,44
_Product Code,95
Number_of_Views,14
Merchant Rating,170
Listing Date,59


## 2. Analiza initiala a datelor

Am analizat dimensiunea setului de date, tipurile de coloane si valorile lipsa.
Aceasta etapa este esentiala pentru a identifica eventuale probleme care pot
afecta performanta modelului de clasificare.


In [86]:
# Pastram doar coloanele relevante
df = df[["Product Title", "Category Label"]].copy()

# Eliminam randurile cu valori lipsa
df.dropna(subset=["Product Title", "Category Label"], inplace=True)

# Normalizam textele
df["Product Title"] = df["Product Title"].astype(str).str.lower().str.strip()
df["Category Label"] = df["Category Label"].astype(str).str.strip()

# Verificam rezultatul
df.head()


Unnamed: 0,Product Title,Category Label
0,apple iphone 8 plus 64gb silver,Mobile Phones
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones
3,apple iphone 8 plus 64gb space grey,Mobile Phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones


In [87]:
df.columns


Index(['Product Title', 'Category Label'], dtype='object')

In [88]:
from sklearn.model_selection import train_test_split

# X = textul produsului
X = df["Product Title"]

# y = categoria produsului
y = df["Category Label"]

# Impartim datele: 80% train, 20% test, cu stratificare
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Dimensiuni seturi:")
print("Train:", X_train.shape, y_train.shape)
print("Test:", X_test.shape, y_test.shape)


Dimensiuni seturi:
Train: (28076,) (28076,)
Test: (7020,) (7020,)


In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initializam vectorizerul
tfidf = TfidfVectorizer(
    stop_words="english",
    max_features=5000
)

# Aplicam TF-IDF
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print("Dimensiuni dupa TF-IDF:")
print("Train:", X_train_tfidf.shape)
print("Test:", X_test_tfidf.shape)


Dimensiuni dupa TF-IDF:
Train: (28076, 5000)
Test: (7020, 5000)


In [90]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initializam modelul
log_reg = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)

# Antrenam modelul
log_reg.fit(X_train_tfidf, y_train)

# Facem predictii
y_pred_lr = log_reg.predict(X_test_tfidf)

# Evaluare
accuracy = accuracy_score(y_test, y_pred_lr)
print("Accuracy Logistic Regression:", accuracy)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))


Accuracy Logistic Regression: 0.9528490028490029

Classification Report:
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        17
            CPUs       0.98      1.00      0.99       749
 Digital Cameras       0.99      0.99      0.99       538
     Dishwashers       0.93      0.96      0.94       681
        Freezers       0.99      0.93      0.96       440
 Fridge Freezers       0.93      0.94      0.94      1094
         Fridges       0.87      0.89      0.88       687
      Microwaves       0.99      0.96      0.98       466
    Mobile Phone       0.00      0.00      0.00        11
   Mobile Phones       0.95      1.00      0.97       801
             TVs       0.99      0.99      0.99       708
Washing Machines       0.96      0.95      0.95       803
          fridge       0.00      0.00      0.00        25

        accuracy                           0.95      7020
       macro avg       0.74      0.74      0.74      70

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Modelul Logistic Regression impreuna cu TF-IDF obtine o acuratete ridicata (aprox. 95%) si performante foarte bune pe categoriile principale de produse.

Clasele cu suport foarte mic (ex: CPU, Mobile Phone, fridge) sunt slab recunoscute, ceea ce indica o problema de dezechilibru si inconsistenta in date, nu o limitare majora a modelului.

Pentru scopul de business – clasificarea automata a majoritatii produselor – modelul este fiabil si usor de utilizat.

In [91]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


In [92]:
# Initializam modelul Naive Bayes
nb_model = MultinomialNB()

# Antrenam modelul
nb_model.fit(X_train_tfidf, y_train)

# Facem predictii pe setul de test
y_pred_nb = nb_model.predict(X_test_tfidf)

# Calculam acuratetea
accuracy_nb = accuracy_score(y_test, y_pred_nb)

print("Accuracy Naive Bayes:", accuracy_nb)


Accuracy Naive Bayes: 0.9235042735042736


In [93]:
print("Classification Report - Naive Bayes:\n")
print(classification_report(y_test, y_pred_nb))


Classification Report - Naive Bayes:



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        17
            CPUs       0.98      1.00      0.99       749
 Digital Cameras       0.99      1.00      0.99       538
     Dishwashers       0.98      0.91      0.95       681
        Freezers       0.99      0.62      0.76       440
 Fridge Freezers       0.74      0.97      0.84      1094
         Fridges       0.87      0.80      0.83       687
      Microwaves       0.99      0.97      0.98       466
    Mobile Phone       0.00      0.00      0.00        11
   Mobile Phones       0.98      0.98      0.98       801
             TVs       0.99      0.99      0.99       708
Washing Machines       0.97      0.94      0.96       803
          fridge       0.00      0.00      0.00        25

        accuracy                           0.92      7020
       macro avg       0.73      0.71      0.71      7020
    weighted avg       0.93      0.92      0.92      7020



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Compararea modelelor

Am antrenat si evaluat doua modele de clasificare:
- Logistic Regression
- Multinomial Naive Bayes

Logistic Regression a obtinut o acuratete mai mare (~95%) fata de Naive Bayes (~92%) si un scor F1 mai bun.
Modelul Logistic Regression s-a comportat mai stabil pe majoritatea categoriilor, in special pe clasele cu suport mare.

Pe baza acestor rezultate, Logistic Regression a fost ales ca model final pentru clasificarea produselor.


In [94]:
import pickle

# Salvam modelul Logistic Regression
with open("product_category_model.pkl", "wb") as f:
    pickle.dump(log_reg, f)

# Salvam si vectorizatorul TF-IDF (FOARTE IMPORTANT)
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf,f)

print("Modelul si vectorizatorul au fost salvate cu succes.")

Modelul si vectorizatorul au fost salvate cu succes.


In [95]:
test_titles = [
    "iphone 7 32gb gold",
    "bosch serie 4 kgv39vl31g",
    "smeg sbs8004po",
    "olympus e m10 mark iii"
]

test_tfidf = tfidf.transform(test_titles)
predictions = log_reg.predict(test_tfidf)

for title, pred in zip(test_titles, predictions):
    print(f"{title} --> {pred}")

iphone 7 32gb gold --> Mobile Phones
bosch serie 4 kgv39vl31g --> Dishwashers
smeg sbs8004po --> Fridges
olympus e m10 mark iii --> Digital Cameras
