# Translate and Classify

**Project aim.** Build a simple sklearn text classifier over transcripts to label calls.

| Stage      | Input                | Output     | Why               |
| ---------- | -------------------- | ---------- | ----------------- |
| Ingest     | Labeled MP3 snippets | WAV files  | Standardize I/O   |
| Transcribe | WAV                  | Text list  | Features for ML   |
| Model      | Text + labels        | Classifier | Automate labeling |

**Data check.** Verify folders, counts, and formats before processing.

| Action        | Tool                               | Result               |
| ------------- | ---------------------------------- | -------------------- |
| List files    | `os.listdir` / `pathlib.Path.glob` | ~50 per class, MP3   |
| Decide target | WAV PCM16                          | Consistent ASR input |

**Batch conversion.** Convert MP3→WAV across both label folders, then transcribe to lists.

**DataFrame assembly.** Join texts and labels for sklearn.

| Set           | Label           | Size | Note          |
| ------------- | --------------- | ---: | ------------- |
| Pre-purchase  | `pre_purchase`  |  ≈50 | From folder A |
| Post-purchase | `post_purchase` |  ≈50 | From folder B |


In [1]:
import pandas as pd

pre_texts = ["Hola, estoy interesado en comprar un producto.", 
             "¿Cuáles son las opciones de pago disponibles?", 
             "Me gustaría saber más sobre las características del producto.", 
             "¿Hay alguna oferta especial en este momento?", 
             "Quisiera saber el tiempo de entrega estimado."]

post_texts = ["Me encanta el producto que compré.", 
              "MUERETE.",
              "Tuve algunos problemas con la entrega, pero se resolvieron rápidamente.",
              "Estoy muy satisfecho con mi compra.",
              "Definitivamente recomendaría este producto a mis amigos."]

df_pre  = pd.DataFrame({"label": "pre_purchase",  "text": pre_texts})
df_post = pd.DataFrame({"label": "post_purchase", "text": post_texts})
df = pd.concat([df_pre, df_post], ignore_index=True)
df.head()

Unnamed: 0,label,text
0,pre_purchase,"Hola, estoy interesado en comprar un producto."
1,pre_purchase,¿Cuáles son las opciones de pago disponibles?
2,pre_purchase,Me gustaría saber más sobre las característica...
3,pre_purchase,¿Hay alguna oferta especial en este momento?
4,pre_purchase,Quisiera saber el tiempo de entrega estimado.



**Model pieces.** Classic bag-of-words + TF-IDF + Multinomial Naive Bayes.

| Component          | Role             | Key args            |
| ------------------ | ---------------- | ------------------- |
| `CountVectorizer`  | Token counts     | `ngram_range=(1,1)` |
| `TfidfTransformer` | Reweight counts  | `use_idf=True`      |
| `MultinomialNB`    | Prob. classifier | `alpha=1.0`         |

In [2]:

# Train/test split, pipeline, fit, evaluate
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

X_train, X_test, y_train, y_test = train_test_split(
    df["text"].astype(str), df["label"], test_size=0.30, random_state=42, stratify=df["label"]
)

pipe = Pipeline([
    ("vec", CountVectorizer()),          # bag-of-words
    ("tfidf", TfidfTransformer()),       # TF-IDF weighting
    ("clf", MultinomialNB())             # classifier
])

pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)
acc = np.mean(y_hat == y_test.to_numpy())
print(f"accuracy={acc:.3f}")


accuracy=0.333


In [None]:
# import sys, subprocess, pkgutil
# subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", "sentencepiece", "transformers", "torch"])


In [None]:
# import sys, subprocess
# subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", "ipywidgets"])

0

In [4]:
# %pip install -U transformers torch  # run once

import pandas as pd
from transformers import pipeline

# Example df
# df = pd.DataFrame({"text": ["hola mundo", "¿cómo estás?"]})

pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")  # es→en

def batch_translate(series: pd.Series, batch_size: int = 16) -> pd.Series:
    s = series.astype("string")
    mask = s.notna() & (s.str.len() > 0)
    out = pd.Series(index=s.index, dtype="string")
    idx = s[mask].index.to_list()
    texts = s[mask].to_list()
    for i in range(0, len(texts), batch_size):
        preds = pipe(texts[i:i+batch_size], max_length=512)
        out.iloc[i:i+batch_size] = [p["translation_text"] for p in preds]
    s.loc[mask] = out.values
    return s

df["translated"] = batch_translate(df["text"])


Device set to use cuda:0


In [5]:
df["translated"]

0           Hello, I'm interested in buying a product.
1              What are the payment options available?
2    I would like to know more about the characteri...
3                  Is there a special offer right now?
4        I'd like to know the estimated delivery time.
5                         I love the product I bought.
6                                                DEAD.
7    I had some problems with delivery, but they re...
8                 I'm very satisfied with my purchase.
9    I would definitely recommend this product to m...
Name: translated, dtype: string

In [7]:

X_train, X_test, y_train, y_test = train_test_split(
    df["translated"].astype(str), df["label"], test_size=0.30, random_state=42, stratify=df["label"]
)

pipe = Pipeline([
    ("vec", CountVectorizer()),          # bag-of-words
    ("tfidf", TfidfTransformer()),       # TF-IDF weighting
    ("clf", MultinomialNB())             # classifier
])

pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)
acc = np.mean(y_hat == y_test.to_numpy())
print(f"accuracy={acc:.3f}")

accuracy=0.667



**Workflow summary.** End-to-end steps you’ll reuse.

| Step       | Function                     | Output            |
| ---------- | ---------------------------- | ----------------- |
| Convert    | `convert_folder_to_wav(...)` | Canonical WAVs    |
| Transcribe | `create_text_list(...)`      | Text per file     |
| Assemble   | `pd.concat([...])`           | Labeled DataFrame |
| Train      | sklearn `Pipeline.fit`       | Trained model     |
| Evaluate   | `predict` then accuracy      | Baseline score    |
