# Proyecto Final - Curso 6: Clasificación de Reclamaciones en TikTok

**Objetivo:** Construir un modelo de bosque aleatorio para predecir si un video publicado en TikTok representa una 'reclamación' o una 'opinión'.

**Contexto:** El equipo de datos de TikTok desea automatizar la revisión de informes de usuarios.

## Carga de datos sintéticos

In [3]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 20.1 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl (273 kB)
Downloading click-8.2.1-py3-none-any.whl (102 kB)
Installing collected packages: regex, click, nltk
Successfully installed click-8.2.1 nltk-3.9.1 regex-2024.11.6


In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JMGY-\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [11]:
import pandas as pd

# Cargar el archivo original
df = pd.read_csv("tiktok_dataset.csv")  # asegurate de que esté en el mismo directorio
print("✅ Dataset cargado.")
print("\n🧾 Columnas disponibles:", df.columns.tolist())

✅ Dataset cargado.

🧾 Columnas disponibles: ['#', 'claim_status', 'video_id', 'video_duration_sec', 'video_transcription_text', 'verified_status', 'author_ban_status', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']


In [12]:
print("\n🔎 Primeras filas del dataset:")
display(df.head())


🔎 Primeras filas del dataset:


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [13]:
print(f"\n🔢 El dataset tiene {df.shape[0]} filas y {df.shape[1]} columnas.")
print("\n📋 Tipos de datos:")
print(df.dtypes)


🔢 El dataset tiene 19382 filas y 12 columnas.

📋 Tipos de datos:
#                             int64
claim_status                 object
video_id                      int64
video_duration_sec            int64
video_transcription_text     object
verified_status              object
author_ban_status            object
video_view_count            float64
video_like_count            float64
video_share_count           float64
video_download_count        float64
video_comment_count         float64
dtype: object


In [14]:
print("\n🧼 Valores nulos por columna:")
print(df.isna().sum())


🧼 Valores nulos por columna:
#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64


In [15]:
if 'video_transcription_text' in df.columns:
    print("\n📝 Algunos textos:")
    print(df['video_transcription_text'].dropna().sample(min(5, len(df))))
    print("\n📉 Cantidad de textos vacíos o solo espacios:")
    print((df['video_transcription_text'].str.strip() == '').sum())
else:
    print("\n⚠️ La columna 'video_transcription_text' no está en el dataset.")


📝 Algunos textos:
13967    my friends' impression is that the longest pos...
8051     i read a story claiming that mice typically on...
15730    my family's hypothesis is that it would take 2...
7923     someone discovered a report claiming that the ...
16528    my colleagues' sentiment is that the average c...
Name: video_transcription_text, dtype: object

📉 Cantidad de textos vacíos o solo espacios:
0


In [16]:
# 📦 Importación de librerías
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# ✅ 1. Cargar y limpiar
df = pd.read_csv("tiktok_dataset.csv")

# 🧼 2. Eliminar filas con NaN importantes
df = df.dropna()
df['claim_status'] = df['claim_status'].map({'opinión': 0, 'reclamación': 1})
df = df.dropna(subset=['claim_status'])  # eliminar solo las filas no mapeadas
df['claim_status'] = df['claim_status'].astype(int)

# 🔐 3. Variables categóricas codificadas
df = pd.get_dummies(df, columns=['verified_status', 'author_ban_status'], drop_first=True)

# 🧠 4. Procesamiento texto TF-IDF
df['video_transcription_text'] = df['video_transcription_text'].fillna('').astype(str)
tfidf = TfidfVectorizer(max_features=50, token_pattern=r'\b\w+\b', min_df=1)
X_text = pd.DataFrame(tfidf.fit_transform(df['video_transcription_text']).toarray())
df = df.drop(columns='video_transcription_text')

# 🧩 5. Separar X e y
X = pd.concat([df.drop(columns='claim_status').reset_index(drop=True), X_text], axis=1)
X.columns = X.columns.astype(str)
y = df['claim_status']

# 🔀 6. División
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 🌳 7. Entrenamiento
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 📈 8. Predicciones
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

# 📋 9. Evaluación
print("\n📊 Reporte de clasificación:\n", classification_report(y_test, y_pred))
print(f"🎯 AUC: {roc_auc_score(y_test, y_proba):.4f}")

# 📉 10. Curva ROC
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label='ROC - Random Forest')
plt.plot([0,1], [0,1], 'k--')
plt.xlabel('Falsos Positivos')
plt.ylabel('Verdaderos Positivos')
plt.title('Curva ROC')
plt.grid(True)
plt.tight_layout()
plt.show()

# 🧱 11. Matriz de Confusión
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Matriz de Confusión')
plt.xlabel('Predicción')
plt.ylabel('Real')
plt.tight_layout()
plt.show()

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [10]:
# 📦 1. Importación de librerías
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ Librerías importadas.")

# 📥 2. Cargar datos
df = pd.read_csv("tiktok_dataset.csv")
print("✅ Dataset cargado.")
print(df.columns.tolist())

# 🧹 3. Codificar variable objetivo
df['claim_status'] = df['claim_status'].map({'opinión': 0, 'reclamación': 1})
df = df.dropna(subset=['claim_status'])
df['claim_status'] = df['claim_status'].astype(int)
print("🔁 Variable objetivo codificada y limpiada.")

# 🔐 4. Codificar variables categóricas
categoricas = ['verified_status', 'author_ban_status']
df = pd.get_dummies(df, columns=categoricas, drop_first=True)
print("🔁 Variables categóricas codificadas.")

# 🧠 5. Procesar texto solo si la columna existe
if 'video_transcription_text' in df.columns:
    print("\n📝 Ejemplo de textos:")
    df['video_transcription_text'] = df['video_transcription_text'].fillna('').astype(str)
    print(df['video_transcription_text'].sample(min(5, len(df))))

    df['texto_util'] = df['video_transcription_text'].str.strip().str.split().str.len()
    df_text = df[df['texto_util'] > 2].copy()

    print(f"\n🧮 Textos útiles encontrados: {len(df_text)} de {len(df)}")

    if len(df_text) > 0:
        tfidf = TfidfVectorizer(max_features=50, token_pattern=r'\b\w+\b', min_df=1)
        X_text = pd.DataFrame(tfidf.fit_transform(df_text['video_transcription_text']).toarray())
        df_text = df_text.drop(columns=['video_transcription_text', 'texto_util'])
        df_model = pd.concat([df_text.reset_index(drop=True), X_text], axis=1)
        print("✅ TF-IDF aplicado correctamente.")
    else:
        print("⚠️ No se encontraron textos útiles. Omite TF-IDF.")
        df = df.drop(columns=['video_transcription_text', 'texto_util'], errors='ignore')
        df_model = df.copy()
else:
    print("⚠️ Columna 'video_transcription_text' no está en el dataset. Continuamos sin texto.")
    df_model = df.copy()

# 🧩 6. Armar X e y
if df_model.empty:
    print("\n⛔ Dataset final vacío. No se puede entrenar el modelo.")
else:
    X = df_model.drop(columns='claim_status')
    y = df_model['claim_status']
    print(f"\n✅ Dataset final con {len(X)} muestras.")

    # 🔀 7. División de datos
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_proba = clf.predict_proba(X_test)[:, 1]

    # 📊 8. Evaluación
    print("\n📋 Reporte de clasificación:")
    print(classification_report(y_test, y_pred))
    print(f"🎯 AUC: {roc_auc_score(y_test, y_proba):.4f}")

    # 📈 9. Curva ROC
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.figure(figsize=(6,4))
    plt.plot(fpr, tpr, label="ROC")
    plt.plot([0,1], [0,1], 'k--')
    plt.xlabel('Falsos Positivos')
    plt.ylabel('Verdaderos Positivos')
    plt.title('Curva ROC')
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # 🧱 10. Matriz de Confusión
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(4,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Matriz de Confusión')
    plt.xlabel('Predicción')
    plt.ylabel('Real')
    plt.tight_layout()
    plt.show()

✅ Librerías importadas.
✅ Dataset cargado.
['#', 'claim_status', 'video_id', 'video_duration_sec', 'video_transcription_text', 'verified_status', 'author_ban_status', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']
🔁 Variable objetivo codificada y limpiada.
🔁 Variables categóricas codificadas.

📝 Ejemplo de textos:
Series([], Name: video_transcription_text, dtype: object)

🧮 Textos útiles encontrados: 0 de 0
⚠️ No se encontraron textos útiles. Omite TF-IDF.

⛔ Dataset final vacío. No se puede entrenar el modelo.
