<a href="https://colab.research.google.com/github/jcterrero02/CoderHouse/blob/main/Imbd_JulioTerrero.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Importamos Paquetes

In [25]:
!pip install datasets



In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# 2. Lectura de los Datos

## Descripción del Dataset

El conjunto de datos de IMDb es una colección de datos de Internet Movie Database (IMDb), que es una base de datos completa en línea con información sobre películas, programas de televisión y más.

El conjunto de datos de IMDb Movie Reviews es un conjunto de datos de análisis de sentimiento binario que contiene 50.000 reseñas etiquetadas como positivas o negativas (1 o 0).

In [27]:
imdb = load_dataset("imdb")
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

imdb es un diccionario con 3 items: train dataset, test dataset y unsupervised. Para nuestro ejercicio utilizaremos los datasets train y test.

In [28]:
# Concatenamos los dataset de train y test
df_imdb = pd.concat([pd.DataFrame(imdb['train']), pd.DataFrame(imdb['test'])])
df_imdb.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [29]:
df_imdb['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,25000
1,25000


In [30]:
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    50000 non-null  object
 1   label   50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ MB


Nuestra data posee 50,000 filas, sin valores nulos.

# 3. Preprocesamiento

In [31]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Creamos la siguiente funcion para poder aplicar la tokenizacion y lematizacion, mientras se eliminan las stops_words
def preprocess_text(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalnum() and token not in stop_words]
    return " ".join(tokens)

df_imdb['processed_text'] = df_imdb['text'].apply(preprocess_text)

# 4. Trainning

In [32]:
# Separamos los datos para entrenamiento
X = df_imdb['processed_text']
y = df_imdb['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [33]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [34]:
# Model training
model = LogisticRegression()
model.fit(X_train_vec, y_train)

# 5. Model Evaluation

In [35]:
# Prediction and evaluation
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

Accuracy: 0.8919333333333334
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      7557
           1       0.88      0.91      0.89      7443

    accuracy                           0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000



In [36]:
# Evaluation of Trainning set
y_pred_train = model.predict(X_train_vec)
accuracy = accuracy_score(y_train, y_pred_train)
print(f"Accuracy: {accuracy}")
print(classification_report(y_train, y_pred_train))

Accuracy: 0.9303428571428571
              precision    recall  f1-score   support

           0       0.94      0.92      0.93     17443
           1       0.92      0.94      0.93     17557

    accuracy                           0.93     35000
   macro avg       0.93      0.93      0.93     35000
weighted avg       0.93      0.93      0.93     35000



En General nuestro modelo es muy bueno detectando los reviews negativos y positivos.