<a href="https://colab.research.google.com/github/nicobargioni/machine-learning/blob/main/Bargioni_Nicolas_Desafio_Clasificacion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Objetivo

Voy a armar un sistema de clasificación binario de sentimiento. El objetivo es que el modelo analice el texto de las reseñas que los usuarios dejan sobre ciertos medicamentos, y determine automáticamente si el sentimiento expresado en cada reseña es positivo o negativo, dando una herramienta rápida para evaluar la percepción de los usuarios y poder actuar en consecuencia.

**Clases**

* Clase Negativa: 0. Son las reseñas con sentimiento negativo.
* Clase Positiva: 1. Son las reseñas con sentimiento positivo.

In [19]:
#@title Importo las librerías
!pip3 install ucimlrepo
!pip3 install nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.model_selection import RandomizedSearchCV
from nltk.corpus import stopwords
import nltk
from ucimlrepo import fetch_ucirepo



##Dataset
**Descripción**

Este dataset proporciona opiniones de pacientes sobre medicamentos específicos junto con afecciones relacionadas y una calificación de 10 estrellas de los pacientes que refleja su satisfacción general.

In [3]:
# fetch dataset
drug_reviews_drugs_com = fetch_ucirepo(id=462)

# data (as pandas dataframes)
X = drug_reviews_drugs_com.data.features
y = drug_reviews_drugs_com.data.targets

# metadata
print(drug_reviews_drugs_com.metadata)

{'uci_id': 462, 'name': 'Drug Reviews (Drugs.com)', 'repository_url': 'https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com', 'data_url': 'https://archive.ics.uci.edu/static/public/462/data.csv', 'abstract': 'The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate', 'Text'], 'num_instances': 215063, 'num_features': 6, 'feature_types': ['Integer'], 'demographics': [], 'target_col': None, 'index_col': ['id'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Wed Apr 03 2024', 'dataset_doi': '10.24432/C5SK5S', 'creators': ['Surya Kallumadi', 'Felix Grer'], 'intro_paper': {'ID': 407, 'type': 'NATIVE', 'title': 'Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data 

In [18]:
print(drug_reviews_drugs_com.metadata['abstract'])

The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.


In [4]:
#Exploro las vafriables
print(drug_reviews_drugs_com.variables)

          name     role         type demographic description units  \
0           id       ID      Integer        None        None  None   
1     drugName  Feature  Categorical        None        None  None   
2    condition  Feature  Categorical        None        None  None   
3       review  Feature  Categorical        None        None  None   
4       rating  Feature  Categorical        None        None  None   
5         date  Feature         Date        None        None  None   
6  usefulCount  Feature  Categorical        None        None  None   

  missing_values  
0             no  
1             no  
2             no  
3             no  
4             no  
5             no  
6             no  


In [5]:
#Empiezo el EDA
X

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37
...,...,...,...,...,...,...
215058,Tamoxifen,"Breast Cancer, Prevention","""I have taken Tamoxifen for 5 years. Side effe...",10,13-Sep-14,43
215059,Escitalopram,Anxiety,"""I&#039;ve been taking Lexapro (escitaploprgra...",9,8-Oct-16,11
215060,Levonorgestrel,Birth Control,"""I&#039;m married, 34 years old and I have no ...",8,15-Nov-10,7
215061,Tapentadol,Pain,"""I was prescribed Nucynta for severe neck/shou...",1,28-Nov-11,20


In [6]:
#Veo que tan balanceadas o desbalanceadas están las variables
X.groupby('condition').size().sort_values(ascending = False)

Unnamed: 0_level_0,0
condition,Unnamed: 1_level_1
Birth Control,38436
Depression,12164
Pain,8245
Anxiety,7812
Acne,7435
...,...
Thyroid Suppression Test,1
Dermatitis Herpeti,1
Prevention of Perinatal Group B Streptococcal Disease,1
Tinea Barbae,1


In [9]:
# Convierto los datos en un dataframe
df = pd.concat([X, y], axis=1)
df

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37
...,...,...,...,...,...,...
215058,Tamoxifen,"Breast Cancer, Prevention","""I have taken Tamoxifen for 5 years. Side effe...",10,13-Sep-14,43
215059,Escitalopram,Anxiety,"""I&#039;ve been taking Lexapro (escitaploprgra...",9,8-Oct-16,11
215060,Levonorgestrel,Birth Control,"""I&#039;m married, 34 years old and I have no ...",8,15-Nov-10,7
215061,Tapentadol,Pain,"""I was prescribed Nucynta for severe neck/shou...",1,28-Nov-11,20


In [10]:
# Creo una columna 'sentiment' basada en 'rating'
df['sentiment'] = df['rating'].apply(lambda x: 1 if x >= 6 else 0)

In [11]:
# Separo los datos en entrenamiento y testeo
df_train, df_test = train_test_split(df, test_size=0.25, random_state=42)
# Verifico la nueva columna y estructura
print("\nPrimeras Filas del Conjunto de Entrenamiento con 'sentiment':")
print(df_train[['rating', 'sentiment']].head())


Primeras Filas del Conjunto de Entrenamiento con 'sentiment':
        rating  sentiment
108836       7          1
113898      10          1
159302      10          1
59039       10          1
146076       6          1


In [13]:
#@title Preprocesamiento del dataset
#Acá vectorizo los valores de la columna review, previamente habiendo quitado las stopwords
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = tfidf.fit_transform(df_train['review'])
X_test_tfidf = tfidf.transform(df_test['review'])

#Creo variables objetivo
y_train = df_train['sentiment']
y_test = df_test['sentiment']

In [14]:
#@title Modelo de Regresión Logística
model_logreg = LogisticRegression()
model_logreg.fit(X_train_tfidf, y_train)

# Hacer predicciones y evaluar el modelo de Regresión Logística
y_pred_logreg = model_logreg.predict(X_test_tfidf)
print("Métricas para Regresión Logística:")
print(classification_report(y_test, y_pred_logreg))
print("Matriz de Confusión:")
print(confusion_matrix(y_test, y_pred_logreg))

Métricas para Regresión Logística:
              precision    recall  f1-score   support

           0       0.77      0.64      0.70     16099
           1       0.86      0.92      0.89     37667

    accuracy                           0.83     53766
   macro avg       0.81      0.78      0.79     53766
weighted avg       0.83      0.83      0.83     53766

Matriz de Confusión:
[[10319  5780]
 [ 3145 34522]]


In [16]:
#@title Modelo de Random Forest
#@markdown Acá usé el hiperparámetro balanced porque en un primer intento el modelo quedó totalmente desbalanceado, dandome un recall de 0.05
param_dist = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'class_weight': ['balanced']  # Añadir el parámetro class_weight
}

# Configurar RandomizedSearchCV con los nuevos hiperparámetros
random_search_rf = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions=param_dist,
    n_iter=4,
    cv=3,
    scoring='accuracy',
    random_state=42
)

# Entrenar el modelo con los hiperparámetros ajustados
random_search_rf.fit(X_train_tfidf, y_train)

# Obtener el mejor modelo y hacer predicciones
best_rf_model = random_search_rf.best_estimator_
y_pred_rf = best_rf_model.predict(X_test_tfidf)

# Evaluar el modelo
print("Métricas para Random Forest:")
print(classification_report(y_test, y_pred_rf))
print("Matriz de Confusión:")
print(confusion_matrix(y_test, y_pred_rf))

Métricas para Random Forest (mejor modelo con class_weight='balanced'):
              precision    recall  f1-score   support

           0       0.62      0.79      0.70     16099
           1       0.90      0.80      0.84     37667

    accuracy                           0.79     53766
   macro avg       0.76      0.79      0.77     53766
weighted avg       0.82      0.79      0.80     53766

Matriz de Confusión:
[[12731  3368]
 [ 7713 29954]]


##Caso real de testeo

In [17]:
# Agrego una lista de reseñas nuevas para ver cómo funcionaría el modelo "en el mundo real"
resenas_prueba = [
    "This medication is fantastic, it has helped me a lot with no side effects.",
    "I didn't like this medicine, it made me dizzy and didn't improve my condition.",
    "It's a regular product, I didn't see much difference but no negative effects either.",
    "I recommend it, it has really helped me a lot with my condition."
]

# Vectorizo
resenas_prueba_tfidf = tfidf.transform(resenas_prueba)

# Hago las predicciones
predicciones = best_rf_model.predict(resenas_prueba_tfidf)

# Resultados
for resena, pred in zip(resenas_prueba, predicciones):
    sentimiento = "Positivo" if pred == 1 else "Negativo"
    print(f"Reseña: '{resena}' \nSentimiento predicho: {sentimiento}\n")

Reseña: 'This medication is fantastic, it has helped me a lot with no side effects.' 
Sentimiento predicho: Positivo

Reseña: 'I didn't like this medicine, it made me dizzy and didn't improve my condition.' 
Sentimiento predicho: Negativo

Reseña: 'It's a regular product, I didn't see much difference but no negative effects either.' 
Sentimiento predicho: Positivo

Reseña: 'I recommend it, it has really helped me a lot with my condition.' 
Sentimiento predicho: Positivo

