# NLP Basics Assessment

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cam2149/icesi-nlp/blob/main/Sesion1/8-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico:

Con tantos tuits circulando a cada segundo, es difícil determinar si el sentimiento detrás de un tuit específico impactará la marca de una empresa o persona por ser viral (positivo), o si devastará las ganancias por su tono negativo. Capturar el sentimiento con palabras es importante en estos tiempos donde las decisiones y reacciones se crean y actualizan en segundos. Pero, ¿qué palabras conducen realmente a la descripción del sentimiento? En esta competencia, tendrás que identificar la parte del tuit (palabra o frase) que refleje el sentimiento.

"Mi perro ridículo es increíble." [sentimiento: positivo]

Desarrollar habilidades en esta importante área con este amplio conjunto de datos de tuits. Perfecciona tu técnica para alcanzar el primer puesto en esta competencia. ¿Qué palabras en los tuits respaldan un sentimiento positivo, negativo o neutral? ¿Cómo puedes ayudar a determinarlo usando herramientas de aprendizaje automático?

El conjunto de datos se titula "Análisis de Sentimiento: Emoción en Tweets de Texto con Etiquetas de Sentimiento existentes", utilizado aquí bajo la licencia Creative Commons Atribución 4.0 Internacional. El objetivo en este concurso es construir un modelo que pueda hacer lo mismo: analizar el sentimiento etiquetado de un tweet determinado y determinar qué palabra o frase lo respalda mejor.

Descargo de responsabilidad: El conjunto de datos de este concurso contiene texto que puede considerarse profano, vulgar u ofensivo.

## Referencias
* [Extracción de sentimiento de tweets](https://www.kaggle.com/competitions/tweet-sentiment-extraction/overview)
* [NLP - Natural Language Processing With Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python)

In [1]:
!pip install kaggle
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [2]:
import pkg_resources
import warnings
import spacy
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score, classification_report
import os
from google.colab import files


warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [4]:

# Initialize SpaCy and VADER
nlp = spacy.load("en_core_web_sm")
analyzer = SentimentIntensityAnalyzer()

# Download and load the Kaggle dataset

!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle competitions download -c tweet-sentiment-extraction
!unzip -o tweet-sentiment-extraction.zip -d tweet_data

Downloading tweet-sentiment-extraction.zip to /content
  0% 0.00/1.39M [00:00<?, ?B/s]
100% 1.39M/1.39M [00:00<00:00, 670MB/s]
Archive:  tweet-sentiment-extraction.zip
  inflating: tweet_data/sample_submission.csv  
  inflating: tweet_data/test.csv     
  inflating: tweet_data/train.csv    


In [5]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/cam2149/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

--2025-08-08 13:29:19--  https://github.com/cam2149/icesi-nlp/raw/refs/heads/main/requirements.txt
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/requirements.txt [following]
--2025-08-08 13:29:20--  https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 349 [text/plain]
Saving to: ‘requirements.txt’


2025-08-08 13:29:20 (18.8 MB/s) - ‘requirements.txt’ saved [349/349]

Collecting pandas==2.1.1 (from -r requirements.txt (line 1))
  Downloading pandas-2.1.1-cp311-cp311-manylinux_2_17_

In [6]:
# Initialize SpaCy and VADER
nlp = spacy.load("en_core_web_sm")
analyzer = SentimentIntensityAnalyzer()

In [7]:
# Load the dataset
try:
    train_df = pd.read_csv("tweet_data/train.csv")
except FileNotFoundError:
    print("Error: train.csv not found even after attempting download and extraction.")
    # You might want to add code here to handle the case where the file is still not found.


# Data Exploration
print("Dataset Preview:")
print(train_df.head())
print("\nColumns:", train_df.columns.tolist())


Dataset Preview:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
4  358bd9e861   Sons of ****, why couldn`t they put them on t...   

                         selected_text sentiment  
0  I`d have responded, if I were going   neutral  
1                             Sooo SAD  negative  
2                          bullying me  negative  
3                       leave me alone  negative  
4                        Sons of ****,  negative  

Columns: ['textID', 'text', 'selected_text', 'sentiment']


In [8]:
# Count the number of tokens in the processed_text column
# Handle potential non-string values by converting them to strings and replacing NaN with empty strings
token_count = train_df["text"].astype(str).apply(lambda x: len(x.split())).sum()

print(f"\nTotal number of tokens in the dataset: {token_count}")


Total number of tokens in the dataset: 354572


In [8]:
# Count the number of sentences in the text column
# Handle potential non-string values by converting them to strings and replacing NaN with empty strings
#sentence_count = train_df["text"].astype(str).apply(lambda x: len(list(nlp(x).sents))).sum()

#print(f"\nTotal number of sentences in the dataset: {sentence_count}")

In [9]:
# Select a row from the dataset (e.g., the first row)
selected_row = train_df.iloc[0]
text = selected_row["text"]

# Process the text with SpaCy
doc = nlp(text)

# Print information for each token
print(f"Analyzing text: '{text}'\n")
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in doc:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")



Analyzing text: ' I`d have responded, if I were going'

Text                POS                 dep                 lemma               
                    SPACE               dep                                     
I`d                 PROPN               nsubj               I`d                 
have                AUX                 aux                 have                
responded           VERB                ROOT                respond             
,                   PUNCT               punct               ,                   
if                  SCONJ               mark                if                  
I                   PRON                nsubj               I                   
were                AUX                 aux                 be                  
going               VERB                advcl               go                  


In [10]:

# Preprocessing function using SpaCy
def preprocess_text(text):
    if pd.isna(text):
        return ""
    doc = nlp(str(text))
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

train_df["processed_text"] = train_df["text"].apply(preprocess_text)

# Sentiment Analysis with VADER
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    compound = scores["compound"]
    if compound > 0.05:
        return "positive"
    elif compound < -0.05:
        return "negative"
    else:
        return "neutral"

train_df["predicted_sentiment"] = train_df["processed_text"].apply(get_sentiment)


In [11]:
# Evaluation against provided labels
print("\nSentiment Prediction Evaluation:")
accuracy = accuracy_score(train_df["sentiment"], train_df["predicted_sentiment"])
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(train_df["sentiment"], train_df["predicted_sentiment"]))


Sentiment Prediction Evaluation:
Accuracy: 0.6310
              precision    recall  f1-score   support

    negative       0.73      0.57      0.64      7781
     neutral       0.71      0.48      0.57     11118
    positive       0.54      0.88      0.67      8582

    accuracy                           0.63     27481
   macro avg       0.66      0.64      0.63     27481
weighted avg       0.66      0.63      0.62     27481



In [12]:

# Text Justification Extraction with SpaCy Matcher
matcher = Matcher(nlp.vocab)
positive_pattern = [{"LOWER": {"IN": ["good", "great", "excellent", "love"]}}]
negative_pattern = [{"LOWER": {"IN": ["bad", "poor", "terrible", "hate"]}}]
matcher.add("PositiveWords", [positive_pattern])
matcher.add("NegativeWords", [negative_pattern])

def extract_justification(text):
    if isinstance(text, str):  # Add this check
        doc = nlp(text)
        matches = matcher(doc)
        if matches:
            match_id, start, end = matches[0]
            return doc[start:end].text
    return ""

train_df["extracted_text"] = train_df["text"].apply(extract_justification)


In [15]:
# Display results
print("\nSample Results with Extracted Justification:")
print(train_df[["text", "sentiment", "predicted_sentiment", "selected_text", "extracted_text"]].head())


Sample Results with Extracted Justification:
                                                text sentiment  \
0                I`d have responded, if I were going   neutral   
1      Sooo SAD I will miss you here in San Diego!!!  negative   
2                          my boss is bullying me...  negative   
3                     what interview! leave me alone  negative   
4   Sons of ****, why couldn`t they put them on t...  negative   

  predicted_sentiment                        selected_text extracted_text  
0             neutral  I`d have responded, if I were going                 
1            negative                             Sooo SAD                 
2            negative                          bullying me                 
3            negative                       leave me alone                 
4             neutral                        Sons of ****,                 


In [18]:
# Save results
train_df.columns = train_df.columns.tolist() # Convert columns to a list of strings


In [19]:
train_df.to_csv("tweet_sentiment_results.csv", index=False)
print("\nResults saved to 'tweet_sentiment_results.csv'")

AttributeError: 'Index' object has no attribute '_format_native_types'