# NLP Basics Assessment

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ohtar10/icesi-nlp/blob/main/Sesion1/8-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico:
[_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) por Ambrose Bierce (1890). Esta historia es de dominio público y el corpus fue obtenido de [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

## Referencias
* [NLP - Natural Language Processing With Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python)

In [None]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [None]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

In [None]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Creamos el documento desde el archivo `owlcreek.txt`**<br>
> Pista: Usa `with open('./owlcreek.txt') as f:`

In [None]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/Sesion1/owlcreek.txt

In [None]:
with open('./owlcreek.txt') as file:
    doc = nlp(file.read())

In [None]:
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

El documento fue cargado exitosamente!

**2. Cuantos tokens hay en el archivo?**

In [None]:
len(doc)

4835

**3. Cuantas oraciones hay en el archivo?**
<br>Pista: Necesitarás una lista primero

In [None]:
sentences = list(doc.sents)
len(sentences)

204

**4. Imprime la segunda oración del documento**
<br> Pista: Los índices comienzan en 0 y el título cuenta como la primera oración.

In [None]:
sentences[1]

The man's hands were behind
his back, the wrists bound with a cord.  

**5. Por cada token en la oración anterior, imprime su `text`, `POS` tag, `dep` tag y `lemma`**
<br>

In [None]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences[1]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
The                 DET                 det                 the                 
man                 NOUN                poss                man                 
's                  PART                case                's                  
hands               NOUN                nsubj               hand                
were                AUX                 ROOT                be                  
behind              ADP                 prep                behind              

                   SPACE               dep                 
                   
his                 PRON                poss                his                 
back                NOUN                pobj                back                
,                   PUNCT               punct               ,                   
the                 DET                 det                 the                 
wrists              NOUN    

**6. Implementa un matcher llamado *Swimming* que encuentre las ocurrencias de la frase *swimming vigorously* Write a matcher called 'Swimming' that finds**
<br>
Pista: Deberías incluir un patrón`'IS_SPACE': True` entre las dos palabras.

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True}, {'LOWER': 'vigorously'}]
matcher.add("Swimming", [pattern])


In [None]:
found_matches = matcher(doc)
found_matches




[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]

**7. Imprime el texto al rededor de cada match encontrado**

In [None]:
start, end = found_matches[0][1:]
doc[start-9:end+13]

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home

In [None]:
start, end = found_matches[1][1:]
doc[start-7:end+5]

over his shoulder; he was now swimming
vigorously with the current.  

**8. Imprime la oración que contiene cada match encontrado**

In [None]:
for sentence in sentences:
    for _, start, end in found_matches:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text, '\n')

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.   

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.   



Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [13]:
!pip install kaggle
!pip install vaderSentiment

import pandas as pd
import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score, classification_report
import os
from google.colab import files

# Initialize SpaCy and VADER
nlp = spacy.load("en_core_web_sm")
analyzer = SentimentIntensityAnalyzer()

# Download and load the Kaggle dataset

!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle competitions download -c tweet-sentiment-extraction
!unzip -o tweet-sentiment-extraction.zip -d tweet_data


Downloading tweet-sentiment-extraction.zip to /content
  0% 0.00/1.39M [00:00<?, ?B/s]
100% 1.39M/1.39M [00:00<00:00, 648MB/s]
Archive:  tweet-sentiment-extraction.zip
  inflating: tweet_data/sample_submission.csv  
  inflating: tweet_data/test.csv     
  inflating: tweet_data/train.csv    


In [1]:
import pandas as pd
import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score, classification_report
import os

# Initialize SpaCy and VADER
nlp = spacy.load("en_core_web_sm")
analyzer = SentimentIntensityAnalyzer()

# Download and load the Kaggle dataset
os.system("kaggle competitions download -c tweet-sentiment-extraction -p tweet_data --unzip")


# Load the dataset
try:
    train_df = pd.read_csv("tweet_data/train.csv")
except FileNotFoundError:
    print("Error: train.csv not found even after attempting download and extraction.")
    # You might want to add code here to handle the case where the file is still not found.


# Data Exploration
print("Dataset Preview:")
print(train_df.head())
print("\nColumns:", train_df.columns.tolist())

# Preprocessing function using SpaCy
def preprocess_text(text):
    if pd.isna(text):
        return ""
    doc = nlp(str(text))
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

train_df["processed_text"] = train_df["text"].apply(preprocess_text)

# Sentiment Analysis with VADER
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    compound = scores["compound"]
    if compound > 0.05:
        return "positive"
    elif compound < -0.05:
        return "negative"
    else:
        return "neutral"

train_df["predicted_sentiment"] = train_df["processed_text"].apply(get_sentiment)

# Evaluation against provided labels
print("\nSentiment Prediction Evaluation:")
accuracy = accuracy_score(train_df["sentiment"], train_df["predicted_sentiment"])
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(train_df["sentiment"], train_df["predicted_sentiment"]))

# Text Justification Extraction with SpaCy Matcher
matcher = Matcher(nlp.vocab)
positive_pattern = [{"LOWER": {"IN": ["good", "great", "excellent", "love"]}}]
negative_pattern = [{"LOWER": {"IN": ["bad", "poor", "terrible", "hate"]}}]
matcher.add("PositiveWords", [positive_pattern])
matcher.add("NegativeWords", [negative_pattern])

def extract_justification(text):
    if isinstance(text, str):  # Add this check
        doc = nlp(text)
        matches = matcher(doc)
        if matches:
            match_id, start, end = matches[0]
            return doc[start:end].text
    return ""

train_df["extracted_text"] = train_df["text"].apply(extract_justification)

# Display results
print("\nSample Results with Extracted Justification:")
print(train_df[["text", "sentiment", "predicted_sentiment", "selected_text", "extracted_text"]].head())

# Save results
train_df.to_csv("tweet_sentiment_results.csv", index=False)
print("\nResults saved to 'tweet_sentiment_results.csv'")

ModuleNotFoundError: No module named 'vaderSentiment'