# NLP Basics Assessment

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NickEsColR/icesi-nlp/blob/main/Sesion1/6-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico: 
[_John F. Kennedy's Inaugural Address_](https://en.wikipedia.org/wiki/Inauguration_of_John_F._Kennedy) por Ambrose Bierce (1890). Esta historia es de dominio público y el corpus fue obtenido de [Project Gutenberg](https://www.gutenberg.org/ebooks/3.txt.utf-8).

## Referencias
* [NLP - Natural Language Processing With Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python)

Run the Cell bellow if you don't have `en_core_web_sm` downloaded

In [None]:
!python -m spacy download en_core_web_sm

In [1]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [2]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Creamos el documento desde el archivo `JFKIA.txt`**<br>
> Pista: Usa `with open('./JFKIA.txt') as f:`

In [4]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/NickEsColR/icesi-nlp/raw/refs/heads/main/Sesion1/JFKIA.txt

In [5]:
with open('./JFKIA.txt', encoding='utf-8') as file:
    doc = nlp(file.read())

El archivo esta cargado en la variable `doc` revisemos los primeros 36 tokens del texto

In [6]:
doc[:36]

John F. Kennedy’s Inaugural Address, January 20, 1961, 12:11 EST


We observe today not a victory of party but a celebration of
freedom—symbolizing an end as well as

El documento fue cargado exitosamente!

**2. Tokens en el archivo**

In [1]:
print(f'El archivo tiene {len(doc)} tokens.')

NameError: name 'doc' is not defined

**3. Oraciones en el archivo**
<br>Pista: Necesitarás una lista primero

In [None]:
sentences = list(doc.sents)
print(f'El archivo tiene {len(sentences)} oraciones.')

50

**4. Imprime la segunda oración del documento**
<br> Pista: Los índices comienzan en 0 y el título cuenta como la primera oración.

In [9]:
sentences[1]

The world is very different now, for man holds in his mortal hands the
power to abolish all forms of human poverty and all forms of human
life.

**5. Por cada token en la oración anterior, imprime su `text`, `POS` tag, `dep` tag y `lemma`**
<br>

In [10]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences[1]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
The                 DET                 det                 the                 
world               NOUN                nsubj               world               
is                  AUX                 ROOT                be                  
very                ADV                 advmod              very                
different           ADJ                 acomp               different           
now                 ADV                 advmod              now                 
,                   PUNCT               punct               ,                   
for                 SCONJ               mark                for                 
man                 NOUN                nsubj               man                 
holds               VERB                advcl               hold                
in                  ADP                 prep                in                  
his                 PRON    

**6. Implementa un matcher llamado *Pledge* que encuentre las ocurrencias de la frase *we pledge***

Pista: Deberías incluir un patrón`'IS_SPACE': True` entre las dos palabras.

In [11]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
# Pattern to find "we pledge" specifically
pledge_pattern = [
    {'LOWER': 'we'},
    {'IS_SPACE': True, 'OP': '*'},
    {'LOWER': 'pledge'}
]

matcher.add("Pledge", [pledge_pattern])


In [12]:
found_matches = matcher(doc)
found_matches

[(13569445558805512325, 314, 316),
 (13569445558805512325, 333, 336),
 (13569445558805512325, 403, 405),
 (13569445558805512325, 509, 511)]

Se encuentran 4 ocurrencias

**7. Imprime el texto al rededor de cada match encontrado**

In [13]:
for _, start, end in found_matches:
    print(doc[start-9:end+13])
    print('-' * 40)

and the success of liberty.

This much we pledge—and more.

To those old allies whose cultural and spiritual
----------------------------------------
allies whose cultural and spiritual origins we share: we
pledge the loyalty of faithful friends. United, there is little we

----------------------------------------
we welcome to the ranks of the free: we pledge
our word that one form of colonial control shall not have passed
----------------------------------------

to break the bonds of mass misery: we pledge our best efforts to help
them help themselves, for whatever period
----------------------------------------


**8. Imprime la oración que contiene cada match encontrado**

In [14]:
for sentence in sentences:
    for _, start, end in found_matches:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text)
            print('-' * 40)

This much we pledge—and more.


----------------------------------------
To those old allies whose cultural and spiritual origins we share: we
pledge the loyalty of faithful friends.
----------------------------------------
To those new states whom we welcome to the ranks of the free: we pledge
our word that one form of colonial control shall not have passed away
merely to be replaced by a far more iron tyranny.
----------------------------------------
To those people in the huts and villages of half the globe struggling
to break the bonds of mass misery: we pledge our best efforts to help
them help themselves, for whatever period is required—not because the
Communists may be doing it, not because we seek their votes, but
because it is right.
----------------------------------------


Podemos notar que el matcher nos permite identificar las partes del texto donde se realizan promesas

Cabe destacar que existen más formas de hacer promesas y para tenerlos todos se debe ampliar las opciones del patrón o crear varios matcher. Para evitar extender el análisis con tantas coincidencias se realizó el ejercicio unicamente con un tipo de promesa en el idioma