# Named Entity Recognition (NER) for Corridos
The following Notebook was created by Group Four in IS578: Introduction to Digital Humanities at the University of Illinois, Urbana-Champaign (Fall 2022). It marginally adapts code from Melanie Walsh's free online textbook: [Introduction to Cultural Analytics and Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/Multilingual/Spanish/02-Named-Entity-Recognition-Spanish.html).

A special thanks to Dr. Zoe LeBlanc for guiding us to resources this semester, especially Walsh's text.

## Overview

Named Entity Recognition (NER) is a common method of Natural Language Processing (NLP). It identifies various objects in text, such as people, places, organizations, or any other proper nouns. There are several Python libraries that enable NER, but here we focus on spaCy. You can learn more about spaCy [here](https://spacy.io/) and [here](https://en.wikipedia.org/wiki/SpaCy). It is one of the most popular NLP libraries in Python.

For our project, we used NER to identify places mentioned in a corpus of corridos. Corridos are a traditional form of Mexican music and folklore that often describe historic events, people, and places. Using NER, we hoped to extract data that could be transferred to geospatial mappings as a way to represent corridos and the significance of the places they mention.

The program in this notebook was useful but not perfect. Our training data was taken from spaCy's free pipelines, and while it was good at finding almost all the named entities, it was not particularly accurate at categorizing them between persons (PER), organizations (ORG), locations (LOC), or miscellaneous (MISC). If we had more time and a larger corpus, we would have attempted to create our own training data to improve these results. This could be a project for the future, if any of our group members are so inclined.

There was also no way for the program to specify locations with enough detail to automate the mapping process; that is, in order to automatically map every place at once, we would have needed city, state, country, longitude, and latitude data. Given the often implied or symbolic nature of this information in our corpus, it was beyond the ability of any computer to identify on its own. Still, our NER program was useful to us because it saved us the tedious work of identifying all the named entities through reading. We still had to double-check the categorization results and manually search place names in Google Maps to verify their accuracy in relation to the corridos, but at least the program in this notebook provided the list of named entities from which to begin the manual work.

Finally, we put this program into a Colab notebook so every group member could follow along, no matter their Python experience. Given this prioritization of accessibility, we have kept the script short and concise to its primary task. Minor variations were used before we established this concise draft–variations that involved different input methods, compared other NER libraries such as NLTK, and attempted different training data. Such variations could still be added to this program. Its use is also not limited to corridos. If you hope to adapt this script to other projects, we encourage you to review Melanie Walsh's [Introduction to Cultural Analytics and Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/Multilingual/Spanish/02-Named-Entity-Recognition-Spanish.html). Also, feel free to contact [Matthew Kollmer](https://matthewkollmer.com). He will be happy to assist if he can (but please know he is still learning how to program in Python, too).


## Import Spacy

To begin our analysis, we need to import spaCy and download our training data. Importing spaCy allows us to use all of the predetermined functions in the spaCy library. Downloading spaCy's training model helps the spaCy functions determine named entities based on prior encoding. If you want to explore other training data, check out spaCy's list [here](https://https://spacy.io/models/es). To utilize other training data, simply change the code below that says "es_core_news_md" with whatever training data you prefer.

In [1]:
import spacy
!python -m spacy download es_core_news_md

import es_core_news_md

SPANISH_NER_MODEL = spacy.load('es_core_news_md')

2022-11-19 22:37:26.084438: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-md==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.4.0/es_core_news_md-3.4.0-py3-none-any.whl (42.3 MB)
[K     |████████████████████████████████| 42.3 MB 1.3 MB/s 
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')


## Input Corrido

Now we need to choose our text for NER analysis. There are a number of ways to input texts, but here we've made it easy. Just copy and paste the corrido you want to analyze in between the sets of apostrophes. This allows you to assign the corrido to an analyzable object–a string of characters–that we've named "corrido" below.

In [2]:
corrido = '''
CORRIDO DE MACARIO ROMERO 

¡Válgame San Juan, qué veo, 
cuánto yaqui de guarache 
y cuánto maldito apache 
con sus flechas de trofeo ! 

¡Abran paso que ahí voy yo,
 ni a los yaquis tengo miedo, 
yo soy Macario Romero 
que al mismo diablo corrió! 

Voy a cantar, mis amigos, 
con cariño verdadero, 
para recordar del hombre 
que fué Macario Romero. -

Era amigo de los hombres, 
los quería de corazón, 
por un amor lo mataron, 
lo mataron a traición. 

Dijo Macario Romero: 
oiga, mi general Plata, 
concédame una licencia, 
para ir a ver a mi chata.

El general Plata dijo: 
Macario ¿qué vas a hacer? 
Te van a quitar la vida 
por una ingrata mujer.

Dijo Macario Romero, 
dando vuelta a una ladera: 
al cabo ¿qué me han de hacer,
 si es pura sarracuatera ? 

Le dijo el general Plata 
sin mi licencia no vas; 
mas si llevas tu capricho, 
en tu salud lo hallarás. 

Dijo Macario Romero 
al salir de la garita : 
yo voy a ver a mi chata, 
a mí nadie me la quita. 

Dijo Jesusita Llamas: 
papá, ahí viene Macario, 
desde a leguas lo conozco
en su caballo melado. 

Don Vicente Llamas dijo: 
¡Jesús! ¿qué plan le pondremos?
 Vamos haciéndole un baile 
y así ya lo mataremos. 

Llega Macario Romero,
lo convidan a bailar, 
y cuando está desarmado, 
le comienzan a tirar:

Dijo Macario Romero : 
acábame de matar, 
que al cabo mi hermano Pedro 
es el que me ha de vengar. 

¡Cobardes! así son buenos, 
me asesinan a traición, 
por viles y montoneros, 
allá lo verán con Dios. 

Sepan que muero en mi ley, 
como se mueren los hombres, 
viles, traidores, collones, 
solos los quisiera ver. 

¡Adiós, chata de mi vida; 
adiós, mi bello lucero, 
adiós, mi prenda querida, 
Jesús, Jesús, que me muero! 

Y diciendo esto expiró, 
el valiente de Macario, 
que en garras de un sanguinario, 
por su desgracia cayó. 

Jesusita Llamas dijo: 
ahora sí quedamos bien, 
ya mataron a Macario, 
mátenme ahora a mí también. 

¡Bandidos, sigan conmigo, 
morirme, morirme quiero! 
para qué quiero la vida 
sin mi Macario Romero!

Brazo a brazo, frente a frente 
debían haberlo agarrado, 
y no traicioneramente, 
como lo han asesinado. 

Don Jesús Aceves dijo: 
vamos levantando una acta, 
que matamos a un bandido, 
de los del general Plata. 

Ya nos quitamos del frente 
a ese famoso escorpión, 
que la echaba de valiente, 
cuando los cogía a traición. 

Ya con esta me despido, 
porque llorar ya no quiero 
la muerte de ese valiente, 
de ese valiente Romero.
'''

## Apply Model to Corrido

Now we need to apply our model to our corrido and create an object with all of the results. The code below does just that–it puts the corrido into the model as a parameter and assigns the results to the object "corrido_entities".

In [3]:
corrido_entities = SPANISH_NER_MODEL(corrido)

Then if we use the spaCy method "ents" on our new object, we'll be given a list of named entities. Simple as that!

In [5]:
corrido_entities.ents

(CORRIDO,
 Macario Romero,
 Macario Romero,
 Dijo Macario Romero,
 Plata,
 Macario,
 Dijo Macario Romero,
 Plata,
 Dijo Macario Romero,
 Jesusita Llamas,
 Macario,
 Don Vicente Llamas,
 Llega Macario Romero,
 Macario Romero,
 Pedro,
 Dios. 
 
 Sepan,
 Jesús,
 Macario,
 Jesusita Llamas,
 Macario,
 ¡Bandidos,
 Macario Romero!
 
 Brazo,
 Don Jesús Aceves,
 general Plata.,
 Romero.)

## spaCy NER Categories

If we want to see which NER categories each named entity has been classified under, we can use a for-in loop that returns each entity next to its category. Keep in mind, these are built-in categories and may not be exactly correct. They are also given acronyms that may need clarification. For a full list of these categories with brief explanations, check out [this part](https://www.kaggle.com/code/curiousprogrammer/entity-extraction-and-classification-using-spacy?scriptVersionId=11364473&cellId=9) of another notebook on spaCy.

In [7]:
for named_entity in corrido_entities.ents:
    print(named_entity.label_, named_entity)


ORG CORRIDO
PER Macario Romero
PER Macario Romero
PER Dijo Macario Romero
PER Plata
PER Macario
PER Dijo Macario Romero
PER Plata
PER Dijo Macario Romero
PER Jesusita Llamas
LOC Macario
PER Don Vicente Llamas
PER Llega Macario Romero
PER Macario Romero
PER Pedro
LOC Dios. 

Sepan
PER Jesús
PER Macario
PER Jesusita Llamas
PER Macario
LOC ¡Bandidos
PER Macario Romero!

Brazo
PER Don Jesús Aceves
ORG general Plata.
MISC Romero.



## Conclusion

As you can see, this final part of the process is not perfect, but the solution (project-specific training data) is beyond the scope of this semester. Still, this NER method saved us a fair amount of time because double-checking these named entities from a list was faster than trying to identify them through close reading. We were therefore grateful to have discovered these methods.

We also benefitted from the exposure to Python and spaCy. As has been true for so much of this semester, our group found lots of things we hope to build on in the future.