# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [2]:
// Equivalent of adding dependencies to maven or sbt files
// For example, to add "org.scalaj" %% "scalaj-http" % "2.4.2" 
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        [39m

In [3]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML[39m

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [4]:
// Tutorial https://alvinalexander.com/source-code/scala-how-to-http-download-xml-rss-feed-timeout/
// get the xml content using scalaj-http
val url = "https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"
val response: HttpResponse[String] = Http(url)
  .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
  .asString
val xmlString = response.body
// convert the `String` to a `scala.xml.Elem`
val xml = XML.loadString(xmlString)
// Extract text from title and description
val rssText = (xml \\ "item").map { item =>
    ((item \ "title").text ++ " " ++ (item \ "description").text)
}

[36murl[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m
[36mresponse[39m: [32mHttpResponse[39m[[32mString[39m] = [33mHttpResponse[39m(
  [32m"""<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"><channel><title>Chicago Tribune</title><link>https://www.chicagotribune.com</link><language>en-US</language><copyright>© 2021 Chicago Tribune</copyright><atom:link href="https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:%5Bnow-2d+TO+now%5D&amp;sort=display_date:desc" rel="self" type="application/rss+xml"/><description>Chicago Tribune News Feed</description><lastBuildDate>Sat, 22 May 2021 22:46:03 +0000</lastBui

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [5]:
val STOPWORDS = Seq (
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
    "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
    "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
    "their", "theirs", "themselves", "what", "which", "who", "whom",
    "this", "that", "these", "those", "am", "is", "are", "was", "were",
    "be", "been", "being", "have", "has", "had", "having", "do", "does",
    "did", "doing", "a", "an", "the", "and", "but", "if", "or",
    "because", "as", "until", "while", "of", "at", "by", "for", "with",
    "about", "against", "between", "into", "through", "during", "before",
    "after", "above", "below", "to", "from", "up", "down", "in", "out",
    "off", "over", "under", "again", "further", "then", "once", "here",
    "there", "when", "where", "why", "how", "all", "any", "both", "each",
    "few", "more", "most", "other", "some", "such", "no", "nor", "not",
    "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
    "will", "just", "don", "should", "now", "on",
    // Contractions without '
    "im", "ive", "id", "Youre", "youd", "youve",
    "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
    "theyre", "theyd", "theyve",
    "shouldnt", "couldnt", "musnt", "cant", "wont",
    // Common uppercase words
    "hi", "hello"
)
val punctuationSymbols = ".,()!?;:'`´\n"
val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")

[36mSTOPWORDS[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"i"[39m,
  [32m"me"[39m,
  [32m"my"[39m,
  [32m"myself"[39m,
  [32m"we"[39m,
  [32m"our"[39m,
  [32m"ours"[39m,
  [32m"ourselves"[39m,
  [32m"you"[39m,
  [32m"yours"[39m,
  [32m"yourself"[39m,
  [32m"yourselves"[39m,
  [32m"he"[39m,
  [32m"him"[39m,
  [32m"his"[39m,
  [32m"himself"[39m,
  [32m"she"[39m,
  [32m"her"[39m,
  [32m"hers"[39m,
  [32m"herself"[39m,
  [32m"it"[39m,
  [32m"its"[39m,
  [32m"itself"[39m,
  [32m"they"[39m,
  [32m"them"[39m,
  [32m"your"[39m,
  [32m"their"[39m,
  [32m"theirs"[39m,
  [32m"themselves"[39m,
  [32m"what"[39m,
  [32m"which"[39m,
  [32m"who"[39m,
  [32m"whom"[39m,
  [32m"this"[39m,
  [32m"that"[39m,
  [32m"these"[39m,
  [32m"those"[39m,
  [32m"am"[39m,
...
[36mpunctuationSymbols[39m: [32mString[39m = [32m""".,()!?;:'`´
"""[39m
[36mpunctuationRegex[39m: [32mString[39m = [32m"""\.|\,|\(|\)|\!|\?

In [6]:
class NERModel(STOPWORDS: Seq[String], punctuationSymbols: String, punctuationRegex: String) {
    // Extract Named Entitis
    def getNEsSingle(text: String): Seq[String] = {
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq
    }
    def getNEs(textList: Seq[String]): Seq[Seq[String]] = {
        textList.map(getNEsSingle)
    }
    
    // Counts Named Entities
    def countNEs(result: Seq[Seq[String]]): Map[String, Int] = {
        result.flatten.foldLeft(Map.empty[String, Int]) {
         (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
    }
    
    // Sorts Named Entities
    def sortNEs(counts: Map[String, Int]): List[(String, Int)] = {
        counts.toList.sortBy(_._2)(Ordering[Int].reverse)
    } 
}

defined [32mclass[39m [36mNERModel[39m

In [7]:
val model = new NERModel(STOPWORDS, punctuationSymbols, punctuationRegex)

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd5$Helper$NERModel@27218727

### 2.2 Aplicar el "Modelo" a los datos

In [8]:
val result = model.getNEs(rssText)

[36mresult[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Sox"[39m,
    [32m"New"[39m,
    [32m"York"[39m,
    [32m"Yankees"[39m,
    [32m"Chicago"[39m,
    [32m"White"[39m,
    [32m"Sox"[39m,
    [32m"Dylan"[39m,
    [32m"Cease"[39m,
    [32m"New"[39m,
    [32m"York"[39m,
    [32m"Yankees"[39m,
    [32m"Saturday"[39m,
    [32m"Yankee"[39m,
    [32m"Stadium"[39m,
    [32m"Sox"[39m
  ),
  [33mArrayBuffer[39m(
    [32m"John"[39m,
    [32m"Tavares"[39m,
    [32m"Toronto"[39m,
    [32m"Maple"[39m,
    [32m"Leafs\u2019"[39m,
    [32m"Montreal"[39m,
    [32m"Canadiens"[39m,
    [32m"Toronto"[39m,
    [32m"John"[39m,
    [32m"Tavares"[39m,
    [32m"Montreal"[39m,
    [32m"Canadiens"[39m,
    [32m"Game"[39m
  ),
  [33mArrayBuffer[39m([32m"NBA"[39m, [32m"NBA"[39m, [32m"Saturday"[39m),
  [33mArrayBuffer[39m(
    [32m"Chi

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [23]:
val counts = model.countNEs(result)
val sortedNEs = model.sortNEs(counts)

[36mcounts[39m: [32mMap[39m[[32mString[39m, [32mInt[39m] = [33mMap[39m(
  [32m"Much"[39m -> [32m1[39m,
  [32m"Gleyber"[39m -> [32m1[39m,
  [32m"Mercedes"[39m -> [32m4[39m,
  [32m"Parker"[39m -> [32m2[39m,
  [32m"John"[39m -> [32m2[39m,
  [32m"Yerm\u00edn"[39m -> [32m4[39m,
  [32m"Foster\u2019s"[39m -> [32m1[39m,
  [32m"Soccer"[39m -> [32m1[39m,
  [32m"Canadiens"[39m -> [32m2[39m,
  [32m"Column"[39m -> [32m1[39m,
  [32m"Yankee"[39m -> [32m2[39m,
  [32m"Sky"[39m -> [32m4[39m,
  [32m"Nationals"[39m -> [32m1[39m,
  [32m"Washington"[39m -> [32m1[39m,
  [32m"Mariners"[39m -> [32m2[39m,
  [32m"Rod\u00f3n"[39m -> [32m1[39m,
  [32m"Skys"[39m -> [32m1[39m,
  [32m"Confused"[39m -> [32m1[39m,
  [32m"Louis"[39m -> [32m3[39m,
  [32m"Cease"[39m -> [32m1[39m,
  [32m"President"[39m -> [32m1[39m,
  [32m"Foster"[39m -> [32m1[39m,
  [32m"Olympic"[39m -> [32m1[39m,
  [32m"Sox\u2019s"[39m -> [32m2[39m,
