# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [1]:
// Equivalent of adding dependencies to maven or sbt files
// For example, to add "org.scalaj" %% "scalaj-http" % "2.4.2" 
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`
import $ivy.`org.json4s::json4s-jackson:3.4.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                                 [39m

In [2]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML
import scala.collection.mutable.MutableList
//import org.json4s.JsonDSL._
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML
[39m
[32mimport [39m[36mscala.collection.mutable.MutableList
//import org.json4s.JsonDSL._
[39m
[32mimport [39m[36morg.json4s._
[39m
[32mimport [39m[36morg.json4s.jackson.JsonMethods._
[39m
[36mformats[39m: [32mDefaultFormats[39m.type = org.json4s.DefaultFormats$@2bdce90

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [3]:
abstract class Parser {
    val emptyText: String
    
    // Get the content
    def openURL(url: String): String = {
        try {
            var response = Http(url)
            .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
            .asString
            response.body
        } catch {
            case e : Throwable  => println("Error in HTTP response")
            emptyText 
        }
    }
    
    // Extract text 
    def processText(rawText: String): Seq[String] 
    
    // Read the text
    def readText(url: String): Seq[String] = {
        processText(openURL(url))
    }
}

class RSSParser() extends Parser {
    val emptyText = "<rss></rss>"
    
    def processText(rawText: String): Seq[String] = {
        // convert the `String` to a `scala.xml.Elem`
        val xml = XML.loadString(rawText)
        // Extract text from title and description
        (xml \\ "item").map { item =>
            ((item \ "title").text ++ " " ++ (item \ "description").text)
        }
    }
}

class RedditParser() extends Parser {
    val emptyText = "{}"
    
    def processText(rawText: String): Seq[String] = {
        val urlPattern = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]"
        val titles = (parse(rawText) \ "data" \ "children" \ "data" \ "title" )
         .extract[List[String]]
        val selftexts = (parse(rawText) \ "data" \ "children" \ "data" \ "selftext" )
         .extract[List[String]]
        val result = titles.zip(selftexts).map{case (a,b) => a ++ " " ++b}
        result.map(text => text.replaceAll(urlPattern, " "))
    }
}



defined [32mclass[39m [36mParser[39m
defined [32mclass[39m [36mRSSParser[39m
defined [32mclass[39m [36mRedditParser[39m

In [4]:
class FeedService() {
    var urls: MutableList[(String, Parser)] = new MutableList[(String, Parser)]()
    
    def subscribe(urlTemplate: String, params: Seq[String], parser: Parser) = {
        if (params == Seq()){
            urls ++= Seq((urlTemplate, parser))
        } else {
            val result = params.map(x => urlTemplate.format(x)).map(y =>(y, parser))
            urls ++= result
        }
    }
    
    def getText(): Seq[String] = {
        urls.flatMap{case (a, b) => b.readText(a)}
    }
}

defined [32mclass[39m [36mFeedService[39m

In [5]:
val rss_parser = new RSSParser
val url_rss = "https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"
val rssText = rss_parser.readText(url_rss)

[36mrss_parser[39m: [32mRSSParser[39m = ammonite.$sess.cmd2$Helper$RSSParser@6ea69848
[36murl_rss[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m
[36mrssText[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"As United Center, other sites wind down, Chicago shifting vaccination focus to local events As the city of Chicago prepares to wind down its biggest mass vaccination site, officials said Tuesday they are focused on a hyperlocal that includes dozens of pop-up events, vaccine incentives and home visits."[39m,
  [32m"Chicago Bears are optimistic about their cornerback competition, but they still are considering adding a veteran free-agent option such as Bashaud Breeland The Chicago Bears are optimistic about the competition they have at cornerback with Desmond Trufant and Kindle Vildor. But the team still is window shopping and considering a veteran option suc

In [7]:
val reddit_parser = new RedditParser
val url_reddit = "https://www.reddit.com/r/Android/hot/.json?count=10"
val redditText = reddit_parser.readText(url_reddit)

[36mreddit_parser[39m: [32mRedditParser[39m = ammonite.$sess.cmd2$Helper$RedditParser@9c36188
[36murl_reddit[39m: [32mString[39m = [32m"https://www.reddit.com/r/Android/hot/.json?count=10"[39m
[36mredditText[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"""Moronic Monday (May 24 2021) - Your weekly questions thread! Note 1. Join us at /r/MoronicMondayAndroid, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! 

Note 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.]( )"""[39m,
  [32m"Google Assistant will soon be able to power off your Android - 9to5Google "[39m,
  [32m"Arm Announces Mobile Armv9 CPU Microarchitectures: Cortex-X2, Cortex-A710 &amp; Cortex-A510 "[39m,
  [32m"Anker teases Nebula Android TV dongle for 2021 release - 9to5Google "[39m,
  [32m"VideoCardz: \"ARM announces Mali-G710, G610, G510 and G310 graphics processing units\" "[39m,
  [32m"Ga

In [8]:
val servicio = new FeedService
servicio.subscribe("https://www.chicagotribune.com/arcio/rss/category/%s/?query=display_date:[now-2d+TO+now]&sort=display_date:desc", List("sports", "business"), rss_parser)
servicio.subscribe("https://www.reddit.com/r/Android/hot/.json?count=10", List(), reddit_parser)

[36mservicio[39m: [32mFeedService[39m = ammonite.$sess.cmd3$Helper$FeedService@7d40e3c
[36mres7_1[39m: [32mMutableList[39m[([32mString[39m, [32mParser[39m)] = [33mMutableList[39m(
  (
    [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m,
    ammonite.$sess.cmd2$Helper$RSSParser@6ea69848
  ),
  (
    [32m"https://www.chicagotribune.com/arcio/rss/category/business/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m,
    ammonite.$sess.cmd2$Helper$RSSParser@6ea69848
  ),
  (
    [32m"https://www.reddit.com/r/Android/hot/.json?count=10"[39m,
    ammonite.$sess.cmd2$Helper$RedditParser@9c36188
  )
)
[36mres7_2[39m: [32mMutableList[39m[([32mString[39m, [32mParser[39m)] = [33mMutableList[39m(
  (
    [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m,
    ammonite.$sess.cmd2$Helper$RSSParser@6ea69848

In [21]:
servicio.getText()

[36mres20[39m: [32mSeq[39m[[32mString[39m] = [33mMutableList[39m(
  [32m"Illinois is set to play a Friday game against Maryland this Big Ten football season Illinois will play on a Friday this season, hosting Maryland on Sept. 17, the Big Ten announced."[39m,
  [32m"Chicago Bull Zach LaVine sells Lakeview mansion for $3M Chicago Bulls guard Zach LaVine on May 21 sold his five-bedroom mansion in the city's Lakeview neighborhood for $3 million."[39m,
  [32m"Aaron Rodgers doesn\u2019t attend the 1st day of Green Bay Packers OTAs \u2014 and remains noncommittal about his future in ESPN interview Green Bay Packers quarterback Aaron Rodgers wasn\u2019t present for the first day of organized team activities Monday, and his future with the team remains uncertain."[39m,
  [32m"Chicago Blackhawks Q&A: What would\u2019ve happened if Corey Crawford stayed? Why are they so set on their defensive scheme? And is there any hope in next season\u2019s Central Division? The Chicago Blackha

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [23]:
val STOPWORDS = Seq (
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
    "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
    "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
    "their", "theirs", "themselves", "what", "which", "who", "whom",
    "this", "that", "these", "those", "am", "is", "are", "was", "were",
    "be", "been", "being", "have", "has", "had", "having", "do", "does",
    "did", "doing", "a", "an", "the", "and", "but", "if", "or",
    "because", "as", "until", "while", "of", "at", "by", "for", "with",
    "about", "against", "between", "into", "through", "during", "before",
    "after", "above", "below", "to", "from", "up", "down", "in", "out",
    "off", "over", "under", "again", "further", "then", "once", "here",
    "there", "when", "where", "why", "how", "all", "any", "both", "each",
    "few", "more", "most", "other", "some", "such", "no", "nor", "not",
    "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
    "will", "just", "don", "should", "now", "on",
    // Contractions without '
    "im", "ive", "id", "Youre", "youd", "youve",
    "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
    "theyre", "theyd", "theyve",
    "shouldnt", "couldnt", "musnt", "cant", "wont",
    // Common uppercase words
    "hi", "hello"
)

[36mSTOPWORDS[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"i"[39m,
  [32m"me"[39m,
  [32m"my"[39m,
  [32m"myself"[39m,
  [32m"we"[39m,
  [32m"our"[39m,
  [32m"ours"[39m,
  [32m"ourselves"[39m,
  [32m"you"[39m,
  [32m"yours"[39m,
  [32m"yourself"[39m,
  [32m"yourselves"[39m,
  [32m"he"[39m,
  [32m"him"[39m,
  [32m"his"[39m,
  [32m"himself"[39m,
  [32m"she"[39m,
  [32m"her"[39m,
  [32m"hers"[39m,
  [32m"herself"[39m,
  [32m"it"[39m,
  [32m"its"[39m,
  [32m"itself"[39m,
  [32m"they"[39m,
  [32m"them"[39m,
  [32m"your"[39m,
  [32m"their"[39m,
  [32m"theirs"[39m,
  [32m"themselves"[39m,
  [32m"what"[39m,
  [32m"which"[39m,
  [32m"who"[39m,
  [32m"whom"[39m,
  [32m"this"[39m,
  [32m"that"[39m,
  [32m"these"[39m,
  [32m"those"[39m,
  [32m"am"[39m,
...

In [24]:
class NERModel(STOPWORDS: Seq[String]) {
    
    val punctuationSymbols = ".,()!?;:'`´\n"
    val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Extract Named Entitis
    def getNEsSingle(text: String): Seq[String] = {
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq
    }
    def getNEs(textList: Seq[String]): Seq[Seq[String]] = {
        textList.map(getNEsSingle)
    }
    
    // Counts Named Entities
    def countNEs(result: Seq[Seq[String]]): Map[String, Int] = {
        result.flatten.foldLeft(Map.empty[String, Int]) {
         (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
    }
    
    // Sorts Named Entities
    def sortNEs(counts: Map[String, Int]): List[(String, Int)] = {
        counts.toList.sortBy(_._2)(Ordering[Int].reverse)
    } 
}

defined [32mclass[39m [36mNERModel[39m

In [25]:
val model = new NERModel(STOPWORDS)

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd23$Helper$NERModel@4f88f068

### 2.2 Aplicar el "Modelo" a los datos

In [26]:
val result = model.getNEs(redditText)

[36mresult[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m(
    [32m"Moronic"[39m,
    [32m"Monday"[39m,
    [32m"May"[39m,
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"Ctrl-F"[39m,
    [32m"Note"[39m,
    [32m"Join"[39m,
    [32m"IRC"[39m,
    [32m"Telegram"[39m
  ),
  [33mArrayBuffer[39m([32m"Google"[39m, [32m"Assistant"[39m, [32m"Android"[39m),
  [33mArrayBuffer[39m(
    [32m"Arm"[39m,
    [32m"Announces"[39m,
    [32m"Mobile"[39m,
    [32m"Armv9"[39m,
    [32m"CPU"[39m,
    [32m"Microarchitectures"[39m,
    [32m"Cortex-X2"[39m,
    [32m"Cortex-A710"[39m,
    [32m"Cortex-A510"[39m
  ),
  [33mArrayBuffer[39m([32m"Anker"[39m, [32m"Nebula"[39m, [32m"Android"[39m, [32m"TV"[39m),
  [33mArrayBuffer[39m(
    [32m"Galaxy"[39m,
    [32m"Upcycling"[39m,
    [32m"Samsung"[39m,
    [32m"Ruined"[39m,
    [32m"Best"[39m,
    [32m"Idea"[39m,
    [32m"Years"[39m
  ),
  [33m

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [57]:
val counts = model.countNEs(result)
val sortedNEs = model.sortNEs(counts)

[36mcounts[39m: [32mMap[39m[[32mString[39m, [32mInt[39m] = [33mMap[39m(
  [32m"Thing"[39m -> [32m1[39m,
  [32m"Advertising"[39m -> [32m1[39m,
  [32m"PLN"[39m -> [32m5[39m,
  [32m"Please"[39m -> [32m1[39m,
  [32m"Easy"[39m -> [32m1[39m,
  [32m"Locked"[39m -> [32m1[39m,
  [32m"One"[39m -> [32m8[39m,
  [32m"CEO"[39m -> [32m1[39m,
  [32m"Tab"[39m -> [32m3[39m,
  [32m"Weinbach"[39m -> [32m3[39m,
  [32m"Use"[39m -> [32m1[39m,
  [32m"Cheaper"[39m -> [32m1[39m,
  [32m"Take"[39m -> [32m1[39m,
  [32m"OLED"[39m -> [32m1[39m,
  [32m"NPUs"[39m -> [32m1[39m,
  [32m"A11"[39m -> [32m3[39m,
  [32m"Disagree"[39m -> [32m1[39m,
  [32m"Upcycling"[39m -> [32m1[39m,
  [32m"ISP"[39m -> [32m1[39m,
  [32m"Settings"[39m -> [32m1[39m,
  [32m"Lite"[39m -> [32m1[39m,
  [32m"Pichai"[39m -> [32m1[39m,
  [32m"Users"[39m -> [32m1[39m,
  [32m"Reddit"[39m -> [32m1[39m,
  [32m"GPS"[39m -> [32m1[39m,
  [32m"Google