# Named Entity Recognition Pipeline

El pipeline toma una URL de un feed en formato RSS, obtiene el título y descripción de los artículos en el feed, detecta las NER con un modelo pre-entrenado, y las muestra ordenadas por frecuencia de aparición.

### Versiones
Probado con:
* Almond 0.6.0
* Ammonite 1.6.7
* Scala library version **2.11.12** -- Copyright 2002-2017, LAMP/EPFL
* Java 1.8.0_282

Para ver más información ir a (Help -> About Scala Kernel)

## 1. Obtener texto

### 1.1 Importar librerías

In [7]:
// Equivalent of adding dependencies to maven or sbt files
// For example, to add "org.scalaj" %% "scalaj-http" % "2.4.2" 
import $ivy.`org.scalaj::scalaj-http:2.4.2`
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
import $ivy.`org.scala-lang.modules::scala-xml:1.3.0`
import $ivy.`org.json4s::json4s-jackson:3.4.0`

[32mimport [39m[36m$ivy.$                              
// "org.scala-lang.modules" %% "scala-xml" % "1.3.0"
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                                 [39m

In [8]:
import scalaj.http.{Http, HttpResponse}
import scala.xml.XML
import scala.collection.mutable.MutableList
//import org.json4s.JsonDSL._
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats

[32mimport [39m[36mscalaj.http.{Http, HttpResponse}
[39m
[32mimport [39m[36mscala.xml.XML
[39m
[32mimport [39m[36mscala.collection.mutable.MutableList
//import org.json4s.JsonDSL._
[39m
[32mimport [39m[36morg.json4s._
[39m
[32mimport [39m[36morg.json4s.jackson.JsonMethods._
[39m
[36mformats[39m: [32mDefaultFormats[39m.type = org.json4s.DefaultFormats$@6e756a75

### 1.1 Obtener el texto del RSS Feed

Realizamos una consulta HTTP, que nos devuelve una instancia de HTTPResponse. Dentro del atributo `body` de la HTTPResponse, se encuentra el texto del feed en formato XML. Luego, se parsea el XML para extraer los campos `title` y `description`.

In [9]:
abstract class Parser {
    // Get the xml content
    val emptyText: String
    def openURL(url: String): String = {
        try {
            var response = Http(url)
            .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
            .asString
            response.body
        } catch {
            case e => println("Error in http response")
            emptyText 
        }
    }
    
    // Extract text 
    def processText(rawText: String): Seq[String] 
    
    // Read the text
    def readText(url: String): Seq[String] = {
        processText(openURL(url))
    }
}

class RSSarser() extends Parser {
    val emptyText = "<rss></rss>"
    def processText(rawText: String): Seq[String] = {
        // convert the `String` to a `scala.xml.Elem`
        val xml = XML.loadString(rawText)
        // Extract text from title and description
        (xml \\ "item").map { item =>
            ((item \ "title").text ++ " " ++ (item \ "description").text)
        }
    }
}

class RedditParser() extends Parser {
    val emptyText = "{}"
    def processText(rawText: String): Seq[String] = {
        (parse(rawText) \ "data" \ "children" \ "data" \ "title" )
        .extract[List[String]] ++ (parse(rawText) \ "data" \ "children" \ "data" \ "selftext" )
        .extract[List[String]]
    }
}



defined [32mclass[39m [36mParser[39m
defined [32mclass[39m [36mXMLParser[39m
defined [32mclass[39m [36mJSONParser[39m

In [12]:
abstract class FeedService() {
    var urls: MutableList[(String, Parser)] = new MutableList[(String, Parser)]()
    def subscribe(urlTemplate: String, params: Seq[String], parser: Parser) 
        // Falta implementar el subscribe. Idea: armar una Seq[urlTemplate++params_i], 
    //armar una Mutable(urlTemplate++params_i, Parser), concatenar con urls 
    def getText(): Seq[String]
        // Devuelve la url parseada
}

defined [32mclass[39m [36mFeedService[39m

In [10]:
val parser = new RSSParser

[36mparser[39m: [32mXMLParser[39m = ammonite.$sess.cmd8$Helper$XMLParser@5b874efc

In [14]:
val url = "https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"
val rssText = parser.readText(url)

[36murl[39m: [32mString[39m = [32m"https://www.chicagotribune.com/arcio/rss/category/sports/?query=display_date:[now-2d+TO+now]&sort=display_date:desc"[39m
[36mrssText[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"Simone Biles dials up the difficulty. \u2018Because I can.\u2019  The Olympic gold medalist\u2019s new vault is so dangerous that gymnastics, for now, limits the scoring rewards for trying it. Biles says that\u2019s unfair."[39m,
  [32m"Connecticut Sun coach Curt Miller apologizes for comment about Las Vegas Aces center Liz Cambage\u2019s weight Connecticut Sun coach Curt Miller has apologized for making a disparaging remark to a referee about the weight of Las Vegas Aces post player Liz Cambage."[39m,
  [32m"Aaron Rodgers doesn\u2019t attend the 1st day of Green Bay Packers OTAs Green Bay Packers quarterback Aaron Rodgers wasn\u2019t present for the first day of organized team activities Monday, according to a person familiar with the situation."

In [15]:
val parser2 = new RedditParser
val reddit = "https://www.reddit.com/r/Android/hot/.json?count=10"
val redditText = parser2.readText(reddit)

[36mparser2[39m: [32mJSONParser[39m = ammonite.$sess.cmd8$Helper$JSONParser@10b324c4
[36mreditt[39m: [32mString[39m = [32m"https://www.reddit.com/r/Android/hot/.json?count=10"[39m
[36mredittText[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"Moronic Monday (May 24 2021) - Your weekly questions thread!"[39m,
  [32m"Honor\u2019s upcoming phones will have Google apps pre-installed"[39m,
  [32m"4 things to know about Google Photos' storage policy change"[39m,
  [32m"Galaxy Upcycling: How Samsung Ruined Their Best Idea in Years"[39m,
  [32m"LineageOS 18.1 brings Android 11 to three Xiaomi, ASUS, and Sony devices"[39m,
  [32m"Practical application of the Android Neural Network API for the use of Tensorflow Lite models"[39m,
  [32m"Galaxy Tab S7 FE 5G silver 64 GB( German Samsung website)"[39m,
  [32m"Samsung Galaxy Tab S7 FE quietly launches with a Snapdragon 750G and 10,090 mAh battery"[39m,
  [32m"Xiaomi Black Shark 4 review"[39m,
  [32m"New Pi

## 2. Detectar las entidades nombradas

### 2.1 Crear el modelo

El **modelo** es sólo la función `getNEs`, que recibe una lista de textos.
Para cada texto, se separa las palabras del texto usando los espacios, y considera que es una entidad nombrada si empieza con mayúscula.

Este código lista los signos de puntuación y algunas palabras comunes del inglés que se van a sacar del texto.

In [53]:
val STOPWORDS = Seq (
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
    "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
    "her", "hers", "herself", "it", "its", "itself", "they", "them", "your",
    "their", "theirs", "themselves", "what", "which", "who", "whom",
    "this", "that", "these", "those", "am", "is", "are", "was", "were",
    "be", "been", "being", "have", "has", "had", "having", "do", "does",
    "did", "doing", "a", "an", "the", "and", "but", "if", "or",
    "because", "as", "until", "while", "of", "at", "by", "for", "with",
    "about", "against", "between", "into", "through", "during", "before",
    "after", "above", "below", "to", "from", "up", "down", "in", "out",
    "off", "over", "under", "again", "further", "then", "once", "here",
    "there", "when", "where", "why", "how", "all", "any", "both", "each",
    "few", "more", "most", "other", "some", "such", "no", "nor", "not",
    "only", "own", "same", "so", "than", "too", "very", "s", "t", "can",
    "will", "just", "don", "should", "now", "on",
    // Contractions without '
    "im", "ive", "id", "Youre", "youd", "youve",
    "hes", "hed", "shes", "shed", "itd", "were", "wed", "weve",
    "theyre", "theyd", "theyve",
    "shouldnt", "couldnt", "musnt", "cant", "wont",
    // Common uppercase words
    "hi", "hello"
)

[36mSTOPWORDS[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m(
  [32m"i"[39m,
  [32m"me"[39m,
  [32m"my"[39m,
  [32m"myself"[39m,
  [32m"we"[39m,
  [32m"our"[39m,
  [32m"ours"[39m,
  [32m"ourselves"[39m,
  [32m"you"[39m,
  [32m"yours"[39m,
  [32m"yourself"[39m,
  [32m"yourselves"[39m,
  [32m"he"[39m,
  [32m"him"[39m,
  [32m"his"[39m,
  [32m"himself"[39m,
  [32m"she"[39m,
  [32m"her"[39m,
  [32m"hers"[39m,
  [32m"herself"[39m,
  [32m"it"[39m,
  [32m"its"[39m,
  [32m"itself"[39m,
  [32m"they"[39m,
  [32m"them"[39m,
  [32m"your"[39m,
  [32m"their"[39m,
  [32m"theirs"[39m,
  [32m"themselves"[39m,
  [32m"what"[39m,
  [32m"which"[39m,
  [32m"who"[39m,
  [32m"whom"[39m,
  [32m"this"[39m,
  [32m"that"[39m,
  [32m"these"[39m,
  [32m"those"[39m,
  [32m"am"[39m,
...

In [54]:
class NERModel(STOPWORDS: Seq[String]) {
    
    val punctuationSymbols = ".,()!?;:'`´\n"
    val punctuationRegex = "\\" + punctuationSymbols.split("").mkString("|\\")
    
    // Extract Named Entitis
    def getNEsSingle(text: String): Seq[String] = {
      text.replaceAll(punctuationRegex, "").split(" ")
        .filter { word:String => word.length > 1 &&
                  Character.isUpperCase(word.charAt(0)) &&
                  !STOPWORDS.contains(word.toLowerCase) }.toSeq
    }
    def getNEs(textList: Seq[String]): Seq[Seq[String]] = {
        textList.map(getNEsSingle)
    }
    
    // Counts Named Entities
    def countNEs(result: Seq[Seq[String]]): Map[String, Int] = {
        result.flatten.foldLeft(Map.empty[String, Int]) {
         (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }
    }
    
    // Sorts Named Entities
    def sortNEs(counts: Map[String, Int]): List[(String, Int)] = {
        counts.toList.sortBy(_._2)(Ordering[Int].reverse)
    } 
}

defined [32mclass[39m [36mNERModel[39m

In [55]:
val model = new NERModel(STOPWORDS)

[36mmodel[39m: [32mNERModel[39m = ammonite.$sess.cmd53$Helper$NERModel@4c69d628

### 2.2 Aplicar el "Modelo" a los datos

In [56]:
val result = model.getNEs(redittText)

[36mresult[39m: [32mSeq[39m[[32mSeq[39m[[32mString[39m]] = [33mList[39m(
  [33mArrayBuffer[39m([32m"Moronic"[39m, [32m"Monday"[39m, [32m"May"[39m),
  [33mArrayBuffer[39m([32m"Honor\u2019s"[39m, [32m"Google"[39m),
  [33mArrayBuffer[39m([32m"Google"[39m, [32m"Photos"[39m),
  [33mArrayBuffer[39m(
    [32m"Galaxy"[39m,
    [32m"Upcycling"[39m,
    [32m"Samsung"[39m,
    [32m"Ruined"[39m,
    [32m"Best"[39m,
    [32m"Idea"[39m,
    [32m"Years"[39m
  ),
  [33mArrayBuffer[39m([32m"LineageOS"[39m, [32m"Android"[39m, [32m"Xiaomi"[39m, [32m"ASUS"[39m, [32m"Sony"[39m),
  [33mArrayBuffer[39m(
    [32m"Practical"[39m,
    [32m"Android"[39m,
    [32m"Neural"[39m,
    [32m"Network"[39m,
    [32m"API"[39m,
    [32m"Tensorflow"[39m,
    [32m"Lite"[39m
  ),
  [33mArrayBuffer[39m([32m"Galaxy"[39m, [32m"Tab"[39m, [32m"S7"[39m, [32m"FE"[39m, [32m"GB"[39m, [32m"German"[39m, [32m"Samsung"[39m),
  [33mArrayBuffer[39m(

## 3. Contar y ordenar las entidades

Concatenar todas las listas, contar cada Named Entity, y luego ordernar por frecuencia

In [57]:
val counts = model.countNEs(result)
val sortedNEs = model.sortNEs(counts)

[36mcounts[39m: [32mMap[39m[[32mString[39m, [32mInt[39m] = [33mMap[39m(
  [32m"Thing"[39m -> [32m1[39m,
  [32m"Advertising"[39m -> [32m1[39m,
  [32m"PLN"[39m -> [32m5[39m,
  [32m"Please"[39m -> [32m1[39m,
  [32m"Easy"[39m -> [32m1[39m,
  [32m"Locked"[39m -> [32m1[39m,
  [32m"One"[39m -> [32m8[39m,
  [32m"CEO"[39m -> [32m1[39m,
  [32m"Tab"[39m -> [32m3[39m,
  [32m"Weinbach"[39m -> [32m3[39m,
  [32m"Use"[39m -> [32m1[39m,
  [32m"Cheaper"[39m -> [32m1[39m,
  [32m"Take"[39m -> [32m1[39m,
  [32m"OLED"[39m -> [32m1[39m,
  [32m"NPUs"[39m -> [32m1[39m,
  [32m"A11"[39m -> [32m3[39m,
  [32m"Disagree"[39m -> [32m1[39m,
  [32m"Upcycling"[39m -> [32m1[39m,
  [32m"ISP"[39m -> [32m1[39m,
  [32m"Settings"[39m -> [32m1[39m,
  [32m"Lite"[39m -> [32m1[39m,
  [32m"Pichai"[39m -> [32m1[39m,
  [32m"Users"[39m -> [32m1[39m,
  [32m"Reddit"[39m -> [32m1[39m,
  [32m"GPS"[39m -> [32m1[39m,
  [32m"Google