##### Social Media Analytics
### Introduction to Text Mining
## Named Entity Recognition
(c) Nuno Antonio 2019-2022 v1.02

### Initial setup

In [1]:
# Import packages
import csv
import pandas as pd
import numpy as np
import nltk 
import re
from bs4 import BeautifulSoup
import spacy
from spacy import displacy
from collections import Counter

In [2]:
# Load dataset
dtypes = {'title':'category','author':'category','text':'category'}
ds = pd.read_csv("CNNArticles.csv", sep=",", 
                 error_bad_lines=False, dtype=dtypes, decimal=',', 
                 index_col='Unnamed: 0', parse_dates=['date'])



  ds = pd.read_csv("CNNArticles.csv", sep=",",


In [3]:
ds.head()

Unnamed: 0,title,author,date,text
0,Russia's war in Ukraine,"['Jessie Yeung', 'Sana Noor Haq', 'Ivana Kotta...",2023-03-05,US Ambassador to Russia Lynne Tracy visited Pa...
1,What we know about the murky drone attack on t...,"['Rob Picheta', 'Anna Chernova', 'Allegra Good...",2023-04-05,The tight ring of security that surrounds the ...
2,How the Kremlin drone attack hands Russia an o...,['Jill Dougherty'],2023-04-05,"At first glance, it looks like a sci-fi movie...."
3,Wave of Russian attacks on Kyiv worst in a yea...,"['Josh Pennington ', 'Olga Voitovych', 'Helen ...",2023-04-05,Russia unleashed its worst attacks on Kyiv in ...
4,"5 things to know for May 4: Atlanta shooting, ...",['Alexandra Meeks'],2023-04-05,Thousands of people are planning to line the s...


### Functions

In [4]:
# Text preprocessing
def textPreProcess(rawText, removeHTML=True, charsToRemove = r'\?|\.|\!|\;|\.|\"|\,|\(|\)|\&|\:|\|[0-9]|--| [ ] ', removeNumbers=True, removeLineBreaks=False, specialCharsToRemove = r'[^\x00-\xfd]', convertToLower=True, removeConsecutiveSpaces=True):
    if type(rawText) != str:
        return rawText
    procText = rawText
        
    # Remove HTML
    if removeHTML:
        procText = BeautifulSoup(procText,'html.parser').get_text()

    # Remove punctuation and other special characters
    if len(charsToRemove)>0:
        procText = re.sub(charsToRemove,' ',procText)

    # Remove numbers
    if removeNumbers:
        procText = re.sub(r'\d+',' ',procText)

    # Remove line breaks
    if removeLineBreaks:
        procText = procText.replace('\n',' ').replace('\r', '')

    # Remove special characters
    if len(specialCharsToRemove)>0:
        procText = re.sub(specialCharsToRemove,' ',procText)

    # Normalize to lower case
    if convertToLower:
        procText = procText.lower() 

    # Replace multiple consecutive spaces with just one space
    if removeConsecutiveSpaces:
        procText = re.sub(' +', ' ', procText)

    return procText

### Analysis

In [5]:
# Create a dataframe with only the description
dsprocessedText = pd.DataFrame(data=ds.text.apply(textPreProcess,charsToRemove ='', removeLineBreaks=False, removeNumbers=False).values, index=ds.index, columns=['PreProcessedText'])

In [6]:
dsprocessedText.head()

Unnamed: 0,PreProcessedText
0,us ambassador to russia lynne tracy visited pa...
1,the tight ring of security that surrounds the ...
2,"at first glance, it looks like a sci-fi movie...."
3,russia unleashed its worst attacks on kyiv in ...
4,thousands of people are planning to line the s...


In [7]:
# Remove rows with empty text
dsprocessedText.PreProcessedText = dsprocessedText.PreProcessedText.str.strip()
dsprocessedText = dsprocessedText[dsprocessedText.PreProcessedText != '']

In [8]:
# Load Spacy English model
nlp = spacy.load("en_core_web_sm")

In [12]:
# Check entities in review 
print(dsprocessedText['PreProcessedText'][1])
doc = nlp(dsprocessedText['PreProcessedText'][1])
print([(X.text, X.label_) for X in doc.ents])

the tight ring of security that surrounds the seat of the russian presidency was punctured in dramatic fashion by what appeared to be two attempted drone strikes in the early hours of wednesday morning. but until the kremlin chose to publicize the incident around 12 hours later, social media footage of the incident had gained little attention. why russia decided to reveal the security breach is unclear. but in a five-paragraph statement on wednesday, the kremlin made the incendiary claim that the drones were an assassination attempt launched by ukraine on the the russian president, vladimir putin. kyiv forcefully denied the claim. kyiv was bombarded with missiles in the hours following russia's claims, in keeping with putin's historic willingness to strike ukrainian cities after any alleged act of provocation. many details about the incident remain murky. here's what we know -- and the questions that remain. what happened? moscow said the alleged attack took place in the early hours of

In [13]:
# Check entities in review 
print(dsprocessedText['PreProcessedText'][0])
doc = nlp(dsprocessedText['PreProcessedText'][0])
print([(X.text, X.label_) for X in doc.ents])

us ambassador to russia lynne tracy visited paul whelan on thursday her first visit to the detained american since taking up the post in moscow earlier this year. "his release remains an absolute priority," the us embassy in moscow said on twitter.  whelan is serving out his prison sentence at a prison camp in mordovia, an eight-hour drive from moscow. background on whelan's case: the american citizen, who also holds irish, british and canadian citizenship, was detained in russia in december 2018 and later sentenced to 16 years in prison on an espionage charge, which he strongly denies.  in an interview with cnn in december, whelan described the prison camp as "better than most in russia because it's mostly foreigners held here, but the conditions are extremely bad." although thursday was tracy's first in-person visit, she has spoken by phone with whelan in the past. the us government was unable to secure whelan's release last year when they brought home two other wrongfully detained a

[('us', 'GPE'), ('russia', 'GPE'), ('paul', 'PERSON'), ('thursday', 'DATE'), ('first', 'ORDINAL'), ('american', 'NORP'), ('moscow', 'GPE'), ('earlier this year', 'DATE'), ('us', 'GPE'), ('moscow', 'GPE'), ('mordovia', 'GPE'), ('eight-hour', 'TIME'), ('moscow', 'GPE'), ('american', 'NORP'), ('irish', 'NORP'), ('british', 'NORP'), ('canadian', 'NORP'), ('russia', 'GPE'), ('december 2018', 'DATE'), ('16 years', 'DATE'), ('cnn', 'ORG'), ('december', 'DATE'), ('russia', 'GPE'), ('thursday', 'DATE'), ('first', 'ORDINAL'), ('us', 'GPE'), ('last year', 'DATE'), ('two', 'CARDINAL'), ('americans', 'NORP'), ('april', 'DATE'), ('december', 'DATE'), ('americans', 'NORP'), ('one', 'CARDINAL'), ('two', 'CARDINAL'), ('americans', 'NORP'), ('russia', 'GPE'), ('wall street journal', 'ORG'), ('evan gershkovich', 'PERSON'), ('more than a month ago', 'DATE'), ('the united states', 'GPE'), ('week', 'DATE'), ('kremlin', 'ORG'), ('tom cotton', 'PERSON'), ('senate armed services committee', 'ORG'), ('russian',

In [14]:
#annReviews = []
# for r in dsprocessedText['PreProcessedText']:
#     doc = nlp(r)
#     for entity in doc.ents:
#         if entity.label_ == 'LANGUAGE':
#             annReviews.append((entity.text, entity.label_))

# print(annReviews)

In [15]:
# print(dsprocessedText['PreProcessedText'][:])  # Print the entire text (optional)

# doc = nlp(" ".join(dsprocessedText['PreProcessedText']))  # Combine all texts and process as a single document

# print([(X.text, X.label_) for X in doc.ents])  # Print the entities found in the entire text


In [16]:
# Count the labels
labels = [x.label_ for x in doc.ents]
Counter(labels)

Counter({'GPE': 185,
         'PERSON': 108,
         'DATE': 91,
         'ORDINAL': 12,
         'NORP': 81,
         'TIME': 28,
         'ORG': 95,
         'CARDINAL': 46,
         'MONEY': 34,
         'PERCENT': 1,
         'FAC': 2,
         'QUANTITY': 5})

In [17]:
# Show top 3 labels
top_labels = [x.text for x in doc.ents]
Counter(top_labels).most_common(30)

[('russia', 66),
 ('russian', 54),
 ('kremlin', 38),
 ('moscow', 32),
 ('ukraine', 30),
 ('wednesday', 26),
 ('us', 18),
 ('thursday', 16),
 ('vladimir putin', 16),
 ('cnn', 15),
 ('two', 14),
 ('putin', 14),
 ('nato', 12),
 ('zelensky', 12),
 ('washington', 11),
 ('volodymyr zelensky', 10),
 ('#', 10),
 ('##kremlin incident', 8),
 ('netherlands', 8),
 ('kherson', 8),
 ('ukrainian', 7),
 ('##', 6),
 ('dutch', 5),
 ('overnight', 5),
 ('early thursday', 5),
 ('wednesday night', 5),
 ('the united states', 4),
 ('the white house', 4),
 ('third', 4),
 ('peskov', 4)]

In [18]:
# Entities visualization
displacy.render(doc, jupyter=True, style='ent')

In [20]:
# For example, if our objective was understand what guests say about the staff language skills we could look for reviews that mention languages
counter=0   # to stop after x for demostration speed
annReviews=[]
for r in dsprocessedText['PreProcessedText']:
  doc = nlp(r)
  for i in doc.ents:
      if i.label_=='LANGUAGE':
          annReviews.append(r)
          counter = counter + 1
          break
  if counter>=3:    # Stop after the first three reviews have been found
      break

annReviews

['the tight ring of security that surrounds the seat of the russian presidency was punctured in dramatic fashion by what appeared to be two attempted drone strikes in the early hours of wednesday morning. but until the kremlin chose to publicize the incident around 12 hours later, social media footage of the incident had gained little attention. why russia decided to reveal the security breach is unclear. but in a five-paragraph statement on wednesday, the kremlin made the incendiary claim that the drones were an assassination attempt launched by ukraine on the the russian president, vladimir putin. kyiv forcefully denied the claim. kyiv was bombarded with missiles in the hours following russia\'s claims, in keeping with putin\'s historic willingness to strike ukrainian cities after any alleged act of provocation. many details about the incident remain murky. here\'s what we know -- and the questions that remain. what happened? moscow said the alleged attack took place in the early hou