##### Social Media Analytics
### Introduction to Text Mining
## Preprocessing
(c) Nuno Antonio 2019-2023 v1.02

In [110]:
import nltk
#nltk.download()

## Packages installation

Packages required for this set of exercices:
- **NLTK**: Install via Anaconda Navigator Environments tab or using *pip install nltk* or using *conda install -c anaconda nltk*
- **NLTK data**: After installing NLTK open a note book and run:<br>
*import nltk<br>
nltk.download()*
- **BeautifulSoup4**: Install via Anaconda Navigator Environments tab or using *pip install python3-bs4* or using *conda install -c anaconda beautifulsoup4*
- **WordCloud**: Install via Anaconda Navigator Environments tab or using *pip install wordcloud*  or using *conda install -c conda-forge wordcloud*
- **Spacy**: Install via Anaconda Navigator Environments tab or using *pip install spacy* or *conda install -c anaconda nltk*<br>
   To install Spanish models: *python -m spacy download es_core_news_sm*<br>
   To install Portuguese models: *python -m spacy download pt_core_news_sm*<br>
   To install English models: *python -m spacy download en_core_web_sm*<br>
   To install other languages models see: https://spacy.io/usage/models#languages<br>
- **Rake-NLTK**: Install via Anaconda Navigator Environments tab or using *pip install rake-nltk* or  *conda install -c conda-forge rake_nltk*
- **VaderSentiment**: Install via Anaconda Navigator Environments tab or using *pip install vaderSentiment* or *conda install -c conda-forge vadersentiment*
- **Gensim**: Install via Anaconda Navigator Environments tab or using *pip install gensim* or *conda install -c anaconda gensim*
- **pyLDAvis**: Install via Anaconda Navigator Environments tab or using *pip install pip install pyldavis* or *conda install -c conda-forge pyldavis*

## Data preprocessing

In [111]:
# Import packages
import csv
import pandas as pd
import numpy as np
import nltk 
import datetime

In [112]:
# Load dataset
dtypes = {'title':'category','author':'category','text':'category'}
ds = pd.read_csv("CNNArticles.csv", sep=",", 
                 error_bad_lines=False, dtype=dtypes, decimal=',', 
                 index_col='Unnamed: 0', parse_dates=['date'])



  ds = pd.read_csv("CNNArticles.csv", sep=",",


In [113]:
# Check first rows
ds.head(50)

Unnamed: 0,title,author,date,text
0,Russia's war in Ukraine,"['Jessie Yeung', 'Sana Noor Haq', 'Ivana Kotta...",2023-03-05,US Ambassador to Russia Lynne Tracy visited Pa...
1,What we know about the murky drone attack on t...,"['Rob Picheta', 'Anna Chernova', 'Allegra Good...",2023-04-05,The tight ring of security that surrounds the ...
2,How the Kremlin drone attack hands Russia an o...,['Jill Dougherty'],2023-04-05,"At first glance, it looks like a sci-fi movie...."
3,Wave of Russian attacks on Kyiv worst in a yea...,"['Josh Pennington ', 'Olga Voitovych', 'Helen ...",2023-04-05,Russia unleashed its worst attacks on Kyiv in ...
4,"5 things to know for May 4: Atlanta shooting, ...",['Alexandra Meeks'],2023-04-05,Thousands of people are planning to line the s...
5,Russia unleashes intense airstrikes against Uk...,[],2023-04-05,Russia unleashes its most intense attacks in K...
6,Saudi Arabia wants to be a bigger player in th...,[],2023-03-05,When Iranian evacuees from Sudan were being fl...
7,Washington is letting Chinese airlines increas...,['Michelle Toh'],2023-04-05,The United States said Wednesday it would allo...
8,Russia blames Ukraine for Moscow drone strikes,[],2023-04-05,Kyiv denies involvement in drone attack on the...
9,"May 3, 2023 - Russia-Ukraine news","['Kathleen Magramo', 'Sophie Tanno', 'Adrienne...",2023-02-05,Our live coverage of Russia's war in Ukraine h...


In [114]:
print((ds['author'] == '[]').sum())

969


In [115]:
# Describe dataset
summary=ds.describe(include='all')
summary=summary.transpose()
summary.head(len(summary))

  summary=ds.describe(include='all')


Unnamed: 0,count,unique,top,freq,first,last
title,4550,4471,Russia's war in Ukraine,22,NaT,NaT
author,4550,1593,[],969,NaT,NaT
date,4550,470,2022-01-03 00:00:00,60,2013-01-22,2023-12-04
text,4550,4544,It tells you plenty about the Eurovision Song ...,2,NaT,NaT


In [116]:
# View text of review with ID T9909
ArticleText = ds.text[1] # reviewText = ds['RevDescription']['T9909']
print("Full review:\n",ArticleText)
# There is HTML after "My husband" and after "restaurant (treat)"
# &amp; is the HTML encoding for "&"

Full review:
 The tight ring of security that surrounds the seat of the Russian presidency was punctured in dramatic fashion by what appeared to be two attempted drone strikes in the early hours of Wednesday morning. But until the Kremlin chose to publicize the incident around 12 hours later, social media footage of the incident had gained little attention. Why Russia decided to reveal the security breach is unclear. But in a five-paragraph statement on Wednesday, the Kremlin made the incendiary claim that the drones were an assassination attempt launched by Ukraine on the the Russian president, Vladimir Putin. Kyiv forcefully denied the claim. Kyiv was bombarded with missiles in the hours following Russia's claims, in keeping with Putin's historic willingness to strike Ukrainian cities after any alleged act of provocation. Many details about the incident remain murky. Here's what we know -- and the questions that remain. What happened? Moscow said the alleged attack took place in the 

In [117]:
# Let's use BeatifulSoap to remove HMTL
from bs4 import BeautifulSoup

In [118]:
# Let's now remove the HTML
rawText = BeautifulSoup(ArticleText,'html.parser').get_text()
print("Full review without HTML:\n",rawText)
# Check the differences (&amp;) see the "<br>"

Full review without HTML:
 The tight ring of security that surrounds the seat of the Russian presidency was punctured in dramatic fashion by what appeared to be two attempted drone strikes in the early hours of Wednesday morning. But until the Kremlin chose to publicize the incident around 12 hours later, social media footage of the incident had gained little attention. Why Russia decided to reveal the security breach is unclear. But in a five-paragraph statement on Wednesday, the Kremlin made the incendiary claim that the drones were an assassination attempt launched by Ukraine on the the Russian president, Vladimir Putin. Kyiv forcefully denied the claim. Kyiv was bombarded with missiles in the hours following Russia's claims, in keeping with Putin's historic willingness to strike Ukrainian cities after any alleged act of provocation. Many details about the incident remain murky. Here's what we know -- and the questions that remain. What happened? Moscow said the alleged attack took 

In [119]:
# Normalize case
# The typical is normalizing to lower case
normalizedText = rawText.lower()
print("Normalized text:\n",normalizedText)

Normalized text:
 the tight ring of security that surrounds the seat of the russian presidency was punctured in dramatic fashion by what appeared to be two attempted drone strikes in the early hours of wednesday morning. but until the kremlin chose to publicize the incident around 12 hours later, social media footage of the incident had gained little attention. why russia decided to reveal the security breach is unclear. but in a five-paragraph statement on wednesday, the kremlin made the incendiary claim that the drones were an assassination attempt launched by ukraine on the the russian president, vladimir putin. kyiv forcefully denied the claim. kyiv was bombarded with missiles in the hours following russia's claims, in keeping with putin's historic willingness to strike ukrainian cities after any alleged act of provocation. many details about the incident remain murky. here's what we know -- and the questions that remain. what happened? moscow said the alleged attack took place in 

In [120]:
# Remove certain classes of characters (using "regular expressions" - regex)
# For that let's use RE
import re
# To test regex online: https://regex101.com
# Regex in more detail: https://www.machinelearningplus.com/python/python-regex-tutorial-examples/

In [121]:
# Remove punctation and other characters such as "&"
charsToRemove = r"\?|\.|\!|\;|\.|\"|\,|\(|\)|\&|[0-9]|--| [ ] "
textWOPunctuation = re.sub(charsToRemove,'', normalizedText)
print("Text without punctuation:\n",textWOPunctuation)

Text without punctuation:
 the tight ring of security that surrounds the seat of the russian presidency was punctured in dramatic fashion by what appeared to be two attempted drone strikes in the early hours of wednesday morning but until the kremlin chose to publicize the incident around  hours later social media footage of the incident had gained little attention why russia decided to reveal the security breach is unclear but in a five-paragraph statement on wednesday the kremlin made the incendiary claim that the drones were an assassination attempt launched by ukraine on the the russian president vladimir putin kyiv forcefully denied the claim kyiv was bombarded with missiles in the hours following russia's claims in keeping with putin's historic willingness to strike ukrainian cities after any alleged act of provocation many details about the incident remain murky here's what we know  and the questions that remain what happened moscow said the alleged attack took place in the earl

In [122]:
# Download the stopwords resource
import nltk
nltk.download('stopwords')

# Now, let's remove stop words
# Let's start by getting the package and checking them
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'why', 'haven', 'how', 'just', 'are', 'her', 'if', 'then', 'needn', 'nor', 'was', 'they', "should've", 'theirs', 'some', "didn't", 'once', 'more', 't', 'does', 'do', 'the', "won't", "you'll", 's', 'ma', 'who', 'as', 'where', 'y', "mustn't", 'don', 'through', 'were', 'under', 'when', "isn't", 'by', 'but', 'will', 'for', 'with', 'from', 'ain', 'yours', 'your', 'into', 'that', 'not', 'hers', 'very', 'and', 'me', 'whom', 'a', 'such', 'it', 'same', 'o', 'below', 're', 'she', 'further', 'which', 'this', 'during', "that'll", 'these', 'above', 'itself', 'yourselves', 'while', 'so', 'my', 'mustn', 'shouldn', 'too', 'can', "you've", 'now', "hasn't", 'didn', 'what', 'shan', 'have', "aren't", 'before', 'its', "haven't", 'ourselves', 'is', 'because', 'hadn', 've', 'again', 'down', "it's", 'wasn', 'no', "doesn't", "hadn't", 'has', "don't", 'doing', 'over', 'to', 'both', 'being', 'm', 'our', 'wouldn', 'yourself', 'between', 'herself', 'at', 'out', 'myself', 'll', 'or', 'mightn', 'than', "weren't", '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [123]:
nltk.download('punkt')

# Your code below which means it breaks down the text into individual words or tokens
from nltk.tokenize import word_tokenize
tokenizedText = word_tokenize(textWOPunctuation, language='english') #parameter is set in English language
print("List of words:\n",tokenizedText)

List of words:
 ['the', 'tight', 'ring', 'of', 'security', 'that', 'surrounds', 'the', 'seat', 'of', 'the', 'russian', 'presidency', 'was', 'punctured', 'in', 'dramatic', 'fashion', 'by', 'what', 'appeared', 'to', 'be', 'two', 'attempted', 'drone', 'strikes', 'in', 'the', 'early', 'hours', 'of', 'wednesday', 'morning', 'but', 'until', 'the', 'kremlin', 'chose', 'to', 'publicize', 'the', 'incident', 'around', 'hours', 'later', 'social', 'media', 'footage', 'of', 'the', 'incident', 'had', 'gained', 'little', 'attention', 'why', 'russia', 'decided', 'to', 'reveal', 'the', 'security', 'breach', 'is', 'unclear', 'but', 'in', 'a', 'five-paragraph', 'statement', 'on', 'wednesday', 'the', 'kremlin', 'made', 'the', 'incendiary', 'claim', 'that', 'the', 'drones', 'were', 'an', 'assassination', 'attempt', 'launched', 'by', 'ukraine', 'on', 'the', 'the', 'russian', 'president', 'vladimir', 'putin', 'kyiv', 'forcefully', 'denied', 'the', 'claim', 'kyiv', 'was', 'bombarded', 'with', 'missiles', 'in'

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [124]:
# Let's create a list with all words that are not part of the stop words list

cleanedText = []
for t in tokenizedText:
    if t not in stop_words:
        cleanedText.append(t)
print("Text without stopwords:\n",cleanedText) 

Text without stopwords:
 ['tight', 'ring', 'security', 'surrounds', 'seat', 'russian', 'presidency', 'punctured', 'dramatic', 'fashion', 'appeared', 'two', 'attempted', 'drone', 'strikes', 'early', 'hours', 'wednesday', 'morning', 'kremlin', 'chose', 'publicize', 'incident', 'around', 'hours', 'later', 'social', 'media', 'footage', 'incident', 'gained', 'little', 'attention', 'russia', 'decided', 'reveal', 'security', 'breach', 'unclear', 'five-paragraph', 'statement', 'wednesday', 'kremlin', 'made', 'incendiary', 'claim', 'drones', 'assassination', 'attempt', 'launched', 'ukraine', 'russian', 'president', 'vladimir', 'putin', 'kyiv', 'forcefully', 'denied', 'claim', 'kyiv', 'bombarded', 'missiles', 'hours', 'following', 'russia', "'s", 'claims', 'keeping', 'putin', "'s", 'historic', 'willingness', 'strike', 'ukrainian', 'cities', 'alleged', 'act', 'provocation', 'many', 'details', 'incident', 'remain', 'murky', "'s", 'know', 'questions', 'remain', 'happened', 'moscow', 'said', 'allege

In [125]:
# Now, let's concatenate it again to a sentence
newText = ''
for t in cleanedText:
  newText = newText + t + ' '
newText = newText.rstrip() # rstring (remove spaces at the right of the string)
print("Full sentence with changes so far:\n", newText)

Full sentence with changes so far:
 tight ring security surrounds seat russian presidency punctured dramatic fashion appeared two attempted drone strikes early hours wednesday morning kremlin chose publicize incident around hours later social media footage incident gained little attention russia decided reveal security breach unclear five-paragraph statement wednesday kremlin made incendiary claim drones assassination attempt launched ukraine russian president vladimir putin kyiv forcefully denied claim kyiv bombarded missiles hours following russia 's claims keeping putin 's historic willingness strike ukrainian cities alleged act provocation many details incident remain murky 's know questions remain happened moscow said alleged attack took place early hours wednesday two unmanned aerial vehicles intercepted destroyed caused damage injury kremlin said russian president building time according kremlin spokesperson dmitry peskov videos emerged social media appearing show incident cnn a

In [126]:
# Many times terms need to be replaced for other terms
# E.g., wi-fi, wifi, internet, wi fi
# In this example: tub by bathtub
#newText = (" "+newText+" ").replace(" tub "," bathtub ")
#print("Text the replacements:\n", newText)

#WHY THE " [ AND '  ]'  ARE STILL THERRE EVE THOUGH I REMOVED THEM

In [127]:
# Now, let's lemmatize the text - reduce terms to their origin 
# Let's start by loading the necessary package
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [128]:
# We are using the tokanized text
# For example, the lemma of "ran" is "run," and the lemma of "cars" is "car." 
# For example, the word "better" can be reduced to its base form "good"
lem = WordNetLemmatizer()
lemmatizedText= []
for t in cleanedText:
    lemWord = lem.lemmatize(t)
    lemmatizedText.append(lemWord)
print("Lemmatized text :\n",lemmatizedText)
# Not that much of a difference - let's see steming

Lemmatized text :
 ['tight', 'ring', 'security', 'surround', 'seat', 'russian', 'presidency', 'punctured', 'dramatic', 'fashion', 'appeared', 'two', 'attempted', 'drone', 'strike', 'early', 'hour', 'wednesday', 'morning', 'kremlin', 'chose', 'publicize', 'incident', 'around', 'hour', 'later', 'social', 'medium', 'footage', 'incident', 'gained', 'little', 'attention', 'russia', 'decided', 'reveal', 'security', 'breach', 'unclear', 'five-paragraph', 'statement', 'wednesday', 'kremlin', 'made', 'incendiary', 'claim', 'drone', 'assassination', 'attempt', 'launched', 'ukraine', 'russian', 'president', 'vladimir', 'putin', 'kyiv', 'forcefully', 'denied', 'claim', 'kyiv', 'bombarded', 'missile', 'hour', 'following', 'russia', "'s", 'claim', 'keeping', 'putin', "'s", 'historic', 'willingness', 'strike', 'ukrainian', 'city', 'alleged', 'act', 'provocation', 'many', 'detail', 'incident', 'remain', 'murky', "'s", 'know', 'question', 'remain', 'happened', 'moscow', 'said', 'alleged', 'attack', 'to

In [129]:
# Load the necessary package
from nltk.stem.porter import PorterStemmer

In [130]:
# Check the differences after this
# Verbs like "stayed -> stay" or "arranged -> arrang" emoving sufix
stem = PorterStemmer()
stemmedText = []
for t in cleanedText:
    stemmedWord = stem.stem(t)
    stemmedText.append(stemmedWord)
print("Stemmed text :\n",stemmedText) 

Stemmed text :
 ['tight', 'ring', 'secur', 'surround', 'seat', 'russian', 'presid', 'punctur', 'dramat', 'fashion', 'appear', 'two', 'attempt', 'drone', 'strike', 'earli', 'hour', 'wednesday', 'morn', 'kremlin', 'chose', 'public', 'incid', 'around', 'hour', 'later', 'social', 'media', 'footag', 'incid', 'gain', 'littl', 'attent', 'russia', 'decid', 'reveal', 'secur', 'breach', 'unclear', 'five-paragraph', 'statement', 'wednesday', 'kremlin', 'made', 'incendiari', 'claim', 'drone', 'assassin', 'attempt', 'launch', 'ukrain', 'russian', 'presid', 'vladimir', 'putin', 'kyiv', 'forc', 'deni', 'claim', 'kyiv', 'bombard', 'missil', 'hour', 'follow', 'russia', "'s", 'claim', 'keep', 'putin', "'s", 'histor', 'willing', 'strike', 'ukrainian', 'citi', 'alleg', 'act', 'provoc', 'mani', 'detail', 'incid', 'remain', 'murki', "'s", 'know', 'question', 'remain', 'happen', 'moscow', 'said', 'alleg', 'attack', 'took', 'place', 'earli', 'hour', 'wednesday', 'two', 'unman', 'aerial', 'vehicl', 'intercept'