# **Project - Part 1: Text Processing**

You are provided with a document corpus which is a set of tweets related to Hurricane Ian
(tw_hurricane_data.json). You can see an example document in the appendix.
As a first step, you must pre-process the documents by:

* Removing stop words
* Tokenization
* Removing punctuation marks
* Stemming
* and... anything else you think it's needed (bonus point)


HINTS:
1. Take into account that for future queries, the final output must return (when
present) the following information for each of the selected documents: Id | Tweet |
Username | Date | Hashtags | Likes | Retweets | Url (here the “Url” means the
tweet link).

2. Think about how to handle the hashtags from your pre-processing steps (e.g.,
removing the “#” from the word), since it may be useful to involve them as separated terms
inside the inverted index.

3. Suggested library that may help you in stemming and stop words: nltk
Make sure you map the tweet’s Ids with the document ids as the document Ids will be
considered for the evaluation stage of the project (tweet_document_ids_map.csv).


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import collections
import json
import re
from tabulate import tabulate
stemmer = nltk.stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Load file path
file_name = '/content/drive/Shareddrives/IRWA/PROJECT/data/tw_hurricane_data.json'
# Use json.loads function with list comprehension to obtain all the tweets
lines = [json.loads(line) for line in open(file_name,'r')]
# Print first tweet for checking purposes
print(lines[0]['entities']['media'][0]['url'])

https://t.co/VROTxNS9rz


In [None]:
# Print total number of tweets
print("Total number of Tweets: {}".format(len(lines)))

Total number of Tweets: 4000


In [None]:
class Tweet:
  def __init__(self, id, tweet, username, date, hashtags, likes, retweets, url):
    self.id = id
    self.tweet = tweet
    self.username = username
    self.date = date
    self.hashtags = hashtags
    self.likes = likes
    self.retweets = retweets
    self.url = url
  def aslist(self):
        return [self.id, self.tweet, self.username, self.date, self.hashtags, self.likes, self.retweets, self.url]
  def __iter__(self):
        return iter(self.aslist())
 
tweets = []

for i in range(len(lines)):

    hashtags = []
    url = ""
    
    if 'media' in lines[i]['entities']:
      url = lines[i]['entities']['media'][0]['url']

    for j in range(len(lines[i]['entities']['hashtags'])):
      hashtags.append(lines[i]['entities']['hashtags'][j].get('text'))

    tweets.append(Tweet(lines[i]['id'], 
                        lines[i]['full_text'], 
                        lines[i]['user']['screen_name'], 
                        lines[i]['created_at'], 
                        hashtags, 
                        lines[i]['favorite_count'], 
                        lines[i]['retweet_count'], 
                        url))

In [None]:
# Print total number of tweets
print("Total number of Tweets: {}".format(len(tweets)))

Total number of Tweets: 4000


In [None]:
# Remove white spaces
def remove_white_space(text):
    return ' '.join(text.split())
# Remove stopwords
def remove_stopwords(words):
    return [w for w in words if w.lower() not in stopwords]
# Remove emojis
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return emoj.sub(r'', data)
# Remove punctuation
def remove_punctuation(data):
    return re.sub(r'[^\w\s]', '', data)
# Remove numbers
def remove_numbers(data):
    return re.sub(r'[0-9]', '', data)

def additional_steps(words):
    f_words = []
    for i in range(len(words)):
        # Eliminate the "#" to only have the keyword, since this symbol is not important
        if words[i].startswith("#"):
          continue
        # Remove the url to only have the text of the tweet (some tweets provide photo URLs at the end of the full text)
        if words[i].startswith("https"): 
          break
        f_words.append(words[i]) 
        
    return f_words
# Preprocess text
def preprocess(text):
    text = text.replace('\\n', '')
    text = remove_emojis(text)
    text = remove_punctuation(text)
    text = remove_numbers(text)
    text = remove_white_space(text)
    words = nltk.tokenize.word_tokenize(text)
    words = [stemmer.stem(word) for word in words]
    words = remove_stopwords(words)
    words = additional_steps(words)
    text = " ".join(words)
    return text


In [None]:
# Compare not pre-processed with processed text
for i in range(10):
    print("===================")
    print("ORIGINAL TEXT")
    print("===================")
    print(tweets[i].tweet)
    print("===================")
    print("PROCESSED TEXT")
    print("===================")
    print(preprocess(tweets[i].tweet))
    print('\n')

for i in range(len(tweets)):
  tweets[i].tweet = preprocess(tweets[i].tweet)

ORIGINAL TEXT
So this will keep spinning over us until 7 pm…go away already. #HurricaneIan https://t.co/VROTxNS9rz
PROCESSED TEXT
keep spin us pmgo away alreadi hurricaneian


ORIGINAL TEXT
Our hearts go out to all those affected by #HurricaneIan. We wish everyone on the roads currently braving the conditions safe travels. 💙
PROCESSED TEXT
heart go affect hurricaneian wish everyon road current brave condit safe travel


ORIGINAL TEXT
Kissimmee neighborhood off of Michigan Ave. 
#HurricaneIan https://t.co/jf7zseg0Fe
PROCESSED TEXT
kissimme neighborhood michigan ave hurricaneian


ORIGINAL TEXT
I have this one tree in my backyard that scares me more than the poltergeist tree when it’s storming and windy like this. #scwx #HurricaneIan
PROCESSED TEXT
one tree backyard scare poltergeist tree storm windi like scwx hurricaneian


ORIGINAL TEXT
@AshleyRuizWx @Stephan89441722 @lilmizzheidi @Mr__Sniffles @winknews @DylanFedericoWX @julianamwx @sydneypersing @NicoleGabeTV I pray for everyone affe

In [None]:
# Final visualization of tweets
visualization_tweets = []
for i in range(10):
  visualization_tweets.append(tweets[i])


headers = ['ID','TWEET','USERNAME','DATE','HASHTAGS','LIKES', 'RETWEETS', 'URL']
print(tabulate(visualization_tweets, headers=headers, tablefmt='fancy_grid'))


╒═════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╤═══════════════╤════════════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════╤═════════╤════════════╤═════════════════════════╕
│                  ID │ TWEET                                                                                                                                                                                  │ USERNAME      │ DATE                           │ HASHTAGS                                                                              │   LIKES │   RETWEETS │ URL                     │
╞═════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╪══