# IRWA Final Project
## **Part 1- Text processing**

Authors:


*   Malena Díaz - u172961
*   Cristina Galvez - u172954






### **Section 1: Importing the data**

First, we install the `word2number` library, which is not a common Python library to have installed.

In [1]:
!pip install word2number

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
Building wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25l[?25hdone
  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5582 sha256=ea611a89e79e5c4b650c5300bc96dd4ae432e316f017e8fdac5398bee0dcf710
  Stored in directory: /root/.cache/pip/wheels/4b/c3/77/a5f48aeb0d3efb7cd5ad61cbd3da30bbf9ffc9662b07c9f879
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


We import several libraries that will be used for this project. Make sure to have them installed, otherwise it will return an error.

In [2]:
import json
import string
import re
import csv
from word2number import w2n
import nltk
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import datetime

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


This library is thought to be executed from Google Colab, so it can be linked to the Drive directories. Please allow Colab to access your Google Drive directories.

In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


**IMPORTANT!**

In order to achieve a correct performance of the program, indicate the path of the data in the following variable `docs_path`. Make sure the document *tw_hurricane_data.json* is stored there.

In [4]:
data_path = 'drive/Shareddrives/IRWA/Project/data/tw_hurricane_data.json' 
with open(data_path) as fp:
    data = fp.readlines()

**IMPORTANT!**

Indicate also the path of the file with the map document (*tweet_document_ids_map.csv*) in in the variable `map_path`.

In [5]:
#importing the map id => document name 
doc_id_dict = {}
map_path = 'drive/Shareddrives/IRWA/Project/data/tweet_document_ids_map.csv' 
with open(map_path) as map_file:
    tsv_reader = csv.reader(map_file, delimiter="\t")
    for line in tsv_reader:
      (doc, id) = line
      doc_id_dict[int(id)] = doc

Check that the files have been read correctly:

In [6]:
print("Total number of Tweets in the dataset: {}".format(len(data)))
print("Total number of Tweets in the map: {}".format(len(doc_id_dict)))

Total number of Tweets in the dataset: 4000
Total number of Tweets in the map: 4000


### **Section 2: Extracting main fields of the dataset**

The previous variable `data` contains an array of strings, each of them a dictionary with information of the tweet in `.json` format. We transform it to an array of dictionaries in Python format.

In [7]:
data = [json.loads(x) for x in data] # transform each tweet from string to dictionnary 

As stated in the statement, we must extract the information about: id, tweet, username, date, hashtag, number of likes, the number of retweets and the tweet url.

In [8]:
def create_struct(data):
  """
  Extract the fields tweet, username, date, hashtag, number of likes, the number of retweets and the tweet url from each tweet. 
  
  Argument:
  data -- An array of tweets (dictionnaries) that contain the keys full_text, user, created_at, entities and retweet_count. User and entities 
  correspond both to dictionnaries containing keys screen_name and favourites_count respectively. 

  Returns:
  collection -- A collection of dictionnaries where the key corresponds to the dictionnary name. Each value is another dictionnary containing the fields 
  of interest. 

  """
  collection = {}     # dictionaty of dictionaries

  for tweet in data:
    doc_dict = {}
    id = tweet['id']
    doc_name = doc_id_dict[id]

    doc_dict['id'] = id
    doc_dict['tweet'] = tweet['full_text']
    doc_dict['username'] = tweet['user']['screen_name']
    doc_dict['date'] = datetime.datetime.strptime(tweet['created_at'], '%a %b %d %H:%M:%S %z %Y') #convert to date time 
    doc_dict['hashtag'] = [x['text'] for x in tweet['entities']['hashtags']] #text of te hashtag 
    doc_dict['likes'] = tweet['user']['favourites_count']
    doc_dict['retweet'] = tweet ['retweet_count'] # can also be field retweeted (true)
    doc_dict['url'] = "https://twitter.com/" +  doc_dict['username'] + "/status/" + str(id) #https://twitter.com/[screen name]/status/[Tweet ID]
    collection[doc_name] = doc_dict

  return collection

Using the function defined above, transform our data into the structure `collection`, that is a dictionary of dictionaries.

In [9]:
collection = create_struct(data)  # transform data
a = [print(x,':', collection['doc_2'][x]) for x in collection['doc_2']]

id : 1575918151862304768
tweet : Our hearts go out to all those affected by #HurricaneIan. We wish everyone on the roads currently braving the conditions safe travels. 💙
username : lytx
date : 2022-09-30 18:39:01+00:00
hashtag : ['HurricaneIan']
likes : 2633
retweet : 0
url : https://twitter.com/lytx/status/1575918151862304768


### **Section 3: Cleaning the dataset**

In this part of the project, we will clean the text part of the documents, so that it does not contain symbols, is all lowercase, URLs are deleted... among others.

First, we have taken special attention into turning written numbers into digits. We have defined a separate function for this purpose:

In [10]:
# to handle turning written numbers into digits

def is_written_num(word):
  # returns True only if it is a written number
  # returns False if it is a digit or something else
  try:
    w2n.word_to_num(word)
    try: 
      int(word)
      return False
    except:
      return True
  except:
    return False

# given an array of strings, takes all written numbers and makes them digits
def make_numbers_digits(text):
  consec_num = False
  num_string = ''
  result = []

  for word in text:
    if not is_written_num(word):
      if consec_num:
        result.append(str(w2n.word_to_num(num_string)))
        consec_num = False
        num_string = ''
      result.append(word)
    else:
      consec_num = True
      num_string += ' ' + word

  if consec_num:
    result.append(str(w2n.word_to_num(num_string)))

  return result

In [11]:
# Check the function is working
example = ['there', 'are', 'five', 'hundred', 'twenty', 'seven', 'ideas', 'for', 'the', 'eighty', 'nine', 'people', 'to', 'process', 'one', 'more', 'time', '66', '345']
print(make_numbers_digits(example))

['there', 'are', '527', 'ideas', 'for', 'the', '89', 'people', 'to', 'process', '1', 'more', 'time', '66', '345']


The following is a function to treat hashtags. Many hashtags consist of many joined words starting with a capital letter. This function does the following:

`#hurricaneIan --> 'hurricane Ian'`

In [12]:
def treat_hashtags(text):
  remove_hashtag = text[1:]
  split_lower_upper = re.sub(r"([A-Z])", r" \1", remove_hashtag)
  return split_lower_upper

We define the function `build_terms()` to turn the given string into a clean version of it. It applies the following transformations:
- remove URLs
- remove user names
- deal with hashtags
- deal with dashes
- deal with currencies
- remove symbols
- transform to lowercase
- delete stop-words
- perform stemming
- turn written numbers into digits
- remove single letter words

In [13]:
def build_terms(line):
    """
    Preprocess the article text (title + body) removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.
    
    Argument:
    line -- string (text) to be preprocessed
    
    Returns:
    line - a list of tokens corresponding to the input text after the preprocessing
    """

    # define stemmer and reference lists
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))
    whitelist = string.ascii_letters + string.digits + ' '

    # clean text
    line = re.sub(r'http\S+', '', line) # remove urls
    line = re.sub(r'@\S+', '', line) # remove mentioned users
    line = ' '.join([treat_hashtags(i) if i.startswith("#") else i for i in line.split()]) # deal with hashtags
    line = line.replace("-", " ") # deal with dashes
    line = line.replace("$", " dollars") # deal with currencies  
    line = line.replace("€", " euros") # deal with currencies  
    line = ''.join([char if char in whitelist else ' ' for char in line]) # remove all symbols (leave only letters, digits, # and spaces)
    line = line.lower() ## Transform in lowercase
    line = line.split() ## Tokenize the text to get a list of terms
    line = [x for x in line if x not in stop_words]  ##eliminate the stopwords (HINT: use List Comprehension)
    line = [stemmer.stem(x) for x in line ] ## perform stemming (HINT: use List Comprehension)
    line = make_numbers_digits(line) # turn written numbers into digits
    line = [x for x in line if (len(x) > 1 or x.isdigit()) ] # remove single letters

    return line

Check if the text is being cleanned properly:

In [14]:
sample_texts = [collection[doc]['tweet'] for doc in collection][:5]

for i in range(5):
  print('Tweet {}:\n{}\n{}\n'.format(i+1, sample_texts[i], build_terms(sample_texts[i])))

Tweet 1:
So this will keep spinning over us until 7 pm…go away already. #HurricaneIan https://t.co/VROTxNS9rz
['keep', 'spin', 'us', '7', 'pm', 'go', 'away', 'alreadi', 'hurrican', 'ian']

Tweet 2:
Our hearts go out to all those affected by #HurricaneIan. We wish everyone on the roads currently braving the conditions safe travels. 💙
['heart', 'go', 'affect', 'hurrican', 'ian', 'wish', 'everyon', 'road', 'current', 'brave', 'condit', 'safe', 'travel']

Tweet 3:
Kissimmee neighborhood off of Michigan Ave. 
#HurricaneIan https://t.co/jf7zseg0Fe
['kissimme', 'neighborhood', 'michigan', 'ave', 'hurrican', 'ian']

Tweet 4:
I have this one tree in my backyard that scares me more than the poltergeist tree when it’s storming and windy like this. #scwx #HurricaneIan
['1', 'tree', 'backyard', 'scare', 'poltergeist', 'tree', 'storm', 'windi', 'like', 'scwx', 'hurrican', 'ian']

Tweet 5:
@AshleyRuizWx @Stephan89441722 @lilmizzheidi @Mr__Sniffles @winknews @DylanFedericoWX @julianamwx @sydneypersing

The last step is to use the previosly defined function to clean all the tweets from `collection`:

In [15]:
for doc in collection:
  collection[doc]['tweet'] = build_terms(collection[doc]['tweet'])