**DELIVERY 1: TEXT PROCESSING**

HARRISON LIAN   **u196989**    

HUGO DA SILVA   **u191838**   

BRAYAN GONZÁLEZ  **u172820**

**As a first step, you must pre-process the documents by:**

- Removing stop-words
- Tokenization
- Removing punctuation
- Stemming
- AND... anything else you think it's needed

**IMPORTANT** - think about if you want to exclude or not the hashtags from your pre-processing steps (e.g., removing the “#” from the word), since it may be useful to involve them as separated terms inside the inverted index.

**The documents are a set of tweets from the World Health Organization (@WHO) and you can see an example document in the appendix.**

In [1]:
# Import necessary libraries
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/datalore/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# Import necessary libraries
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import json
import string
import re

In [3]:
# Path where our .txt is located
path = 'dataset_tweets_WHO.txt'

# Convert the text to json
with open(path) as f:
    tweets_json = json.load(f)

In [215]:
def text_processing(line):

    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))

    # Remove all NON-ASCII character such as emojis, symbols from other languages
    line = "".join(c for c in line if c in string.printable)

    # Lowercase chars and tokenize
    line = line.lower().split()

    # Replace into empty string punctuaction signs
    # \W+ means all non alpha-numerical symbols, transform into empty string
    line = [re.sub('\W+','', word) for word in line]

    # Remove all HTML tags such as &amp, q&amp;a, the URLS and any empty string due to emoji, symbols or punctuation removal
    # ^amp$ means the exact match, ^http means words starting with http and last condition word to check empty string
    line = [word for word in line if not (re.match("^qampa$" , word) or re.match("^amp$" , word) or re.match("^http" , word)) 
    and word]

    # Remove the stopwords 
    line = [word for word in line if word not in stop_words]

    # Do stemming
    line = [stemmer.stem(word) for word in line] 

    return line

In [214]:

for key in tweets_json:
    print(text_processing(tweets_json[key]['full_text']))



['intern', 'day', 'disast', 'risk', 'reduct', 'openwho', 'launch', 'multiti', 'core', 'curriculum', 'help', 'equip', 'compet', 'need', 'work', 'within', 'public', 'health', 'emerg', 'respons', 'start', 'learn', 'today', 'ready4respons']
['covid19', 'shown', 'health', 'emerg', 'disast', 'affect', 'entir', 'commun', 'especi', 'weak', 'health', 'system', 'vulner', 'popul', 'like', 'migrant', 'indigen', 'peopl', 'live', 'fragil', 'humanitarian', 'condit']
['intern', 'day', 'disast', 'risk', 'reduct', 'better', 'respond', 'emerg', 'countri', 'must', 'invest', 'health', 'care', 'system', 'achiev', 'gender', 'equiti', 'protect', 'marginalis', 'group', 'ensur', 'readi', 'equit', 'access', 'suppli', 'strong', 'resili', 'health', 'system']
['rt', 'whoafro', 'congratul', 'algeria', 'algeria', '16th', 'countri', 'africa', 'reach', 'mileston', 'fulli', 'vaccin', '10', 'pop']
['rt', 'opsom', 'si', 'est', 'completament', 'vacunado', 'pued', 'contraer', 'covid19', 'importa', 'si', 'est', 'vacunado', '