# IRWA Final Project 

This project aims to build a search engine implementing different indexing and ranking algorithms. This will be done using a file containing a set of tweets from the World Health Organization (@WHO).

It will be divided in four parts:

    1) Text processing
    2) Indexing and ranking
    3) Evaluation 
    4) User Interface and Web analytics


Students Group 9:
- Mireia Beltran (U161808)
- Cisco Orteu (U162354)
- Laura Casanovas (U161832)

#### Packages

We first import all the packages needed for text processing. 

In [2]:
# if you do not have 'nltk', the following command should work "python -m pip install nltk"
import nltk
nltk.download('stopwords')
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la
import json
import regex as re 
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lau\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Load data into memory

The dataset is stored in a txt file ```dataset_tweets_WHO.txt```and it contains a set of tweets in json format. We create tweets_data by using json.loads() function, which from a JSON string it can be parsed and it returns the content of the file.

In [3]:
text = open('dataset_tweets_WHO.txt', 'r')

In [4]:
tweets_data=[]
for line in text:
    tweet=json.loads(line)
    tweets_data.append(tweet)
text.close()

In [6]:
len(tweets_data[0].keys())

2399

#### Dataset Creation

In [78]:
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data)

# Print head of DataFrame
print(df.head())

#parse dataframe to csv
df.to_csv('df.csv')

                                                   0  \
0  {'created_at': 'Wed Oct 13 09:15:58 +0000 2021...   

                                                   1  \
0  {'created_at': 'Wed Oct 13 08:46:17 +0000 2021...   

                                                   2  \
0  {'created_at': 'Wed Oct 13 07:53:28 +0000 2021...   

                                                   3  \
0  {'created_at': 'Wed Oct 13 05:47:26 +0000 2021...   

                                                   4  \
0  {'created_at': 'Wed Oct 13 05:47:10 +0000 2021...   

                                                   5  \
0  {'created_at': 'Wed Oct 13 05:44:56 +0000 2021...   

                                                   6  \
0  {'created_at': 'Tue Oct 12 22:15:15 +0000 2021...   

                                                   7  \
0  {'created_at': 'Tue Oct 12 22:15:12 +0000 2021...   

                                                   8  \
0  {'created_at': 'Tue Oct 12 21:01:45 +

We now create a new variable called 'texts' which will contain in each position of the array a tweet.  

In [79]:
texts=[]
for i in df:
    line =  df[i].item()['full_text']
    texts.append(str(line))

In [80]:
texts[0]

"It's International Day for Disaster Risk Reduction\n\n#OpenWHO has launched a multi-tiered core curriculum to help equip you with the competencies needed to work within public health emergency response.\n\nStart learning today &amp; be #Ready4Response:\n👉 https://t.co/hBFFOF0xKL https://t.co/fgZY22RWuS"

"it's international day for disaster risk reduction\n\n#openwho has launched a multi-tiered core curriculum to help equip you with the competencies needed to work within public health emergency response.\n\nstart learning today &amp; be #ready4response:\n👉 https://t.co/hbffof0xkl https://t.co/fgzy22rwus"

## 1) Text Processing

We implement the function ```build_terms(text)```.

It takes as input a text and performs the following operations:

- Remove stop words
- Stem terms
- Transform all text to lowercase
- Tokenize the text to get a list of terms

In [143]:
def build_terms(text):
    """
    Preprocess the article text (title + body) removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.
    
    Argument:
    text -- string (text) to be preprocessed
    
    Returns:
    text - a list of tokens corresponding to the input text after the preprocessing
    """
    # create the pattern
    stemmer = PorterStemmer()
    
    stop_words = set(stopwords.words("english"))
    remove = string.punctuation
    remove = remove.replace("#", "–")# don't remove hashtags
    remove = remove+'¿'
    pattern = r"[{}]".format(remove) # create the pattern
    text=re.sub(pattern, "", text)
    text = re.sub(r'http\S+', '', text)
    
    #compile a regular expression pattern into a regular expression object
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    #Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement
    text = emoji_pattern.sub(r'', text) # no emoji
    
    # Transform in lowercase
    text=  str.lower(text) 
    # Tokenize the text to get a list of terms
    text=  text.split() 
    # Eliminate the stopwords
    text=[l for l in text if l not in stop_words] 
    

    # Perform stemming 
    text=[stemmer.stem(word) for word in text]
    
    return text

In [148]:
build_terms(texts[11])

['rt',
 'drtedro',
 'donat',
 'arent',
 'enough',
 'deliv',
 '#vaccinequ',
 'end',
 '#covid19',
 'pandem',
 'need',
 'stronger',
 'leadership',
 'ramp',
 'the…']

In [123]:
texts[1]

'#COVID19 has shown how health emergencies and disasters affect entire communities – especially those with weak health systems, and vulnerable populations like migrants, indigenous peoples, and those living in fragile humanitarian conditions. https://t.co/jpUQpnu0V1'