# IRWA Final Project 

This project aims to build a search engine implementing different indexing and ranking algorithms. This will be done using a file containing a set of tweets from the World Health Organization (@WHO).

It will be divided in four parts:

    1) Text processing
    2) Indexing and ranking
    3) Evaluation 
    4) User Interface and Web analytics


Students Group 9:
- Mireia Beltran (U161808)
- Cisco Orteu (U162354)
- Laura Casanovas (U161832)

#### Packages

We first import all the packages needed for text processing. 

In [22]:
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la

In [23]:
# if you do not have 'nltk', the following command should work "python -m pip install nltk"
import nltk
nltk.download('stopwords')
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la
import json
import regex as re 
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mire2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Load data into memory

The dataset is stored in a txt file ```dataset_tweets_WHO.txt```and it contains a set of tweets in json format. We create tweets_data by using json.loads() function, which from a JSON string it can be parsed and it returns the content of the file.

In [2]:
text = open('dataset_tweets_WHO.txt', 'r')

In [3]:
tweets_data=[]
for line in text:
    tweet=json.loads(line)
    tweets_data.append(tweet)
text.close()

#### Dataset Creation

In order to read and process each tweet, we create a dictionary in which we will have just one row, and each column will contain one tweet

In [4]:
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data)
df.to_csv('df.csv')

We now create a new variable called 'texts' which will contain in each position of the array a tweet. Below this cell we print as an example the content of the first position of the array

In [5]:
texts=[]
for i in df:
    line =  df[i].item()['full_text']
    texts.append(str(line))

In [6]:
texts[0]

"It's International Day for Disaster Risk Reduction\n\n#OpenWHO has launched a multi-tiered core curriculum to help equip you with the competencies needed to work within public health emergency response.\n\nStart learning today &amp; be #Ready4Response:\n👉 https://t.co/hBFFOF0xKL https://t.co/fgZY22RWuS"

## 1) Text Processing

We implement the function ```build_terms(text)```.

It takes as input a text and performs the following operations:

- Stem terms
- Remove stop words
- Remove punctuation 
- Remove links
- Remove emojis
- Transform all text to lowercase
- Tokenize the text to get a list of terms

(We decided not to remove hashtags since it may be interesting to treat them separately later)

In [7]:
def build_terms(text):
    """
    Preprocess the article text (title + body) removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.
    
    Argument:
    text -- string (text) to be preprocessed
    
    Returns:
    text - a list of tokens corresponding to the input text after the preprocessing
    """
    # create the pattern
    stemmer = PorterStemmer()
    
    stop_words = set(stopwords.words("english"))
    remove = string.punctuation
    remove = remove.replace("#", "–")# don't remove hashtags
    remove = remove+'¿'
    pattern = r"[{}]".format(remove) # create the pattern
    text = re.sub(pattern, "", text)
    text = re.sub(r'http\S+', '', text)
    
    #compile a regular expression pattern into a regular expression object
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    #Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement
    text = emoji_pattern.sub(r'', text) # no emoji
    
    # Transform in lowercase
    text=  str.lower(text) 
    # Tokenize the text to get a list of terms
    text=  text.split() 
    # Eliminate the stopwords
    text=[l for l in text if l not in stop_words] 
    

    # Perform stemming 
    text=[stemmer.stem(word) for word in text]
    
    return text

In [8]:
texts_processed = []
for i in range(len(texts)):
    texts_processed.append(build_terms(texts[i]))
    

In [151]:
def create_index(texts, num_documents):
    """
    Implement the inverted index
    
    Argument:
    lines -- collection of Wikipedia articles
    
    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of documents where these keys appears in (and the positions) as values.
    """
    index = defaultdict(list)
    tf = defaultdict(list)  #term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  #document frequencies of terms in the corpus
    title_index = {}  # dictionary to map page titles to page ids
    idf = defaultdict(float)
    tweet_id = 0
    for text in texts:  # Remember, lines contain all documents
        terms = build_terms(text) #page_title + page_text
        tweet_id += 1
        title_index[tweet_id]=text  
        
        ## ===============================================================        
        ## create the index for the current page and store it in current_page_index (current_page_index)
        ## current_page_index ==> { ‘term1’: [current_doc, [list of positions]], ...,‘term_n’: [current_doc, [list of positions]]}

        ## Example: if the curr_doc has id 1 and his text is 
        ##"web retrieval information retrieval":

        ## current_page_index ==> { ‘web’: [1, [0]], ‘retrieval’: [1, [1,4]], ‘information’: [1, [2]]}

        ## the term ‘web’ appears in document 1 in positions 0, 
        ## the term ‘retrieval’ appears in document 1 in positions 1 and 4
        ## ===============================================================
        doc_freq = {i:terms.count(i) for i in terms}
        current_page_index = {}
        
        for position, term in enumerate(terms): # terms contains page_title + page_text. Loop over all terms
            
            try:
                # if the term is already in the index for the current page (current_page_index)
                # append the position to the corresponding list
                
        ## START CODE
                current_page_index[term][1].append(position)  
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [tweet_id, array('I',[position])] #'I' indicates unsigned int (int in Python)
        
        norm = 0
        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document. 
            # posting ==> [current_doc, [list of positions]] 
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)
            

        #calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1])/norm,4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] +=1 # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        # Compute IDF following the formula (3) above. HINT: use np.log
        for term in df:
            idf[term] = np.round(np.log(float(num_documents/df[term])), 4)

    return index, tf, df, idf, title_index

In [152]:
import time
start_time = time.time()
index, tf, df, idf, title_index = create_index(texts, len(texts))
print("Total time to create the index: {} seconds".format(np.round(time.time() - start_time, 2)))

Total time to create the index: 281.47 seconds


In [153]:
print("Index results for the term 'researcher': {}\n".format(index['researcher']))
print("First 10 Index results for the term 'research': \n{}".format(index['research'][:10]))

Index results for the term 'researcher': []

First 10 Index results for the term 'research': 
[[23, array('I', [16])], [154, array('I', [22])], [172, array('I', [17])], [204, array('I', [5])], [211, array('I', [8])], [212, array('I', [5])], [222, array('I', [6])], [423, array('I', [12])], [429, array('I', [4])], [460, array('I', [23])]]


# Query

In [154]:
def rank_documents(terms, docs, index, idf, tf, title_index):
    """
    Perform the ranking of the results of a search based on the tf-idf weights
    
    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies
    title_index -- mapping between page id and page title
    
    Returns:
    Print the list of ranked documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms 
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query. 
    # Example: collections.Counter(["hello","hello","world"]) --> Counter({'hello': 2, 'world': 1})
    #HINT: use when computing tf for query_vector

    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]=query_terms_count[term]/query_norm * idf[term] 

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
            # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....

            #tf[term][0] will contain the tf of the term "term" in the doc 26            
            if doc in docs:
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc 
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot
    
    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)
    result_docs = [x[1] for x in doc_scores]
    #print document titles instead if document id's
    #result_docs=[ title_index[x] for x in result_docs ]
    if len(result_docs) == 0:
        print("No results found, try again")
        query = input()
        docs = search_tf_idf(query, index)
    #print ('\n'.join(result_docs), '\n')
    return result_docs

In [155]:
def search(query, index):
    """
    The output is the list of documents that contain any of the query terms. 
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = build_terms(query)
    docs = set()
    for term in query:
    ## START DODE
        try:
            # store in term_docs the ids of the docs that contain "term"                        
            term_docs=[posting[0] for posting in index[term]]
            # docs = docs Union term_docs
            docs = docs.union(term_docs)
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs = rank_documents(query, docs, index, idf, tf, title_index)
    return docs

In [156]:
#for the queries
top_10 = {}
for i in index:
    top_10[i]=len(index[i])
top_10 = dict(sorted(top_10.items(), reverse=True, key=lambda item: item[1]))
top_10

{'drtedro': 951,
 '#covid19': 723,
 'amp': 635,
 'health': 596,
 'rt': 549,
 'vaccin': 432,
 'countri': 330,
 'peopl': 300,
 'support': 243,
 'pandem': 233,
 'global': 228,
 'need': 227,
 'live': 201,
 '#vaccinequ': 197,
 'care': 151,
 'help': 149,
 'access': 138,
 'world': 134,
 'year': 133,
 'work': 125,
 'use': 125,
 'new': 116,
 'diseas': 115,
 'emerg': 113,
 'today': 112,
 'death': 112,
 'risk': 110,
 'provid': 109,
 'servic': 108,
 'includ': 108,
 'call': 105,
 'million': 105,
 'continu': 105,
 'prevent': 104,
 'actacceler': 104,
 'protect': 102,
 'one': 96,
 'safe': 96,
 'also': 95,
 'everi': 95,
 'must': 94,
 'end': 93,
 'respons': 92,
 'make': 92,
 'suppli': 91,
 'around': 88,
 'share': 87,
 'develop': 83,
 'thank': 81,
 'time': 81,
 'medic': 81,
 'dr': 81,
 'commit': 80,
 'public': 79,
 'get': 79,
 'mani': 78,
 'system': 77,
 'un': 76,
 'variant': 76,
 'month': 75,
 'know': 75,
 'case': 75,
 'dose': 74,
 'join': 74,
 'increas': 74,
 'take': 73,
 'commun': 72,
 'ensur': 71,
 '

In [157]:
queries = ['covid vaccine', 'global pandemic', 'world disease', 'emergency call','countries risk']
for i in queries:
    docs = search(i, index)
    top = 10

    print("\n======================\nSample of {} results out of {} for the query '", i,"'\n".format(top, len(docs)))
    for d_id in docs[:top]:
        print("tweet_id= {} - tweet: {}\n".format(d_id, title_index[d_id]))


Sample of {} results out of {} for the query ' covid vaccine '

tweet_id= 2048 - tweet: @DrTedros "The global gap in #COVID19 vaccine supply is hugely uneven and inequitable. Some countries and regions are actually ordering millions of booster doses, before other countries have had supplies to vaccinate their #healthworkers and most vulnerable"-@DrTedros #VaccinEquity

tweet_id= 2050 - tweet: @DrTedros "It is definitely worse in places that have very few vaccines but the #COVID19 pandemic is not over anywhere. The current collective strategy reminds me of a fire fighting team taking on a forest blaze"-@DrTedros https://t.co/qFijG1cZj0

tweet_id= 4 - tweet: RT @WHOAFRO: Congratulations Algeria🇩🇿!

#Algeria is the 16th country in #Africa to reach the milestone of fully vaccinating 10% of its pop…

tweet_id= 2053 - tweet: @DrTedros "Vaccines have never been the way out of this crisis on their own but this current wave is demonstrating again just what a powerful tool they are to battle ba