In [1]:
from google.colab import drive
drive.mount('/content/drive') 

Mounted at /content/drive


#IRWA Project

The aim of this project is to o build a search engine implementing different indexing and ranking algorithms. This search engine will be developed based on a document corpus composed of a set of tweets related to Hurricane Ian.

The project will have the following four incremental steps: Text Processing, Indexing and Evaluation, Ranking, and User Interface and Web Analytics.

##PART 1: Text Processing

For this first part of the project, we were asked to pre-process the documents (set of tweets) by:

- Removing stop words
- Tokenization
- Removing punnctation marks
- Stemming

However, we added som pre-processing steps which we thought they would be useful, such as removing the '#' sign from hashtags.


#### Load Python packages
We will first import all the packages that we will use during this first part.

In [2]:
# if you do not have 'nltk', the following command should work "python -m pip install nltk"
import nltk
import time
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
import json
import pandas as pd
from numpy import linalg as la
!pip install language-data
from langcodes import *

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting language-data
  Downloading language_data-1.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 7.6 MB/s 
[?25hCollecting marisa-trie<0.8.0,>=0.7.7
  Downloading marisa_trie-0.7.7-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 48.7 MB/s 
Installing collected packages: marisa-trie, language-data
Successfully installed language-data-1.1 marisa-trie-0.7.7


#### Load data into memory
The document corpus is stored in a JSON file, and as we already mentioned, it is composed by a set of tweets related to Hurricane Ian. We have a lot of information for each tweet, which we will have to select the relevant and useful one for future stages.

Opening the json file as plaintext, and loading each tweet to a list of tweets.

In [4]:
docs_path = '/content/drive/Shareddrives/RIAW/data_part1/tw_hurricane_data.json'
info_needed=["created_at","tweet_id","full_text","username","favorite_count", "retweet_count","hashtags","URL"]
tweets = [] #list of tweets but as json objects
for line in open(docs_path, 'r'):
    tweets.append(json.loads(line))

####Define useful functions

Implement the function ```extract_hashtags(tweet)```.

It takes as input a tweet and extract its hashtags into a hashtags array.

In [5]:
def extract_hashtags(tweet):
  hashtags=[]
  for i in range (len(tweet["entities"]["hashtags"])):
    hashtags.append(tweet["entities"]["hashtags"][i]["text"])
  return hashtags


Implement the function ```extract_username(tweet)```.

It takes as input a tweet and extract its username (we chose the screen_name since it does not have special characters nor emojis) into a hashtags array.

In [6]:
def extract_username(tweet):
  return tweet["user"]["screen_name"]

Implement the function ```extract_URL(tweet)```.

It takes as input a tweet and return its URL in case they have. Otherwise, it returns a "NO URL" tag identifying the missing value.

In [7]:
def extract_URL(tweet):
  try:
    return tweet["entities"]["urls"][0]["url"]
  except:
    return "NO URL"

Implement the function ```get_stopwords(tweet)```.

It takes as input a set of tweets and it returns all stopwords from the different languages of the tweets. In this case the tweets are all in english, but might be useful for future files of tweets

In [8]:
def get_stopwords(tweets):
  languages=[]
  stop_words=set(languages)
  for tweet in tweets:
    lang=Language.get(tweet['lang']).display_name().lower()
    if lang not in languages:
      languages.append(lang)
  for lang in languages:
    stop_words = stop_words.union(set(stopwords.words(lang)))
  return stop_words

Implement the function ```eliminate_hashtags_and_urls(line)```.

It takes as input a text (the actual tweet) and returns an array of words without including hashtags and urls.

In [9]:
def eliminate_hashtags_and_urls(line):
  words=[]
  for word in line:
    if not(word.__contains__("#") or word.__contains__("http") or not(word.isalnum())):
      words.append(word)
  return words

Implement the function ```build_terms(line)```.

It takes as input a text and performs the following operations:

- Transform all text to lowercase
- Tokenize the text to get a list of terms (use *split function*)
- Eliminate the hashtags and urls using the previously defined function
- Remove stop words
- Stem terms 
- Replace the elements present in *elements_to_replace* by a blank



In [10]:
def build_terms(line,stopwords):
    
    elements_to_replace=['}','{','[',']','"',',']
    stemmer = PorterStemmer()
    stop_words = stopwords
    
    line=  line.lower() ## Transform in lowercase
    line=  line.split(" ") ## Tokenize the text to get a list of terms
    line = eliminate_hashtags_and_urls(line)
    line = [x for x in line if x not in stop_words]  ##eliminate the stopwords
    line = [stemmer.stem(word) for word in line] ## perform stemming
    for e in elements_to_replace:
      line = [word.strip().replace(e, '')for word in line]
   
    return line

In the following cell, we have worked with the *tweet_document_ids_map.csv* file. The code below reads the file, eliminates the \t and \n terms, splits the doc_id from the tweet_id, creating an array, and then it converts the tweet_id to integer. Finally, each array created is added to the doc_ids matrix.

In [11]:
mapping_doc = '/content/drive/Shareddrives/RIAW/data_part1/tweet_document_ids_map.csv'
with open(mapping_doc) as fp:
    initial_docs_ids = fp.readlines()
    docs_ids=[]
    for doc_id in initial_docs_ids:
      doc_id = ' '.join(doc_id.split())
      doc_id = doc_id.split()
      doc_id[1] = int(doc_id[1])
      docs_ids.append(doc_id)

Implement the function ```define_key(tweet, docs_ids)```.

It takes as input a tweet and an array of docs_ids, and it returns the document id of the tweet.

In [12]:
def define_key(tweet,docs_ids):
  for id in docs_ids:
    if tweet["id"]==id[1]:
      return id[0]

Implement the function ```preprocess_tweets(tweets, info_needed, doc_ids)```.

It takes as input the list of tweets, the info_needed (tweet's information we want to extract to work on in the future) and an array of doc_ids. The function performs the following operations:

- Gets the stopwords of the tweets with the ```get_stopwords(tweet)``` function

And then for each tweet:
- Define a new key with the ```define_key(tweet,doc_ids)``` function

And for each information needed of the tweet:
- Apply to the tweet the different extracting functions defined at the beginning or extract directly the information in some cases and save them to the processed_tweet dictionary
- Mapping the document id (key) defined previously with its corresponding dictionary of values (tweet information), and adding it to the dictionary created beforewards (tweets_processed)

Finally the function returns the dictionary of dictionaries: tweets_processed.

In [13]:
def preprocess_tweets(tweets,info_needed, doc_ids):
  stopwords=get_stopwords(tweets)
  tweets_processed={}
  for i in range (len(tweets)):
    tweet=tweets[i]
    new_key=define_key(tweet,doc_ids)
    processed_tweet={}
    for j in range(len(info_needed)):
      if info_needed[j]=="created_at":
          processed_tweet[info_needed[j]]=tweet["created_at"]
      if info_needed[j]=="tweet_id":
          processed_tweet[info_needed[j]]=tweet["id"]
      if info_needed[j]=="full_text":
          processed_tweet[info_needed[j]]=build_terms(tweet["full_text"],stopwords)
      if info_needed[j]=="username":
          processed_tweet[info_needed[j]]=extract_username(tweet)
      if info_needed[j]=="favorite_count":
          processed_tweet[info_needed[j]]=tweet["favorite_count"]
      if info_needed[j]=="retweet_count":
          processed_tweet[info_needed[j]]=tweet["retweet_count"]
      if info_needed[j]=="hashtags":
          processed_tweet[info_needed[j]]=extract_hashtags(tweet)
      if info_needed[j]=="URL":
          processed_tweet[info_needed[j]]=extract_URL(tweet)
    tweets_processed[new_key]=processed_tweet
  return tweets_processed

Apply the ```preprocess_tweets(tweets, info_needed, doc_ids)``` function to all the set of tweets forming the document corpus.

In [14]:
preprocessed_tweets = preprocess_tweets(tweets, info_needed,docs_ids)

Finally, we create a table with pandas.DataFrame of the dictionary of dictionaries obtained in the previous cell and transpose it so we have a more visual and clear table.

In [15]:
df = pd.DataFrame.from_dict(preprocessed_tweets)
df = df.transpose()
df

Unnamed: 0,created_at,tweet_id,full_text,username,favorite_count,retweet_count,hashtags,URL
doc_1,Fri Sep 30 18:39:08 +0000 2022,1575918182698979328,"[keep, spin, us, 7, away]",suzjdean,0,0,[HurricaneIan],NO URL
doc_2,Fri Sep 30 18:39:01 +0000 2022,1575918151862304768,"[heart, go, affect, wish, everyon, road, curre...",lytx,0,0,[HurricaneIan],NO URL
doc_3,Fri Sep 30 18:38:58 +0000 2022,1575918140839673873,"[kissimme, neighborhood, michigan]",CHeathWFTV,0,0,[HurricaneIan],NO URL
doc_4,Fri Sep 30 18:38:57 +0000 2022,1575918135009738752,"[one, tree, backyard, scare, poltergeist, tree...",spiralgypsy,0,0,"[scwx, HurricaneIan]",NO URL
doc_5,Fri Sep 30 18:38:53 +0000 2022,1575918119251419136,"[pray, everyon, affect, associ, sympathi, anim...",Blondie610,0,0,[HurricaneIan],NO URL
...,...,...,...,...,...,...,...,...
doc_3996,Fri Sep 30 14:33:06 +0000 2022,1575856268022992896,"[carrboro, public, servic, place, stand, best,...",CarrboroFire,2,0,"[CarrboroSafe, ncwx, HurricaneIan]",https://t.co/jrmrS3tJXa
doc_3997,Fri Sep 30 14:33:01 +0000 2022,1575856245650919424,"[list, widespread, flood, bc, bc, low, even]",Baconbitsnews,0,0,"[Kissimmee, SaintCloud, BlueCounty, Disney, De...",https://t.co/JaOkK6skP9
doc_3998,Fri Sep 30 14:32:57 +0000 2022,1575856228886089728,"[realli, flood, flute]",jganyfl1,16,8,"[HurricaneIan, Florida, MAGATears]",NO URL
doc_3999,Fri Sep 30 14:32:56 +0000 2022,1575856226139017216,"[damag, area, punta, tropic, gulf, power]",haddad_cj,2,1,[HurricaneIan],NO URL
