# COMP6210 Assignment 1

## Part 2: Keyword extraction from tweets

**Name: Mohammad Abdul Ahad Chowdhury**
    
**ID: 46168249**

**Email: mohammadabdul.ahadchowdhury@students.mq.edu.au**

**Date: 17 March 2020**

## Methodology

First of all, we are going to clean up the text data. Then we will run lemmatization and removal of stopwords on the texts.

We will be using [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) method to extract keywords from these cleaned-up texts. The method consists of calculating Token Frequency (TF) for each text, and calculating Inverse Document Frequency (IDF) for each token of each text. Finally, we will use these two data structures to find out the keyword of each tweet.

## Libraries
We are going to use
- **NLTK** for NLP tasks such as tokenization and lemmatization
- **NumPy** for numeric operations
- **Pandas** for two data structures: `DataFrame` and `Series`
- **RE** for some operation regarding regular expression
- **PyMongo** for connecting to MongoDB Atlas for CRUD operations

Note: `pymongo` was chosen over `mongoengine` due to availability of documentation online at the time of writing.

In [57]:
import nltk
import numpy as np
import pandas as pd
import re
from pymongo import MongoClient

These downloads are needed to run some NLTK commands for the first time.

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download ('stopwords')

Now, we are reading the JSON file into a Pandas `DataFrame`. We're going to have two copies of the `DataFrame`: one to work with, and the other to have as a fresh reference copy, a source of truth, if things get messy with the other one!

In [41]:
client = MongoClient("mongodb+srv://maac:1234@cluster0-dtz35.mongodb.net/test?retryWrites=true&w=majority")
db = client.Tweets
db_tweets = list(db.tweets.find())
persistent_dataset = pd.DataFrame(db_tweets)
# persistent_dataset is the original copy.
dataset = persistent_dataset
# The dataset we will work on
dataset.head()

Unnamed: 0,_id,created_at,id_str,text,user,place,entities,tag
0,5e6fcd2cfaabb5690b6eff12,Thu Apr 06 15:24:15 +0000 2017,1850006245121695744,"Train people well enough so they can leave, tr...","{'id': 224499494501.0, 'name': 'user 01', 'scr...",{},"{'hashtags': [], 'urls': [{'url': 'twt url sam...",My number 1 Tweet
1,5e6fcd2cfaabb5690b6eff13,Thu Apr 06 15:24:16 +0000 2017,8150006245121695744,On Friday evening we welcomed our new MRes and...,"{'id': 224499494502, 'name': 'user 02', 'scree...",{},"{'hashtags': [], 'urls': [{'url': 'twt url sam...",My number 2 Tweet
2,5e6fcd2cfaabb5690b6eff14,Thu Apr 06 15:24:17 +0000 2017,8510006245121695744,"Excited to publish the BPM Newsletter 1-2018, ...","{'id': 224499494503, 'name': 'user 03', 'scree...",{},"{'hashtags': [], 'urls': [{'url': 'twt url sam...",
3,5e6fcd2cfaabb5690b6eff15,Thu Apr 06 15:24:18 +0000 2017,8501006245121695744,Our social networks amplify negative more tha...,"{'id': 224499494504, 'name': 'user 04', 'scree...",{},"{'hashtags': [], 'urls': [{'url': 'twt url sam...",
4,5e6fcd2cfaabb5690b6eff16,Thu Apr 06 15:24:19 +0000 2017,8500106245121695744,Welcome to the official Department of Computin...,"{'id': 224499494505, 'name': 'user 05', 'scree...",{},"{'hashtags': [], 'urls': [{'url': 'twt url sam...",


## Cleaning up

Let's say, we have an example tweet `I had a size 5 football.`

For step 1, we will

- convert all the text to lowercase: `i had a size 5 football.`
- tokenize all the words: `["i", "had", "a", "size", "5", "football", "."]`
- get rid of all the numbers: `["i", "had", "a", "size", "football", "."]`
- remove all punctuation marks: `["i", "had", "a", "size", "football"]`

In [27]:
from string import punctuation
from nltk.tokenize import word_tokenize
tokenized_tweets = []

texts = dataset.text

# step 1: convert whole text to lowercase
tweets = texts.str.lower().to_list()

for tweet in tweets:
    # step 2: tokenize all the words
    tokens = word_tokenize(tweet)

    # step 3: remove all numeric tokens
    re_num = re.compile(r'[0-9]')
    tokens = [i for i in tokens if not re_num.match(i)]
    
    # step 4: remove all punctuation
    re_pun = re.compile('[%s]' % re.escape(punctuation))
    tokens = [i for i in tokens if not re_pun.match(i)]

    tokenized_tweets.append(tokens)

Next, we will use NLTK's `WordNetLemmatizer` to lemmatize the words. It means, all the words (i.e. 'running, 'ran') would be converted into their base forms (i.e. 'run'), so that the same words do not get counted more only because of a different form.

So, our previous example now becomes like this: `["i", "have", "a", "size", "football"]`.

In [28]:
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()

lemmatized_tweets = []

for tweet in tokenized_tweets:
    lemma = [wnl.lemmatize(token) for token in tweet]
    lemmatized_tweets.append(lemma)

Now, we remove the stopwords from the texts. Stopwords are some highly commonly used words (i.e. 'this', 'the') that do not contribute to the word count.

So, our example becomes `["i", "have", "size", "football"]`.

In [29]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

filtered_tweets = []

for tweet in lemmatized_tweets:
    filtered = [token for token in tweet if token not in stop_words]
    filtered_tweets.append(filtered)

## Calculation

After clean-up, we move on to calculation. As mentioned before, to obtain TF-IDF we have to calculate `TF` and `IDF` first.

`TF` or token frequency of a text is basically a table of how many times each word occurs in that text. Continuing from our example, here is the TF using `Counter` class from Python's own `collections` library:

`Counter({'i': 1, 'have': 1, 'size': 1, 'football': 1})`

In [32]:
from collections import Counter
tf = []

for tweet in filtered_tweets:
    tf.append(Counter(tweet))

To calculate `IDF` or inverse document frequency for each token `T`, we use the following formula:

`IDF(T) = log(number_of_documents ÷ number_of_documents_that_include_T )`

This quantity signifies the amount of information a token provides across the texts it appears in.

In [34]:
from math import log
total_tweets = len(filtered_tweets)
idf = dict()

for tweet in filtered_tweets:
    for token in tweet:
        if token not in idf.keys():
            tweets_with_token = [idx for idx in range(total_tweets) if tf[idx][token] > 0]
            occurences_of_token = len(tweets_with_token)
            idf[token] = log(total_tweets / occurences_of_token)

Finally, We calculate `TF-IDF` for each tweet and for each token. `TF-IDF` for a specific tweet and for a particular token is given by the product of the `TF` of the token for that tweet and the `IDF` of that token across the whole collection of texts. This process yields an `m`-by-`n` data structure (in this case, a `DataFrame`) where `n` is the number of tweets and `m` is the total number of tokens.

In [24]:
tokens = idf.keys()
matrix = []

for tweet_idx, tweet in enumerate(filtered_tweets):
    numbers = []
    for token in tokens:
        numbers.append(tf[tweet_idx][token] * idf[token])
    matrix.append(numbers)
    
tf_idf = pd.DataFrame(matrix, columns=tokens)

In [36]:
tf_idf.head()

Unnamed: 0,train,people,well,enough,leave,treat,n't,want,—,richardbranson,...,master,career,universe,pretty,place,u,seems,awful,space,sagan
0,3.912023,1.966113,7.824046,7.824046,3.912023,3.912023,3.218876,3.218876,3.912023,3.912023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


When we sort this `DataFrame` column-wise by the highest value for each individual row, we can see the keyword for each tweet as a `Series`.

In [51]:
keywords = tf_idf.idxmax(axis=1)
keywords.head()

0        well
1     student
2     excited
3      social
4    official
dtype: object

## Exporting to CSV
We export these values to a CSV file. `header` is set to `False` to get rid of one useless row at the top of the file.

In [55]:
keywords.to_csv('keywords.csv', header = False)

## Updating the database
We update the database with our newfound data using `pymongo` driver.

In [50]:
for i, t in enumerate(db_tweets):
    db.tweets.update_one(t, { "$set": { "keyword": keywords[i] } })