# Text Analytics | BAIS:6100
# Module 10: Text Similarity (Exercises)

Instructor: Kang-Pyo Lee 

Twitter hashtag options:
- ai
- bitcoin
- blacklivesmatter
- bts
- covid19
- fakenews
- innovation
- mentalhealth
- metoo
- startup

Choose a Twitter hashtag you're interested in and save it in the `hashtag` variable below.

In [None]:
# Your answer here
hashtag = ""

In [None]:
N = 500

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 150)

months = ["202012", "202011", "202010", "202009", "202008", "202007", 
          "202006", "202005", "202004", "202003", "202002", "202001"]

df = pd.DataFrame()
for month in months:
    dftmp = pd.read_csv("classdata/tweets/tweets_{}_{}.csv".format(hashtag, month), sep="\t", quoting=3)
    
    ##############################################
    # Create a random sample of N rows.
    ##############################################
    if len(dftmp) > N:
        dftmp = dftmp.sample(n=N)
    ##############################################
    
    df = pd.concat([df, dftmp])
    print("{}: {:,}".format(month, len(dftmp)))

print("Total number of tweets in df: {:,}\n".format(len(df)))

df.user_name = df.user_name.astype(str)
df.text = df.text.astype(str)

df = df.drop_duplicates(["text"])
df.index = range(len(df))

df

Let's drop all duplicates in the `text` column and then re-index the rows of the dataframe such that it ranges from 0 to the length - 1.

1\. Get all the unique words from the first 200 values in the `text` column of `df`, sort them in alphabetical order, and save them in a list named `unique_words`. Use the `get_unique_words` function. 

In [None]:
import nltk
import re

def get_unique_words(text_list):
    all_words = set()
    
    for text in text_list:
        words = nltk.word_tokenize(text)
        for word in words:
            if re.search("^[a-zA-Z][a-zA-Z0-9]+", word):  # Any word starting with an alphabet letter followed by any alphanumerical characters
                all_words.add(word.lower())
                
    return all_words

In [None]:
# Your answer here


In [None]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
len(unique_words)

2\. Write a fuction named `find_similar_words` that take two parameters `word_list` and `distance` and print all the pairs of words in `word_list` that have the Levenshtein distance of `distance`. Set the default value for the `distance` parameter to 1. Filter out the words that have fewer than 5 characters or end with '…'. After writing the function, test the function by passing `unique_words` you got from question 1 for the first argument and 1 and 2, respectively, for the second argument. 

In [None]:
import numpy as np

def get_levenshtein_dist(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y), dtype=int)
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    
    return (matrix[size_x - 1, size_y - 1])

In [None]:
# Your answer here


In [None]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
find_similar_words(unique_words)

In [None]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
find_similar_words(unique_words, 2)

3\. Create a TF-IDF vectorizer by applying the l2 normalization for the `norm` parameter, using the global and local stopwords below for the `stop_words` parameter, and setting the `max_df` to 0.7. Then, using the vectorizer, transform the `text` column of `df` to a document-term matrix named `dtm`. Lastly, create a dataframe named `df_sim` that contains all the pairwise cosine similarities of `dtm` as its data and the index of `df` as both the index and columns of the dataframe. 

In [None]:
from nltk.corpus import stopwords
import string

global_stopwords = stopwords.words("english")
local_stopwords = [c for c in string.punctuation] +\
                  ['’', '``', '…', '...', "''", '‘', '“', '”', "'m", "'re", "'s", "'ve", 'amp', 'https', "n't", 'rt']

In [None]:
# Your answer here


In [None]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
df_sim

4\. Using `df_sim`, print all the pairs of tweet texts in the `text` column of `df` that have the cosine similarity strictly higher than 0.7 and strictly lower than 1. Skip if the first 20 characters of one text are in the other text.

In [None]:
# Your answer here


<hr>

You have a new dataframe `df2` with two columns `text1` and `text2`. 

In [None]:
texts = df.text.sample(n=10)
data = []

for text1 in texts:
    for text2 in texts:
        if text1 != text2:
            data.append([text1, text2])
            
df2 = pd.DataFrame(data=data, columns=["text1", "text2"])
df2

5\. Add a new column `sim_jaccard`, such that the new column has the Jaccard similarity of the two texts from the `text1` and `text2` columns. 

In [None]:
def tokenize(text):
    words = nltk.word_tokenize(text)
    words = [word.lower() for word in words if word not in string.punctuation]
    
    return words

def get_jaccard_sim(text1, text2): 
    a = set(tokenize(text1)) 
    b = set(tokenize(text2))
    i = a & b
    u = a | b
    
    return len(i) / len(u)

In [None]:
# Your answer here


In [None]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
df2[["text1", "text2", "sim_jaccard"]]

6\. Add a new column `sim_cosine`, such that the new column has the cosine similarity of the two texts from the `text1` and `text2` columns. 

In [None]:
def get_cosine_sim(text1, text2):
    corpus = [text1, text2]
    vectorizer = TfidfVectorizer(use_idf=False, norm=None)
    dtm = vectorizer.fit_transform(corpus)
    
    return cosine_similarity(dtm)[0][1]

In [None]:
# Your answer here


In [None]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
df2[["text1", "text2", "sim_jaccard", "sim_cosine"]]