# ENSF-612 Assignment 2

#### Install modules

In [0]:
!pip install nltk pyspellchecker beautifulsoup4



#### Import Modules

In [0]:
from spellchecker import SpellChecker
import nltk
import string
import re
import math
from bs4 import BeautifulSoup

#### Download `nltk` data

In [0]:
nltk.download(['punkt', 'stopwords', 'maxent_treebank_pos_tagger', 'averaged_perceptron_tagger', 'wordnet'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[3]: True

## Question 1

#### 1. Task 1

To fix typos, we can use `pyspellchecker` that uses Levenshtein Distance algorithm to find permutations of the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list to determine the correct word.

The below steps are followed for this task -

1. Initialize `SpellChecker`, and remove the word "jsut" from the `pyspellchecker`, as it according to `pyspellchecker`, "jsut" is a valid english word.
1. Tokenize the original sentence into words using `word_tokenize` of `nltk`.
1. Pass each word to `SpellChecker` function `correction` to get the word without typo. The already correct word is returned as it is. Store the fixed words in a list.
1. Join the list back to get the sentence.

In [0]:
# 1. removing "jsut" word
spell = SpellChecker()
spell.word_frequency.remove('jsut')

# 2. tokenize the given sentence
given_sentence = 'this is jsut graet!'
given_words = nltk.word_tokenize(given_sentence)

# 3. fix typo in each word, and store in list fixed_words
fixed_words = []
for w in given_words:
  fixed_words.append(spell.correction(w))

# 4. form the sentence again
fixed_sentence = " ".join(fixed_words)

print(fixed_sentence)

this is just great !


#### 2. Task 2

So, there were two typos in the given sentence - 'jsut' and 'graet'.

1. jsut  
    Two replace are required for letter 's' to 'u' and letter 'u' to 's'
1. graet  
    Two replace are required for letter 'a' to 'e' and letter 'e' to 'a'

## Question 2

#### Importing CSV files

The csv files are uploaded to DBFS. We use the spark `read` function to import csv for all three programming languages. Also, we change the datatype of column "Score" to integer.

In [0]:
def read_CSV_to_DF(filepath):
  """
  Reads a csv file into a spark dataframe
  """
  df = (spark.read
        .option("multiline", "true")
        .option("quote", '"')
        .option("header", "true")
        .option("escape", "\\")
        .option("escape", '"')
        .csv(filepath)
        )
  
  return df


# importing files from DBFS
df_jv = read_CSV_to_DF('/FileStore/assignment_2/SO_Java.csv')
df_py = read_CSV_to_DF('/FileStore/assignment_2/SO_Python.csv')
df_js = read_CSV_to_DF('/FileStore/assignment_2/SO_Javascript.csv')

# cast the Score to int
df_jv = df_jv.withColumn("Score", df_jv["Score"].cast("int"))
df_py = df_py.withColumn("Score", df_py["Score"].cast("int"))
df_js = df_js.withColumn("Score", df_js["Score"].cast("int"))

#### 1. Pre-Processing

The textual contents in the files are preprocessed using various methods described below.

##### a. Merging of Title and Body column

The columns "Title" and "Body" are merged together and stored as "Title_Body", so that any operation applied is reflected to contents of both the columns.

In [0]:
@udf
def merge_cols(a, b):
  """
  UDF to merge two columns
  """
  c = a + " " + b  
  return c


df_jv = df_jv.select("*", merge_cols("Title", "Body").alias("Title_Body")).drop("Body")
df_py = df_py.select("*", merge_cols("Title", "Body").alias("Title_Body")).drop("Body")
df_js = df_js.select("*", merge_cols("Title", "Body").alias("Title_Body")).drop("Body")

##### b. Extraction of textual contents

After removing all the hyperlinks and code snippets, textual contents are extracted from the column "Title_Body". The case of all words is also lowered. This extracted textual contents in lowercase are stored in column "Text".

In [0]:
@udf
def preprocess_text(body):
  """
  UDF to extract textual contents
  """

  soup = BeautifulSoup(body)

  # remove hyperlinks
  urls  =  soup.find_all('a')
  if len(urls) > 0: soup.a.clear()

  # remove code
  codes = soup.find_all('code')
  if len(codes) > 0: soup.code.clear()

  # remove preformatted text
  pres = soup.find_all('pre')
  if len(pres) > 0: soup.pre.clear()

  # all the remaining is textual content
  text = soup.get_text().lower()

  return text


df_jv = df_jv.select("*", preprocess_text("Title_Body").alias("Text")).drop("Title_Body")
df_py = df_py.select("*", preprocess_text("Title_Body").alias("Text")).drop("Title_Body")
df_js = df_js.select("*", preprocess_text("Title_Body").alias("Text")).drop("Title_Body")

##### c. Tokenizing and stopwords removal

The column "Text" is tokenized from a string to a list of words using `nltk`, and then all stopwords are removed. Also, words smaller than length of three are also removed. The result is stored in column "Text_no_stopwords".

In [0]:
@udf
def remove_stopwords(text):
  """
  UDF to tokenize and remove stopwords and words of length
  two or less.
  """
  
  stopwords = nltk.corpus.stopwords.words('english')
  sentences = nltk.sent_tokenize(text)
  filtered_words = []

  for sentence in sentences:
    words = nltk.word_tokenize(sentence)

    for word in words:
      if len(word) < 3:
        continue
      
      if word in stopwords:
        continue
      
      filtered_words.append(word)

  return filtered_words


df_jv = df_jv.select("*", remove_stopwords("Text").alias("Text_no_stopwords")).drop("Text")
df_py = df_py.select("*", remove_stopwords("Text").alias("Text_no_stopwords")).drop("Text")
df_js = df_js.select("*", remove_stopwords("Text").alias("Text_no_stopwords")).drop("Text")

##### d. Noise removal
The punctuation, considered as noise, is removed. The results is then stored in column "Text_cleaned".

In [0]:
regex = re.compile('^[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+$')

@udf
def remove_noise(text):
  """
  UDF to remove punctuation
  """
  
  text_no_noise = []
  
  for word in text:
    if bool(regex.match(word)) == False:
      text_no_noise.append(word)
    
  return text_no_noise


df_jv = df_jv.select("*", remove_noise("Text_no_stopwords").alias("Text_cleaned")).drop("Text_no_stopwords")
df_py = df_py.select("*", remove_noise("Text_no_stopwords").alias("Text_cleaned")).drop("Text_no_stopwords")
df_js = df_js.select("*", remove_noise("Text_no_stopwords").alias("Text_cleaned")).drop("Text_no_stopwords")

##### e. Stemming

All the words in the column "Text_cleaned" are now kept only in their root form. The Snowball stemmer is used for this purpose. The result is stored in column "Text_stemmed".

In [0]:
@udf
def word_stem(text):
  """
  UDF to stem the words
  """
  
  sb = nltk.stem.SnowballStemmer(language='english')
  
  stemmed_words = []
  
  for t in text:
    st = sb.stem(t)
    stemmed_words.append(st)
  
  return stemmed_words


df_jv = df_jv.select("*", word_stem("Text_cleaned").alias("Text_stemmed")).drop("Text_cleaned")
df_py = df_py.select("*", word_stem("Text_cleaned").alias("Text_stemmed")).drop("Text_cleaned")
df_js = df_js.select("*", word_stem("Text_cleaned").alias("Text_stemmed")).drop("Text_cleaned")

#### 2. Highest scored question without accepted answer

The pre-processed dataframe is filtered to have only those rows that don't have any accepted answer. Then, that dataframe is sorted in descending order on the column "Score", to have highest scored question on top. Only the topmost question with highest score is selected. If more than one question has same highest score, only the first to appear in the dataframe is selected.

In [0]:
def highest_unanswered(pyspark_df):
  """
  Returns the Row that has highest score question
  without any accepted answer
  """
  
  unanswered_df = pyspark_df.filter(pyspark_df["AcceptedAnswerId"].isNull())
  unanswered_df = unanswered_df.sort("Score", ascending=False)
  
  return unanswered_df.first()


q_jv = highest_unanswered(df_jv)
q_py = highest_unanswered(df_py)
q_js = highest_unanswered(df_js)

#### 3. Cosine Similarity

Below is the udf to calculate cosine similarity of parameter `t1` with global variable `t2`. Both `t1` and `t2` are list of words. The formula given in the slides is implemented in pure python.

Also, there is a utility function to call the udf, that will be used by dataframe of all three programming languages. This function filters to keep only questions that have accepted answers, and stores the cosine similarity in the column "Similarity". The dataframe is then sorted based on the cosine similarity score in descending order.

In [0]:
@udf
def cosine_similarity(t1):
  """
  UDF to calculates the cosine similarity of t1 with
  with highest scored question without accepted answer
  """
  
  d = list(set(t1 + t2))
  
  dt1 = []
  dt2 = []
  
  for word in d:
    if word in t1:
      dt1.append(1)
      
    else:
      dt1.append(0)
  
    if word in t2:
      dt2.append(1)
      
    else:
      dt2.append(0)
  
  numerator = 0
  sum_sq_dt1 = 0
  sum_sq_dt2 = 0
  
  for i in range(0, len(dt1)):
    numerator += dt1[i] * dt2[i]
    
    sum_sq_dt1 += dt1[i] * dt1[i]
    sum_sq_dt2 += dt2[i] * dt2[i]
    
  denominator = math.sqrt(sum_sq_dt1 * sum_sq_dt2)
  
  if denominator == 0:
    return 0
  
  return numerator/denominator


def df_with_similarity(pyspark_df):
  """
  Returns a pyspark dataframe with column Similarity
  calculated using cosine similarity
  """
  
  df = pyspark_df.filter(pyspark_df["AcceptedAnswerId"].isNotNull()).select("*", cosine_similarity("Text_stemmed").alias("Similarity")).sort("Similarity", ascending=False)
  
  return df

##### a. Calculating similarity of java df

In [0]:
t2 = q_jv["Text_stemmed"].strip('][').split(', ')
answered_df_jv = df_with_similarity(df_jv)

##### b. Calculating similarity of python df

In [0]:
t2 = q_py["Text_stemmed"].strip('][').split(', ')
answered_df_py = df_with_similarity(df_py)

##### c. Calculating similarity of javascript df

In [0]:
t2 = q_js["Text_stemmed"].strip('][').split(', ')
answered_df_js = df_with_similarity(df_js)

#### 4. Printing most similar question

In [0]:
def print_top_n(df, n):
  """
  Prints top n Title of df
  """
  
  top_n = df.take(n)
  
  for i in range(0, n):
    print("  {}. [Id = {}] {}".format(i+1, top_n[i]["Id"], top_n[i]["Title"]))

Three most similar questions in Java

In [0]:
print("The highest scored question in Java without any accepted answer is-\n  [Id = {}] {}\n".format(q_jv["Id"], q_jv["Title"]))
 
print("And the top three most similar questions with accepted answers are")

print_top_n(answered_df_jv, 3)

The highest scored question in Java without any accepted answer is-
  [Id = 109383] Sort a Map<Key, Value> by values

And the top three most similar questions with accepted answers are
  1. [Id = 48663774] Add a value to all Sets (values in a map) in a Map<String, Set<String>>
  2. [Id = 16954286] Initialize map of map in java
  3. [Id = 32078074] Java: Array sorting


Three most similar questions in Python

In [0]:
print("The highest scored question in Python without any accepted answer is-\n  [Id = {}] {}\n".format(q_py["Id"], q_py["Title"]))

print("And the top three most similar questions with accepted answers are")

print_top_n(answered_df_py, 3)

The highest scored question in Python without any accepted answer is-
  [Id = 455612] Limiting floats to two decimal points

And the top three most similar questions with accepted answers are
  1. [Id = 32778666] How to change all True values in list on False and vice versa
  2. [Id = 32300907] Counting number of unique values in subset of sorted array
  3. [Id = 63869688] Returning large data array from Python to Java using Chaquopy


Three most similar questions in JavaScript

In [0]:
print("The highest scored question in Javascript without any accepted answer is-\n  [Id = {}] {}\n".format(q_js["Id"], q_js["Title"]))

print("And the top three most similar questions with accepted answers are")

print_top_n(answered_df_js, 3)

The highest scored question in Javascript without any accepted answer is-
  [Id = 8567114] How to make an AJAX call without jQuery?

And the top three most similar questions with accepted answers are
  1. [Id = 890807] Iterate over a Javascript associative array in sorted order
  2. [Id = 48863885] Converting array of object to en object with key : value dynamic
  3. [Id = 417645] How to convert variable name to string in JavaScript?
