# **INFO5731 Assignment Three**

In this assignment, you are required to conduct information extraction, semantic analysis based on **the dataset you collected from assignment two**. You may use scipy and numpy package in this assignment.

# **Question 1: Understand N-gram**

(45 points). Write a python program to conduct N-gram analysis based on the dataset in your assignment two:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the **noun phrases** and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets). 


In [None]:
# Imports and downloads
import string
import csv
import re
# Import and download all needed NLTK modules and data
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word # for lemmatization

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# Raw results are cleaned again to capture doi from URL to use as doc ID
results_file_path = 'citeseerx_results.csv'
doc_id_file_path = 'citeseerx_results_docid.csv'
# .*doi=(.*?)&
p = re.compile('.*doi=(.*?)&')
cleaned = []

# open file to write doc id and abstract for sentiment analysis
doc_object  = open(doc_id_file_path, "w")
doc_writer = csv.writer(doc_object)
header = ['doc_id', 'abstract', 'sentiment']
doc_writer.writerow(header)
# Open data file containing previous results
with open(results_file_path) as results_file:
  results_reader = csv.DictReader(results_file)
  for result in results_reader:
    url = result['url']
    m = p.search(url)
    doi = m.group(1)
    abstract = result['abstract']
    #print(doi, abstract)
    #doc_object.write(doi + ',' + abstract + '\n')
    doc_writer.writerow([doi, abstract, None])

    words = nltk.word_tokenize(abstract)
    sentences = nltk.sent_tokenize(abstract)
    # lowercase
    lowered = []
    sentences_lowered = []
    for sent in sentences:
      sent_lowered = []
      words = nltk.word_tokenize(sent)
      for word in words:
        sent_lowered.append(word.lower())        
      #print(' '.join(sent_lowered))
      sentences_lowered.append(sent_lowered)
    #print(sentences_lowered)

    # remove punctuation
    # make translation for punctuation
    remove_punctuation = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    no_punc_sentences = []
    for sent in sentences_lowered:
      no_punc_sent = []
      for word in sent:
        no_punc_word = word.translate(remove_punctuation)
        if not no_punc_word.isspace():
          no_punc_sent.append(no_punc_word)
      #print(no_punc_sent)
      no_punc_sentences.append(no_punc_sent)
    #print(no_punc_sentences)

    # remove numbers
    no_digits = []
    remove_digits = str.maketrans(string.digits, ' '*len(string.digits))
    no_digits_sentences = []
    for sent in no_punc_sentences:
      no_digit_sent = []
      for word in sent:
        word_no_digits = word.translate(remove_digits)
        if not word_no_digits.isspace():
          no_digit_sent.append(word_no_digits)
      no_digits_sentences.append(no_digit_sent)
    #print(no_digits_sentences)

    # remove stopwords
    stop = stopwords.words('english')
    no_stops_sentences = []
    for sent in no_digits_sentences:
      no_stop_sent = []
      for word in sent:
        if word not in stop:
          no_stop_sent.append(word)
      no_stops_sentences.append(no_stop_sent)
    #print(no_stops_sentences)

    # lemmatize
    #st = PorterStemmer()
    lemmed_sentences = []
    for sent in no_stops_sentences:
      lemmed_sent = []
      for word in sent:
        #lemmed.append(st.stem(Word(word).lemmatize()))
        lemmed_sent.append(Word(word).lemmatize())
      lemmed_sentences.append(lemmed_sent)
    # create single doc for each result
    doc_list = []
    for l in lemmed_sentences:
      doc_list.extend(l)
    lemmed_doc = ' '.join(str(tok) for tok in doc_list)
    # add to cleaned results
    cleaned.append({'doi': doi, 'lemmed_doc': lemmed_doc, 'lemmed': lemmed_sentences})

doc_object.close()
# Generate text corpus from results
corpus_text = '' # for n-gram
corpus_list = [] # for td-idf
for result in cleaned:
  #print(result)
  corpus_text = corpus_text + result['lemmed_doc']
  corpus_list.append(result['lemmed_doc'])
print(corpus_text)



concept maximum entropy traced back along multiple thread biblical time recently however computer become powerful enough permit widescale application concept real world problem statistical estimation pattern recognition paper describe method statistical modeling based maximum entropy present maximum likelihood approach automatically constructing maximum entropy model describe implement approach efficiently using example several problem natural language processingscaling conditional random field natural language processing term condition term condition copyright work deposited minerva access retainedpaper address issue cooperation linguistics natural language processing nlp general linguistics machine translation mt particular focus one direction cooperation namely application linguistics nlp virtually ignoringnatural language processing application description logic used encode knowledge base syntactic semantic pragmatic element needed drive semantic interpretation natural language gen

In [None]:
# ngrams
def ngrams(words, n):
    all_ngrams = []
    for x in range(0, len(words)):
        ngram = ' '.join(words[x:x + n])
        all_ngrams.append(ngram)
 
    return all_ngrams
corpus_word_list = list(corpus_text.split(" ")) 

# 1. trigram frequency
trigrams = ngrams(corpus_word_list, 3)
trigram_count = {}
for trigram in trigrams:
  count = trigram_count.get(trigram, 0)
  trigram_count[trigram] = count + 1
print('Trigram count:')
print(trigram_count)

# 2. bigram frequency
# Calculate the probabilities for all the bigrams in the dataset by using the fomular 
#count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.
bigrams = ngrams(corpus_word_list, 2)
bigram_count = {}
for bigram in bigrams:
  count = bigram_count.get(bigram, 0)
  bigram_count[bigram] = count + 1
# word count
word_count = {}
for word in corpus_word_list:
  count = word_count.get(word, 0)
  word_count[word] = count + 1
print('Word count:')
print(word_count)
print('Bigram probability:')
for bigram in bigram_count:
  try:
    w1, w2 = bigram.split()
  except ValueError:
    w1 = bigram
  count = bigram_count[bigram]
  w2_count = word_count[w2]
  #print(w2_count)
  prob = count / w2_count
  print(bigram, '-', prob)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
one meaning - 0.2
meaning understood - 0.5
understood one - 0.0625
one way - 0.1111111111111111
way natural - 0.006666666666666667
language ambiguous - 1.0
ambiguous computer - 0.058823529411764705
computer able - 0.16666666666666666
able understand - 0.4
understand language - 0.005
language way - 0.2222222222222222
way people - 0.25
people natural - 0.013333333333333334
nlp concerned - 1.0
concerned development - 0.05
development computational - 0.08333333333333333
computational model - 0.030303030303030304
model aspect - 0.3333333333333333
aspect human - 0.07142857142857142
human language - 0.01
processing ambiguity - 0.25
ambiguity occur - 1.0
occur various - 0.07142857142857142
various level - 0.1111111111111111
level nlp - 0.011627906976744186
nlp ambiguity - 0.25
ambiguity could - 0.5
could lexical - 0.2
lexical syntactic - 0.09090909090909091
pragmatic etc - 0.25
etc paper - 0.03125
paper present - 0.25
present stu

# **Question 2: Undersand TF-IDF and Document representation**

(40 points). Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program: 

(1) To build the **documents-terms weights (tf*idf) matrix bold text**.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using **cosine similarity**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
tfidf = TfidfVectorizer() 
tfidf_result = tfidf.fit_transform(corpus_list) 
#docs_tfidf = vectorizer.fit_transform(corpus_list)

# 1. Matrix
print('TFIDF matrix:')  
matrix = tfidf_result.toarray()
for line in matrix:
  #print(str(line))
  line_val = []
  for item in line:
    line_val.append(str(item))
  print(line_val)


TFIDF matrix:
['0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0'

In [None]:
# 2. rank by similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

query_text = ['A novel method for extracting text from natural history collections produces world changing results']
tfidf_query = tfidf.transform(query_text)
similarities = cosine_similarity(tfidf_query, tfidf_result)
similar_docs = []
item_index = 0
for sim in np.nditer(similarities):
  doc = corpus_list[item_index]
  #print(sim, doc)
  item_index += 1
  similar_docs.append({'similarity': float(sim), 'doc':doc})

print(similar_docs)

[{'similarity': 0.0, 'doc': ''}, {'similarity': 0.07567315258156022, 'doc': 'concept maximum entropy traced back along multiple thread biblical time recently however computer become powerful enough permit widescale application concept real world problem statistical estimation pattern recognition paper describe method statistical modeling based maximum entropy present maximum likelihood approach automatically constructing maximum entropy model describe implement approach efficiently using example several problem natural language processing'}, {'similarity': 0.009174037342425922, 'doc': 'scaling conditional random field natural language processing term condition term condition copyright work deposited minerva access retained'}, {'similarity': 0.008389090899131448, 'doc': 'paper address issue cooperation linguistics natural language processing nlp general linguistics machine translation mt particular focus one direction cooperation namely application linguistics nlp virtually ignoring'}, 

# **Question 3: Create your own training and evaluation data for sentiment analysis**

(15 points). **You dodn't need to write program for this question!** Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral). Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew. This datset will be used for assignment four: sentiment analysis and text classification. 


In [None]:
# The GitHub link of your final csv file

# Link: https://github.com/jbest/Jason_INFO5731_Spring2021/blob/main/citeseerx_results_sentiment.csv