# Introduction to NLP: Assignment 3

## Assignment on Unsupervised Learning in NLP

### Description of Assignment 3

This assignment relates to Theme 3 of the Introduction to NLP course and will focus on topics of unsupervised learning in NLP.
The data set to be used in the assignment is abstracts,zip file. Use  the files in the subdirectory "awards_2002" as the corpus in the exercises. Follow the code examples in the lecture notebook and adapt them to work with this data, and perform the following steps described in the assignment steps/Questions section.

**Assignment steps/Questions:**

1. **Topic Analysis**
   * There are quantitative measures for evaluating various aspects of clustering results and topic models intrinsically, and they can also be evaluated extrinsically by how well the clusters/topics serve some supervised task as features. In this exercise, however, we will focus on qualitative evaluation of the results in terms of their descriptiveness. As in the example code, you may limit yourself to the 1000 first documents of the corpus when performing clustering, in order to simplify the task and speed up experimentation, but use the whole corpus to calculate tf-idf features.
   * **a.** Experiment with different setups of the tf-idf feature extraction and clustering (k-means or hierarchical), in order to obtain meaningful results. When you arrive at a good configuration, describe it and motivate your chosen setup/parameters.
   * **b.** Inspect the keywords of the clusters. List the 10 first clusters out of all (i.e., not cherry picked examples) and provide an as descriptive label as possible for each of them. 
   * **c.** Select one or two good clusters (that can be clearly interpreted) and one or two bad clusters (that might be difficult to interpret or distinguish). Motivate your choise (clusters may, for instance, be overlapping, too broad/narrow or incoherent). 
   * **d.** Repeat the experiment in (a) with LDA topic modeling instead (on the whole corpus), and explain briefly how the results compare to your previously chosen clustering setup. A few concrete examples may be helpful. Do your best to make sure the list of topic keywords are informative through appropriate post-processing.

2. **Word Vectors**
   * **a.** Choose about 5 words (arbitrarily) to use as seed words in the following experiment. Train word2vec vectors on the corpus while trying out variations on the parameters. Evaluate the vector models by inspecting the most similar words for each of the seed words, and try to identify qualitative differences between different parameter choices. Which parameters seem to have the most interesting effect? At what values? Motivate. Finally study the qualitative effect of increasing the training data, by similarly comparing vectors trained with the best setup on the texts from the awards_2002 directory against vectors trained on the whole set of abstracts (1990-2002).
   * **b.** Repeat the experiment with ELMo from the lecture, with a different target word and different sentences. Choose a word that can have multiple senses, and construct 10 sentences that express 2-3 different senses of the word. Produce ELMo embeddins for the target word in each sentence and measure the similarity between the vectors. Evaluate in how many cases the measured similarities can be used to successfully distinguish between the different senses. Comment on the results, e.g., are you able to identify a particular way in which the model fails?
   

### Import libraries

In [51]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from sklearn.cluster import KMeans
import heapq, numpy as np
#!pip3 install gensim
from gensim import corpora, models

### Read in the data set 

In [2]:
abstracts = []

for subdirectory in os.listdir(os.getcwd() + '/awards_2002'):
    for filename in  os.listdir(os.getcwd() + '/awards_2002/' + subdirectory):
        path = os.getcwd() + '/awards_2002/' + subdirectory + "/" + filename
        with open(path, 'rt', encoding='latin1')as f:
            readNextLines = False
            abstractStr = ""
            for line in f.readlines():
                if(readNextLines):
                    abstractStr += line.strip() + " "
                if(line.startswith('Abstract    :')):
                    readNextLines = True
            abstracts.append(abstractStr)

## 1. Topic Analysis
**There are quantitative measures for evaluating various aspects of clustering results and topic models intrinsically, and they can also be evaluated extrinsically by how well the clusters/topics serve some supervised task as features. In this exercise, however, we will focus on qualitative evaluation of the results in terms of their descriptiveness. As in the example code, you may limit yourself to the 1000 first documents of the corpus when performing clustering, in order to simplify the task and speed up experimentation, but use the whole corpus to calculate tf-idf features.**

### 1.a - Experiment with different setups of the tf-idf feature extraction and clustering (k-means or hierarchical), in order to obtain meaningful results. When you arrive at a good configuration, describe it and motivate your chosen setup/parameters.

In [33]:
tfidf_vectorizer = TfidfVectorizer(min_df=2, use_idf=True, sublinear_tf=True, max_df=1.0, max_features=20000, ngram_range=(1,1))
# Tip: the vectorizer also supports extracting n-gram features (common short sequences of words), which may be more descriptive but also much less frequent

# Calcualate term-document matrix with tf-idf scores
tfidf_matrix = tfidf_vectorizer.fit_transform(abstracts)

# Check matrix shape
tfidf_matrix.toarray().shape # N_docs x N_terms

(9923, 20000)

In [34]:
terms_in_docs = tfidf_vectorizer.inverse_transform(tfidf_matrix)
token_counter = Counter()
for terms in terms_in_docs:
    token_counter.update(terms)

for term, count in token_counter.most_common(20):
    print("%d\t%s" % (count, term))

9637	the
9619	of
9613	and
9511	to
9442	in
8743	this
8625	for
8269	is
8228	will
7632	be
7419	on
7271	with
7167	that
6656	are
6642	research
6561	by
6424	as
5968	from
5750	an
5402	these


In [35]:
## Inspect top terms per document

features = tfidf_vectorizer.get_feature_names()
for doc_i in range(5):
    print("\nDocument %d, top terms by TF-IDF" % doc_i)
    for term, score in sorted(list(zip(features,tfidf_matrix.toarray()[doc_i])), key=lambda x:-x[1])[:5]:
        print("%.2f\t%s" % (score, term))


Document 0, top terms by TF-IDF
0.29	chow
0.19	hodge
0.19	algebraic
0.14	geometry
0.14	subgroup

Document 1, top terms by TF-IDF
0.22	cultivating
0.21	ethnically
0.20	control
0.20	exchanging
0.18	american

Document 2, top terms by TF-IDF
0.30	updating
0.20	reference
0.18	secondly
0.17	method
0.14	simulation

Document 3, top terms by TF-IDF
0.23	conference
0.20	computations
0.19	learn
0.18	theory
0.18	group

Document 4, top terms by TF-IDF
0.21	uncontrollable
0.20	commonality
0.19	uncertainties
0.18	preferences
0.18	alternative


#### Try out clustering

In [36]:
matrix_sample = tfidf_matrix[:1000]

In [42]:

# Do clustering
km = KMeans(n_clusters=30, random_state=123, verbose=0)
km.fit(matrix_sample)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=30, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=123, tol=0.0001, verbose=0)

In [43]:

# Custom function to print top keywords for each cluster
def print_clusters(matrix, clusters, n_keywords=10):
    for cluster in range(min(clusters), max(clusters)+1):
        cluster_docs = [i for i, c in enumerate(clusters) if c == cluster]
        print("Cluster: %d (%d docs)" % (cluster, len(cluster_docs)))
        
        # Keep scores for top n terms
        new_matrix = np.zeros((len(cluster_docs), matrix.shape[1]))
        for cluster_i, doc_vec in enumerate(matrix[cluster_docs].toarray()):
            for idx, score in heapq.nlargest(n_keywords, enumerate(doc_vec), key=lambda x:x[1]):
                new_matrix[cluster_i][idx] = score

        # Aggregate scores for kept top terms
        keywords = heapq.nlargest(n_keywords, zip(new_matrix.sum(axis=0), features))
        print(', '.join([w for s,w in keywords]))
        print()


In [44]:
print_clusters(matrix_sample, km.labels_)

Cluster: 0 (4 docs)
proofs, resultants, theorems, nova, lemma, footage, analogy, pbs, logarithmic, wgbh

Cluster: 1 (88 docs)
equations, dynamical, fluid, nonlinear, random, waves, solutions, fluids, mesoscale, schrodinger

Cluster: 2 (46 docs)
colleges, alliance, technicians, college, umeb, manufacturing, curriculum, workforce, technical, industry

Cluster: 3 (43 docs)
available, not, zygotic, zygomycota, zygomycetes, zygmund, zurich, zuni, zro2, zr

Cluster: 4 (56 docs)
conference, algebraic, contract, arsenic, seminar, meeting, stochastic, statistical, grid, haiwee

Cluster: 5 (13 docs)
oceanographic, vessel, fleet, ship, ships, shipboard, operated, ctd, unols, equipment

Cluster: 6 (26 docs)
fellowship, mathematical, sciences, fellowships, zygotic, zygomycota, zygomycetes, zygmund, zurich, zuni

Cluster: 7 (30 docs)
algebras, representation, lie, symmetries, theory, representations, finite, sheaves, quantum, automorphic

Cluster: 8 (45 docs)
workshop, cmes, costa, cme, sediment, so

#### Explanation to choices

I chose ngram range (1,2) because anything more than that would be rare and max limit is at base 20000 because more rare wont have too much of an effect on it

Chose K-means because of the speed difference and the performance. cluster size dunno why yet, Kmeans more familiar with it than hierachical

### 1.b - Inspect the keywords of the clusters. List the 10 first clusters out of all (i.e., not cherry picked examples) and provide an as descriptive label as possible for each of them. 


NOTE PRINT 10 FIRST

In [46]:
print_clusters(matrix_sample, km.labels_)

Cluster: 0 (4 docs)
proofs, resultants, theorems, nova, lemma, footage, analogy, pbs, logarithmic, wgbh

Cluster: 1 (88 docs)
equations, dynamical, fluid, nonlinear, random, waves, solutions, fluids, mesoscale, schrodinger

Cluster: 2 (46 docs)
colleges, alliance, technicians, college, umeb, manufacturing, curriculum, workforce, technical, industry

Cluster: 3 (43 docs)
available, not, zygotic, zygomycota, zygomycetes, zygmund, zurich, zuni, zro2, zr

Cluster: 4 (56 docs)
conference, algebraic, contract, arsenic, seminar, meeting, stochastic, statistical, grid, haiwee

Cluster: 5 (13 docs)
oceanographic, vessel, fleet, ship, ships, shipboard, operated, ctd, unols, equipment

Cluster: 6 (26 docs)
fellowship, mathematical, sciences, fellowships, zygotic, zygomycota, zygomycetes, zygmund, zurich, zuni

Cluster: 7 (30 docs)
algebras, representation, lie, symmetries, theory, representations, finite, sheaves, quantum, automorphic

Cluster: 8 (45 docs)
workshop, cmes, costa, cme, sediment, so

### 1.c - Select one or two good clusters (that can be clearly interpreted) and one or two bad clusters (that might be difficult to interpret or distinguish). Motivate your choise (clusters may, for instance, be overlapping, too broad/narrow or incoherent). 



### 1.d - Repeat the experiment in (a) with LDA topic modeling instead (on the whole corpus), and explain briefly how the results compare to your previously chosen clustering setup. A few concrete examples may be helpful. Do your best to make sure the list of topic keywords are informative through appropriate post-processing.

In [52]:
## Topic modeling demo
#!pip3 install gensim

# Fast and simple tokenization
new_vectorizer = TfidfVectorizer()
word_tokenizer = new_vectorizer.build_tokenizer()
tokenized_text = [word_tokenizer(abs) for abs in abstracts]

# Train LDA model
dictionary = corpora.Dictionary(tokenized_text)
lda_corpus = [dictionary.doc2bow(text) for text in tokenized_text]
lda_model = models.LdaModel(lda_corpus, id2word=dictionary, num_topics=10)

In [55]:
for i, topic in lda_model.show_topics(num_words=50, formatted=False):
    print("Topic", i)
    printed_terms = 0
    for term, score in topic:
        if printed_terms >= 10:
            break
        elif term in "the of and to for in or The is be may an a with at are on by as from can will that this".split(): # to lower case?
            continue
        printed_terms += 1
        print("%.4f\t%s" % (score,term))
    print()

Topic 0
0.0078	program
0.0072	students
0.0070	research
0.0064	University
0.0055	project
0.0050	science
0.0043	engineering
0.0039	This
0.0032	their
0.0031	year

Topic 1
0.0026	retail
0.0026	This
0.0024	MJO
0.0021	query
0.0019	project
0.0018	lake
0.0017	research
0.0016	data
0.0014	new
0.0013	these

Topic 2
0.0058	research
0.0048	project
0.0047	This
0.0041	these
0.0039	have
0.0034	study
0.0032	which
0.0032	has
0.0029	understanding
0.0027	their

Topic 3
0.0113	species
0.0064	plant
0.0059	plants
0.0056	genes
0.0041	research
0.0041	evolutionary
0.0040	This
0.0039	these
0.0035	genetic
0.0034	gene

Topic 4
0.0079	research
0.0067	materials
0.0055	This
0.0038	new
0.0038	project
0.0038	high
0.0037	properties
0.0034	these
0.0031	used
0.0029	students

Topic 5
0.0051	data
0.0042	This
0.0036	study
0.0029	have
0.0028	these
0.0028	research
0.0027	project
0.0025	which
0.0025	ice
0.0023	between

Topic 6
0.0045	carbon
0.0035	research
0.0035	species
0.0035	these
0.0031	between
0.0030	project
0.0030	This
0.