# Applied Machine Learning (2021), exercises


## General instructions for all exercises

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE

Do not change other areas of the document, since it may disturb the autograding of your results!
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manualle graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks may be text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, download the whole notebook, using menu `File -> Download as -> Notebook (.ipynb)`. Save the file in your hard disk, and submit it in [Moodle](https://moodle.uwasa.fi) under the corresponding excercise.

Your solution should be an executable Python code. Use the code already existing as an example of Python programing and read more from the numerous Python programming material from the Internet if necessary. 


In [None]:
NAME = "Hoang Nguyen Duc"
Student_number = "d120411"

---

# NLP

In [1]:
# Standard libraries to be used
import glob
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')


# Import the NLTK library
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.stem import  WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import gensim
from gensim import corpora
import pyLDAvis
from pyLDAvis import gensim_models

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\u375049\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\u375049\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Task 1

Read the sample dataset containing 2491 short sentences each in separate lines. The first five lines of the dataset are shown below:

`
Innovation in Database Management: Computer Science vs. Engineering.
High performance prime field multiplication for GPU.
enchanted scissors: a scissor interface for support in cutting and interactive fabrication.
Detection of channel degradation attack by Intermediary Node in Linear Networks.
Pinning a Complex Network through the Betweenness Centrality Strategy.
`

Read the data and prepare that for ML using the following phases:

 - Read the dataset `dataset.txt` using for example PlaintextCorpusReader
 - Tokenize the dataset to words
     - if data is the text data returned by PlaintextCorpusReader, you may find data.sents() as a convenient function to tokenize every line 
 - Remove all words which are shorter than 5 characters
 - Remove all english stop words
 - Lemmatize words
 
Store the result in list called as `words`. Make sure that the list contains 2491 sublists, which contain a few words each. 

In [7]:
# YOUR CODE HERE
from nltk.tokenize import sent_tokenize

def preprocess(sentence:str):
    processed_words = word_tokenize(sentence)
    # Remove all words which are shorter than 5 characters
    processed_words = [w for w in processed_words if len(w) >= 5]
    # Transform all words to lowercase
    processed_words = [w.lower() for w in processed_words]
    # Remove all English stopwords
    processed_words = [w for w in processed_words if not w in stopwords.words()]
    # Lemmatize words
    processed_words = [lemmatizer.lemmatize(w) for w in processed_words]
    return processed_words

docs = PlaintextCorpusReader('.', 'dataset.txt')
sentence_tokens = sent_tokenize(docs.raw())

words = [preprocess(sent) for sent in sentence_tokens]

In [8]:
if 'words' not in globals():
    print("Use name words for a data structure for cleaned words, please.")

if len(words)>3000: print("You may have placed all the words in one single list, not as a list of sentences.")
assert(len(words)==2491), "Check that your words is a list sentences, which are list of words"

if len(words[0])>6: print("Perhaps you forgot to exclude words shorter than 5 characters") 
assert(len(words[0])==6), "Check that each item in the list is a sigle sentence"

if words[0][0]=='Innovation': print ("Convert to lowercase. You may use token.lower() to convert a token to lowercase")
assert(words[0][0]=='innovation')
    
if (words[8][6]=='attacks'): print("You probably forgot to lemmatize words")
assert(words[8][6]=='attack')


## Task 2

Make an LDA model of the text data
 - Create a dictionary of words, with gensim
 - Create a bag of words (bow) from the words using the dictionary, and name it as corpus
 - Make the ldamodel. Try using 8 topics and 15 passes. Use default values for most of the fields

In [17]:
# YOUR CODE HERE
dictionary = corpora.Dictionary(words)
corpus = [dictionary.doc2bow(doc) for doc in words]
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary)

In [18]:
if 'dictionary' not in globals():
    print("Use variable dictionary for dictionary, please.")

if 'corpus' not in globals():
    print("Use variable corpus for bag of words, please.")

assert(dictionary.id2token[1]=='database')
assert(len(corpus)==2491)
assert(len(corpus[0])==6)
assert(len(ldamodel.show_topic(1))==10)

## Visualize LDA model

- Visualize the lda model using pyLDAVis, store the prepared visualization as name `vis`
- Adjust the relevance metric $\lambda=0.2$ from the interactive slider and try to think names for some clusters

In [26]:
# YOUR CODE HERE
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary)
vis
# Some possible clusters names:
# 1. Communication technology
# 2. Biomedical analytics
# 3. Mechanics control

In [27]:
if 'vis' not in globals():
    print("Use variable vis for prepared visualization, please.")

assert(vis.topic_info.shape[1]==6)