# <center>HW 7: Topic Modeling and Word Vectors</center>

## Q1 Topic Modeling

In this question, you'll need the dataset that was used in HW 6:

- hw_6_train.csv: This file contains a list of documents. It's used for training models
- hw6_test.csv: This file contains a list of documents and their ground-truth labels. It's used for testing performance. 

Define a function `topic_modeling` as follows: 
- Take the following parameters:
    - `train_text`: a list of train documents
    - `test_text`: a list of test documents
    - `test_label`: a list ground truth labels for the test documents
    - `num_topics`: the number of topics
    
    
- Cluster `train_text` into `num_topics` topics using LDA
    - Fit and transform the `train_text` into word counts by a vectorizer. Then transform `test_text` by the fitted vectorizer.
    - When generating counts, you need to tune parameters such as `stop_words` and `min_df` for better performance
    
    
- Predict the topic mixture of each document in `test_text`, and then assign the document only to the topic with the `max probability`.
- Apply `majority vote` rule to map the predicted topic IDs to `test_label`. Hint: 
    - Do not hardcode the mapping in your code because each run may give you a different mapping. You can use `idxmax` function of Pandas to generate the mapping dynamically 
    - You can use cross tabulation to map the clusters to ground truth labels. Check the class notes for details.
- Print the classification report


- Return the fitted lda model

Test your function and compare the performance with what you achieved in HW6. Briefly analyze whether LDA can deliver better clustering for this dataset.

In [1]:
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.decomposition import LatentDirichletAllocation
import gensim
from gensim import corpora
import numpy as np
import nltk,string
from gensim.models import word2vec
from operator import itemgetter
from sklearn import metrics
from sklearn import svm
from sklearn.metrics import classification_report
# add your import

In [2]:
train = pd.read_csv("hw6_train.csv")
train.head()

test = pd.read_csv("hw6_test.csv")
test.head()

Unnamed: 0,description
0,The Barguelonne (French: la Barguelonne) is a...
1,Conus nielsenae is a species of sea snail a m...
2,Coleto Creek Reservoir is a reservoir on Cole...
3,Indian River is a 59.1-mile-long (95.1 km) tr...
4,The Funtenseetauern is a 2579 m high border p...


Unnamed: 0,label,description
0,4,Onychogomphus styx is a species of dragonfly ...
1,3,The Bonnet Carré Spillway is a flood control ...
2,4,Coleophora centaureivora is a moth of the Col...
3,1,Paris Nogari (c. 1536–1601) was an Italian pa...
4,3,Blacktail Butte (7688 feet (2343 m)) is a but...


In [3]:
def topic_modeling(train_text, test_text, test_label, num_clusters):

    lda = None
    tf_vectorizer=None
    stop = list(stopwords.words('english')) + ['said']
    tf_vectorizer = CountVectorizer(min_df=3, stop_words=stop)
    tf = tf_vectorizer.fit_transform(train_text)
    test_tf = tf_vectorizer.transform(test_text)
    corpus = gensim.matutils.Sparse2Corpus(tf, documents_columns=False)
    id2word={idx:w for idx, w in enumerate(tf_vectorizer.get_feature_names())}
    dictionary = corpora.Dictionary.from_corpus(corpus, id2word=id2word)
    
    lda = gensim.models.ldamodel.LdaModel(corpus, num_topics = num_clusters, id2word=id2word, iterations=30)
    test_corpus = gensim.matutils.Sparse2Corpus(test_tf,documents_columns=False)
    predict = lda.get_document_topics(test_corpus)
    a=[]
    for i in range(0,500):
        b=sorted(list(predict)[i], key=lambda x: x[1])[-1][0]
        a.append(b)
    predicted=list(a)
    confusion_df = pd.DataFrame(list(zip(test_label.values, predicted)),columns = ["label", "cluster"])
    crosstable=pd.crosstab( index=confusion_df.cluster, columns=confusion_df.label)
    mapping = dict(crosstable.idxmax(axis=1))
    predicted_target = [mapping[i] for i in predicted]

    print(metrics.classification_report(test_label, predicted_target))
    
    
    
    return lda

In [4]:
lda = topic_modeling(train["description"], 
                                   test["description"], 
                                   test["label"],
                                   num_clusters=4 )

              precision    recall  f1-score   support

           1       0.67      0.82      0.74       123
           2       0.68      0.69      0.68       100
           3       0.67      0.29      0.41       146
           4       0.58      0.82      0.68       131

    accuracy                           0.64       500
   macro avg       0.65      0.66      0.63       500
weighted avg       0.65      0.64      0.62       500



## Q2: Supervised Sentiment Analysis Using Word Vectors

In this question, you'll need dataset:
- `hw7_train.csv`: dataset fro training
- `hw7_test.csv`: dataset for test

A snippet of the dataset is given below.

In [2]:
train = pd.read_csv("hw7_train.csv")
test = pd.read_csv("hw7_test.csv")

train.head()

Unnamed: 0,label,text
0,1,Getting ready for college. I had a good sleep....
1,1,We are having a party now to have all the fami...
2,0,@marC0110 ummm.. i see you.. and i really wann...
3,1,@saboteur1 Thanks for following Much apprecia...
4,1,Why eat at home? Picnic plans for today are al...


### Q2.1: Train Word Vectors

Write a function `train_wordvec(docs, vector_size)` as follows:
- Take two inputs:
    - `docs`: a list of documents
    - `vector_size`: the dimension of word vectors
- First tokenize `docs` into tokens
- Use `gensim` package to train word vectors. Set the `vector size` and also carefully set other parameters such as `window`, `min_count` etc.
- return the trained word vector model

In [3]:
def train_wordvec(docs, vector_size = 100):
    
    wv_model = None
    sentences=[ [token.strip(string.punctuation).strip() for token in nltk.word_tokenize(doc.lower()) if token not in string.punctuation and \
                 len(token.strip(string.punctuation).strip())>=2] for doc in docs]
    wv_model = word2vec.Word2Vec(sentences,vector_size=vector_size ,min_count=5, window=5, workers=4 )
    # add your code
    
    return wv_model

In [4]:
wv_model = train_wordvec(train["text"], vector_size = 100)

### Q2.2: Generate Vector Representation for Documents

Write a function `generate_doc_vector(docs, wv_model)` as follows:
- Take two inputs:
    - `docs`: a list of documents, 
    - `wv_model`: trained word vector model. Set the default value to 100.
- First tokenize each document `doc` in `docs` into tokens
- For each token in `doc`, look up for its word vector in `wv_model`. Then the document vector (denoted as `d`) of `doc` can be calculated as the `mean of the word vectors of its tokens`, i.e. $d = \frac{\sum_{i \in doc}{v_i}}{|doc|}$, where $v_i$ is the word vector of the i-th token.
- Return the vector representations of all `docs` as a numpy array of shape `(n, vector_size)`, where `n` is the number of documents in `docs` and `vector_size` is the dimension of word vectors.


Note: It may not be a good idea to represent a document as the mean of its word vectors. For example, if one word is positive and another is negative, the sum of the these two words may make the resulting vector is no longer sensitive to sentiment. You'll learn more advanced methods to generate document vector in deep learning courses.

In [34]:
def generate_doc_vector(docs, wv_model):
    
    vectors = None
    sentences=[ [token.strip(string.punctuation).strip() for token in nltk.word_tokenize(doc.lower()) if token not in string.punctuation and \
                 len(token.strip(string.punctuation).strip())>=2] for doc in docs]
    d=[]
    for sentence in sentences:
        d2=[]
        for word in sentence:
            if word in wv_model.wv.key_to_index:
                d1=wv_model.wv[word]
            d2.append(d1)
        d3=np.array(d2)
        d4=np.mean(d3,axis=0)
        d.append(d4)
    vectors=np.array(d)
    
    # add your code
    return vectors

In [35]:
train_X = generate_doc_vector(train["text"], wv_model)
test_X = generate_doc_vector(test["text"], wv_model)

### Q2.3: Put everything together


Define a function `predict_sentiment(train_text, train_label, test_text, test_label, vector_size = 100)` as follows:

- Take the following inputs:
    - `train_text, train_label`: a list of documents and their labels for training
    - `test_text, test_label`: a list of documents and their labels for testing,
    - `vector_size`: the dimension of word vectors. Set the default value to 100.
- Call `train_wordvec(docs, vector_size)` to train a word vector model using `train_text`
- Call `generate_doc_vector(docs, wv_model)` to generate vector representations (denoted as `train_X`) for documents in `train_text`. 
- Call `generate_doc_vector(docs, wv_model)` to generate vector representations (denoted as `test_X`) for each document in `test_text`
- Fit a linear SVM model using `train_X` and `train_label`
- Predict the label for `test_X` and print out classification report for the testing subset.
- This function has no return

In [37]:
def predict_sentiment(train_text, train_label, test_text, test_label, vector_size = 100):
    wv_model = train_wordvec(train_text, vector_size = vector_size)
    train_X = generate_doc_vector(train_text, wv_model)
    test_X = generate_doc_vector(test_text, wv_model)
    cls = svm.LinearSVC()
    dtm=cls.fit(train_X,train_label)
    predicted=cls.predict(test_X)
    classification_report_result=classification_report(test_label,predicted)
    print(classification_report_result)
    # add your code


In [38]:
predict_sentiment(train["text"], train["label"],\
                  test["text"], test["label"],\
                  vector_size = 100)
    

              precision    recall  f1-score   support

           0       0.71      0.70      0.70      9968
           1       0.70      0.72      0.71     10032

    accuracy                           0.71     20000
   macro avg       0.71      0.71      0.71     20000
weighted avg       0.71      0.71      0.71     20000



