# Problem #2A: The goal is to identify the best number of clusters that responses to the following question organize into using NLP methods:

"What one action can faculty take to improve your educational experience at UW?"

No assumptions are made about how many clusters (groups) these responses will fall into. The goal of
this portion of the NLP project is to identify the optimal number of clusters to support future coding of
these responses. 

This will be accomplished by representing the students' responses into three categories::
part A: topic, part B: sentiment analysis, and part B: semantic similarity.

This code solves part A: Topic Modelling

In [1]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('/Users/nehakardam/Documents/UWclasses /EE517 NLP/Project/FacultySupport_SA1_Jan14_2021_All.csv')

In [3]:
df.shape

(1624, 60)

In [4]:
df.head()

Unnamed: 0,Join Code,RemoteTrad,Subject Code,Class,Quarter,Year,Section,A1_Status,A2_Major,A3.1,...,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59
0,48.0,2.0,EE233_SP20_AC_48,EE233_Spring2020,Spring,2020.0,AC,2,1.0,1.0,...,,,,,,,,,,
1,49.0,2.0,EE233_SP20_AA_49,EE233_Spring2020,Spring,2020.0,AA,4,1.0,1.0,...,,,,,,,,,,
2,63.0,2.0,EE235_SP20_AD_63,EE235_Spring2020,Spring,2020.0,AD,2,1.0,,...,,,,,,,,,,
3,11.0,2.0,EE331_SP20_AA_11,EE331_Spring2020,Spring,2020.0,AA,3,1.0,1.0,...,,,,,,,,,,
4,3.0,2.0,EE233_SP20_AB_3,EE233_Spring2020,Spring,2020.0,AB,2,1.0,1.0,...,,,,,,,,,,


In [5]:
df["SA1"][350]

'0e breakout rooms in lectures to allow students to still have ineraction with one another.'

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(df['SA1'].values.astype('U'))

In [7]:
doc_term_matrix

<1624x1138 sparse matrix of type '<class 'numpy.int64'>'
	with 12089 stored elements in Compressed Sparse Row format>

# Method 1: LDA for Topic Modeling

 LDA is used to create topics along with the probability distribution for each word in our vocabulary for each topic. 

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=5, random_state=42)
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(n_components=5, random_state=42)

In [9]:
first_topic = LDA.components_[0]

In [10]:
top_topic_words = first_topic.argsort()[-10:]

In [11]:
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])

classes
assignments
online
questions
work
exam
professors
time
class
students


In [12]:
for i,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['classes', 'assignments', 'online', 'questions', 'work', 'exam', 'professors', 'time', 'class', 'students']


Top 10 words for topic #1:
['giving', 'practice', 'real', 'solve', 'assignments', 'example', 'problems', 'professors', 'exams', 'nan']


Top 10 words for topic #2:
['online', 'provide', 'recordings', 'professor', 'post', 'lectures', 'class', 'slides', 'notes', 'lecture']


Top 10 words for topic #3:
['help', 'having', 'problems', 'homework', 'online', 'class', 'time', 'questions', 'office', 'hours']


Top 10 words for topic #4:
['students', 'make', 'provide', 'problems', 'material', 'practice', 'professors', 'examples', 'class', 'lectures']




In [13]:
topic_values = LDA.transform(doc_term_matrix)
topic_values.shape

(1624, 5)

In [14]:
df['Topic'] = topic_values.argmax(axis=1)

In [15]:
df.head()

Unnamed: 0,Join Code,RemoteTrad,Subject Code,Class,Quarter,Year,Section,A1_Status,A2_Major,A3.1,...,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Topic
0,48.0,2.0,EE233_SP20_AC_48,EE233_Spring2020,Spring,2020.0,AC,2,1.0,1.0,...,,,,,,,,,,3
1,49.0,2.0,EE233_SP20_AA_49,EE233_Spring2020,Spring,2020.0,AA,4,1.0,1.0,...,,,,,,,,,,1
2,63.0,2.0,EE235_SP20_AD_63,EE235_Spring2020,Spring,2020.0,AD,2,1.0,,...,,,,,,,,,,0
3,11.0,2.0,EE331_SP20_AA_11,EE331_Spring2020,Spring,2020.0,AA,3,1.0,1.0,...,,,,,,,,,,0
4,3.0,2.0,EE233_SP20_AB_3,EE233_Spring2020,Spring,2020.0,AB,2,1.0,1.0,...,,,,,,,,,,0


In [16]:
df.to_csv('/Users/nehakardam/Documents/UWclasses /EE517 NLP/Project/FS_Topic_LDA_Aug8.csv', index = False)

# Method 2: Non-Negative Matrix Factorization (NMF)

Non-negative matrix factorization is also a supervised learning technique which performs clustering as well as dimensionality reduction. It can be used in combination with TF-IDF scheme to perform topic modeling. In this section, we will see how Python can be used to perform non-negative matrix factorization for topic modeling.

In [17]:
df = pd.read_csv('/Users/nehakardam/Documents/UWclasses /EE517 NLP/Project/FacultySupport_SA1_Jan14_2021_All.csv')
df.head()

Unnamed: 0,Join Code,RemoteTrad,Subject Code,Class,Quarter,Year,Section,A1_Status,A2_Major,A3.1,...,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59
0,48.0,2.0,EE233_SP20_AC_48,EE233_Spring2020,Spring,2020.0,AC,2,1.0,1.0,...,,,,,,,,,,
1,49.0,2.0,EE233_SP20_AA_49,EE233_Spring2020,Spring,2020.0,AA,4,1.0,1.0,...,,,,,,,,,,
2,63.0,2.0,EE235_SP20_AD_63,EE235_Spring2020,Spring,2020.0,AD,2,1.0,,...,,,,,,,,,,
3,11.0,2.0,EE331_SP20_AA_11,EE331_Spring2020,Spring,2020.0,AA,3,1.0,1.0,...,,,,,,,,,,
4,3.0,2.0,EE233_SP20_AB_3,EE233_Spring2020,Spring,2020.0,AB,2,1.0,1.0,...,,,,,,,,,,


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix1 = tfidf_vect.fit_transform(df['SA1'].values.astype('U'))

In [23]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=5, init='random', random_state=0)
nmf.fit(doc_term_matrix1 )

NMF(init='random', n_components=5, random_state=0)

In [24]:
import random

for i in range(10):
    random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
    print(tfidf_vect.get_feature_names()[random_id])

complete
reachable
want
adding
voice
classrooms
detailed
engaged
just
post


In [25]:
first_topic = nmf.components_[0]
top_topic_words = first_topic.argsort()[-10:]

In [26]:
for i in top_topic_words:
    print(tfidf_vect.get_feature_names()[i])

helps
learn
sessions
cover
provide
exam
understand
review
material
nan


In [27]:
for i,topic in enumerate(nmf.components_):
    print(f'Top 10 words for topic #{i}:')
    print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['helps', 'learn', 'sessions', 'cover', 'provide', 'exam', 'understand', 'review', 'material', 'nan']


Top 10 words for topic #1:
['material', 'exam', 'extra', 'tests', 'homework', 'examples', 'exams', 'provide', 'problems', 'practice']


Top 10 words for topic #2:
['review', 'clear', 'online', 'canvas', 'provide', 'recordings', 'slides', 'post', 'notes', 'lecture']


Top 10 words for topic #3:
['sessions', 'online', 'lots', 'extra', 'holding', 'help', 'hold', 'available', 'office', 'hours']


Top 10 words for topic #4:
['online', 'ask', 'helpful', 'make', 'professors', 'time', 'questions', 'lectures', 'students', 'class']




In [28]:
topic_values1 = nmf.transform(doc_term_matrix1)
df['Topic1'] = topic_values1.argmax(axis=1)
df.head()

Unnamed: 0,Join Code,RemoteTrad,Subject Code,Class,Quarter,Year,Section,A1_Status,A2_Major,A3.1,...,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Topic1
0,48.0,2.0,EE233_SP20_AC_48,EE233_Spring2020,Spring,2020.0,AC,2,1.0,1.0,...,,,,,,,,,,1
1,49.0,2.0,EE233_SP20_AA_49,EE233_Spring2020,Spring,2020.0,AA,4,1.0,1.0,...,,,,,,,,,,4
2,63.0,2.0,EE235_SP20_AD_63,EE235_Spring2020,Spring,2020.0,AD,2,1.0,,...,,,,,,,,,,4
3,11.0,2.0,EE331_SP20_AA_11,EE331_Spring2020,Spring,2020.0,AA,3,1.0,1.0,...,,,,,,,,,,4
4,3.0,2.0,EE233_SP20_AB_3,EE233_Spring2020,Spring,2020.0,AB,2,1.0,1.0,...,,,,,,,,,,4


In [29]:
df.to_csv('/Users/nehakardam/Documents/UWclasses /EE517 NLP/Project/FS_Topic_NMF_Aug8.csv', index = False)

Reference:https://stackabuse.com/python-for-nlp-topic-modeling/