# Individual Coding Exercise (ICE) 2 for HUDK4051: Learning Analytics

In this ICE 2, the following aspects of Natural Language Processing (NLP) are covered:

- perform basic data cleaning related to NLP
- build basic tokenized document-term matrix for analysis
- run basic exploratory analysis with document-term matrix
- implement LDA topic modeling
- implement basic text classifer

NLP is a segment of data science that comprises of systematic processes for understanding, analyzing, and deriving information from text in an efficient manner. With NLP, analysts can organize large amounts of text data, perform various automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

In [1]:
#importing necessary packages

import pandas as pd
import numpy as np
import re
import string
from gensim import matutils, models
import scipy.sparse

The dataset used in this ICE 2 is collected from the students of a prominent university in North India with student feedback about their university. It contains 6 categories, which includes teaching, course content, examination, lab work, library facilities and extra curricular activities. Data for each category includes two columns, where each column can have any of the three labels, i.e. 0 (neutral), 1 (positive) and -1 (negative).

In [2]:
#uploading the dataset

Eval = pd.read_csv("ICE2_dataset.csv")
Eval

Unnamed: 0,teaching,teaching1,content,content1,exam,exam1,lab,lab1,lib,lib1,extrac,extrac1
0,0,teacher are punctual but they should also give...,0.0,content of courses are average,1.0,examination pattern is good,-1,"not satisfactory, lab work must include latest...",0.0,library facilities are good but number of book...,1,extracurricular activities are excellent and p...
1,1,Good,-1.0,Not good,1.0,Good,1,Good,-1.0,Not good,1,Good
2,1,Excellent lectures are delivered by teachers a...,1.0,All courses material provide very good knowled...,1.0,Exam pattern is up to the mark and the Cgpa de...,1,Lab work is properly covered in the labs by th...,1.0,Library facilities are excellent in terms of g...,1,Extra curricular activities also help students...
3,1,Good,-1.0,Content of course is perfectly in line with th...,-1.0,Again the university tests students of their a...,1,Good,0.0,Its the best thing i have seen in this univers...,-1,Complete wastage of time. Again this opinion i...
4,1,teachers give us all the information required ...,1.0,content of courses improves my knowledge,1.0,examination pattern is good,1,practical work provides detail knowledge of th...,1.0,library has huge collection of books from diff...,1,extracurricular activities increases mental an...
...,...,...,...,...,...,...,...,...,...,...,...,...
180,1,intraction is good and leacture delivery also ...,0.0,every one can tell depth of course but some on...,,exam pattern is good and marks distribution is...,1,all labs and practical going on well,,good,1,they all are held in super
181,1,all the given terms are good regarding the uni...,1.0,we are getting maximum knowledge,1.0,all are good,1,not bad,-1.0,library facilities are not good.They are not f...,1,good
182,1,All the terms are good regarding the universit...,1.0,Knowledge is maximum gained by reading books ...,0.0,The examination pattern is good .But time is n...,1,Labs are upto the mark.,1.0,They are good,1,the extracurricular activities held in univers...
183,-1,Some of the teacher are un experienced. Also t...,0.0,Its fine but it should focus more towards prac...,1.0,MCQ pattern is quite good and efficient way fo...,-1,Our labs do not have all facalities.,1.0,We have a good library with all facalities.,1,Our university has lot of extracurricular goin...


## **Data Cleaning and Wrangling**

In [3]:
#removing null values

Evaln = Eval.dropna(0)
Evaln

Unnamed: 0,teaching,teaching1,content,content1,exam,exam1,lab,lab1,lib,lib1,extrac,extrac1
0,0,teacher are punctual but they should also give...,0.0,content of courses are average,1.0,examination pattern is good,-1,"not satisfactory, lab work must include latest...",0.0,library facilities are good but number of book...,1,extracurricular activities are excellent and p...
1,1,Good,-1.0,Not good,1.0,Good,1,Good,-1.0,Not good,1,Good
2,1,Excellent lectures are delivered by teachers a...,1.0,All courses material provide very good knowled...,1.0,Exam pattern is up to the mark and the Cgpa de...,1,Lab work is properly covered in the labs by th...,1.0,Library facilities are excellent in terms of g...,1,Extra curricular activities also help students...
3,1,Good,-1.0,Content of course is perfectly in line with th...,-1.0,Again the university tests students of their a...,1,Good,0.0,Its the best thing i have seen in this univers...,-1,Complete wastage of time. Again this opinion i...
4,1,teachers give us all the information required ...,1.0,content of courses improves my knowledge,1.0,examination pattern is good,1,practical work provides detail knowledge of th...,1.0,library has huge collection of books from diff...,1,extracurricular activities increases mental an...
...,...,...,...,...,...,...,...,...,...,...,...,...
179,1,GOOD,-1.0,HAVE TO IMPROVE,1.0,EXAMPATTERN IS GOOD AND MARKS DISTRIBUTION IS ...,1,IT IS GOOD,1.0,IT IS ALSO GOOD,1,VERY GOOD
181,1,all the given terms are good regarding the uni...,1.0,we are getting maximum knowledge,1.0,all are good,1,not bad,-1.0,library facilities are not good.They are not f...,1,good
182,1,All the terms are good regarding the universit...,1.0,Knowledge is maximum gained by reading books ...,0.0,The examination pattern is good .But time is n...,1,Labs are upto the mark.,1.0,They are good,1,the extracurricular activities held in univers...
183,-1,Some of the teacher are un experienced. Also t...,0.0,Its fine but it should focus more towards prac...,1.0,MCQ pattern is quite good and efficient way fo...,-1,Our labs do not have all facalities.,1.0,We have a good library with all facalities.,1,Our university has lot of extracurricular goin...


In [4]:
#rearranging the format of the data

new_rows = []
for index, row in Evaln.iterrows():
    new_rows.append([row.teaching, row.teaching1, 'teaching'])
    new_rows.append([row.content, row.content1, 'course content'])
    new_rows.append([row.exam, row.exam1, 'examination'])
    new_rows.append([row.lab, row.lab1, 'lab work'])
    new_rows.append([row.lib, row.lib1, 'library facilities'])
    new_rows.append([row.extrac, row.extrac1, 'extracurricular'])
EvalClean = pd.DataFrame(new_rows, columns=['Ratings','Comments', 'Categories'])
EvalClean

Unnamed: 0,Ratings,Comments,Categories
0,0.0,teacher are punctual but they should also give...,teaching
1,0.0,content of courses are average,course content
2,1.0,examination pattern is good,examination
3,-1.0,"not satisfactory, lab work must include latest...",lab work
4,0.0,library facilities are good but number of book...,library facilities
...,...,...,...
1081,-1.0,HAVE TO IMPROVE,course content
1082,0.0,PAPER CHECKING IS VERY HARD REMAINING IS GOOD,examination
1083,1.0,ALL PRACTICAL WORK IS GOOD,lab work
1084,-1.0,THEY IS NO PROBLEM WITH THEM\n,library facilities


In [5]:
#creating a function to make text lowercase, remove punctuations and remove umbers
    
def clean_text(text):
    text = text.lower()
    text = re.sub('[%s]' %re.escape(string.punctuation), '', text)
    text = re.sub('[0-9]+', '', text)
    return text

In [6]:
#running the transformed (after data wrangling) dataset through the clean_text function created above for data cleaning

EvalClean['Comments'] = EvalClean['Comments'].map(clean_text)
EvalClean

Unnamed: 0,Ratings,Comments,Categories
0,0.0,teacher are punctual but they should also give...,teaching
1,0.0,content of courses are average,course content
2,1.0,examination pattern is good,examination
3,-1.0,not satisfactory lab work must include latest ...,lab work
4,0.0,library facilities are good but number of book...,library facilities
...,...,...,...
1081,-1.0,have to improve,course content
1082,0.0,paper checking is very hard remaining is good,examination
1083,1.0,all practical work is good,lab work
1084,-1.0,they is no problem with them\n,library facilities


## **Organizing Data**

In [7]:
#Using CountVectorizer() to build a document-term matrix in a Bag-Of-Words style without stop words
#turning the corpus data into a document-term matrix for the machine to be able to read it

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

commentCV = cv.fit_transform(EvalClean['Comments'])
commentCV_dtm = pd.DataFrame(commentCV.toarray(), columns = cv.get_feature_names())

commentCV_dtm.index = EvalClean['Comments'].index
commentCV_dtm

Unnamed: 0,abilities,ability,able,abroad,absolutely,abt,academic,accessable,accitivties,accordance,...,working,works,world,worth,write,writing,yeah,year,years,yes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1081,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1082,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1083,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1084,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
#checking the distribution of the positive and negative comments

EvalClean.groupby(['Ratings', 'Categories']).count().unstack()

Unnamed: 0_level_0,Comments,Comments,Comments,Comments,Comments,Comments
Categories,course content,examination,extracurricular,lab work,library facilities,teaching
Ratings,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
-1.0,30,23,12,37,31,13
0.0,25,29,19,16,24,35
1.0,126,129,150,128,126,133


In [9]:
#calculating the most frequent words

totalCT = commentCV_dtm.sum()
commentCV_dtm[totalCT.sort_values(ascending = False).index[:20]].sum()

good          629
excellent      72
university     61
students       61
books          47
library        47
course         39
teachers       38
lab            38
pattern        38
activities     36
knowledge      35
time           31
teaching       30
work           29
average        29
content        29
paper          28
checking       27
courses        27
dtype: int64

## **Topic Modeling**

The ultimate goal of topic modeling is to find various topics that are present in this corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics. Particularly, we will use one of the most frequently used TM methods, Latent Dirichlet Allocation (LDA).

In [10]:
#transposing the document term matrix to term-document matrix

tdm = commentCV_dtm.transpose()

#compress a matrix that is in rows and prepare the data in genism format and obtain a dictionary id2word of the locations of each term in the dtm
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)
id2word = dict((v, p) for p, v in cv.vocabulary_.items())

In [11]:
#passing the corpus through the ldamodel()

lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=20)
lda.print_topics()

[(0,
  '0.039*"university" + 0.029*"lab" + 0.023*"students" + 0.022*"work" + 0.022*"average" + 0.019*"practical" + 0.012*"teaching" + 0.011*"teachers" + 0.010*"activities" + 0.009*"helpful"'),
 (1,
  '0.284*"good" + 0.022*"books" + 0.022*"library" + 0.014*"course" + 0.012*"time" + 0.012*"facilities" + 0.011*"delivery" + 0.011*"teachers" + 0.011*"content" + 0.011*"knowledge"'),
 (2,
  '0.059*"excellent" + 0.031*"pattern" + 0.023*"paper" + 0.022*"checking" + 0.019*"exam" + 0.017*"marks" + 0.016*"activities" + 0.016*"examination" + 0.012*"distribution" + 0.010*"best"')]

## **Text Classifier**

In [12]:
#splitting the dataset into two

from sklearn.model_selection import train_test_split

Xs_docs = EvalClean['Comments']
Ys_evals = EvalClean['Ratings']
xs_training, xs_test, y_training, y_test = train_test_split(Xs_docs, Ys_evals, test_size = 0.2)

In [13]:
#importing MultinomialNB instead because the label has more than two levels
from sklearn.naive_bayes import MultinomialNB 

#prepare the training features
cv = CountVectorizer(stop_words = 'english')
features = cv.fit_transform(xs_training)

#train a multinomial Naive Bayes Model
model = MultinomialNB()
model.fit(features, y_training)

#prepare the testing xs.
feature_test = cv.transform(xs_test)

#print the model accuracy
print(model.score(feature_test, y_test))

0.7752293577981652


In [14]:
from sklearn import svm

#prepare the training features
cv = CountVectorizer(stop_words = 'english')
features = cv.fit_transform(xs_training)

#train SVM classifier
model = svm.SVC()
model.fit(features, y_training)

#prepare the testing xs
feature_test = cv.transform(xs_test)

#print the model accuracy
print(model.score(feature_test, y_test))

0.7660550458715596
