# Faculty Bios Topic modelling

This notebook shows the code for the topic model that we have built based on the faculty bios data provided as part of https://github.com/CS410Fall2020/ExpertSearch/tree/master/data/compiled_bios

We have use gensim library (https://pypi.org/project/gensim/) for topic modelling. We have also used nltk (https://www.nltk.org/) for text cleaning and preprocessing

In [10]:
from nltk.corpus import stopwords as sw
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer;
import nltk

#### Loading the appropriate models for stopwords removal and stemming

In [11]:

stopwords = set(sw.words('english'))
stemmer = PorterStemmer()

#### Loading the faculty bios pages

In [12]:
positive_data = []
with open('../data/classificationData/positive.txt', 'r') as f:
    for line in f:
        data = line.split("#####")
        positive_data.append(data[0].strip())

len(positive_data)

6521

#### Using nltk.wordnet to do Lemmatization

In [13]:
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    return word.lower() if lemma is None else lemma

[nltk_data] Downloading package wordnet to /Users/pushpit/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Preprocessing data to do topic modelling
1. Tokenize --> using nltk word tokenizer
2. Remove stopwords and get lemmatized form for the word. Also filtered all the word with less than 4 chracters

In [14]:
def preprocess_data_for_topic_modeling(text):
    text_tokens = word_tokenize(text)
    processed_tokens = [get_lemma(word) for word in text_tokens if not word in stopwords and len(word) > 4]
    return processed_tokens

cleaned_data = [preprocess_data_for_topic_modeling(data) for data in positive_data]
cleaned_data[0]

[<function str.lower()>,
 <function str.lower()>,
 <function str.lower()>,
 'professor',
 <function str.lower()>,
 <function str.lower()>,
 'scholar',
 'depart',
 <function str.lower()>,
 <function str.lower()>,
 <function str.lower()>,
 <function str.lower()>,
 'urbana',
 'champaign',
 'urbana',
 <function str.lower()>,
 <function str.lower()>,
 <function str.lower()>,
 <function str.lower()>,
 'ph.d.',
 <function str.lower()>,
 'michigan',
 'arbor',
 'professor',
 <function str.lower()>,
 'professor',
 <function str.lower()>,
 'virginia',
 'august',
 'august',
 <function str.lower()>,
 <function str.lower()>,
 'urbana',
 'champaign',
 <function str.lower()>,
 'professor',
 <function str.lower()>,
 <function str.lower()>,
 'professor',
 'interest',
 <function str.lower()>,
 'system',
 <function str.lower()>,
 'system',
 'network',
 'sensor',
 'network',
 <function str.lower()>,
 'system',
 'embed',
 <function str.lower()>,
 'system',
 <function str.lower()>,
 'interest',
 'develop',
 

# Topic Modeling

Using gene

In [19]:
from gensim import corpora
dictionary = corpora.Dictionary(cleaned_data)
corpus = [dictionary.doc2bow(line) for line in cleaned_data]

import pickle
pickle.dump(corpus, open('expert_corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [21]:
import gensim
num_of_topics = 10
lda_topic_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_of_topics, id2word=dictionary, passes=20)
lda_topic_model.save('expert_topic_model_10.gensim')

In [24]:
experts_topics = lda_topic_model.print_topics(num_words=10)
experts_topics

[(0,
  '0.019*"graphics" + 0.018*"paper" + 0.015*"image" + 0.014*"siggraph" + 0.010*"computer" + 0.010*"video" + 0.009*"rendering" + 0.009*"vision" + 0.009*"shape" + 0.008*"light"'),
 (1,
  '0.033*"conference" + 0.027*"international" + 0.018*"systems" + 0.014*"proceedings" + 0.013*"networks" + 0.013*"computing" + 0.012*"network" + 0.009*"security" + 0.009*"computer" + 0.009*"workshop"'),
 (2,
  '0.015*"translation" + 0.012*"speech" + 0.010*"blandford" + 0.010*"rogers" + 0.009*"nadia" + 0.008*"article" + 0.008*"yvonne" + 0.006*"philipp" + 0.005*"koehn" + 0.005*"bianchi-berthouze"'),
 (3,
  '0.022*"research" + 0.019*"function" + 0.019*"study" + 0.017*"details" + 0.014*"state" + 0.013*"liverpool" + 0.012*"university" + 0.011*"return" + 0.011*"postgraduate" + 0.010*"about"'),
 (4,
  '0.014*"programming" + 0.013*"system" + 0.011*"architecture" + 0.011*"software" + 0.010*"memory" + 0.009*"parallel" + 0.009*"paper" + 0.008*"systems" + 0.008*"design" + 0.007*"performance"'),
 (5,
  '0.017*"ele