# Paper Grading Assistant

## Modeling

Data comes from this link:
- https://components.one/datasets/all-the-news-2-news-articles-dataset/

Heavy inspiration drawn from:
- https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45
(Use incognito window when opening that link)

In [146]:
# !pip install gensim
import os, sys
from gensim import corpora, models
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [147]:
# Helper Functions
# Run the utilty functions from a seperate notebook
%run topic_model_utils.ipynb

def strip_html(raw_html):
    clean_re = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    clean_re2 = re.compile('\n|\t')
    text = re.sub(clean_re, '', raw_html)
    text = re.sub(clean_re2, ' ', text)
    return text

def grab_text(path):
    text = []
    paragraph = ''
    with open(path) as file:
        for line in file:
            if len(paragraph.split()) < 100:
                paragraph = paragraph + ' \n ' + strip_html(line.strip())
            else:
                text.append(paragraph)
                paragraph = ''
    file.close()
    return text


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [148]:
data = grab_text("D:\\Kaggle\\paul-graham-essays\\paul_graham_essay.txt")
data = np.array(data)
print(data.shape)
df = pd.DataFrame()
df['texts'] = data
df['word_count'] = df['texts'].apply(word_count)
df["id"] = df.index + 1
df.head(20)


(3932,)


Unnamed: 0,texts,word_count,id
0,\n September 2017 \n \n The most valuable in...,123,1
1,\n the world. But another less common approac...,126,2
2,\n \n Corollary: the more general the ideas ...,124,3
3,\n with more. But only if you keep going. So ...,134,4
4,\n \n People who are powerful but uncharisma...,117,5
5,\n the right one. \n January 2017 \n \n Beca...,124,6
6,\n \n But maybe there is a simpler explanati...,117,7
7,\n Newton made three bets. One of them worked...,130,8
8,"\n \n In the real world, \n about 4% of peop...",126,9
9,"\n as I say in the talk, Pittsburgh has some ...",124,10


In [149]:
df_filtered = df[df['word_count'] > 2]
df_filtered

Unnamed: 0,texts,word_count,id
0,\n September 2017 \n \n The most valuable in...,123,1
1,\n the world. But another less common approac...,126,2
2,\n \n Corollary: the more general the ideas ...,124,3
3,\n with more. But only if you keep going. So ...,134,4
4,\n \n People who are powerful but uncharisma...,117,5
...,...,...,...
3927,"\n Instead of a single, monolithic program, \...",130,3928
3928,\n modify. Fewer components also means fewer...,120,3929
3929,\n \n Bottom-up design makes programs easie...,130,3930
3930,\n perhaps to redesign the program in a simpl...,128,3931


In [150]:
df_filtered['processed_texts'] = df_filtered['texts'].apply(process_text)
df_filtered

Unnamed: 0,texts,word_count,id,processed_texts
0,\n September 2017 \n \n The most valuable in...,123,1,"[september, valuable, insight, general, surpri..."
1,\n the world. But another less common approac...,126,2,"[world, le, common, approach, focus, general, ..."
2,\n \n Corollary: the more general the ideas ...,124,3,"[corollary, general, idea, talking, le, worry,..."
3,\n with more. But only if you keep going. So ...,134,4,"[going, doubly, important, let, discouraged, p..."
4,\n \n People who are powerful but uncharisma...,117,5,"[people, powerful, uncharismatic, tend, dislik..."
...,...,...,...,...
3927,"\n Instead of a single, monolithic program, \...",130,3928,"[instead, single, monolithic, program, larger,..."
3928,\n modify. Fewer components also means fewer...,120,3929,"[modify, fewer, component, mean, fewer, connec..."
3929,\n \n Bottom-up design makes programs easie...,130,3930,"[bottomup, design, make, program, easier, read..."
3930,\n perhaps to redesign the program in a simpl...,128,3931,"[redesign, program, simpler, way, bottomup, de..."


In [151]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

no_features = 1000

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, 
                                   min_df=2, 
                                   max_features=no_features, 
                                   stop_words='english', 
                                   preprocessor=' '.join)
tfidf = tfidf_vectorizer.fit_transform(df_filtered['processed_texts'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, 
                                min_df=2, 
                                max_features=no_features, 
                                stop_words='english', 
                                preprocessor=' '.join)
tf = tf_vectorizer.fit_transform(df_filtered['processed_texts'])
tf_feature_names = tf_vectorizer.get_feature_names()

In [152]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

no_topics = 20

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)


In [153]:
# Use the top words for each cluster by tfidf weight
# to create 'topics'

# Getting a df with each topic by document
nmf_docweights = nmf.transform(tfidf_vectorizer.transform(df_filtered['processed_texts']))
lda_docweights = lda.transform(tf_vectorizer.transform(df_filtered['processed_texts']))

n_top_words = 8

nmf_topic_df = topic_table(
    nmf,
    tfidf_feature_names,
    n_top_words
).T

# Cleaning up the top words to create topic summaries
nmf_topic_df['topics'] = nmf_topic_df.apply(lambda x: [' '.join(x)], axis=1) # Joining each word into a list
nmf_topic_df['topics'] = nmf_topic_df['topics'].str[0]  # Removing the list brackets
nmf_topic_df['topics'] = nmf_topic_df['topics'].apply(lambda x: whitespace_tokenizer(x)) # tokenize
nmf_topic_df['topics'] = nmf_topic_df['topics'].apply(lambda x: unique_words(x))  # Removing duplicate words
nmf_topic_df['topics'] = nmf_topic_df['topics'].apply(lambda x: [' '.join(x)])  # Joining each word into a list
nmf_topic_df['topics'] = nmf_topic_df['topics'].str[0]  # Removing the list brackets

nmf_topic_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,topics
0,people,thing,make,good,think,just,way,want,people thing make good think just way want
1,language,program,programming,programmer,code,design,java,use,language program programming programmer code d...
2,investor,fundraising,invest,deal,founder,offer,lead,price,investor fundraising invest deal founder offer...
3,wa,did,time,yahoo,got,wanted,computer,year,wa did time yahoo got wanted computer year
4,company,big,market,product,technology,stock,employee,small,company big market product technology stock em...


In [154]:
# Use the top words for each cluster by tfidf weight
# to create 'topics'

# Getting a df with each topic by document
lda_docweights = lda.transform(tf_vectorizer.transform(df_filtered['processed_texts']))

n_top_words = 8

lda_topic_df = topic_table(
    lda,
    tf_feature_names,
    n_top_words
).T

# Cleaning up the top words to create topic summaries
lda_topic_df['topics'] = lda_topic_df.apply(lambda x: [' '.join(x)], axis=1) # Joining each word into a list
lda_topic_df['topics'] = lda_topic_df['topics'].str[0]  # Removing the list brackets
lda_topic_df['topics'] = lda_topic_df['topics'].apply(lambda x: whitespace_tokenizer(x)) # tokenize
lda_topic_df['topics'] = lda_topic_df['topics'].apply(lambda x: unique_words(x))  # Removing duplicate words
lda_topic_df['topics'] = lda_topic_df['topics'].apply(lambda x: [' '.join(x)])  # Joining each word into a list
lda_topic_df['topics'] = lda_topic_df['topics'].str[0]  # Removing the list brackets

lda_topic_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,topics
0,program,work,people,write,word,kid,make,book,program work people write word kid make book
1,software,spam,user,web,filter,application,server,site,software spam user web filter application serv...
2,money,investor,company,round,angel,million,vcs,valuation,money investor company round angel million vcs...
3,programmer,microsoft,google,business,serverbased,technology,application,company,programmer microsoft google business serverbas...
4,learn,hacker,hacking,problem,treat,fix,hack,bug,learn hacker hacking problem treat fix hack bug


In [155]:

# Create a df with only the created topics and topic num
lda_topic_df = lda_topic_df['topics'].reset_index()
lda_topic_df.columns = ['lda_topic_num', 'topics']

lda_topic_df.head(30)

Unnamed: 0,lda_topic_num,topics
0,0,program work people write word kid make book
1,1,software spam user web filter application serv...
2,2,money investor company round angel million vcs...
3,3,programmer microsoft google business serverbas...
4,4,learn hacker hacking problem treat fix hack bug
5,5,number list wa measure new performance rate gr...
6,6,thing people good time question year ha just
7,7,valley silicon city people audience country am...
8,8,startup founder investor want make deal vcs pe...
9,9,reading thanks jessica draft robert livingston...


In [156]:

# Create a df with only the created topics and topic num
nmf_topic_df = nmf_topic_df['topics'].reset_index()
nmf_topic_df.columns = ['nmf_topic_num', 'topics']

nmf_topic_df.head(30)

Unnamed: 0,nmf_topic_num,topics
0,0,people thing make good think just way want
1,1,language program programming programmer code d...
2,2,investor fundraising invest deal founder offer...
3,3,wa did time yahoo got wanted computer year
4,4,company big market product technology stock em...
5,5,startup founder start starting successful succ...
6,6,idea good bad new come mind change ask
7,7,thanks jessica draft reading livingston morris...
8,8,spam filter mail email spammer probability fil...
9,9,vcs angel round vc founder deal series valuation


In [157]:
# Creating a temp df with the url and topic num to join on
id_ = df_filtered['id'].tolist()
df_temp1 = pd.DataFrame({
    'id': id_,
    'nmf_topic_num': nmf_docweights.argmax(axis=1)
})

merged_topic1 = df_temp1.merge(
    nmf_topic_df,
    on='nmf_topic_num',
    how='left'
)

df_temp2 = pd.DataFrame({
    'id': id_,
    'lda_topic_num': lda_docweights.argmax(axis=1)
})

merged_topic2 = df_temp2.merge(
    lda_topic_df,
    on='lda_topic_num',
    how='left'
)

# Merging to get the topic num with url
merged_topic = merged_topic1.merge(
    merged_topic2,
    on='id',
    how='left'
)

# Merging with the original df
df_topics = pd.merge(
    df_filtered,
    merged_topic,
    on='id',
    how='left'
)

df_topics = df_topics.drop(
    'processed_texts',
    axis=1
)

df_topics = df_topics.rename(columns={'topics_x' : 'nmf_topic', 'topics_y' : 'lda_topic' })

df_topics.head(15)

Unnamed: 0,texts,word_count,id,nmf_topic_num,nmf_topic,lda_topic_num,lda_topic
0,\n September 2017 \n \n The most valuable in...,123,1,0,people thing make good think just way want,1,software spam user web filter application serv...
1,\n the world. But another less common approac...,126,2,6,idea good bad new come mind change ask,11,idea company people work startup big problem hard
2,\n \n Corollary: the more general the ideas ...,124,3,6,idea good bad new come mind change ask,11,idea company people work startup big problem hard
3,\n with more. But only if you keep going. So ...,134,4,7,thanks jessica draft reading livingston morris...,11,idea company people work startup big problem hard
4,\n \n People who are powerful but uncharisma...,117,5,0,people thing make good think just way want,18,language lisp programming problem use think wa...
5,\n the right one. \n January 2017 \n \n Beca...,124,6,0,people thing make good think just way want,8,startup founder investor want make deal vcs pe...
6,\n \n But maybe there is a simpler explanati...,117,7,0,people thing make good think just way want,6,thing people good time question year ha just
7,\n Newton made three bets. One of them worked...,130,8,3,wa did time yahoo got wanted computer year,12,wa hacker design just software people make time
8,"\n \n In the real world, \n about 4% of peop...",126,9,0,people thing make good think just way want,0,program work people write word kid make book
9,"\n as I say in the talk, Pittsburgh has some ...",124,10,10,valley silicon city university town hub boston...,8,startup founder investor want make deal vcs pe...


In [158]:

A = tfidf_vectorizer.transform(df_topics['texts'])
W = nmf.components_
H = nmf.transform(A)

print('A = {} x {}'.format(A.shape[0], A.shape[1]))
print('W = {} x {}'.format(W.shape[0], W.shape[1]))
print('H = {} x {}'.format(H.shape[0], H.shape[1]))

# Get the residuals for each document
r = np.zeros(A.shape[0])

for row in range(A.shape[0]):
    r[row] = np.linalg.norm(A[row, :] - H[row, :].dot(W), 'fro')

sum_sqrt_res = round(sum(np.sqrt(r)), 3)
print('Sum of the squared residuals is {}'.format(sum_sqrt_res))

A = 3932 x 1000
W = 20 x 1000
H = 3932 x 20
Sum of the squared residuals is 0.0


In [161]:
def print_lines(idx):
    print('=====')
    print('entry: ', idx)
    print(df_topics['texts'][idx])
    print(' ')
    print('Topics:')
    print("nmf topics:", df_topics['nmf_topic'][idx])
    print("lda topics:", df_topics['lda_topic'][idx])
    print('=====')

for i in range(0,20):
    print_lines(i)

=====
entry:  0
 
 September 2017 
  
 The most valuable insights are both general and surprising. 
 F=ma for example. But general and surprising is a hard 
 combination to achieve. That territory tends to be picked 
 clean, precisely because those insights are so valuable. 
  
 Ordinarily, the best that people can do is one without the 
 other: either surprising without being general (e.g. 
 gossip), or general without being surprising (e.g. 
 platitudes). 
  
 Where things get interesting is the moderately valuable 
 insights.  You get those from small additions of whichever 
 quality was missing.  The more common case is a small 
 addition of generality: a piece of gossip that's more than
 
Topics:
nmf topics people thing make good think just way want
lda topics software spam user web filter application server site
=====
=====
entry:  1
 
 the world. But another less common approach is to focus on 
 the most general ideas and see if you can find something new 
 to say about them. Be