# 1. Importing libraries and Dataset

In [1]:
import pandas as pd 
import numpy as np
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.feature_extraction.text import CountVectorizer
import gensim
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from collections import Counter
import warnings

In [2]:
%matplotlib inline
warnings.filterwarnings('ignore')

In [3]:
data_df=pd.read_csv("nips-papers/papers.csv")

In [4]:
data_df.columns

Index(['id', 'year', 'title', 'event_type', 'pdf_name', 'abstract',
       'paper_text'],
      dtype='object')

In [5]:
data_df['paper_text'].head().apply(len)

0    21643
1    15505
2    20523
3    19441
4    20219
Name: paper_text, dtype: int64

In [6]:
author_name =pd.read_csv('nips-papers/authors.csv')

In [7]:
author_name.head()

Unnamed: 0,id,name
0,1,Hisashi Suzuki
1,10,David Brady
2,100,Santosh S. Venkatesh
3,1000,Charles Fefferman
4,10000,Artur Speiser


# 2.Text Trocessing

In [8]:
data_df[['title','pdf_name']].head(10)

Unnamed: 0,title,pdf_name
0,Self-Organization of Associative Database and ...,1-self-organization-of-associative-database-an...
1,A Mean Field Theory of Layer IV of Visual Cort...,10-a-mean-field-theory-of-layer-iv-of-visual-c...
2,Storing Covariance by the Associative Long-Ter...,100-storing-covariance-by-the-associative-long...
3,Bayesian Query Construction for Neural Network...,1000-bayesian-query-construction-for-neural-ne...
4,"Neural Network Ensembles, Cross Validation, an...",1001-neural-network-ensembles-cross-validation...
5,Using a neural net to instantiate a deformable...,1002-using-a-neural-net-to-instantiate-a-defor...
6,Plasticity-Mediated Competitive Learning,1003-plasticity-mediated-competitive-learning.pdf
7,ICEG Morphology Classification using an Analog...,1004-iceg-morphology-classification-using-an-a...
8,Real-Time Control of a Tokamak Plasma Using Ne...,1005-real-time-control-of-a-tokamak-plasma-usi...
9,Pulsestream Synapses with Non-Volatile Analogu...,1006-pulsestream-synapses-with-non-volatile-an...


Since title and pdf_name is  same hence removing pdf_name 

In [9]:
data_df['event_type'].unique()

array([nan, 'Oral', 'Spotlight', 'Poster'], dtype=object)

In [10]:
data_df.drop(['pdf_name','event_type'],axis=1,inplace=True)

Since our main object is to summarize the text hence and tagging the key words hence we donot require the event type 

In [11]:
data_df.head()

Unnamed: 0,id,year,title,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [12]:
paper_separate_abstract = data_df[(data_df['abstract']!='Abstract Missing')].index

In [13]:
paper_separate_abstract

Int64Index([ 941, 1067, 2384, 2385, 2388, 2389, 2390, 2393, 2394, 2396,
            ...
            6937, 6938, 6939, 6940, 6941, 6943, 6944, 6945, 6946, 6947],
           dtype='int64', length=3924)

In [14]:
len(data_df[(data_df['abstract']!='Abstract Missing')].index)

3924

In [15]:
print(data_df['abstract'].iloc[941])

Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 


In [16]:
data_df = data_df.iloc[paper_separate_abstract]

In [17]:
data_df.drop('paper_text',axis=1,inplace=True)

Since abstract is present in 3924 text hence removing the full papers of this journals

In [18]:
data_df.reset_index(inplace = True)

In [19]:
data_df.drop('index',axis=1,inplace= True)

In [20]:
print(data_df['abstract'][0])

Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 


Having the word count 

In [21]:
data_df.head()

Unnamed: 0,id,year,title,abstract
0,1861,2000,Algorithms for Non-negative Matrix Factorization,Non-negative matrix factorization (NMF) has pr...
1,1975,2001,Characterizing Neural Gain Control using Spike...,Spike-triggered averaging techniques are effec...
2,3163,2007,Competition Adds Complexity,It is known that determinining whether a DEC-P...
3,3164,2007,Efficient Principled Learning of Thin Junction...,We present the first truly polynomial algorith...
4,3167,2007,Regularized Boost for Semi-Supervised Learning,Semi-supervised inductive learning concerns ho...


In [22]:
def text_processing(df,col):
    temp_df = df[col]
    # 1.Remove punctuation
    temp_df = temp_df.apply(lambda x: re.sub('[[^a-zA-Z]]',' ',x))
    # 2. converting lower case
    temp_df = temp_df.apply(lambda x: x.lower())
    # 3. removing special character and digit
    temp_df = temp_df.apply(lambda x: re.sub("(\\d|\\W)+"," ",x))
    return temp_df

In [23]:
data_df['abstract'] =text_processing(data_df,'abstract')

In [25]:
print(data_df['abstract'][0])

non negative matrix factorization nmf has previously been shown to be a useful decomposition for multivariate data two different multi plicative algorithms for nmf are analyzed they differ only slightly in the multiplicative factor used in the update rules one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized kullback leibler divergence the monotonic convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm the algorithms can also be interpreted as diag onally rescaled gradient descent where the rescaling factor is optimally chosen to ensure convergence 


In [26]:
def tokenize_lemmatize(df,col):
    temp_df =df[col]
    #1. Word Tokenization:
    temp_df = temp_df.apply(lambda x : word_tokenize(x))
    word_no_pre = temp_df.apply(lambda x: len(x))
    temp_df = temp_df.apply(lambda x : [i for i in x if not i in stopwords.words('english')])
    #2. Word Lemmatization:
    lemmatize =WordNetLemmatizer()
    temp_df = temp_df.apply(lambda x: [lemmatize.lemmatize(i) for i in x])
    word_no_post =temp_df.apply(lambda x: len(x))
    temp_df = temp_df.apply(lambda x: " ".join(x))
    return temp_df,word_no_pre,word_no_post

In [27]:
data_df['abstract_post'],data_df['word_count_pre'],data_df['word_count_post']=tokenize_lemmatize(data_df,'abstract')

In [28]:
data_df.head()

Unnamed: 0,id,year,title,abstract,abstract_post,word_count_pre,word_count_post
0,1861,2000,Algorithms for Non-negative Matrix Factorization,non negative matrix factorization nmf has prev...,non negative matrix factorization nmf previous...,108,67
1,1975,2001,Characterizing Neural Gain Control using Spike...,spike triggered averaging techniques are effec...,spike triggered averaging technique effective ...,83,52
2,3163,2007,Competition Adds Complexity,it is known that determinining whether a dec p...,known determinining whether dec pomdp namely c...,70,40
3,3164,2007,Efficient Principled Learning of Thin Junction...,we present the first truly polynomial algorith...,present first truly polynomial algorithm learn...,144,89
4,3167,2007,Regularized Boost for Semi-Supervised Learning,semi supervised inductive learning concerns ho...,semi supervised inductive learning concern lea...,123,81


In [29]:
 token_df = data_df['abstract_post'].apply(lambda x:x.split(" "))

In [30]:
# creating word dictionary
dictionary = gensim.corpora.Dictionary(token_df)
#converting dictionary into a bag of words 
word_map =[dictionary.doc2bow(text) for text in token_df]
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=word_map,
                                           id2word=dictionary,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=50,
                                           per_word_topics=True)

In [30]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, word_map, dictionary)

In [31]:
topics = lda_model.show_topics(formatted=False)

In [32]:
topix={}
for i in range(len(topics)):
    val=[]
    print( 'Topic '+str(i))
    for j in range(len(topics[i][1])):
        val.append(topics[i][1][j][0])
    print(val)
    topix['Topic '+str(i)]=val

Topic 0
['method', 'matrix', 'problem', 'data', 'kernel', 'algorithm', 'linear', 'estimator', 'n', 'analysis']
Topic 1
['model', 'data', 'approach', 'learning', 'method', 'algorithm', 'inference', 'graph', 'distribution', 'based']
Topic 2
['network', 'model', 'neural', 'image', 'deep', 'task', 'training', 'learning', 'representation', 'object']
Topic 3
['algorithm', 'learning', 'problem', 'function', 'bound', 'show', 'optimization', 'gradient', 'optimal', 'result']


In [33]:
#len(lda_model[word_map])
len(dictionary)

12641

In [54]:
no_feat =5000
# LDA can only use raw term counts  because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_features=no_feat,stop_words='english')
X_tf = tf_vectorizer.fit_transform(data_df['abstract_post'])
tf_feat_name = tf_vectorizer.get_feature_names()

lda = LatentDirichletAllocation(learning_method='online')
lda_param = {'n_components': [4,5,6], 'learning_decay': [0.5,0.7,0.9]}
lda_model = GridSearchCV(estimator=lda,param_grid=lda_param)
lda_model.fit(X_tf)

GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_components': [4, 5, 6], 'learning_decay': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [55]:
lda_model.best_params_

{'learning_decay': 0.9, 'n_components': 4}

In [56]:
best_lda_model =lda_model.best_estimator_

In [57]:
lda_output = best_lda_model.fit_transform(X_tf)

In [58]:
np.argsort(best_lda_model.components_[1])[:-10-1:-1]

array([2855, 2973, 2975, 2147, 2526, 4534, 3045, 1146, 4331, 1747])

In [59]:
def visualizing_topic_cluster(model,stop_len,feat_name):
    topic={}
    for index,topix in enumerate(model.components_):
        topic[index]= [feat_name[i] for i in topix.argsort()[:-stop_len-1:-1]]
    
    return topic
        

In [60]:
dict_topic=visualizing_topic_cluster(best_lda_model,stop_len=10,feat_name=tf_feat_name)

In [61]:
len(best_lda_model.components_)

4

In [62]:
len([print('Topic'+str(key),dict_topic[key]) for key in dict_topic])

Topic0 ['method', 'data', 'problem', 'algorithm', 'learning', 'model', 'matrix', 'approach', 'function', 'kernel']
Topic1 ['model', 'network', 'neural', 'image', 'learning', 'task', 'object', 'deep', 'state', 'feature']
Topic2 ['model', 'data', 'inference', 'learning', 'distribution', 'latent', 'label', 'approach', 'method', 'variable']
Topic3 ['algorithm', 'problem', 'learning', 'function', 'bound', 'optimal', 'result', 'time', 'policy', 'stochastic']


4

In [63]:
columns=['Topic'+ i for i in list(map(str,list(dict_topic.keys())))]

In [64]:
lda_df = pd.concat(objs=[data_df,pd.DataFrame(lda_output,columns=columns).apply(lambda x : np.round(x,2)),pd.DataFrame([[np.argmax(x),dict_topic[np.argmax(x)]] for x in lda_output],columns=['Major_topic','keywords'])],axis=1)

In [65]:
columns

['Topic0', 'Topic1', 'Topic2', 'Topic3']

In [66]:
def rmse_calculation(x,y):
    x,y =x,y
    error=[]
    len_y =len(y)
    for i in range(len_y):
        error.append(np.sqrt(np.power((np.array(x)-np.array(y[i])),2).sum()))
    return error 

In [102]:
def reccomendation(df,paper_id,col):
    paper_1 = df[(df['id']==paper_id)]
    dominant_topic = int(paper_1['Major_topic'])
    paper_2 = df[(df['Major_topic']== dominant_topic)]
    paper_2 = paper_2.drop(paper_1.index)
    x= paper_1[columns].values.tolist()
    y= paper_2[columns].values.tolist()
    error = np.round(rmse_calculation(x,y),2)
    paper_2['error'] = error
    return  paper_2[['id','title','error']]

In [156]:
lda_df

Unnamed: 0,id,year,title,abstract,abstract_post,word_count_pre,word_count_post,Topic0,Topic1,Topic2,Topic3,Major_topic,keywords
0,1861,2000,Algorithms for Non-negative Matrix Factorization,non negative matrix factorization nmf has prev...,non negative matrix factorization nmf previous...,108,67,0.38,0.00,0.00,0.61,3,"[algorithm, problem, learning, function, bound..."
1,1975,2001,Characterizing Neural Gain Control using Spike...,spike triggered averaging techniques are effec...,spike triggered averaging technique effective ...,83,52,0.16,0.82,0.01,0.01,1,"[model, network, neural, image, learning, task..."
2,3163,2007,Competition Adds Complexity,it is known that determinining whether a dec p...,known determinining whether dec pomdp namely c...,70,40,0.01,0.01,0.01,0.97,3,"[algorithm, problem, learning, function, bound..."
3,3164,2007,Efficient Principled Learning of Thin Junction...,we present the first truly polynomial algorith...,present first truly polynomial algorithm learn...,144,89,0.57,0.00,0.30,0.12,0,"[method, data, problem, algorithm, learning, m..."
4,3167,2007,Regularized Boost for Semi-Supervised Learning,semi supervised inductive learning concerns ho...,semi supervised inductive learning concern lea...,123,81,0.99,0.00,0.00,0.00,0,"[method, data, problem, algorithm, learning, m..."
5,3168,2007,Simplified Rules and Theoretical Analysis for ...,we show that under suitable assumptions primar...,show suitable assumption primarily linearizati...,158,99,0.41,0.58,0.00,0.00,1,"[model, network, neural, image, learning, task..."
6,3169,2007,Predicting human gaze using low-level saliency...,under natural viewing conditions human observe...,natural viewing condition human observer shift...,204,129,0.00,0.94,0.05,0.00,1,"[model, network, neural, image, learning, task..."
7,3171,2007,Mining Internet-Scale Software Repositories,large repositories of source code create new c...,large repository source code create new challe...,191,130,0.08,0.39,0.53,0.00,2,"[model, data, inference, learning, distributio..."
8,3172,2007,Continuous Time Particle Filtering for fMRI,we construct a biologically motivated stochast...,construct biologically motivated stochastic di...,102,63,0.68,0.31,0.00,0.00,0,"[method, data, problem, algorithm, learning, m..."
9,3174,2007,An online Hebbian learning rule that performs ...,independent component analysis ica is a powerf...,independent component analysis ica powerful me...,104,62,0.54,0.45,0.00,0.00,0,"[method, data, problem, algorithm, learning, m..."


In [157]:
reccomended =reccomendation(lda_df,3167,columns)

In [158]:
reccomended.sort_values('error')

Unnamed: 0,id,title,error
3433,6794,Consistent Multitask Learning with Nonlinear O...,0.00
3361,6722,Fixed-Rank Approximation of a Positive-Semidef...,0.00
690,4000,Sufficient Conditions for Generating Group Lev...,0.00
1877,5237,Learning with Fredholm Kernels,0.00
678,3988,Efficient and Robust Feature Selection via Joi...,0.00
1879,5239,Kernel Mean Estimation via Spectral Filtering,0.00
3376,6737,Generalized Linear Model Regression under Dist...,0.00
1880,5240,Subspace Embeddings for the Polynomial Kernel,0.00
1284,4626,Exact and Stable Recovery of Sequences of Sign...,0.00
655,3965,Network Flow Algorithms for Structured Sparsity,0.00


In [168]:
lda_df.iloc[4]['abstract']

'semi supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data several boosting algorithms have been extended to semi supervised learning with various strategies to our knowledge however none of them takes local smoothness constraints among data into account during ensemble learning in this paper we introduce a local smoothness regularizer to semi supervised boosting algorithms based on the universal optimization framework of margin cost functionals our regularizer is applicable to existing semi supervised boosting algorithms to improve their generalization and speed up their training comparative results on synthetic benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer we discuss relevant issues and relate our regularizer to previous work '

In [172]:
lda_df.iloc[1877]['abstract']

'in this paper we propose a framework for supervised and semi supervised learning based on reformulating the learning problem as a regularized fredholm integral equation our approach fits naturally into the kernel framework and can be interpreted as constructing new data dependent kernels which we call fredholm kernels we proceed to discuss the noise assumption for semi supervised learning and provide evidence evidence both theoretical and experimental that fredholm kernels can effectively utilize unlabeled data under the noise assumption we demonstrate that methods based on fredholm learning show very competitive performance in the standard semi supervised learning setting '