In [1]:
%matplotlib inline


# Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


This is an example of applying :class:`sklearn.decomposition.NMF` and
:class:`sklearn.decomposition.LatentDirichletAllocation` on a corpus
of documents and extract additive models of the topic structure of the
corpus.  The output is a list of topics, each represented as a list of
terms (weights are not shown).

Non-negative Matrix Factorization is applied with two different objective
functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.
The latter is equivalent to Probabilistic Latent Semantic Indexing.

The default parameters (n_samples / n_features / n_components) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).


<function fetch_20newsgroups at 0x000002897EA268B8>
Loading dataset...


TypeError: fetch_20newsgroups() got an unexpected keyword argument 'return_X_y'

In [42]:
import pandas as pd
import numpy as np
import os
%matplotlib inline
df=pd.read_csv('eBM_cleaned_data.csv')

In [43]:
df.head()

Unnamed: 0,SUPERVISOR,PROJECT_DESCRIPTION,project_id,supervisor_id,AR-VR,Accountancy,Action Research,Aeronautical,Aerospace,Africa,...,Warehousing,Web 2.0,Workflow,Workload,company-based project,digital technology,e-commerce,eBusiness,gender,industry-based project
0,Amanda Dowd,The areas that I would be interested in superv...,1,12,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,Amanda Dowd,INTRODUCTION\r\n\r\nWith increasing numbers of...,2,12,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,Nigel Denton,If you are thinking about a project in the are...,3,159,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Des Doran,Exploring the impact of sustainable issues upo...,4,58,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Des Doran,This project should explore how strategic posi...,5,58,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
data_samples=df.PROJECT_DESCRIPTION

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
n_samples = 2000
n_features = 1000
n_components = 30
n_top_words = 20
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
print("done")

Extracting tf features for LDA...
done


In [53]:
from sklearn.decomposition import  LatentDirichletAllocation

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


n_components = 30   
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf)
print(" LDA fit done")
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
 LDA fit done

Topics in LDA model:
Topic #0: disruptive radical project innovation businesses use innovations technology based term rate students economic statistics ideas process technological test highlight wider
Topic #1: ideas search based project open commerce contact optimisation seo conversion following voice rate engine topics usability website strategy mobile content
Topic #2: research project application look new warwick technical company send learning ac business technology data write applications current discuss statistics provide
Topic #3: sector student students conducted supervise exploring automotive requirements work looking company extent context innovative focus explore choose stream erp global
Topic #4: brand little bit marketing trust building research manage students project nottingham external experience student interested completing management years able number
Topic #5: enterprise resou

In [54]:
doc_topic_dist = lda.transform(tf)
df_tatio=pd.DataFrame(doc_topic_dist)

In [55]:
df_tatio.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,...,0.000521,0.000521,0.984896,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521
1,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,...,0.00023,0.00023,0.993333,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023
2,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,...,0.002222,0.002222,0.74294,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222
3,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,...,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.970707
4,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,...,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.970707


In [41]:
df_tatio.to_csv("eBM_ratio_data.csv", index=False)

In [56]:
df_tatio["project_id"]=df["project_id"]
df_tatio["supervisor_id"]=df["supervisor_id"]
df_tatio.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,project_id,supervisor_id
0,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,...,0.984896,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,0.000521,1,12
1,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,...,0.993333,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,0.00023,2,12
2,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,...,0.74294,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,0.002222,3,159
3,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,...,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.970707,4,58
4,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,...,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.00101,0.970707,5,58


In [57]:
df_tatio.to_csv("eBM_ratio_data.csv", index=False)