In [None]:
%matplotlib inline


# Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


This is an example of applying :class:`sklearn.decomposition.NMF` and
:class:`sklearn.decomposition.LatentDirichletAllocation` on a corpus
of documents and extract additive models of the topic structure of the
corpus.  The output is a list of topics, each represented as a list of
terms (weights are not shown).

Non-negative Matrix Factorization is applied with two different objective
functions: the Frobenius norm, and the generalized Kullback-Leibler divergence.
The latter is equivalent to Probabilistic Latent Semantic Indexing.

The default parameters (n_samples / n_features / n_components) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).




In [6]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

In [7]:
print("Loading dataset")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))

Loading dataset


In [12]:
import pandas as pd
dataset = pd.read_pickle('data_clean.pkl')

In [13]:
pd.set_option('max_colwidth', 1500)
dataset

Unnamed: 0,text
0,thank you so much chris and it is truly a great honor to have the opportunity to come to this stage twice I am extremely grateful i have been blown away by this conference and i want to thank all of you for the many nice comment about what i had to say the other night and i say that sincerely partly because mock sob i need that laughter put yourselves in my position laughter i flew on air force two for eight year laughter now i have to take off my shoe or boot to get on an airplane laughter applause I will tell you one quick story to illustrate what that is been like for me laughter it is a true story every bit of this is true soon after tipper and i left the mock sob white house laughter we were driving from our home in nashville to a little farm we have mile east of nashville driving ourselves laughter i know it sound like a little thing to you but laughter i looked in the rearview mirror and all of a sudden it just hit me there wa no motorcade back there laughter you have heard of phantom limb pain laughter this wa a rented ford taurus laughter it wa dinnertime and we started looking for a place to eat we were on we got to exit lebanon tennessee we got off the exit we found a shoneys restaurant lowcost family restaurant chain for those of you who do not know it we went in and sat down at the booth and the waitress came over made a big commotion over tipper laughter she took our order...
9,if you are here today and I am very happy that you are you have all heard about how sustainable development will save u from ourselves however when we are not at ted we are often told that a real sustainability policy agenda is just not feasible especially in large urban area like new york city and that is because most people with decisionmaking power in both the public and the private sector really do not feel a though they are in danger the reason why I am here today in part is because of a dog an abandoned puppy i found back in the rain back in she turned out to be a much bigger dog than I would anticipated when she came into my life we were fighting against a huge waste facility planned for the east river waterfront despite the fact that our small part of new york city already handled more than percent of the entire city commercial waste a sewage treatment pelletizing plant a sewage sludge plant four power plant the world largest fooddistribution center a well a other industry that bring more than diesel truck trip to the area each week the area also ha one of the lowest ratio of park to people in the city so when i wa contacted by the park department about a seedgrant initiative to help develop waterfront project i thought they were really wellmeaning but a bit naive I would lived in this area all my life and you could not get to the river because of all the lovely facility that i mentioned earlier then while jogging with my dog one morning she pulled ...
18,good morning how are you laughter it is been great ha not it I have been blown away by the whole thing in fact I am leaving laughter there have been three theme running through the conference which are relevant to what i want to talk about one is the extraordinary evidence of human creativity in all of the presentation that we have had and in all of the people here just the variety of it and the range of it the second is that it is put u in a place where we have no idea what is going to happen in term of the future no idea how this may play out i have an interest in education actually what i find is everybody ha an interest in education do not you i find this very interesting if you are at a dinner party and you say you work in education actually you are not often at dinner party frankly laughter if you work in education you are not asked laughter and you are never asked back curiously that is strange to me but if you are and you say to somebody you know they say what do you do and you say you work in education you can see the blood run from their face they are like oh my god you know why me laughter my one night out all week laughter but if you ask about their education they pin you to the wall because it is one of those thing that go deep with people am i right like religion and money and other thing so i have a big interest in education and i think we all do we have a huge vested interest in it partly because it is education th...
25,about year ago i took on the task to teach global development to swedish undergraduate student that wa after having spent about year together with african institution studying hunger in africa so i wa sort of expected to know a little about the world and i started in our medical university karolinska institute an undergraduate course called global health but when you get that opportunity you get a little nervous i thought these student coming to u actually have the highest grade you can get in swedish college system so i thought maybe they know everything I am going to teach them about so i did a pretest when they came and one of the question from which i learned a lot wa this one which country ha the highest child mortality of these five pair i put them together so that in each pair of country one ha twice the child mortality of the other and this mean that it is much bigger a difference than the uncertainty of the data i will not put you at a test here but it is turkey which is highest there poland russia pakistan and south africa and these were the result of the swedish student i did it so i got the confidence interval which is pretty narrow and i got happy of course a right answer out of five possible that mean that there wa a place for a professor of international health and for my course laughter but one late night when i wa compiling the report i really realized my discovery i have shown that swedish top student know statistically significantly le ab...
36,thank you i have to tell you I am both challenged and excited my excitement is i get a chance to give something back my challenge is the shortest seminar i usually do is hour laughter I am not exaggerating i do weekend i do more obviously i also coach people but I am into immersion because how did you learn language not just by learning principle you got in it and you did it so often that it became real the bottom line of why I am here besides being a crazy mofo is that I am not here to motivate you you do not need that obviously often that is what people think i do and it is the furthest thing from it what happens though is people say to me i do not need any motivation but that is not what i do I am the why guy i want to know why you do what you do what is your motive for action what is it that drive you in your life today not year ago are you running the same pattern because i believe that the invisible force of internal drive activated is the most important thing I am here because i believe emotion is the force of life all of u here have great mind most of u here have great mind right we all know how to think with our mind we can rationalize anything we can make anything happen i agree with what wa described a few day ago that people work in their selfinterest but we know that that is bullshit at time you do not work in your selfinterest all the time because when emotion come into it the wiring change in the way it function so it is wonderful to t...
44,I am going to present three project in rapid fire i do not have much time to do it and i want to reinforce three idea with that rapidfire presentation the first is what i like to call a hyperrational process it is a process that take rationality almost to an absurd level and it transcends all the baggage that normally come with what people would call sort of a rational conclusion to something and it concludes in something that you see here that you actually would not expect a being the result of rationality the second the second is that this process doe not have a signature there is no authorship architect are obsessed with authorship this is something that ha editing and it ha team but in fact we no longer see within this process the traditional master architect creating a sketch that his minion carry out and the third is that it challenge and this is in the length of this very hard to support why connect all these thing but it challenge the high modernist notion of flexibility high modernist said we will create sort of singular space that are generic almost anything can happen within them i call it sort of shotgun flexibility turn your head this way shoot and you are bound to kill something so this is the promise of high modernism within a single space actually any kind of activity can happen but a we are seeing operational cost are starting to dwarf capital cost in term of design parameter and so with this sort of idea what happens is whatever actually i...
49,I am often asked what surprised you about the book and i say that i got to write it i would have never imagined that not in my wildest dream did i think i do not even consider myself to be an author and I am often asked why do you think so many people have read this this thing selling still about a million copy a month and i think it is because spiritual emptiness is a universal disease i think inside at some point we put our head down on the pillow and we go there is got to be more to life than this get up in the morning go to work come home and watch tv go to bed get up in the morning go to work come home watch tv go to bed go to party on weekend a lot of people say I am living no you are not living that is just existing just existing i really think that there is this inner desire i do believe what chris said i believe that you are not an accident your parent may not have planned you but i believe god did i think there are accidental parent there is no doubt about that i do not think there are accidental kid and i think you matter i think you matter to god i think you matter to history i think you matter to this universe and i think that the difference between what i call the survival level of living the success level of living and the significance level of living is do you figure out what on earth am i here for i meet a lot of people who are very smart and say but why can not i figure out my problem and i meet a lot of people who are very successful who say w...
57,it is wonderful to be back i love this wonderful gathering and you must be wondering what on earth have they put up the wrong slide no no look at this magnificent beast and ask the question who designed it this is ted this is technology entertainment design and there is a dairy cow it is a quite wonderfully designed animal and i wa thinking how do i introduce this and i thought well maybe that old doggerel by joyce kilmer you know poem are made by fool like me but only god can make a tree and you might say well god designed the cow but of course god got a lot of help this is the ancestor of cattle this is the aurochs and it wa designed by natural selection the process of natural selection over many million of year and then it became domesticated thousand of year ago and human being became it steward and without even knowing what they were doing they gradually redesigned it and redesigned it and redesigned it and then more recently they really began to do reverse engineering on this beast and figure out just what the part were how they worked and how they might be optimized how they might be made better now why am i talking about cow because i want to say that much the same thing is true of religion religion are natural phenomenon they are just a natural a cow they have evolved over millennium they have a biological base just like the aurochs they have become domesticated and human being have been redesigning their religion for thousand of year this is ted and...
65,I am going to take you on a journey very quickly to explain the wish I am going to have to take you somewhere where many people have not been and that is around the world when i wa about year old kate stohr and myself started an organization to get architect and designer involved in humanitarian work not only about responding to natural disaster but involved in systemic issue we believe that where the resource and expertise are scarce innovative sustainable design can really make a difference in people life so i started my life a an architect or training a an architect and i wa always interested in socially responsible design and how you can really make an impact but when i went to architecture school it seemed that i wa a black sheep in the family many architect seemed to think that when you design you design a jewel and it is a jewel that you try and crave for whereas i felt that when you design you either improve or you create a detriment to the community in which you are designing so you are not just doing a building for the resident or for the people who are going to use it but for the community a a whole and in we started by responding to the issue of the housing crisis for returning refugee in kosovo and i did not know what i wa doing like i said and I am the internet generation so we started a website we put a call out there and to my surprise in a couple of month we had hundred of entry from around the world that led to a number of prototype being bu...
75,i can not help but this wish to think about when you are a little kid and all your friend ask you if a genie could give you one wish in the world what would it be and i always answered well I would want the wish to have the wisdom to know exactly what to wish for well then you would be screwed because you would know what to wish for and you would use up your wish and now since we only have one wish unlike last year they had three wish I am not going to wish for that so let u get to what i would like which is world peace and i know what you are thinking you are thinking the poor girl up there she think she is at a beauty pageant she is not she is at the ted prize laughter but i really do think it make sense and i think that the first step to world peace is for people to meet each other I have met a lot of different people over the year and I have filmed some of them from a dotcom executive in new york who wanted to take over the world to a military press officer in qatar who would rather not take over the world if you have seen the film control room that wa sent out you would understand a little bit why applause thank you wow some of you watched it that is great that is great so basically what I would like to talk about today is a way for people to travel to meet people in a different way than because you can not travel all over the world at the same time and a long time ago well about year ago my mom had an exchange student and I am going to show you slide o...


In [14]:
n_samples = 1591
n_features = 1000
n_components = 10
n_top_words = 20

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

data_samples = dataset.text[:n_samples]
print("done in %0.3fs." % (time() - t0))

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

# Fit the NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

done in 108.475s.
Extracting tf-idf features for NMF...
done in 2.402s.
Extracting tf features for LDA...
done in 2.469s.

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=1591 and n_features=1000...
done in 1.777s.

Topics in NMF model (Frobenius norm):
Topic #0: said laughter people know life did story time child say day just love year got man thing school kid family
Topic #1: woman men girl sex female boy violence male mother village daughter young black man mom talk role country job community
Topic #2: people country world government percent year global dollar need ha problem state change company social economy china democracy money economic
Topic #3: cancer cell patient disease drug body doctor tissue health blood treatment medical medicine dna gene protein trial virus heart bone
Topic #4: planet earth universe water star specie ocean year ice animal plant life energy light solar tree climate sun mar space
Topic #5: city building design architecture space com