In [1]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords

In [2]:
n_samples = 2000
n_features = 1000
n_components = 5
n_top_words = 10

In [3]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [4]:
data = pd.read_csv("./data/dataset.csv", encoding="utf_8")

In [5]:
stop_words = set(stopwords.words('english'))
stop_words.update(["coffee", "question","questions","answer"])

In [6]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words=stop_words)
tf = tf_vectorizer.fit_transform(data["Body"])

In [7]:
pd.DataFrame(tf.toarray(), index=data["Body"],columns=tf_vectorizer.get_feature_names())

Unnamed: 0_level_0,10,1000,12,2015,24,51,75,80,ability,able,...,write,writing,wrong,wrote,year,years,yes,yesterday,yet,zero
Body,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
While answering a few of EdChum's questions I discovered that what I/we in the USA call pour over coffee is referred to as drip coffee in the UK. I added the pour-over tag to both questions I encountered but figured we should decide as a community which tag to use to describe this brewing process and then properly document it because drip-coffee means something different in the US (which is apparently referred to as filter-cofee in the UK). For clarification the method in question is shown in the image below. \n \n,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Being newly created we have zero feeds appearing in our main chat right now. What blogs, news sites, or other important coffee related things should appear in our main chat room's feed? Post your suggestions/submissions. \n",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
"It looks like filter coffee has another, different meaning too. When I read ""drip coffee,"" I think of the kind you get from a traditional coffeemaker. Go for ""pour-over.""\n",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"The chatroom name is so bland. ""Coffee."" Look at all the creative names others have thought up:\n\n""Root Access"" for Super User\n""The DMZ"" for Security\n""The Renderfarm"" for Blender\n""The Litter Box"" for Pets\n""The Hangar"" for Aviation\n""You Are Here"" for Travel\n""The Water Cooler"" for The Workplace\n""The Whiteboard"" for Programmers\n""The Nineteenth Byte"" for Code Golf\netc...\n\nCan we think of a better name for our chatroom?\nOnly one idea per answer, please. Vote up the ideas that you like!\nStolen from Lifehacks meta, which was in turn stolen from PPCG meta. But that's okay, because I wrote both of those posts too. :P\n",0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
"On most SE sites, product recommendations are off-topic, as they tend to become obsolete quickly (see this blog post). Do we want questions asking for the recommendation of goods/products here?\nDo we want to completely rule recomendations off-topic, or should we allow certain kinds?\n",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"I see no reason why a different rule should apply. \nAnd when you are fixing a typo, removing greetings and signatures is common sense. It’s a minor edit, some may think it superfluous, I appreciate the effort.\nIf I am to venture a guess - the user in question has been a member for a comparatively short time and no linked profiles to other sites. There’s a good chance that he simply didn’t know about the convention and combined with the fact that it takes some users a while to become comfortable with the fact that their posts may be edited by the community, the decline may well have been a knee-jerk protective reaction. \n",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"This tag is used by staff when sharing concepts in the Discovery phase relating to product or configuration changes. In most cases a direction and/or goal has been established, and there has likely been some amount of time invested into discovery work and research. The post is being presented to the Community for feedback to be taken into consideration. Where possible, the post includes specific questions to help guide Community feedback.\n",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Indicates that the post shares product or configuration change concepts during the Discovery phase. Open to receiving feedback from the Community preceding implementation.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Seasoned Advice has excluded recipe requests for various reasons. One of the most important one being that asking for recipes is fundamentally opinion-based and therefore not a good fit for the site and the SE system in general.\nShould we adapt this reasoning and rule for Coffee SE as well?\n,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
tf_vectorizer.get_feature_names()

['10',
 '1000',
 '12',
 '2015',
 '24',
 '51',
 '75',
 '80',
 'ability',
 'able',
 'accept',
 'acceptable',
 'accepted',
 'accepting',
 'accomplish',
 'according',
 'account',
 'across',
 'action',
 'actions',
 'active',
 'activity',
 'actual',
 'actually',
 'ad',
 'add',
 'added',
 'adding',
 'ads',
 'advertising',
 'advice',
 'affect',
 'ago',
 'allow',
 'almost',
 'along',
 'already',
 'also',
 'alternative',
 'always',
 'amount',
 'another',
 'answered',
 'answering',
 'answers',
 'anyone',
 'anything',
 'anyway',
 'app',
 'apparently',
 'apply',
 'appreciate',
 'appreciated',
 'approach',
 'appropriate',
 'approve',
 'approved',
 'arabica',
 'area',
 'area51',
 'around',
 'aside',
 'ask',
 'asked',
 'asker',
 'asking',
 'asks',
 'aspect',
 'aspects',
 'associated',
 'assume',
 'attempt',
 'attention',
 'attract',
 'attractive',
 'author',
 'authoritative',
 'authors',
 'auto',
 'automated',
 'automatically',
 'available',
 'avatar',
 'avoid',
 'award',
 'aware',
 'away',
 'awesome'

In [9]:
lsa = TruncatedSVD(n_components=n_components, n_iter=100, random_state=42)
tfidf_lsa = lsa.fit_transform(tf)

In [10]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lsa, tf_feature_names, n_top_words)

Topic #0: site answers like think one would topic sites meta people
Topic #1: tag wiki flavor edit content one process brewing like history
Topic #2: queue tag tasks flags users moderation wiki posts edit user
Topic #3: hats bash winter public earn site viewed se particular well
Topic #4: meta sites edit people main suggested want participation work site

