In [8]:
import json
import sklearn

In [9]:
fp = open('jobs.jl')
jobs = [json.loads(foo) for foo in fp]

In [10]:
len(jobs)

239

In [11]:
jobs[0]

{'url': 'https://www.facebook.com/careers/jobs/a0I1H00000LCeXGUA1/',
 'title': 'Product Management Lead, Account Integrity (Community Integrity)',
 'location': 'London, United Kingdom',
 'description': " Facebook's mission is to give people the power to build community and bring the world closer together. Through our family of apps and services, we're building a different kind of company that connects billions of people around the world, gives them ways to share what matters most to them, and helps bring people closer together. Whether we're creating new products or helping a small business expand its reach, people at Facebook are builders at heart. Our global teams are constantly iterating, solving problems, and working together to empower people around the world to build community and connect in meaningful ways. Together, we can help people build stronger communities — we're just getting started.  Protecting the safety and integrity of the Facebook community is the company's top prio

This below is a topic modeling demo from https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730 and http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
no_features = 10000
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
docs = ["{} {}".format(j['title'], j['description']) for j in jobs]
tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names)

['10', '12', '15', '180', '18th', '20', '2017', '2018', '21st', '25', '2b', '30', '3d', '50', 'abilities', 'ability', 'able', 'abnormal', 'abuse', 'academic', 'academics', 'accelerate', 'accelerators', 'access', 'accessible', 'accomplish', 'according', 'accordingly', 'account', 'accountability', 'accountable', 'accounting', 'accounts', 'accuracy', 'accurate', 'achieve', 'acquisition', 'acquisitions', 'act', 'action', 'actionable', 'actions', 'activating', 'active', 'actively', 'activities', 'activity', 'actor', 'actors', 'actuators', 'acuity', 'ad', 'adapt', 'adaptable', 'adcs', 'add', 'adding', 'addition', 'additional', 'additionally', 'address', 'adf', 'adhere', 'adjust', 'adjustments', 'administration', 'admins', 'adoption', 'ads', 'advance', 'advanced', 'advances', 'advantage', 'adversarial', 'advertisement', 'advertiser', 'advertisers', 'advertising', 'advise', 'advocacy', 'advocate', 'advocates', 'advocating', 'affairs', 'affect', 'affecting', 'affects', 'afraid', 'africa', 'agen



It's interesting how NNMF, PLSI, and LDA come up with different topics for the same documents.



In [23]:
from sklearn.decomposition import NMF
no_topics = 10

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

In [24]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(nmf, tfidf_feature_names, no_top_words)

Topic 0:
workflows internal data business external workflow metrics measurement threats communicate
Topic 1:
recruiting hiring recruiter compensation hr maintain partnering cycle stakeholders team
Topic 2:
enterprise financial productivity establish bachelors engineering business operational aligned month
Topic 3:
enterprise applications web end java software scalable apis soap imagining
Topic 4:
data analysis technical informed business statistical quality insights quantitative engineering
Topic 5:
operations team risk community media solutions business policy escalations data
Topic 6:
program implementation technical coordination day management projects tpm requirements manager
Topic 7:
whatsapp abuse detection simple reliable talk free anti way erlang
Topic 8:
electrical hardware design layout signal mechanical engineer test board network
Topic 9:
product marketing partnerships startups policy developers companies strategic market developer


In [26]:
plsi = NMF(n_components=no_topics, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
display_topics(plsi, tfidf_feature_names, 10)

Topic 0:
ways stronger solving services team work reach mission empower family
Topic 1:
recruiting updates sourcing relationships spike stakeholders recruiter regularly techniques unique
Topic 2:
systems getting team work gives supporting years strategic position ways
Topic 3:
complex measurement metrics expand based mission tools ensure design pain
Topic 4:
php sql wrong sets systems using quantitative python solve content
Topic 5:
provide operations support problem services strong scalable risk partners risks
Topic 6:
technical systems hands program software technology scope engineering manager problem
Topic 7:
php simple social ideal related knowledge machine laundry science engineers
Topic 8:
high phd level technology years ideas using design including researchers
Topic 9:
travel 25 proactively willingness serving partnerships track startup social relationship


In [32]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(docs)
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)
tf_feature_names = tf_vectorizer.get_feature_names()
display_topics(lda, tf_feature_names, 10)

Topic 0:
data team business engineering technical product analysis systems functional cross
Topic 1:
data team business closer bring community product operations technical ways
Topic 2:
business team data community product ways bring closer work management
Topic 3:
whatsapp abuse research team detection simple product data spam anti
Topic 4:
product team business data help analysis bring ways solving closer
Topic 5:
business product team engineering management community closer ways operations solving
Topic 6:
network data suppliers odm generation design signal product business community
Topic 7:
team electrical business systems engineering community work internal functional design
Topic 8:
data team product management community business closer engineering bring program
Topic 9:
team business product data mission solving work systems management ensure


Note that LDA works from count data only, rather than weights.  As a result, the LDA topics are crowded out by the boilerplate language; note that almost every cluster is about bringing communities closer.  The NMF methods get around this using tf-idf weights which discount words that occur in every document.

Just as a comparison point, here's the same analysis with k-Means.  I'm not a big fan of k-Means in this problem because I don't have a good prior for how many clusters there are, but it might be more familiar to people.  As a bonus, the jobs are actually assigned to clusters.

In [50]:
from sklearn.cluster import MiniBatchKMeans
from sklearn import metrics

true_k = 5
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000).fit(tfidf)
print(km.labels_)

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % tfidf_feature_names[ind], end='')
    print()

[0 0 2 2 0 2 2 2 0 2 2 2 2 0 0 0 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2
 2 0 2 2 2 2 0 3 2 3 0 2 3 2 0 2 0 0 0 2 2 2 3 2 0 2 2 0 2 2 2 3 0 0 2 2 2
 2 2 2 0 2 2 2 2 2 2 0 4 4 2 2 0 4 2 2 2 0 0 2 0 2 4 2 4 4 2 2 3 0 4 4 3 0
 0 0 0 0 3 3 0 0 3 3 0 0 2 2 2 3 2 2 2 2 1 2 0 1 2 2 2 2 2 0 2 1 0 2 1 2 1
 0 0 1 2 0 2 1 0 1 2 2 2 2 1 2 1 0 0 2 2 2 1 2 0 2 0 4 1 3 0 2 1 1 0 2 1 2
 0 2 0 2 1 2 2 0 1 2 1 1 1 1 1 1 2 0 1 2 0 2 2 0 1 2 0 1 2 0 0 0 3 3 3 3 3
 3 0 0 0 0 0 3 0 0 3 3 2 0 0 3 4 4]
Cluster 0: product design research marketing security whatsapp abuse electrical community team
Cluster 1: recruiting hiring recruiter maintain hr compensation partnering university cycle team
Cluster 2: data business team operations systems quality community product analysis engineering
Cluster 3: program product programs startups technical partnerships developers work policy management
Cluster 4: policy public escalations standards community africa policies benefits enforcement whatsapp


In the end, I think this reflects that the positions break down along operating unit lines: business development, recruiting, analytics, risk and security.  This isn't very indicating of what integrity means, but it does show that as a term it spans the company's operations.  I can't tell from here if it means the same thing to everyone.