In [1]:
import numpy as np

import nltk
import pandas as pd

from sklearn.decomposition import NMF
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

import pickle

## Import ngramed pickle, vectorize

In [2]:
df = pd.read_pickle('../Data/01_clean_sf_custom_ngram')

In [3]:
df.shape

(3364, 4)

In [4]:
tf_idf = TfidfVectorizer(max_df=0.95)
tf_idf_array = tf_idf.fit_transform(df.listed_items).toarray()
tf_idf_df = pd.DataFrame(tf_idf_array,columns=tf_idf.get_feature_names())
tf_idf_df.shape

(3364, 15388)

## Pickling models/vectorizers for use in the Flask App

In [5]:
pickle_out = open('../Tools_and_models/tf_idf_model',"wb")
pickle.dump(tf_idf, pickle_out)
pickle_out.close()

In [6]:
pickle_out = open('../Tools_and_models/tf_idf_vectorizer',"wb")
pickle.dump(tf_idf_vectorizer, pickle_out)
pickle_out.close()

In [5]:
df.columns = ['company_name', 'job_title', 'listed_items', 'posting_url']

In [6]:
df = df.merge(tf_idf_df,left_index=True,right_index=True)
df.to_pickle('../Data/01_tf_idf_and_features')
df.head(2)

Unnamed: 0,company_name,job_title,listed_items,posting_url,aa,aaa,aaai,aac,aad,aami,...,zoura,zpn,zuckerberg,zurb,zvs,zweigwhite,zymergen,zymo,zynga,ºc
0,Gap Inc. Corporate,"Software Engineer, Price Execution",write build product according business conduct...,https://www.indeed.com/rc/clk?jk=77d524a7cf198...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,WrkShp,Business Analyst,closely product assist investigation deep dive...,https://www.indeed.com/company/WrkShp/jobs/Bus...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Topic Modeling based on elbows found in KMeans clustering Inertia

Note that only the clustering of TF-IDF was used as word2vec did not provide salient elbows.  During my modeling I used the pretrained version of word2vec which was trained on google news. The lack of elbows in plotting inertia likely results from the fact that 'cloud', for example, was not used in relation to cloud computing in the training set. As a result, word2vec would not know to associate terms such 'cloud' and 'azure'. 

I ultimately decided to use 9 classes during topic modeling. While there was no elbow at 9 while plotting the inertia, it should not be surprising that the locations of elbows while plotting inertia did not correlate directly to the number of topics. Afterall, part of the difficulty in navigating data science job descriptions is that the different roles within the field of data science may require different skill sets (or in terms of how that translates to my modeling, how jobs require different _'topics'_.

## 9 Classes

In [7]:
nmf_model = NMF(n_components=9, random_state=42)
nmf = nmf_model.fit_transform(tf_idf_df)

In [8]:
W = nmf
H = nmf_model.components_

The W factor contains the document membership weights relative to each of the k topics. Each row corresponds to a single document, and each column correspond to a topic.

In [9]:
W.shape

(3364, 9)

The H factor contains the term weights relative to each of the k topics. In this case, each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary.

In [10]:
H.shape

(9, 15388)

In [11]:
def get_descriptor(terms, H, topic_index, top):
    #reverse sort the values to sort the indices
    top_indices = np.argsort(H[topic_index,:])[::-1]
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append(terms[term_index])
    return top_terms

In [12]:
def list_top_wrods(model, feature_names, n_top_words):
    top_words = []
    for topic in model.components_:
        top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    return top_words

In [13]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
    print()

In [14]:
print_top_words(nmf_model,tf_idf.get_feature_names(),10)

Topic #0:
business analysis analytics insight sql model statistical tool quantitative statistic
Topic #1:
learning machine model algorithm ml deep production tensorflow technique deep_learning
Topic #2:
project process management support required client business issue technical quality
Topic #3:
test database application java aws cloud pipeline web development tool
Topic #4:
design research designer visual web interaction ux mobile end prototyping
Topic #5:
product customer team technical cross partner development drive strategy lead
Topic #6:
cell biology assay laboratory molecular scientific method chemistry protein development
Topic #7:
marketing sale customer content campaign channel strategy enablement digital develop
Topic #8:
security network infrastructure vulnerability incident threat cloud linux application technical



## 11 Classes

In [15]:
nmf_model = NMF(n_components=11, random_state=42)
nmf = nmf_model.fit_transform(tf_idf_df)

In [16]:
W = nmf
H = nmf_model.components_

The W factor contains the document membership weights relative to each of the k topics. Each row corresponds to a single document, and each column correspond to a topic.

In [17]:
W.shape

(3364, 11)

The H factor contains the term weights relative to each of the k topics. In this case, each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary.

In [18]:
H.shape

(11, 15388)

In [19]:
top_indices = np.argsort(H[1,:])[::-1]

In [20]:
def get_descriptor(terms, H, topic_index, top):
    #reverse sort the values to sort the indices
    top_indices = np.argsort(H[topic_index,:])[::-1]
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append(terms[term_index])
    return top_terms

In [21]:
def list_top_wrods(model, feature_names, n_top_words):
    top_words = []
    for topic in model.components_:
        top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    return top_words

In [22]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
    print()

In [23]:
print_top_words(nmf_model,tf_idf.get_feature_names(),10)

Topic #0:
business analysis analytics insight sql statistical tool quantitative model statistic
Topic #1:
learning machine model algorithm ml deep tensorflow technique production deep_learning
Topic #2:
project management process business client required support quality manage issue
Topic #3:
pipeline spark aws database processing distributed cloud hadoop infrastructure java
Topic #4:
design research visual designer interaction ux prototyping prototype engineer mechanical
Topic #5:
product team cross development strategy drive functional lead partner define
Topic #6:
cell biology assay laboratory molecular scientific method chemistry protein development
Topic #7:
marketing sale content campaign channel strategy digital medium enablement develop
Topic #8:
security network infrastructure vulnerability incident cloud threat linux application response
Topic #9:
test web testing application development javascript automation framework end react
Topic #10:
customer technical support sale issu

## 15 Classes

In [24]:
tf_idf = TfidfVectorizer(max_df=0.95)
tf_idf_array = tf_idf.fit_transform(df.listed_items).toarray()
tf_idf_df = pd.DataFrame(tf_idf_array,columns=tf_idf.get_feature_names())
tf_idf_df.shape

(3364, 15388)

In [25]:
nmf_model = NMF(n_components=15, random_state=42)
nmf = nmf_model.fit_transform(tf_idf_df)

In [26]:
W = nmf
H = nmf_model.components_

The W factor contains the document membership weights relative to each of the k topics. Each row corresponds to a single document, and each column correspond to a topic.

In [27]:
W.shape

(3364, 15)

The H factor contains the term weights relative to each of the k topics. In this case, each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary.

In [28]:
H.shape

(15, 15388)

In [29]:
top_indices = np.argsort(H[1,:])[::-1]

In [30]:
def get_descriptor(terms, H, topic_index, top):
    #reverse sort the values to sort the indices
    top_indices = np.argsort(H[topic_index,:])[::-1]
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append(terms[term_index])
    return top_terms

In [31]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [32]:
print_top_words(nmf_model,tf_idf.get_feature_names(),10)

Topic #0:
analysis analytics business insight sql statistical tool statistic quantitative model
Topic #1:
learning machine model algorithm ml deep tensorflow technique production deep_learning
Topic #2:
project required equipment material procedure control process knowledge construction report
Topic #3:
web application javascript end react development framework cs html apis
Topic #4:
product team cross product_management development strategy manager define drive launch
Topic #5:
sale salesforce account finance financial enablement marketing develop planning operation
Topic #6:
cell biology assay molecular scientific laboratory protein development biochemistry chemistry
Topic #7:
marketing content campaign digital medium channel social strategy creative product_marketing
Topic #8:
security network infrastructure vulnerability incident cloud threat linux application response
Topic #9:
test testing automation qa tool quality development bug process case
Topic #10:
customer technical suppo

## 18 Classes

In [33]:
tf_idf = TfidfVectorizer(max_df=0.95)
tf_idf_array = tf_idf.fit_transform(df.listed_items).toarray()
tf_idf_df = pd.DataFrame(tf_idf_array,columns=tf_idf.get_feature_names())
tf_idf_df.shape

(3364, 15388)

In [34]:
nmf_model = NMF(n_components=18, random_state=42)
nmf = nmf_model.fit_transform(tf_idf_df)

In [35]:
W = nmf
H = nmf_model.components_

The W factor contains the document membership weights relative to each of the k topics. Each row corresponds to a single document, and each column correspond to a topic.

In [36]:
W.shape

(3364, 18)

The H factor contains the term weights relative to each of the k topics. In this case, each row corresponds to a topic, and each column corresponds to a unique term in the corpus vocabulary.

In [37]:
H.shape

(18, 15388)

In [38]:
top_indices = np.argsort(H[1,:])[::-1]

In [39]:
def get_descriptor(terms, H, topic_index, top):
    #reverse sort the values to sort the indices
    top_indices = np.argsort(H[topic_index,:])[::-1]
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append(terms[term_index])
    return top_terms

In [40]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [41]:
print_top_words(nmf_model,tf_idf.get_feature_names(),10)

Topic #0:
business process functional partner management technical area team across drive
Topic #1:
learning machine algorithm model ml deep tensorflow technique deep_learning production
Topic #2:
laboratory process chemistry method equipment lab analytical knowledge development material
Topic #3:
web application javascript react end development framework cs html apis
Topic #4:
product team cross product_management development strategy define manager drive management
Topic #5:
sale salesforce enablement marketing account develop operation enterprise channel customer
Topic #6:
cell biology assay molecular scientific molecular_biology protein biochemistry biological pcr
Topic #7:
marketing content campaign digital medium channel strategy social product_marketing creative
Topic #8:
security network infrastructure vulnerability incident cloud threat linux application response
Topic #9:
test testing automation qa tool quality bug case development agile
Topic #10:
customer technical support 

# Labeling using 9 topics
For plotting in 3-D using plotly

In [43]:
nmf_model = NMF(n_components=9, random_state=42)
nmf = nmf_model.fit_transform(tf_idf_df)

In [44]:
nmf_results = pd.DataFrame(nmf)
nmf_results.columns = ['Predicted Role: Business Analyst', 'Predicted Role: Advanced AI', 
                       'Predicted Role: Data Manager', 
                       'Predicted Role: Software Engineer', 'Predicted Role: Engineering/Design', 
                       'Predicted Role: Product Manager', 'Predicted Role: Biology/Chemistry', 
                       'Predicted Role: Marketing and Ads', 'Predicted Role: Data Engineer']
nmf_labels = pd.DataFrame(nmf_results.T.idxmax())
nmf_labels.columns = ['labels']
nmf_labels.head() 

Unnamed: 0,labels
0,Predicted Role: Software Engineer
1,Predicted Role: Business Analyst
2,Predicted Role: Software Engineer
3,Predicted Role: Data Manager
4,Predicted Role: Advanced AI


## Assigning Classes

Classes were assigned based on the topic that had the maximum score.

In [45]:
df = pd.read_pickle('../Data/01_clean_sf')

In [46]:
df.shape

(3364, 4)

In [47]:
df = df.merge(nmf_labels,left_index=True,right_index=True)

## Adding SVD

In [48]:
pca = PCA(n_components=3)
pca.fit(tf_idf_df)
tf_idf_df = pd.DataFrame(pca.transform(tf_idf_df))

In [49]:
df = df.merge(tf_idf_df,left_index=True,right_index=True)

In [50]:
df.to_pickle('../Data/01_sf_labeled')

In [51]:
df.head()

Unnamed: 0,company_name,job_title,listed_items,url,labels,0,1,2
0,Gap Inc. Corporate,"Software Engineer, Price Execution",write build product according business conduct...,https://www.indeed.com/rc/clk?jk=77d524a7cf198...,Predicted Role: Software Engineer,-0.018076,0.014856,0.118068
1,WrkShp,Business Analyst,closely product assist investigation deep dive...,https://www.indeed.com/company/WrkShp/jobs/Bus...,Predicted Role: Business Analyst,-0.049042,0.180981,-0.111674
2,Ceres Imaging,Image Processing: GIS / Remote Sensing Analyst,proficiency gi e g arcgis envi processing prod...,https://www.indeed.com/rc/clk?jk=8f702cd563785...,Predicted Role: Software Engineer,0.068105,-0.108212,-0.024264
3,Deloitte,"Analyst, Strategy and Research",effectively interpret client request use tacti...,https://www.indeed.com/rc/clk?jk=8a288a5c5a09d...,Predicted Role: Data Manager,-0.076932,-0.030792,-0.033639
4,Turing Video,Computer Vision Software Engineer,maintain existing implement algorithm necessar...,https://www.indeed.com/rc/clk?jk=fcf308f2fee2a...,Predicted Role: Advanced AI,0.144809,0.012745,0.025521
