#### Validation of predicted future jobs
In this notebook, I use cosine similarity to compare my predicted future jobs (predicted job i+1) and actual future jobs (actual job i+1), both of which are represented as topic weight matrices. As a baseline measure, I calculate cosine similarity between actual current job (actual job i) and actual job i+1, that is, simply using someone's current job as prediction of his/her future job. I also randomly shuffled the pairings between actual job i and actual job i+1 to get roughly chance-level similarity between topic weights of 2 jobs. 

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pickle
import os
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [30]:
# load pickled files
df_work_val = pickle.load(open('work_exp_validation.pkl', 'rb'))
transition_mat = pickle.load(open('transition_mat.pkl', 'rb'))
vectorizer = pickle.load(open('vectorizer.pkl', 'rb'))
nmf = pickle.load(open('nmf.pkl', 'rb'))

In [31]:
df_work_val.head()

Unnamed: 0,resume_id,job_id,job_title_processed,job_description_processed,title_and_desc
0,26,0,data scientist data analytics lab lead departm...,planned business strategy budget hired managed...,data scientist data analytics lab lead departm...
1,26,1,engineer,developed implemented customized plan performe...,engineer developed implemented customized plan...
2,26,2,manager,modified javascript core interpreter extension...,manager modified javascript core interpreter e...
3,26,3,junior researcher,developed event handling system tmax window ne...,junior researcher developed event handling sys...
4,26,4,student researcher master course,built robot action scripting programming langu...,student researcher master course built robot a...


In [32]:
# extract topics from job descriptions in validation set
job_description_val = vectorizer.transform(df_work_val['job_description_processed'])
W_val = nmf.transform(job_description_val)

# save topic weights as df and concatenate with df_work_val 
topic_weights_val = pd.DataFrame(W_val, columns=['job_i_topic_'+'{:01d}'.format(i+1) for i in range(20)])
# normalize topic weights such that they sum to 1 for each job description
topic_weights_val = topic_weights_val.div(topic_weights_val.sum(axis=1), axis=0)

df_work_val = pd.concat([df_work_val, topic_weights_val], axis=1, sort=False)

In [34]:
# get predicted topic weights for job i+1 - topic weights for job i * transition matrix
topics_job_i = df_work_val.iloc[:, 5:25].values
topics_job_iplus1_pred = np.dot(topics_job_i, transition_mat)
topics_job_iplus1_pred = pd.DataFrame(topics_job_iplus1_pred,
                                      columns=['job_i+1_predicted_topic_'+'{:01d}'.format(i+1) for i in range(20)])
# concat with the main df
df_work_val = pd.concat([df_work_val, topics_job_iplus1_pred], axis=1, sort=False)

In [35]:
# NaN for job i+1 predicted topics if job_id=0 (most recent job), no data to validate
cols = [col for col in df_work_val.columns if 'predicted' in col]
df_work_val.loc[df_work_val['job_id']==0, cols] = np.nan

In [36]:
# create columns for job i+1 actual topics using shift
topics_job_iplus1_actual = df_work_val.iloc[:, 5:25].shift()
topics_job_iplus1_actual.columns = ['job_i+1_actual_topic_'+'{:01d}'.format(i+1) for i in range(20)]

df_work_val = pd.concat([df_work_val, topics_job_iplus1_actual], axis=1, sort=False)
df_work_val.head()

Unnamed: 0,resume_id,job_id,job_title_processed,job_description_processed,title_and_desc,job_i_topic_1,job_i_topic_2,job_i_topic_3,job_i_topic_4,job_i_topic_5,job_i_topic_6,job_i_topic_7,job_i_topic_8,job_i_topic_9,job_i_topic_10,job_i_topic_11,job_i_topic_12,job_i_topic_13,job_i_topic_14,job_i_topic_15,job_i_topic_16,job_i_topic_17,job_i_topic_18,job_i_topic_19,job_i_topic_20,job_i+1_predicted_topic_1,job_i+1_predicted_topic_2,job_i+1_predicted_topic_3,job_i+1_predicted_topic_4,job_i+1_predicted_topic_5,job_i+1_predicted_topic_6,job_i+1_predicted_topic_7,job_i+1_predicted_topic_8,job_i+1_predicted_topic_9,job_i+1_predicted_topic_10,job_i+1_predicted_topic_11,job_i+1_predicted_topic_12,job_i+1_predicted_topic_13,job_i+1_predicted_topic_14,job_i+1_predicted_topic_15,job_i+1_predicted_topic_16,job_i+1_predicted_topic_17,job_i+1_predicted_topic_18,job_i+1_predicted_topic_19,job_i+1_predicted_topic_20,job_i+1_actual_topic_1,job_i+1_actual_topic_2,job_i+1_actual_topic_3,job_i+1_actual_topic_4,job_i+1_actual_topic_5,job_i+1_actual_topic_6,job_i+1_actual_topic_7,job_i+1_actual_topic_8,job_i+1_actual_topic_9,job_i+1_actual_topic_10,job_i+1_actual_topic_11,job_i+1_actual_topic_12,job_i+1_actual_topic_13,job_i+1_actual_topic_14,job_i+1_actual_topic_15,job_i+1_actual_topic_16,job_i+1_actual_topic_17,job_i+1_actual_topic_18,job_i+1_actual_topic_19,job_i+1_actual_topic_20
0,26,0,data scientist data analytics lab lead departm...,planned business strategy budget hired managed...,data scientist data analytics lab lead departm...,0.031262,0.019807,0.041755,0.0,0.0,0.0,0.288095,0.016559,0.092926,0.048439,0.052461,0.062533,0.128481,0.0,0.057644,0.0,0.009052,0.0,0.0,0.150985,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,26,1,engineer,developed implemented customized plan performe...,engineer developed implemented customized plan...,0.0,0.0,0.233295,0.021032,0.001139,0.012604,0.187922,0.064003,0.0,0.056375,0.087451,0.009515,0.019449,0.214731,0.047625,0.007274,0.002886,0.0,0.024835,0.009864,0.053834,0.053906,0.095094,0.03633,0.0473,0.03475,0.132531,0.033347,0.030076,0.044125,0.043641,0.039284,0.039717,0.117552,0.041555,0.028034,0.037301,0.019854,0.042053,0.029719,0.031262,0.019807,0.041755,0.0,0.0,0.0,0.288095,0.016559,0.092926,0.048439,0.052461,0.062533,0.128481,0.0,0.057644,0.0,0.009052,0.0,0.0,0.150985
2,26,2,manager,modified javascript core interpreter extension...,manager modified javascript core interpreter e...,0.0,0.0,0.466469,0.0,0.020704,0.007929,0.161754,0.0,0.009889,0.0,0.1329,0.032057,0.085056,0.0,0.007465,0.0,0.075776,0.0,0.0,0.0,0.052867,0.045243,0.156008,0.025439,0.04476,0.041256,0.119603,0.023461,0.036509,0.040388,0.056763,0.051937,0.057715,0.025704,0.032906,0.025419,0.051652,0.027714,0.049757,0.0349,0.0,0.0,0.233295,0.021032,0.001139,0.012604,0.187922,0.064003,0.0,0.056375,0.087451,0.009515,0.019449,0.214731,0.047625,0.007274,0.002886,0.0,0.024835,0.009864
3,26,3,junior researcher,developed event handling system tmax window ne...,junior researcher developed event handling sys...,0.002027,0.031736,0.082929,0.0,0.0,0.029299,0.046821,0.019593,0.0,0.018675,0.552699,0.0,0.010481,0.039956,0.038732,0.027024,0.062586,0.0,0.026686,0.010758,0.062575,0.071959,0.078853,0.044031,0.040496,0.038977,0.063649,0.03502,0.044271,0.045091,0.103089,0.040014,0.056341,0.035814,0.04612,0.042291,0.043285,0.041533,0.037293,0.0293,0.0,0.0,0.466469,0.0,0.020704,0.007929,0.161754,0.0,0.009889,0.0,0.1329,0.032057,0.085056,0.0,0.007465,0.0,0.075776,0.0,0.0,0.0
4,26,4,student researcher master course,built robot action scripting programming langu...,student researcher master course built robot a...,0.0,0.076347,0.033516,0.0,0.0,0.024421,0.067403,0.02864,0.035995,0.009719,0.453864,0.0,0.061972,0.031641,0.015607,0.02286,0.091075,0.035705,0.0,0.011235,0.06368,0.08587,0.062453,0.041767,0.04428,0.035364,0.069836,0.036688,0.047949,0.044567,0.094428,0.038405,0.061315,0.032112,0.044693,0.039394,0.047727,0.050535,0.028351,0.030585,0.002027,0.031736,0.082929,0.0,0.0,0.029299,0.046821,0.019593,0.0,0.018675,0.552699,0.0,0.010481,0.039956,0.038732,0.027024,0.062586,0.0,0.026686,0.010758


In [37]:
# drop rows with nan, which are rows with job_id=0, no data on job i+1
df_work_val1 = df_work_val.dropna(axis=0)
df_work_val1.reset_index(drop=True, inplace=True)

In [38]:
# randomly shuffle rows of job i topic weights to have random topic weights 
random = df_work_val1.iloc[:, 5:25].sample(frac=1, random_state=42).reset_index(drop=True)
random.columns = ['random_topic_'+'{:01d}'.format(i+1) for i in range(20)]

df_work_val1 = pd.concat([df_work_val1, random], axis=1, sort=False)

In [39]:
# compute cosine similarity between predicted future topic weights and actual future topic weights
# as baseline, also compute cosine similarity between current topic weights and actual future topic weights
cols_current = df_work_val1.columns[5:25]
cols_future_pred = [col for col in df_work_val1.columns if 'predicted' in col]
cols_future_actual = [col for col in df_work_val1.columns if 'actual' in col]
cols_random = [col for col in df_work_val1.columns if 'random' in col]

current_all = df_work_val1.loc[:, cols_current].values
future_pred_all = df_work_val1.loc[:, cols_future_pred].values
future_actual_all = df_work_val1.loc[:, cols_future_actual].values
random_all = df_work_val1.loc[:, cols_random].values

for i in range(len(df_work_val1)):
    current = current_all[i, :].reshape(1, -1)
    future_pred = future_pred_all[i, :].reshape(1, -1)
    future_actual = future_actual_all[i, :].reshape(1, -1)
    random = random_all[i, :].reshape(1, -1)
    # compute cosine similarity between predicted and actual topic weights, and between previous and actual topic weights
    df_work_val1.loc[i,'similarity_future_pred'] = np.dot(future_pred, future_actual.T)[0][0] \
                                                   /(np.linalg.norm(future_pred)*np.linalg.norm(future_actual)) # retrieve number from 2-d array
    df_work_val1.loc[i,'similarity_current'] = np.dot(current, future_actual.T)[0][0] \
                                               /(np.linalg.norm(current)*np.linalg.norm(future_actual))
    df_work_val1.loc[i, 'similarity_rand'] = np.dot(random, future_actual.T)[0][0] \
                                             /(np.linalg.norm(random)*np.linalg.norm(future_actual))

In [42]:
df_work_val1[['similarity_future_pred', 'similarity_current', 'similarity_rand']].describe()

Unnamed: 0,similarity_future_pred,similarity_current,similarity_rand
count,1256.0,1256.0,1256.0
mean,0.617108,0.543472,0.250229
std,0.168128,0.270744,0.204
min,0.10898,0.0,0.0
25%,0.507555,0.329467,0.08715
50%,0.637477,0.571899,0.204335
75%,0.744747,0.770289,0.372872
max,0.960588,0.999821,1.0


So cosine similarity between predicted job i+1 and actual job i+1 is higher than chance level similarity between 2 topic weight matrices. It is also higher than similarity between actual job i and actual job i+1 (smaller deviation too). Not bad. 