###### Name: Deepak Vadithala
###### Course: MSc Data Science
###### Project Name: MOOC Recommender System

##### Notes:
This notebook contains the analysis of the **Google's Word2Vec** model. This model is trained on the news articles. 
two variable **(Role and Skill Scores)** is used to predict the course category. 
Skill Score is calculated using the similarity between the skills from LinkedIn compared with the course description with keywords from Coursera.


*Model Source Code Path: /mooc-recommender/Model/Cosine_Distance.py*

*Github Repo: https://github.com/iamdv/mooc-recommender*

In [42]:
# **************************** IMPORTANT ****************************
'''
This cell configuration settings for the Notebook. 
You can run one role at a time to evaluate the performance of the model
Change the variable names to run for multiple roles

In this model:
1. Google word2vec model has two variables Roles and Skills with 
50% weightage for each
'''

# *******************************************************************
# For each role a list of category names are grouped. 
# Please don't change these variables

label_DataScientist = ['Data Science','Data Analysis','Data Mining','Data Visualization']

label_SoftwareDevelopment = ['Software Development','Computer Science',
                           'Programming Languages', 'Algorithms and Data Structures', 
                           'Information Technology']


label_DatabaseAdministrator = ['Databases']

label_Cybersecurity = ['Cybersecurity']

label_FinancialAccountant = ['Finance', 'Accounting']

label_MachineLearning = ['Machine Learning', 'Deep Learning']

label_Musician = ['Music']

label_Dietitian = ['Nutrition & Wellness', 'Health & Medicine']

            
# *******************************************************************


# *******************************************************************
# Environment and Config Variables. Change these variables as per the requirement.

my_fpath_model = "../Data/Final_Model_Output.csv"

my_fpath_courses = "../Data/main_coursera.csv"

my_fpath_skills_DataScientist = "../Data/Word2Vec-Google/Word2VecGoogle_DataScientist.csv"

my_fpath_skills_SoftwareDevelopment = "../Data/Word2Vec-Google/Word2VecGoogle_SoftwareDevelopment.csv" 

my_fpath_skills_DatabaseAdministrator = "../Data/Word2Vec-Google/Word2VecGoogle_DatabaseAdministrator.csv"

my_fpath_skills_Cybersecurity = "../Data/Word2Vec-Google/Word2VecGoogle_Cybersecurity.csv"

my_fpath_skills_FinancialAccountant = "../Data/Word2Vec-Google/Word2VecGoogle_FinancialAccountant.csv"

my_fpath_skills_MachineLearning = "../Data/Word2Vec-Google/Word2VecGoogle_MachineLearning.csv"

my_fpath_skills_Musician = "../Data/Word2Vec-Google/Word2VecGoogle_Musician.csv"

my_fpath_skills_Dietitian = "../Data/Word2Vec-Google/Word2VecGoogle_Dietitian.csv"


# *******************************************************************


# *******************************************************************
# Weighting Variables. Change them as per the requirement.
# Role score is not applicable for Google's Word2Vec model.

my_role_weight = 0.5

my_skill_weight = 0.5

my_threshold = 0.37

# *******************************************************************


In [43]:
# Importing required modules/packages

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import string
import csv
import json



In [44]:
# Downloading the stopwords like i, me, and, is, the etc.

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/DV/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [45]:
# Loading courses and skills data from the CSV files

df_courses = pd.read_csv(my_fpath_courses)

df_DataScientist = pd.read_csv(my_fpath_skills_DataScientist)
df_DataScientist = df_DataScientist.drop('Role', 1)
df_DataScientist.columns = ['Course Id', 'DataScientist_Skill_Score', 'DataScientist_Role_Score', 'DataScientist_Keyword_Score']

df_SoftwareDevelopment = pd.read_csv(my_fpath_skills_SoftwareDevelopment)
df_SoftwareDevelopment = df_SoftwareDevelopment.drop('Role', 1)
df_SoftwareDevelopment.columns = ['Course Id','SoftwareDevelopment_Skill_Score', 'SoftwareDevelopment_Role_Score', 'SoftwareDevelopment_Keyword_Score']

df_DatabaseAdministrator = pd.read_csv(my_fpath_skills_DatabaseAdministrator)
df_DatabaseAdministrator = df_DatabaseAdministrator.drop('Role', 1)
df_DatabaseAdministrator.columns = ['Course Id','DatabaseAdministrator_Skill_Score', 'DatabaseAdministrator_Role_Score', 'DatabaseAdministrator_Keyword_Score']

df_Cybersecurity = pd.read_csv(my_fpath_skills_Cybersecurity)
df_Cybersecurity = df_Cybersecurity.drop('Role', 1)
df_Cybersecurity.columns = ['Course Id','Cybersecurity_Skill_Score', 'Cybersecurity_Role_Score', 'Cybersecurity_Keyword_Score']

df_FinancialAccountant = pd.read_csv(my_fpath_skills_FinancialAccountant)
df_FinancialAccountant = df_FinancialAccountant.drop('Role', 1)
df_FinancialAccountant.columns = ['Course Id','FinancialAccountant_Skill_Score', 'FinancialAccountant_Role_Score', 'FinancialAccountant_Keyword_Score']

df_MachineLearning = pd.read_csv(my_fpath_skills_MachineLearning)
df_MachineLearning = df_MachineLearning.drop('Role', 1)
df_MachineLearning.columns = ['Course Id','MachineLearning_Skill_Score', 'MachineLearning_Role_Score', 'MachineLearning_Keyword_Score']

df_Musician = pd.read_csv(my_fpath_skills_Musician)
df_Musician = df_Musician.drop('Role', 1)
df_Musician.columns = ['Course Id','Musician_Skill_Score', 'Musician_Role_Score', 'Musician_Keyword_Score']

df_Dietitian = pd.read_csv(my_fpath_skills_Dietitian)
df_Dietitian = df_Dietitian.drop('Role', 1)
df_Dietitian.columns = ['Course Id','Dietitian_Skill_Score', 'Dietitian_Role_Score','Dietitian_Keyword_Score']


In [46]:
# Merging the csv files

df_cosdist = df_DataScientist.merge(df_SoftwareDevelopment, on = 'Course Id', how = 'outer')

df_cosdist = df_cosdist.merge(df_DatabaseAdministrator, on = 'Course Id', how = 'outer')

df_cosdist = df_cosdist.merge(df_Cybersecurity, on = 'Course Id', how = 'outer')

df_cosdist = df_cosdist.merge(df_FinancialAccountant, on = 'Course Id', how = 'outer')

df_cosdist = df_cosdist.merge(df_MachineLearning, on = 'Course Id', how = 'outer')

df_cosdist = df_cosdist.merge(df_Musician, on = 'Course Id', how = 'outer')

df_cosdist = df_cosdist.merge(df_Dietitian, on = 'Course Id', how = 'outer')



In [47]:
# Exploring data dimensionality, feature names, and feature types.

print(df_courses.shape,"\n")

print(df_cosdist.shape,"\n")

print(df_courses.columns, "\n")

print(df_cosdist.shape,"\n")

print(df_courses.describe(), "\n")

print(df_cosdist.describe(), "\n")


(2213, 19) 

(2213, 25) 

Index(['Unnamed: 0', 'Course Id', 'Course Name', 'Course Description', 'Slug',
       'Provider', 'Universities/Institutions', 'Parent Subject',
       'Child Subject', 'Category', 'Url', 'Length', 'Language',
       'Credential Name', 'Rating', 'Number of Ratings', 'Certificate',
       'Workload', 'Course Keywords'],
      dtype='object') 

(2213, 25) 

        Unnamed: 0    Course Id      Length       Rating  Number of Ratings  \
count  2213.000000  2213.000000  964.000000  2213.000000        2213.000000   
mean   1106.000000  4816.998192    6.063278     2.352785          10.321735   
std     638.982394  3033.878865    2.724669     2.129134         110.680382   
min       0.000000   303.000000    1.000000     0.000000           0.000000   
25%     553.000000  1829.000000    4.000000     0.000000           0.000000   
50%    1106.000000  4880.000000    6.000000     3.000000           1.000000   
75%    1659.000000  7329.000000    7.000000     4.428571       

In [48]:
# Quick check to see if the dataframe showing the right results

df_cosdist.head(20)

Unnamed: 0,Course Id,DataScientist_Skill_Score,DataScientist_Role_Score,DataScientist_Keyword_Score,SoftwareDevelopment_Skill_Score,SoftwareDevelopment_Role_Score,SoftwareDevelopment_Keyword_Score,DatabaseAdministrator_Skill_Score,DatabaseAdministrator_Role_Score,DatabaseAdministrator_Keyword_Score,...,FinancialAccountant_Keyword_Score,MachineLearning_Skill_Score,MachineLearning_Role_Score,MachineLearning_Keyword_Score,Musician_Skill_Score,Musician_Role_Score,Musician_Keyword_Score,Dietitian_Skill_Score,Dietitian_Role_Score,Dietitian_Keyword_Score
0,303,0.353896,0.218317,0.006389,0.499727,0.4079,0.286107,0.481017,0.552069,0.239622,...,0.126673,0.297831,0.218179,-0.00998,0.113307,0.009469,0.031942,0.158573,0.129606,0.202016
1,305,0.248907,0.201266,0.060863,0.671046,0.637541,0.339763,0.326477,0.478538,0.337813,...,0.244841,0.255012,0.263021,0.070378,0.238268,0.126894,0.102302,0.167242,0.123262,0.224514
2,306,0.196352,0.111541,0.028367,0.462116,0.177365,0.022605,0.17797,0.172532,0.141302,...,-0.015601,0.33411,0.221548,-0.022811,0.201262,0.104183,0.093264,0.208029,0.163249,0.18022
3,307,0.325023,0.269903,0.192762,0.461995,0.365222,0.234099,0.31837,0.290179,0.158531,...,0.133242,0.404256,0.389423,0.102171,0.201187,0.136207,0.161853,0.236696,0.216953,0.193625
4,308,0.268324,0.177796,0.09639,0.424911,0.171458,0.070023,0.251495,0.102554,0.157861,...,0.04969,0.35383,0.325543,0.158385,0.267893,0.094709,0.060037,0.226828,0.12305,0.164989
5,309,0.346725,0.279909,0.093165,0.451475,0.406482,0.310806,0.311971,0.405204,0.389088,...,0.142451,0.377204,0.210824,0.072626,0.193436,0.00828,0.01041,0.216048,0.12402,0.215586
6,316,0.330795,0.341219,0.076157,0.401275,0.225375,0.133911,0.26369,0.267757,0.107805,...,0.168597,0.296694,0.241718,0.047848,0.186473,0.035255,-0.010463,0.238093,0.254985,0.326331
7,317,0.281024,0.177137,0.065165,0.361162,0.201478,0.083905,0.204727,0.168023,0.096344,...,0.075515,0.348276,0.378852,0.12202,0.253335,0.165166,0.146791,0.261725,0.143083,0.192215
8,318,0.274875,0.276595,0.090191,0.479362,0.401717,0.237797,0.33721,0.443668,0.275484,...,0.271226,0.303685,0.247455,0.001999,0.171764,0.089439,0.030157,0.16674,0.185945,0.166751
9,322,0.337485,0.187397,0.028805,0.404149,0.291805,0.082941,0.251568,0.248627,0.173421,...,0.232345,0.3684,0.391255,0.100462,0.200777,0.116844,0.087077,0.201222,0.236463,0.240487


In [49]:
# Joining two dataframes - Courses and the Cosein Similarity Results based on the 'Course Id' variable. 
# Inner joins: Joins two tables with the common rows. This is a set operateion.

df_courses_score = df_courses.merge(df_cosdist, on ='Course Id', how='inner')

print(df_courses_score.shape,"\n")


(2213, 43) 



In [50]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: DATA SCIENTIST

y_actu_DataScientist         = ''
y_pred_DataScientist         = ''

df_courses_score['DataScientist_Final_Score'] = (df_courses_score['DataScientist_Role_Score'] * my_role_weight) + (df_courses_score['DataScientist_Skill_Score'] * my_skill_weight)

df_courses_score['DataScientist_Predict'] = (df_courses_score['DataScientist_Final_Score'] >= my_threshold)

df_courses_score['DataScientist_Label'] = df_courses_score.Category.isin(label_DataScientist)

y_pred_DataScientist = pd.Series(df_courses_score['DataScientist_Predict'], name='Predicted')

y_actu_DataScientist = pd.Series(df_courses_score['DataScientist_Label'], name='Actual')

df_confusion_DataScientist = pd.crosstab(y_actu_DataScientist, y_pred_DataScientist , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [51]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: SOFTWARE ENGINEER/DEVELOPER

y_actu_SoftwareDevelopment         = ''
y_pred_SoftwareDevelopment         = ''

df_courses_score['SoftwareDevelopment_Final_Score'] = (df_courses_score['SoftwareDevelopment_Role_Score'] * my_role_weight) + (df_courses_score['SoftwareDevelopment_Skill_Score'] * my_skill_weight)

df_courses_score['SoftwareDevelopment_Predict'] = (df_courses_score['SoftwareDevelopment_Final_Score'] >= my_threshold)

df_courses_score['SoftwareDevelopment_Label'] = df_courses_score.Category.isin(label_SoftwareDevelopment)

y_pred_SoftwareDevelopment = pd.Series(df_courses_score['SoftwareDevelopment_Predict'], name='Predicted')

y_actu_SoftwareDevelopment = pd.Series(df_courses_score['SoftwareDevelopment_Label'], name='Actual')

df_confusion_SoftwareDevelopment = pd.crosstab(y_actu_SoftwareDevelopment, y_pred_SoftwareDevelopment , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [52]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: DATABASE DEVELOPER/ADMINISTRATOR

y_actu_DatabaseAdministrator         = ''
y_pred_DatabaseAdministrator         = ''

df_courses_score['DatabaseAdministrator_Final_Score'] = (df_courses_score['DatabaseAdministrator_Role_Score'] * my_role_weight) + (df_courses_score['DatabaseAdministrator_Skill_Score'] * my_skill_weight)

df_courses_score['DatabaseAdministrator_Predict'] = (df_courses_score['DatabaseAdministrator_Final_Score'] >= my_threshold)

df_courses_score['DatabaseAdministrator_Label'] = df_courses_score.Category.isin(label_DatabaseAdministrator)

y_pred_DatabaseAdministrator = pd.Series(df_courses_score['DatabaseAdministrator_Predict'], name='Predicted')

y_actu_DatabaseAdministrator = pd.Series(df_courses_score['DatabaseAdministrator_Label'], name='Actual')

df_confusion_DatabaseAdministrator = pd.crosstab(y_actu_DatabaseAdministrator, y_pred_DatabaseAdministrator , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [53]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: CYBERSECURITY CONSULTANT

y_actu_Cybersecurity         = ''
y_pred_Cybersecurity         = ''

df_courses_score['Cybersecurity_Final_Score'] = (df_courses_score['Cybersecurity_Role_Score'] * my_role_weight) + (df_courses_score['Cybersecurity_Skill_Score'] * my_skill_weight)

df_courses_score['Cybersecurity_Predict'] = (df_courses_score['Cybersecurity_Final_Score'] >= my_threshold)

df_courses_score['Cybersecurity_Label'] = df_courses_score.Category.isin(label_Cybersecurity)

y_pred_Cybersecurity = pd.Series(df_courses_score['Cybersecurity_Predict'], name='Predicted')

y_actu_Cybersecurity = pd.Series(df_courses_score['Cybersecurity_Label'], name='Actual')

df_confusion_Cybersecurity = pd.crosstab(y_actu_Cybersecurity, y_pred_Cybersecurity , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [54]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: FINANCIAL ACCOUNTANT

y_actu_FinancialAccountant         = ''
y_pred_FinancialAccountant         = ''

df_courses_score['FinancialAccountant_Final_Score'] = (df_courses_score['FinancialAccountant_Role_Score'] * my_role_weight) + (df_courses_score['FinancialAccountant_Skill_Score'] * my_skill_weight)

df_courses_score['FinancialAccountant_Predict'] = (df_courses_score['FinancialAccountant_Final_Score'] >= my_threshold)

df_courses_score['FinancialAccountant_Label'] = df_courses_score.Category.isin(label_FinancialAccountant)

y_pred_FinancialAccountant = pd.Series(df_courses_score['FinancialAccountant_Predict'], name='Predicted')

y_actu_FinancialAccountant = pd.Series(df_courses_score['FinancialAccountant_Label'], name='Actual')

df_confusion_FinancialAccountant = pd.crosstab(y_actu_FinancialAccountant, y_pred_FinancialAccountant , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [55]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: MACHINE LEARNING ENGINEER

y_actu_MachineLearning         = ''
y_pred_MachineLearning         = ''

df_courses_score['MachineLearning_Final_Score'] = (df_courses_score['MachineLearning_Role_Score'] * my_role_weight) + (df_courses_score['MachineLearning_Skill_Score'] * my_skill_weight)

df_courses_score['MachineLearning_Predict'] = (df_courses_score['MachineLearning_Final_Score'] >= my_threshold)

df_courses_score['MachineLearning_Label'] = df_courses_score.Category.isin(label_MachineLearning)

y_pred_MachineLearning = pd.Series(df_courses_score['MachineLearning_Predict'], name='Predicted')

y_actu_MachineLearning = pd.Series(df_courses_score['MachineLearning_Label'], name='Actual')

df_confusion_MachineLearning = pd.crosstab(y_actu_MachineLearning, y_pred_MachineLearning , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [56]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: MUSICIAN

y_actu_Musician         = ''
y_pred_Musician         = ''

df_courses_score['Musician_Final_Score'] = (df_courses_score['Musician_Role_Score'] * my_role_weight) + (df_courses_score['Musician_Skill_Score'] * my_skill_weight)

df_courses_score['Musician_Predict'] = (df_courses_score['Musician_Final_Score'] >= my_threshold)

df_courses_score['Musician_Label'] = df_courses_score.Category.isin(label_Musician)

y_pred_Musician = pd.Series(df_courses_score['Musician_Predict'], name='Predicted')

y_actu_Musician = pd.Series(df_courses_score['Musician_Label'], name='Actual')

df_confusion_Musician = pd.crosstab(y_actu_Musician, y_pred_Musician , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [57]:
# Tranforming and shaping the data to create the confusion matrix for the ROLE: NUTRITIONIST/DIETITIAN

y_actu_Dietitian         = ''
y_pred_Dietitian         = ''

df_courses_score['Dietitian_Final_Score'] = (df_courses_score['Dietitian_Role_Score'] * my_role_weight) + (df_courses_score['Dietitian_Skill_Score'] * my_skill_weight)

df_courses_score['Dietitian_Predict'] = (df_courses_score['Dietitian_Final_Score'] >= my_threshold)

df_courses_score['Dietitian_Label'] = df_courses_score.Category.isin(label_Dietitian)

y_pred_Dietitian = pd.Series(df_courses_score['Dietitian_Predict'], name='Predicted')

y_actu_Dietitian = pd.Series(df_courses_score['Dietitian_Label'], name='Actual')

df_confusion_Dietitian = pd.crosstab(y_actu_Dietitian, y_pred_Dietitian , rownames=['Actual'], colnames=['Predicted'], margins=False)


In [58]:
df_confusion_DataScientist


Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2071,60
True,26,56


In [59]:
df_confusion_SoftwareDevelopment

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1511,581
True,14,107


In [60]:
df_confusion_DatabaseAdministrator

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2103,99
True,2,9


In [61]:
df_confusion_Cybersecurity

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2110,73
True,0,30


In [62]:
df_confusion_FinancialAccountant

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1998,112
True,23,80


In [63]:
df_confusion_MachineLearning

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1990,199
True,1,23


In [64]:
df_confusion_Musician

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2163,13
True,2,35


In [65]:
df_confusion_Dietitian

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2052,116
True,20,25


In [66]:
# Performance summary for the ROLE: DATA SCIENTIST


try:
    tn_DataScientist = df_confusion_DataScientist.iloc[0][False]
except:
    tn_DataScientist = 0
    
try:
    tp_DataScientist =  df_confusion_DataScientist.iloc[1][True]
except:
    tp_DataScientist = 0

    
try:
    fn_DataScientist = df_confusion_DataScientist.iloc[1][False]
except:
    fn_DataScientist = 0
    
try:
    fp_DataScientist =  df_confusion_DataScientist.iloc[0][True]
except:
    fp_DataScientist = 0  
    
    
total_count_DataScientist = tn_DataScientist + tp_DataScientist + fn_DataScientist + fp_DataScientist

print('Data Scientist Accuracy Rate : ', '{0:.2f}'.format((tn_DataScientist + tp_DataScientist) / total_count_DataScientist * 100))

print('Data Scientist Misclassifcation Rate : ',  '{0:.2f}'.format((fn_DataScientist + fp_DataScientist) / total_count_DataScientist * 100))

print('Data Scientist True Positive Rate : ',  '{0:.2f}'.format(tp_DataScientist / (tp_DataScientist + fn_DataScientist) * 100))

print('Data Scientist False Positive Rate : ',  '{0:.2f}'.format(fp_DataScientist / (tn_DataScientist + fp_DataScientist) * 100))


Data Scientist Accuracy Rate :  96.11
Data Scientist Misclassifcation Rate :  3.89
Data Scientist True Positive Rate :  68.29
Data Scientist False Positive Rate :  2.82


In [67]:
# Performance summary for the ROLE: SOFTWARE ENGINEER


try:
    tn_SoftwareDevelopment = df_confusion_SoftwareDevelopment.iloc[0][False]
except:
    tn_SoftwareDevelopment = 0
    
try:
    tp_SoftwareDevelopment =  df_confusion_SoftwareDevelopment.iloc[1][True]
except:
    tp_SoftwareDevelopment = 0

    
try:
    fn_SoftwareDevelopment = df_confusion_SoftwareDevelopment.iloc[1][False]
except:
    fn_SoftwareDevelopment = 0
    
try:
    fp_SoftwareDevelopment =  df_confusion_SoftwareDevelopment.iloc[0][True]
except:
    fp_SoftwareDevelopment = 0  
    
    
total_count_SoftwareDevelopment = tn_SoftwareDevelopment + tp_SoftwareDevelopment + fn_SoftwareDevelopment + fp_SoftwareDevelopment

print('Software Engineer Accuracy Rate : ', '{0:.2f}'.format((tn_SoftwareDevelopment + tp_SoftwareDevelopment) / total_count_SoftwareDevelopment * 100))

print('Software Engineer Misclassifcation Rate : ',  '{0:.2f}'.format((fn_SoftwareDevelopment + fp_SoftwareDevelopment) / total_count_SoftwareDevelopment * 100))

print('Software Engineer True Positive Rate : ',  '{0:.2f}'.format(tp_SoftwareDevelopment / (tp_SoftwareDevelopment + fn_SoftwareDevelopment) * 100))

print('Software Engineer False Positive Rate : ',  '{0:.2f}'.format(fp_SoftwareDevelopment / (tn_SoftwareDevelopment + fp_SoftwareDevelopment) * 100))


Software Engineer Accuracy Rate :  73.11
Software Engineer Misclassifcation Rate :  26.89
Software Engineer True Positive Rate :  88.43
Software Engineer False Positive Rate :  27.77


In [68]:
# Performance summary for the ROLE: DATABASE DEVELOPER/ ADMINISTRATOR


try:
    tn_DatabaseAdministrator = df_confusion_DatabaseAdministrator.iloc[0][False]
except:
    tn_DatabaseAdministrator = 0
    
try:
    tp_DatabaseAdministrator =  df_confusion_DatabaseAdministrator.iloc[1][True]
except:
    tp_DatabaseAdministrator = 0

    
try:
    fn_DatabaseAdministrator = df_confusion_DatabaseAdministrator.iloc[1][False]
except:
    fn_DatabaseAdministrator = 0
    
try:
    fp_DatabaseAdministrator =  df_confusion_DatabaseAdministrator.iloc[0][True]
except:
    fp_DatabaseAdministrator = 0  
    
    
total_count_DatabaseAdministrator = tn_DatabaseAdministrator + tp_DatabaseAdministrator + fn_DatabaseAdministrator + fp_DatabaseAdministrator

print('Database Administrator Accuracy Rate : ', '{0:.2f}'.format((tn_DatabaseAdministrator + tp_DatabaseAdministrator) / total_count_DatabaseAdministrator * 100))

print('Database Administrator Misclassifcation Rate : ',  '{0:.2f}'.format((fn_DatabaseAdministrator + fp_DatabaseAdministrator) / total_count_DatabaseAdministrator * 100))

print('Database Administrator True Positive Rate : ',  '{0:.2f}'.format(tp_DatabaseAdministrator / (tp_DatabaseAdministrator + fn_DatabaseAdministrator) * 100))

print('Database Administrator False Positive Rate : ',  '{0:.2f}'.format(fp_DatabaseAdministrator / (tn_DatabaseAdministrator + fp_DatabaseAdministrator) * 100))


Database Administrator Accuracy Rate :  95.44
Database Administrator Misclassifcation Rate :  4.56
Database Administrator True Positive Rate :  81.82
Database Administrator False Positive Rate :  4.50


In [69]:
# Performance summary for the ROLE: CYBERSECURITY CONSULTANT


try:
    tn_Cybersecurity = df_confusion_Cybersecurity.iloc[0][False]
except:
    tn_Cybersecurity = 0
    
try:
    tp_Cybersecurity =  df_confusion_Cybersecurity.iloc[1][True]
except:
    tp_Cybersecurity = 0

    
try:
    fn_Cybersecurity = df_confusion_Cybersecurity.iloc[1][False]
except:
    fn_Cybersecurity = 0
    
try:
    fp_Cybersecurity =  df_confusion_Cybersecurity.iloc[0][True]
except:
    fp_Cybersecurity = 0  
    
    
total_count_Cybersecurity = tn_Cybersecurity + tp_Cybersecurity + fn_Cybersecurity + fp_Cybersecurity

print('Cybersecurity Consultant Accuracy Rate : ', '{0:.2f}'.format((tn_Cybersecurity + tp_Cybersecurity) / total_count_Cybersecurity * 100))

print('Cybersecurity Consultant Misclassifcation Rate : ',  '{0:.2f}'.format((fn_Cybersecurity + fp_Cybersecurity) / total_count_Cybersecurity * 100))

print('Cybersecurity Consultant True Positive Rate : ',  '{0:.2f}'.format(tp_Cybersecurity / (tp_Cybersecurity + fn_Cybersecurity) * 100))

print('Cybersecurity Consultant False Positive Rate : ',  '{0:.2f}'.format(fp_Cybersecurity / (tn_Cybersecurity + fp_Cybersecurity) * 100))


Cybersecurity Consultant Accuracy Rate :  96.70
Cybersecurity Consultant Misclassifcation Rate :  3.30
Cybersecurity Consultant True Positive Rate :  100.00
Cybersecurity Consultant False Positive Rate :  3.34


In [70]:
# Performance summary for the ROLE: FINANCIAL ACCOUNTANT


try:
    tn_FinancialAccountant = df_confusion_FinancialAccountant.iloc[0][False]
except:
    tn_FinancialAccountant = 0
    
try:
    tp_FinancialAccountant =  df_confusion_FinancialAccountant.iloc[1][True]
except:
    tp_FinancialAccountant = 0

    
try:
    fn_FinancialAccountant = df_confusion_FinancialAccountant.iloc[1][False]
except:
    fn_FinancialAccountant = 0
    
try:
    fp_FinancialAccountant =  df_confusion_FinancialAccountant.iloc[0][True]
except:
    fp_FinancialAccountant = 0  
    
    
total_count_FinancialAccountant = tn_FinancialAccountant + tp_FinancialAccountant + fn_FinancialAccountant + fp_FinancialAccountant

print('Financial Accountant Consultant Accuracy Rate : ', '{0:.2f}'.format((tn_FinancialAccountant + tp_FinancialAccountant) / total_count_FinancialAccountant * 100))

print('Financial Accountant Consultant Misclassifcation Rate : ',  '{0:.2f}'.format((fn_FinancialAccountant + fp_FinancialAccountant) / total_count_FinancialAccountant * 100))

print('Financial Accountant Consultant True Positive Rate : ',  '{0:.2f}'.format(tp_FinancialAccountant / (tp_FinancialAccountant + fn_FinancialAccountant) * 100))

print('Financial Accountant Consultant False Positive Rate : ',  '{0:.2f}'.format(fp_FinancialAccountant / (tn_FinancialAccountant + fp_FinancialAccountant) * 100))


Financial Accountant Consultant Accuracy Rate :  93.90
Financial Accountant Consultant Misclassifcation Rate :  6.10
Financial Accountant Consultant True Positive Rate :  77.67
Financial Accountant Consultant False Positive Rate :  5.31


In [71]:
# Performance summary for the ROLE: MACHINE LEARNING ENGINEER


try:
    tn_MachineLearning = df_confusion_MachineLearning.iloc[0][False]
except:
    tn_MachineLearning = 0
    
try:
    tp_MachineLearning =  df_confusion_MachineLearning.iloc[1][True]
except:
    tp_MachineLearning = 0

    
try:
    fn_MachineLearning = df_confusion_MachineLearning.iloc[1][False]
except:
    fn_MachineLearning = 0
    
try:
    fp_MachineLearning =  df_confusion_MachineLearning.iloc[0][True]
except:
    fp_MachineLearning = 0  
    
    
total_count_MachineLearning = tn_MachineLearning + tp_MachineLearning + fn_MachineLearning + fp_MachineLearning

print('Machine Learning Engineer Accuracy Rate : ', '{0:.2f}'.format((tn_MachineLearning + tp_MachineLearning) / total_count_MachineLearning * 100))

print('Machine Learning Engineer Misclassifcation Rate : ',  '{0:.2f}'.format((fn_MachineLearning + fp_MachineLearning) / total_count_MachineLearning * 100))

print('Machine Learning Engineer True Positive Rate : ',  '{0:.2f}'.format(tp_MachineLearning / (tp_MachineLearning + fn_MachineLearning) * 100))

print('Machine Learning Engineer False Positive Rate : ',  '{0:.2f}'.format(fp_MachineLearning / (tn_MachineLearning + fp_MachineLearning) * 100))


Machine Learning Engineer Accuracy Rate :  90.96
Machine Learning Engineer Misclassifcation Rate :  9.04
Machine Learning Engineer True Positive Rate :  95.83
Machine Learning Engineer False Positive Rate :  9.09


In [72]:
# Performance summary for the ROLE: MUSICIAN


try:
    tn_Musician = df_confusion_Musician.iloc[0][False]
except:
    tn_Musician = 0
    
try:
    tp_Musician =  df_confusion_Musician.iloc[1][True]
except:
    tp_Musician = 0

    
try:
    fn_Musician = df_confusion_Musician.iloc[1][False]
except:
    fn_Musician = 0
    
try:
    fp_Musician =  df_confusion_Musician.iloc[0][True]
except:
    fp_Musician = 0  
    
    
total_count_Musician = tn_Musician + tp_Musician + fn_Musician + fp_Musician

print('Musician Accuracy Rate : ', '{0:.2f}'.format((tn_Musician + tp_Musician) / total_count_Musician * 100))

print('Musician Misclassifcation Rate : ',  '{0:.2f}'.format((fn_Musician + fp_Musician) / total_count_Musician * 100))

print('Musician True Positive Rate : ',  '{0:.2f}'.format(tp_Musician / (tp_Musician + fn_Musician) * 100))

print('Musician False Positive Rate : ',  '{0:.2f}'.format(fp_Musician / (tn_Musician + fp_Musician) * 100))


Musician Accuracy Rate :  99.32
Musician Misclassifcation Rate :  0.68
Musician True Positive Rate :  94.59
Musician False Positive Rate :  0.60


In [73]:
# Performance summary for the ROLE: DIETITIAN


try:
    tn_Dietitian = df_confusion_Dietitian.iloc[0][False]
except:
    tn_Dietitian = 0
    
try:
    tp_Dietitian =  df_confusion_Dietitian.iloc[1][True]
except:
    tp_Dietitian = 0

    
try:
    fn_Dietitian = df_confusion_Dietitian.iloc[1][False]
except:
    fn_Dietitian = 0
    
try:
    fp_Dietitian =  df_confusion_Dietitian.iloc[0][True]
except:
    fp_Dietitian = 0  
    
    
total_count_Dietitian = tn_Dietitian + tp_Dietitian + fn_Dietitian + fp_Dietitian

print('Dietitian Accuracy Rate : ', '{0:.2f}'.format((tn_Dietitian + tp_Dietitian) / total_count_Dietitian * 100))

print('Dietitian Misclassifcation Rate : ',  '{0:.2f}'.format((fn_Dietitian + fp_Dietitian) / total_count_Dietitian * 100))

print('Dietitian True Positive Rate : ',  '{0:.2f}'.format(tp_Dietitian / (tp_Dietitian + fn_Dietitian) * 100))

print('Dietitian False Positive Rate : ',  '{0:.2f}'.format(fp_Dietitian / (tn_Dietitian + fp_Dietitian) * 100))


Dietitian Accuracy Rate :  93.85
Dietitian Misclassifcation Rate :  6.15
Dietitian True Positive Rate :  55.56
Dietitian False Positive Rate :  5.35


In [74]:
df_final_model = df_courses_score[['Course Id', 'Course Name', 'Course Description', 'Slug',
       'Provider', 'Universities/Institutions', 'Parent Subject',
       'Child Subject', 'Category', 'Url', 'Length', 'Language',
       'Credential Name', 'Rating', 'Number of Ratings', 'Certificate',
       'Workload',
        'DataScientist_Final_Score', 'DataScientist_Predict',
        'SoftwareDevelopment_Final_Score', 'SoftwareDevelopment_Predict', 
        'DatabaseAdministrator_Final_Score', 'DatabaseAdministrator_Predict',
        'Cybersecurity_Final_Score', 'Cybersecurity_Predict',
        'FinancialAccountant_Final_Score', 'FinancialAccountant_Predict',
        'MachineLearning_Final_Score', 'MachineLearning_Predict',
        'Musician_Final_Score', 'Musician_Predict',
        'Dietitian_Final_Score', 'Dietitian_Predict']]

In [78]:
df_final_model

test = df_final_model.sort_values('FinancialAccountant_Final_Score', ascending=False)


test


Unnamed: 0,Course Id,Course Name,Course Description,Slug,Provider,Universities/Institutions,Parent Subject,Child Subject,Category,Url,...,Cybersecurity_Final_Score,Cybersecurity_Predict,FinancialAccountant_Final_Score,FinancialAccountant_Predict,MachineLearning_Final_Score,MachineLearning_Predict,Musician_Final_Score,Musician_Predict,Dietitian_Final_Score,Dietitian_Predict
1485,6749,Financial Accounting: Foundations,"In this course, you will learn foundations of ...",coursera-financial-accounting-foundations,Coursera,University of Illinois at Urbana-Champaign,Business,Accounting,Accounting,https://www.coursera.org/learn/financial-accou...,...,0.298198,False,0.670776,True,0.231681,False,0.122455,False,0.225509,False
287,769,Introduction to Financial Accounting,Master the technical skills needed to analyze ...,coursera-introduction-to-financial-accounting,Coursera,University of Pennsylvania|||Wharton School of...,Business,Accounting,Accounting,https://www.coursera.org/learn/wharton-accounting,...,0.279021,False,0.655986,True,0.232760,False,0.119940,False,0.221610,False
1483,6726,Accounting: Principles of Financial Accounting,Financial Accounting is often called the langu...,coursera-accounting-principles-of-financial-ac...,Coursera,IESE Business School,Business,Accounting,Accounting,https://www.coursera.org/learn/financial-accou...,...,0.287678,False,0.638713,True,0.255465,False,0.154618,False,0.249125,False
844,3539,More Introduction to Financial Accounting,The course builds on my Introduction to Financ...,coursera-more-introduction-to-financial-accoun...,Coursera,University of Pennsylvania,Business,Accounting,Accounting,https://www.coursera.org/learn/wharton-financi...,...,0.259679,False,0.628390,True,0.234770,False,0.103120,False,0.223486,False
1983,9083,Formal Financial Accounting,This course builds upon what you learned in Fi...,coursera-formal-financial-accounting,Coursera,University of Illinois at Urbana-Champaign,Business,Accounting,Accounting,https://www.coursera.org/learn/formal-financia...,...,0.313807,False,0.627069,True,0.250703,False,0.150274,False,0.239414,False
1828,8304,Financial Accounting Fundamentals,This course will teach you the tools you'll ne...,coursera-financial-accounting-fundamentals,Coursera,University of Virginia,Business,Accounting,Accounting,https://www.coursera.org/learn/uva-darden-fina...,...,0.265765,False,0.623439,True,0.247642,False,0.143329,False,0.256181,False
1724,7750,Financial Accounting Toolkit for Decision Making,This course gives you a firm grounding in both...,coursera-financial-accounting-toolkit-for-deci...,Coursera,Vanderbilt University,Business,Accounting,Accounting,https://www.coursera.org/learn/finance-and-acc...,...,0.312383,False,0.616475,True,0.287686,False,0.124117,False,0.232599,False
1631,7186,Accounting and Finance for IT professionals,This course presents an introduction to the ba...,coursera-accounting-and-finance-for-it-profess...,Coursera,Indian School of Business,Business,Accounting,Accounting,https://www.coursera.org/learn/accounting-finance,...,0.293120,False,0.598871,True,0.227224,False,0.141161,False,0.281292,False
1107,4893,The Global Financial Crisis,Former U.S. Secretary of the Treasury Timothy ...,coursera-the-global-financial-crisis,Coursera,Yale University,Business,Finance,Finance,https://www.coursera.org/learn/global-financia...,...,0.263494,False,0.593397,True,0.134600,False,0.121971,False,0.214919,False
1645,7251,Operational Finance: Finance for Managers,"When it comes to numbers, there is always more...",coursera-operational-finance-finance-for-managers,Coursera,IESE Business School,Business,Finance,Finance,https://www.coursera.org/learn/operational-fin...,...,0.248254,False,0.593112,True,0.168117,False,0.122643,False,0.249337,False


In [35]:
# Save the model results to the CSV File

df_final_model.columns


df_final_model = df_final_model.drop(df_final_model.columns[df_final_model.columns.str.contains('unnamed',case = False)],axis = 1)
df_final_model = df_final_model.replace(np.nan, '', regex=True)
df_final_model.columns = ['courseId', 'courseName', 'courseDescription', 'slug', 'provider', 
'universitiesInstitutions', 'parentSubject', 'childSubject', 
'category', 'url', 'length', 'language', 'credentialName', 'rating', 
'numberOfRatings', 'certificate', 'workload',     
'dataScientistFinalScore', 'dataScientistPredict',
'softwareDevelopmentFinalScore', 'softwareDevelopmentPredict', 
'databaseAdministratorFinalScore', 'databaseAdministratorPredict',
'cybersecurityFinalScore', 'cybersecurityPredict',
'financialAccountantFinalScore', 'financialAccountantPredict',
'machineLearningFinalScore', 'machineLearningPredict',
'musicianFinalScore', 'musicianPredict',
'dietitianFinalScore', 'dietitianPredict']


df_final_model


Unnamed: 0,courseId,courseName,courseDescription,slug,provider,universitiesInstitutions,parentSubject,childSubject,category,url,...,cybersecurityFinalScore,cybersecurityPredict,financialAccountantFinalScore,financialAccountantPredict,machineLearningFinalScore,machineLearningPredict,musicianFinalScore,musicianPredict,dietitianFinalScore,dietitianPredict
0,303,Introduction to Databases,This course covers database design and the use...,coursera-introduction-to-databases,Coursera,Stanford University,Programming,Databases,Databases,https://www.coursera.org/course/db,...,0.342538,False,0.197449,False,0.258005,False,0.061388,False,0.144089,False
1,305,Software as a Service,,coursera-software-as-a-service,Coursera,"University of California, Berkeley",Programming,Web Development,Web Development,https://www.coursera.org/course/saas,...,0.358490,False,0.279609,False,0.259016,False,0.182581,False,0.145252,False
2,306,Human-Computer Interaction,Helping you build human-centered design skills...,coursera-human-computer-interaction,Coursera,"University of California, San Diego",Art & Design,Design & Creativity,Design & Creativity,https://www.coursera.org/course/hciucsd,...,0.218914,False,0.104302,False,0.277829,False,0.152722,False,0.185639,False
3,307,Natural Language Processing,Have you ever wondered how to build a system t...,coursera-natural-language-processing,Coursera,Columbia University,Computer Science,Artificial Intelligence,Artificial Intelligence,https://www.coursera.org/course/nlp,...,0.230990,False,0.199540,False,0.396840,True,0.168697,False,0.226825,False
4,308,Game Theory,"Popularized by movies such as ""A Beautiful Min...",coursera-game-theory,Coursera,Stanford University|||The University of Britis...,Social Sciences,Economics,Economics,https://www.coursera.org/learn/game-theory-1,...,0.210484,False,0.189530,False,0.339687,False,0.181301,False,0.174939,False
5,309,Probabilistic Graphical Models 1: Representation,Probabilistic graphical models (PGMs) are a ri...,coursera-probabilistic-graphical-models-1-repr...,Coursera,Stanford University,Computer Science,Artificial Intelligence,Artificial Intelligence,https://www.coursera.org/learn/probabilistic-g...,...,0.265583,False,0.221871,False,0.294014,False,0.100858,False,0.170034,False
6,316,Information Theory,This course is an introduction to information ...,coursera-information-theory,Coursera,The Chinese University of Hong Kong,Engineering,Electrical Engineering,Electrical Engineering,https://www.coursera.org/course/informationtheory,...,0.288635,False,0.248245,False,0.269206,False,0.110864,False,0.246539,False
7,317,Model Thinking,We live in a complex world with diverse people...,coursera-model-thinking,Coursera,University of Michigan,Social Sciences,Sociology,Sociology,https://www.coursera.org/learn/model-thinking,...,0.221478,False,0.233149,False,0.363564,False,0.209250,False,0.202404,False
8,318,Computer Security,Learn how to design secure systems and write s...,coursera-computer-security,Coursera,Stanford University,Computer Science,Cybersecurity,Cybersecurity,https://www.coursera.org/course/security,...,0.518575,True,0.294041,False,0.275570,False,0.130601,False,0.176342,False
9,322,Computer Vision: The Fundamentals,"In this course, we will study the concepts and...",coursera-computer-vision-the-fundamentals,Coursera,"University of California, Berkeley",Computer Science,Artificial Intelligence,Artificial Intelligence,https://www.coursera.org/course/vision,...,0.285846,False,0.248627,False,0.379828,True,0.158811,False,0.218842,False


In [36]:
df_final_model.to_csv(my_fpath_model, sep=',', encoding='utf-8')

### End of the Notebook. Thank you!