# Mount Colab with Drive

Google Colaboratory is a free Jupyter notebook environment provided by Google where you can use free GPUs and TPUs. If you don’t have a computer that can take the training workload of complex machine learning and deep learning models. So [Google colab](https://colab.research.google.com/notebooks/welcome.ipynb) is a good idea for you!

here I will mount google colab with my drive so that it can access my drive and see my files.

In [85]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


# Import Dataset

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

In [0]:
dataset = pd.read_csv('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/dataset/jobs_data.csv', header = 0, index_col=False)

# Perform EDA and Get insights about the data

Exploratory Data Analysis (**EDA**) is an approach to analyzing datasets to summarize their main characteristics in order to know which statistical model to use. but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

In [88]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,title,jobFunction,industry
0,0,Full Stack PHP Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Marketing and Advertisi..."
1,1,CISCO Collaboration Specialist Engineer,"['Installation/Maintenance/Repair', 'IT/Softwa...",['Information Technology Services']
2,2,Senior Back End-PHP Developer,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Computer Networking']"
3,3,UX Designer,"['Creative/Design/Art', 'IT/Software Developme...","['Computer Software', 'Information Technology ..."
4,4,Java Technical Lead,"['Engineering - Telecom/Technology', 'IT/Softw...","['Computer Software', 'Information Technology ..."


here we want to drop the first index column. we can use [DataFrame.drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function in pandas. `inplace` means to do this operation in the same dataframe.

In [0]:
dataset.drop('Unnamed: 0', axis=1, inplace=True)


##### We noticed that jobFunction has many cells with a value  of `'[nan]'`. so we want to count these nans to take a dicision to it. each cell of the jobFunction column is a string,  so I want to parse this string and converting it to NAN object. 

In [0]:
dataset['jobFunction'].replace('[\'nan\']', np.nan, inplace=True)

Now we want to check the number of `nans` in the jobFunction column. 
Fortunately there exist a module in python called `Counter` can help us with this task, it will take the jobFunction series and returning a dictionary of each string with its frequency

In [91]:
from collections import Counter

Counter(dataset['jobFunction'])

Counter({"['Accounting/Finance', 'Administration', 'Operations/Management']": 1,
         "['Accounting/Finance', 'Administration']": 8,
         "['Accounting/Finance', 'Analyst/Research']": 9,
         "['Accounting/Finance', 'Banking']": 1,
         "['Accounting/Finance', 'Business Development']": 1,
         "['Accounting/Finance', 'C-Level Executive/GM/Director']": 1,
         "['Accounting/Finance', 'Customer Service/Support', 'IT/Software Development']": 3,
         "['Accounting/Finance', 'Customer Service/Support']": 3,
         "['Accounting/Finance', 'Education/Teaching']": 3,
         "['Accounting/Finance', 'Engineering - Telecom/Technology', 'IT/Software Development']": 3,
         "['Accounting/Finance', 'Human Resources', 'Administration']": 1,
         "['Accounting/Finance', 'Human Resources']": 1,
         "['Accounting/Finance', 'IT/Software Development', 'Engineering - Telecom/Technology']": 1,
         "['Accounting/Finance', 'IT/Software Development', 'Strategy/

# Data Cleaning

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. This data is usually not necessary or helpful when it comes to analyzing data because it may provide inaccurate results. <br>
 **garbage in .. garbage out!!**

Since our model expects the user to input the `Title` then it recommends `jobFunction/s` to it. we see that we have *117* cells in `jobFunction` that is not annotated (with `nan` value). we may replace `nan` with some value but in our situation I think the simplest and best way is to remove `nans`

you can do it easily with [DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) in pandas.

In [0]:
dataset.dropna(subset=['jobFunction'], inplace=True)

Now the next step after removing the `nans`, we want to get insights about the `jobFunction` column considering it our label that the model will predict.<br>

Lets play with the dataset and get the unique job functions (*number of classes*) in the dataset 

In [0]:
def from_str_to_list(text):
    '''
    This function takes a string object (with a list format) then parsing it 
    and converting it to list 
    '''
    str_cleaned = re.sub('[\'"\[\]]', '', text)
    return [token.strip() for token in str_cleaned.split(',')]


In [0]:
# Convert jobFunctions from list of strings to a list of lists 
job_functions = [from_str_to_list(function) for function in dataset['jobFunction']]

# flatten jobFunctions to deal with each function separately
job_functions_flattened = [item for sublist in job_functions for item in sublist]

In [95]:
job_functions_flattened

['Engineering - Telecom/Technology',
 'IT/Software Development',
 'Installation/Maintenance/Repair',
 'IT/Software Development',
 'Engineering - Telecom/Technology',
 'Engineering - Telecom/Technology',
 'IT/Software Development',
 'Creative/Design/Art',
 'IT/Software Development',
 'Engineering - Telecom/Technology',
 'IT/Software Development',
 'IT/Software Development',
 'Engineering - Telecom/Technology',
 'Customer Service/Support',
 'Engineering - Telecom/Technology',
 'IT/Software Development',
 'Engineering - Mechanical/Electrical',
 'Sales/Retail',
 'Education/Teaching',
 'Administration',
 'Operations/Management',
 'Sales/Retail',
 'Marketing/PR/Advertising',
 'Media/Journalism/Publishing',
 'Accounting/Finance',
 'Sales/Retail',
 'Creative/Design/Art',
 'IT/Software Development',
 'Engineering - Telecom/Technology',
 'Education/Teaching',
 'Media/Journalism/Publishing',
 'Marketing/PR/Advertising',
 'Marketing/PR/Advertising',
 'Sales/Retail',
 'IT/Software Development',
 'E

To get the unique job functions with a simple way, you can use a `set`

In [96]:
unique_job_functions_flattened = set(job_functions_flattened)

classes = list(unique_job_functions_flattened)
print('number of classes is {}' .format(len(classes)))

number of classes is 37


##### so we have **37 classes** that our model should recommend from them based on the employee's `title`

According to the job title in our dataset we want to see the frequency of the words in each title to know the most important words our model should preserve

In [0]:
title_words = [word for title in dataset['title'] for word in title.split()]

In [98]:
Counter(title_words)

Counter({'Full': 246,
         'Stack': 264,
         'PHP': 169,
         'Developer': 2064,
         'CISCO': 2,
         'Collaboration': 3,
         'Specialist': 1109,
         'Engineer': 1308,
         'Senior': 1913,
         'Back': 31,
         'End-PHP': 1,
         'UX': 27,
         'Designer': 509,
         'Java': 120,
         'Technical': 345,
         'Lead': 130,
         'Support': 242,
         'iOS': 129,
         'Mechanical': 86,
         'Real': 131,
         'Estate': 126,
         'Sales': 1220,
         '-': 1983,
         '10th': 10,
         'of': 50,
         'Ramadan': 13,
         'School': 29,
         'Principal': 11,
         'Representative': 414,
         'Accountant': 279,
         'Indoor': 43,
         'Executive': 501,
         'Full-Stack': 29,
         'Joomla': 6,
         'Expert': 17,
         'English': 228,
         'Teacher': 322,
         'Assistant': 209,
         'Marketing': 463,
         'Coordinator': 158,
         'Business': 333

we see that the `title` coiumn has many charachers and words that should be cleaned before feeding this data to the model.
<br><br>
we suggest to remove **punctuations** and **numbers** from the title column since it would not have a critical value in our model. then we can **lowercase** all the string so that the model doesn't differentiate between a word like '*Developer*' and '*developer*'

In [0]:
import string

def normalize(title):
  # remove punctuation
  replace_punctuation = str.maketrans(string.punctuation, ' '*len(string.punctuation))
  title = title.translate(replace_punctuation)
  # remove numbers
  title = re.sub('\d', ' ', title)
  # lowercase all characters
  title = title.lower()
  # remove extra spaces between string
  title = ' '.join(title.split())

  return title


In [100]:
print('before applying normalizaitoin: \n')
print(dataset['title'][:20])

# apply the normalization to the title column
normalized_titles = [normalize(title) for title in dataset['title']]

print('\n after applying normalizaitoin: \n')
print(normalized_titles[:20])

before applying normalizaitoin: 

0                           Full Stack PHP Developer
1            CISCO Collaboration Specialist Engineer
2                      Senior Back End-PHP Developer
3                                        UX Designer
4                                Java Technical Lead
5                         Technical Support Engineer
6                               Senior iOS Developer
7                                Mechanical Engineer
8     Real Estate Sales Specialist - 10th of Ramadan
9                                   School Principal
10                       Senior Sales Representative
11                                        Accountant
12                            Indoor Sales Executive
13                    PHP Full-Stack - Joomla Expert
14                         English Teacher Assistant
15                             Marketing Coordinator
16                         Senior Business Developer
17                          Senior Website Developer
18          

The next step we want to know if this model is *balanced* (number of observation in the dataset is approximately the same for each class) or not. <br>
lets get insight about our labels `jobFunctions`

In [101]:
Counter(job_functions_flattened)

Counter({'Accounting/Finance': 477,
         'Administration': 656,
         'Analyst/Research': 272,
         'Banking': 10,
         'Business Development': 445,
         'C-Level Executive/GM/Director': 3,
         'Creative/Design/Art': 739,
         'Customer Service/Support': 863,
         'Education/Teaching': 527,
         'Engineering - Construction/Civil/Architecture': 302,
         'Engineering - Mechanical/Electrical': 499,
         'Engineering - Oil & Gas/Energy': 10,
         'Engineering - Other': 169,
         'Engineering - Telecom/Technology': 3886,
         'Fashion': 3,
         'Hospitality/Hotels/Food Services': 15,
         'Human Resources': 260,
         'IT/Software Development': 4383,
         'Installation/Maintenance/Repair': 576,
         'Legal': 21,
         'Logistics/Supply Chain': 171,
         'Manufacturing/Production': 122,
         'Marketing/PR/Advertising': 1400,
         'Media/Journalism/Publishing': 951,
         'Medical/Healthcare': 217,
 

we see that our dataset is **imbalanced** (one class has 4383 record and another one has only 7 !!) and this may cause some problems in our model. so we should take care about this issue. since there are many machine learning algorithms that fails with imbalanced data and many metrics also doesn't provide correct insights about the error with imbalanced data. based on this point *we should choose the right algorithm and the right metric for this data!*
<br><br>
But now how should we overcome this problem? We can consider **downsampling** the majority class or **upsampling** the minority class. But because we don't have a large dataset I think the simplest and best one in our situation is  '*upsampling*' which means to get more data about minority classes. this will make every class has approximately the same observations as the other classes and this will be very benificial to our model so that it can't be biased to a particular class.
 <br><br>
Another solution we may try with our problem in the **feature engineering** part is to merge some labels that is quite similar to each others into one label. this will improve our model significantly!


# Vectorize your data

Unfortunately our machine learning model will not understand these categorical data even after cleaning it. it expects the input as numbers not characters and words. so what can we do?
<br>
we will convert these words into vectors of numbers then feed it to the algorithm. we can use many techniques such as [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) or [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). both of them tokenize a collection of text and build a Word counts but I prefer `TfidfVectorizer` because it also penalize worthless words like 'the' that may appear many times in the data.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

Here we will build a `TfidfVectorizer` with n-gram in a range from 1 to 3. this means that the algorithm creates a window from one word to 3 words and passes it around each title to detect the correlations between these words and the order of them in the windows. and finally it will return a sparse matrix that carries the features of all these words with the correlations between each others and the order of them.

In [0]:
# build Tfidf with n-gram of 1 to 3 words
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,3))
feature_matrix = vectorizer.fit_transform(normalized_titles)

Also we should vectorize the labels and since our model has multilabel (*title may has more than one function*) we should use [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html)
. It will return a matrix of zeros and ones. for each row the index of the true label is one and the others are zeros.

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

multiLabelBinarizer = MultiLabelBinarizer()
labels = multiLabelBinarizer.fit_transform(job_functions)

# Build the model

Well! lets start a new part and train a simple model with our data. we will try to train [multiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html). This is a simple strategy for extending classifiers that do not natively support *multi-target classification*.
<br> And we will try to train the model with (**SVC**) Support Vector  Classifier which works fine on sparse and *unbalanced data*.

In [105]:
from sklearn.svm import LinearSVC 
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(feature_matrix, labels, test_size=0.2)
estimator = LinearSVC()
model = MultiOutputClassifier(estimator)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                          fit_intercept=True,
                                          intercept_scaling=1,
                                          loss='squared_hinge', max_iter=1000,
                                          multi_class='ovr', penalty='l2',
                                          random_state=None, tol=0.0001,
                                          verbose=0),
                      n_jobs=None)

After we tained the model we want to see how the model works on the test set. You can do it directly with a **predict** function.

In [0]:
y_pred = model.predict(X_test)

# Evaluate your model

Our data is imbalanced so choosing a good metric will boost the performance of the model significantly. We will try **F-1 score** because it’s one of the best metrics to use with imbalanced data. It’s simply the harmonic mean of *precision* and *recall*. And when it reaches a high score, that means that we reached a perfect precision and recall together.

In [107]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred, average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


0.911731906663528

#Test your own

You can try this model by entering an employee's **job title** and our model will recommend new **job functions** !!

In [0]:
def recommend_job_functions(title):
  '''
  This function recommend new job functions based on the employee job title
  '''
  title_norm = normalize(title)
  title_vec = vectorizer.transform([title_norm])
  pred = model.predict(title_vec)
  
  # return job functions from the prediction
  results = []
  for index in np.where(pred[0]==1)[0]:
    results.append(classes[index])

  return results


In [109]:
title = 'machine learning engineer'
functions = recommend_job_functions(title)
print('Recommended functions based on this title: \n', functions)

Recommended functions based on this title: 
 ['Media/Journalism/Publishing', 'Administration', 'Operations/Management']


# saving the model using `pickle`

Finally after we built the model, we want to save it to a file on the desk in order to load it by flask app. we recommend using `pickle`. it's a python module used for **serializing** and **de-serializing** a python object structure. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.<br>
So in our situation we will pickle `model`, `vectorizer` and `classes` objects so that we can de-serialize and use them in the flask app.

To pickle an object in python. you should use `pickle.dump()` and pass to it the file path you want to dump (with `'wb'` mode)

In [0]:
import pickle

pickle.dump(model, open('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/model.pkl', 'wb'))
pickle.dump(vectorizer, open('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/vectorizer.pkl', 'wb'))
pickle.dump(classes, open('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/classes.pkl', 'wb'))

In order to use them in flask app. All you need is to use `pickle.load()` with your pickle file in your path. don't forget to use `'rb'` mode while loading.

In [0]:
model = pickle.load(open('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/model.pkl', 'rb'))
vectorizer = pickle.load(open('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/vectorizer.pkl', 'rb'))
classes = pickle.load(open('/gdrive/My Drive/MohamedAbdullah_JobFunctionRecommendationTask/01_Code/classes.pkl', 'rb'))