## Centric Innovation Pilot: Employee Recommendation Engine
------------

#### Hypothesis
 
In a professional services organization, we can improve overall utilization and reduce dependency on recruiting by using aggregated workforce data to produce machine generated staffing suggestions.

![](Innovation.png)

For v0.1 we had access to resume data for Centric employees which was pulled from Centricity. This is what it looked like:


![](ResumeSample.png)


This one file has the resume information for 331 employees

#### Step 1: We will import the file and check it's encoding

In [10]:
import pandas as pd
import numpy as np
import string
import chardet
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

fileName = 'D:/Python/Centric/output/EverySingleResume.txt'

with open(fileName, 'rb') as f:
    result = chardet.detect(f.readline())
    
print(result)

{'encoding': 'windows-1252', 'confidence': 0.73}


#### Step 2: Read contents of the file into a dataframe

In [12]:
df = pd.read_csv(fileName,sep="^", skiprows=2, header=None, encoding='windows-1252',error_bad_lines=False)

df.tail()

Unnamed: 0,0,1
77789,Oswaldo Gonzalez 2015.TXT,
77790,Oswaldo Gonzalez 2015.TXT,Foreign Language Skills
77791,Oswaldo Gonzalez 2015.TXT,
77792,Oswaldo Gonzalez 2015.TXT,English – Fluent in Business
77793,Oswaldo Gonzalez 2015.TXT,Spanish – Native


#### Step 3: Clean up the name field and remove NaN values

In [14]:
    df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
    df[0] = df[0].replace({'Centric': ''}, regex=True)
    df[0] = df[0].replace({'centric': ''}, regex=True)
    df[0] = df[0].replace({'CENTRIC': ''}, regex=True)
    df[0] = df[0].replace({'Resume': ''}, regex=True)
    df[0] = df[0].replace({'resume': ''}, regex=True)
    df[0] = df[0].replace({'New': ''}, regex=True)
    df[0] = df[0].replace({'Current': ''}, regex=True)
    df[0] = df[0].replace({'Profile': ''}, regex=True)
    df[0] = df[0].replace({'Consulting': ''}, regex=True)
    df[0] = df[0].replace({'.txt': ''}, regex=True)
    df[0] = df[0].replace({'.TXT': ''}, regex=True)
    df[0] = df[0].replace({'-': ''}, regex=True)
    df[0] = df[0].replace({'_': ''}, regex=True)
    df[0] = df[0].replace(r"\(.*\)","", regex=True)
    df[0] = df[0].replace(r"[*0-9]","", regex=True)
    df[0] = df[0].replace({'\.': ''}, regex=True)
    df[0] = df[0].replace({'January': ''}, regex=True)
    df[0] = df[0].replace({'February': ''}, regex=True)
    df[0] = df[0].replace({'Feburary': ''}, regex=True)
    df[0] = df[0].replace({'March': ''}, regex=True)
    df[0] = df[0].replace({'April': ''}, regex=True)
    df[0] = df[0].replace({'May': ''}, regex=True)
    df[0] = df[0].replace({'June': ''}, regex=True)
    df[0] = df[0].replace({'July': ''}, regex=True)
    df[0] = df[0].replace({'August': ''}, regex=True)
    df[0] = df[0].replace({'September': ''}, regex=True)
    df[0] = df[0].replace({'October': ''}, regex=True)
    df[0] = df[0].replace({'November': ''}, regex=True)
    df[0] = df[0].replace({'December': ''}, regex=True)
    df[0] = df[0].replace({'Feb': ''}, regex=True)
    df[0] = df[0].replace({'Apr': ''}, regex=True)
    df[0] = df[0].replace({'May': ''}, regex=True)
    df[0] = df[0].replace({'Jun': ''}, regex=True)
    df[0] = df[0].replace({'Aug': ''}, regex=True)
    df[0] = df[0].replace({'Sep': ''}, regex=True)
    df[0] = df[0].replace({'Oct': ''}, regex=True)
    df[0] = df[0].replace({'Nov': ''}, regex=True)
    df[0] = df[0].replace({'Dec': ''}, regex=True)

#### Step 4: Remove any trailing white spaces

In [17]:
    df[0] = df[0].str.strip()
    uniqueNames = df[0].unique()
    uniqueNames[0:5]

array(['Aaron Aude', 'Abrams', 'Adam Burkholder', 'Adam Wiggershaus',
       'AJ Flynn'], dtype=object)

#### Step 5: Group all the resume lines into one

In [20]:
df[1] = df[1].astype(str) + " "
df_new = df.groupby([0]).sum().reset_index() 
df_new.rename(columns={0:'ResourceName'}, inplace=True)
df_new.rename(columns={1:'text'}, inplace=True)
df_new.head()

Unnamed: 0,ResourceName,text
0,A Nixon,ÿþ Andrew Nixon Associate Profile Andrew...
1,AJ Flynn,ÿþ Anthony J. Flynn Professional Experienc...
2,Aaron Aude,Aaron Aude Associate Profile Aaron has ove...
3,Abrams,ÿþANGELA ABRAMS 105 Brookhill Drive Gahann...
4,Adam Burkholder,ÿþ Adam Burkholder Associate Profile Ada...


#### Step 6: Convert resume text to lowercase

In [21]:
df_new.text = df_new['text'].str.lower()
df_new.text.head()

0    ÿþ   andrew nixon   associate profile   andrew...
1    ÿþ   anthony j. flynn   professional experienc...
2    aaron aude   associate profile   aaron has ove...
3    ÿþangela abrams   105 brookhill drive   gahann...
4    ÿþ   adam burkholder   associate profile   ada...
Name: text, dtype: object

#### Step 7: In order to build a recommendation engine, we will build a sparse matrix from the resume text. We will do so by converting the collection of resumes to a matrix of TF-IDF features. Using this matrix, we will calculate cosine distance between resumes and rank them in their sorted order. The top 5 documents will be our top 5 recommendations based on an input of the Employee Name.

In [27]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel        
tfidf = TfidfVectorizer().fit_transform(df_new.text) 
   
def similar_resume(name_of_employee):
    #Find location of name in dataframe
    x = df_new[df_new['ResourceName'].str.contains('^'+name_of_employee)].index
    y = calculate_cosine_similarity(x[0])
    print(df_new.ResourceName[y])                
         
                
def calculate_cosine_similarity(employee_loc):
    cosine_similarities = linear_kernel(tfidf[employee_loc-1:employee_loc], tfidf).flatten()  
    related_docs_indices = cosine_similarities.argsort()[:-5:-1]
    return related_docs_indices
    
try:
    similar_resume('Hitanshu Pande')
except:
    print("Resume not in database!")


156     Hendarsin Lukito
181          Jeff Benson
170         Jana Sanders
152    Hanmanth Jogiraju
Name: ResourceName, dtype: object


----------------
#### As seen above, the employees similar to 'Hitanshu Pande' were Hendersin Lukito, Jeff Benson, Jana Sanders and Hanmanth Jogiraju. 

#### This is a first step towards determining the right employee that should be suggested for a particular job opening. Based on enriched past data, we can make mature recommendations based on previous staffing decisions, availablility, travel preferences, SO and IV allegiances. 