## Group1

### Material:
- [Pandas - Filter row and columns](https://python.plainenglish.io/filtering-rows-and-columns-in-pandas-python-techniques-you-must-know-6cdfc32c614c)
- [Pandas - Drop multiple columns](https://pythonexamples.org/pandas-dataframe-delete-column/#5)
- [Pandas - Check Pandas data type](https://datascientyst.com/check-dtype-column-columns-pandas-dataframe/#:~:text=%20How%20to%20Check%20the%20Dtype%20of%20Column,Check%20if%20column%20is%20numeric%2C%20dateti...%20More%20)
- [Data - Columns Views - Original Data](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents)
- [Pandas - Convert value in columns](https://stackoverflow.com/questions/52317459/python-pandas-convert-single-value-in-object-column)
- [Time Ranges/ Time Comparision](https://pythonawesome.com/time-ranges-with-python/)
- [Remove columns or Rows in Pandas](https://www.bing.com/search?q=remove+column+from+pandas&cvid=0b68a851c23b4a55abbb755ec28ca2f6&aqs=edge..69i57j0l7j69i64.8187j0j1&pglt=931&FORM=ANNTA1&PC=U531)
- [Remove rows with certain citeria in Python Pandas](https://stackoverflow.com/questions/42125131/delete-row-based-on-nulls-in-certain-columns-pandas)
- [AI BOOKS](http://aima.cs.berkeley.edu/)
- https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3
- https://monkeylearn.com/keyword-extraction/
- https://www.justintodata.com/use-nlp-in-python-practical-step-by-step-example/
- https://mathdatasimplified.com/
- https://www.kdnuggets.com/2019/11/content-based-recommender-using-natural-language-processing-nlp.html (7/28/2022)

In [None]:
# Data Pre-Processing - Job listing Dataset 
# Import necessary packages
!pip install rake_nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from rake_nltk import Rake
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import os
import gc # For garbage collection when deal with memory
import re

## Support Functions

In [None]:
# Support function to print out any list
def pretty_print(word_list):
  index = 1
  for word in word_list:
    print(word, end=', ')
    if index % 10 == 0:
      print('')
    index += 1

In [None]:
# Garage collections
gc.collect()

## Required library

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Job Data Analyst:
- Goal: Top 3 Career choices, success factor (example Salaries growth, location, etc as per AI Attributes)

### Read Data:

In [None]:
# FileNames is a list with the names of the csv files contained in the 'dataset' path
def get_file_names(path):
  filenames = []
  for file in os.listdir(path):
    if file.endswith('.csv'):
      filenames.append(file)
  return filenames

# function that reads the file from the FileNames list and makes it become a dataFrame
def GetFile(fnombre, path):
  location = path + fnombre
  df = pd.read_csv(location)
  return df

file_path_job = './drive/MyDrive/datasets/jobposting/'
# combine all the data frame as one using list complehesion
dfjob = pd.concat([GetFile(file, file_path_job) for file in get_file_names(file_path_job)])

In [None]:
describe_chart = dfjob.describe(include='all')
describe_chart.T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
jobid,300000.0,300000.0,00009f127a9e34a7,1.0,,,,,,,
apply_link,135806.0,135806.0,https://www.indeed.com/applystart?jk=00009f127...,1.0,,,,,,,
company_link,286348.0,286348.0,https://www.indeed.com/cmp/The-Est%C3%A9e-Laud...,1.0,,,,,,,
company_name,286666.0,97715.0,Deloitte,3804.0,,,,,,,
company_rating,217652.0,,,,3.568953,0.498957,1.0,3.3,3.6,3.9,5.0
company_reviews_count,217652.0,,,,3997.138106,12995.71481,2.0,47.0,310.0,2029.0,236369.0
country,300000.0,1.0,US,300000.0,,,,,,,
country_code,293806.0,166.0,US,282417.0,,,,,,,
current_url,300000.0,300000.0,https://www.indeed.com/viewjob?jk=00009f127a9e...,1.0,,,,,,,
date_posted,300000.0,66.0,30+ days ago,112773.0,,,,,,,


### Check Missing Values and Clean Up
Truc Report:
- Data cleaning took me a total of more than 8hrs to looks for the approriate data that need to keep or drop.
- All the attributes need to make sense and support the machine learning model
- Data that consider biased will be drop
- Data that is missing need to fix and transform to meaningful data

In [None]:
dfjob.drop(['jobid','apply_link','company_link','country','current_url','date_posted','date_posted_parsed','domain','region','srcname'],axis=1,inplace=True)

# Out put will be company name and job title
# Remove apply_link because it will not be necessary to have it (we want to analyze the sucessful candidates as well as the current one)
# Apply_link can be removed when the job is filled which is a good sign to analyze these job description (apply_link hold no value)

# Drop the row where the company name or link is blank:
dfjob.dropna(axis=0, how='all',subset=['company_name', 'job_type'], thresh=2, inplace=True) 

# Change null in qualification to no requirement
dfjob['qualifications'] = dfjob['qualifications'].fillna('["No requirement"]')

# Change null in benefits to no benefits 
dfjob['benefits'] = dfjob['benefits'].fillna('["No benefits"]')

# Assume all the mssing value in salary_formated is negotiable (50% of the dataset)
dfjob['salary_formatted'] = dfjob['salary_formatted'].fillna('Negotiable')

# Assume all the missing country is Others
dfjob['country_code']=dfjob['country_code'].fillna('Other')

# Fill in the rating with 0
dfjob['company_rating']=dfjob['company_rating'].fillna(0.0)
dfjob['company_reviews_count']=dfjob['company_reviews_count'].fillna(0.0)

# Drop the row where the company rating or review couns is 0:
dfjob = dfjob[(dfjob.company_rating != 0) & (dfjob.company_reviews_count != 0)]

In [None]:
dfjob.shape

(199173, 12)

In [None]:
dfjob.head()

Unnamed: 0,company_name,company_rating,company_reviews_count,country_code,description,description_text,job_title,job_type,location,salary_formatted,benefits,qualifications
0,The Estée Lauder Companies,4.0,2214.0,US,<div>\n <p>The Treasury Analyst will assist th...,The Treasury Analyst will assist the Treasury ...,"Analyst, Treasury – Banking Retail","[""Full-time""]",United States,Negotiable,"[""No benefits""]","[""No requirement""]"
2,Accenture,4.0,21827.0,US,<div></div>\n<div>\n <div>\n <div>\n <b>ACC...,ACCENTURE's Flexible Workforce solves clients’...,Cloud Architect,"[""Contract""]",United States,Negotiable,"[""No benefits""]","[""No requirement""]"
3,Techo-Bloc,3.1,114.0,AO,<div>\n Company Description\n <p><b><br> Why W...,Company Description\n Why We Want You: Multi...,Maintenance Technician - $30+/hr Day Shift (El...,"[""Full-time""]","Angola, IN 46703",From $30 an hour,"[""401(k)"",""Dental insurance"",""Disability insur...","[""No requirement""]"
7,McMahon Associates,3.8,8.0,US,<div>\n <p><b>POSITION SUMMARY:</b></p> \n <p>...,POSITION SUMMARY: \n The Senior Project Engine...,Senior Project Engineer (Transportation - Design),"[""Full-time""]","Camp Hill, PA 17011",Negotiable,"[""No benefits""]","[""No requirement""]"
8,Amazon Kuiper Manufacturing,3.5,82832.0,US,<div>\n <ul>\n <li>BS degree or higher in Ele...,BS degree or higher in Electrical or Computer ...,Design Verification Manager,"[""Full-time""]","Austin, TX",Negotiable,"[""No benefits""]","[""No requirement""]"


In [None]:
describe_chart = dfjob.describe(include='all')
describe_chart.T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
company_name,199173.0,48175.0,Deloitte,3746.0,,,,,,,
company_rating,199173.0,,,,3.564244,0.502832,1.0,3.3,3.6,3.9,5.0
company_reviews_count,199173.0,,,,3945.074357,12875.904782,2.0,44.0,288.0,1906.0,236369.0
country_code,199173.0,155.0,US,187555.0,,,,,,,
description,199173.0,168669.0,"<div>\n Assurance believes that you’re unique,...",874.0,,,,,,,
description_text,199173.0,168450.0,"Assurance believes that you’re unique, and you...",875.0,,,,,,,
job_title,199173.0,123554.0,Assistant Manager,438.0,,,,,,,
job_type,199173.0,186.0,"[""Full-time""]",132319.0,,,,,,,
location,199173.0,28117.0,Remote,2850.0,,,,,,,
salary_formatted,199173.0,14595.0,Negotiable,132510.0,,,,,,,


## Bag of words and cosine similarity

In [None]:
# drop for SVM, Bag of words and cosine similarity
dfjob.drop(['country_code', 'description', 'job_type', 'salary_formatted', 'benefits'],axis=1,inplace=True)

In [None]:
dfjob.head()

Unnamed: 0,company_name,company_rating,company_reviews_count,description_text,job_title,location,qualifications
0,The Estée Lauder Companies,4.0,2214.0,The Treasury Analyst will assist the Treasury ...,"Analyst, Treasury – Banking Retail",United States,"[""No requirement""]"
2,Accenture,4.0,21827.0,ACCENTURE's Flexible Workforce solves clients’...,Cloud Architect,United States,"[""No requirement""]"
3,Techo-Bloc,3.1,114.0,Company Description\n Why We Want You: Multi...,Maintenance Technician - $30+/hr Day Shift (El...,"Angola, IN 46703","[""No requirement""]"
7,McMahon Associates,3.8,8.0,POSITION SUMMARY: \n The Senior Project Engine...,Senior Project Engineer (Transportation - Design),"Camp Hill, PA 17011","[""No requirement""]"
8,Amazon Kuiper Manufacturing,3.5,82832.0,BS degree or higher in Electrical or Computer ...,Design Verification Manager,"Austin, TX","[""No requirement""]"


In [None]:
describe_chart = dfjob.describe(include='all')
describe_chart.T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
company_name,199173.0,48175.0,Deloitte,3746.0,,,,,,,
company_rating,199173.0,,,,3.564244,0.502832,1.0,3.3,3.6,3.9,5.0
company_reviews_count,199173.0,,,,3945.074357,12875.904782,2.0,44.0,288.0,1906.0,236369.0
description_text,199173.0,168450.0,"Assurance believes that you’re unique, and you...",875.0,,,,,,,
job_title,199173.0,123554.0,Assistant Manager,438.0,,,,,,,
location,199173.0,28117.0,Remote,2850.0,,,,,,,
qualifications,199173.0,9070.0,"[""No requirement""]",173994.0,,,,,,,


**Result: **
- Due to all the above analyst, I can conclude the norating consist of the company who only post their job one time. So the change to promote in these company is small due to the amount of jobs posted and rating. There for I will remove the row associate with these company where the change is small and the review is none.

### Adding Resume data

In [None]:
# Open text file resume
file1 = open('./drive/MyDrive/resume.txt', 'r')
resume_data = []  

while True:
    # Get next line from file
    line = file1.readline()
    resume_data.append(line)

    # if line is empty or end of file is reached
    if not line:
        break
    
file1.close()

In [None]:
type(resume_data)

list

In [None]:
# Clean up the resume


# Clean up address, school, name, number, take only character in to the new string list
for i in range(0,len(resume_data)):
    resume_data[i] = re.sub(r'\[.*?\]', '', resume_data[i])
    word1 = " ".join(re.findall("[a-zA-Z]+", resume_data[i]))
    resume_data[i] = word1

# Using the keywords dictionary to hold all the keyword
keyword_dict = []

for line in resume_data:
    li = list(line.split(" "))
    for string_ in li:
        keyword_dict.append(string_.lower()) # Convert the string to lower

# Character that does not necessary to the search can be removed
remove_characters = ['','a','truc','huynh','through','self','classroom','ide','concepts','founder','manager','online','first','second','are','was','unsatisfied',
                     'an','to','on','and','that','this','the','by','in','with','s','of','non','co','my','your','his','her','they','their','he','she','it','under',
                     'may','guided','submit','vietnam','cis','any','unsatisfied','services','for','watercraft','specialist','us','recomendation','years','work','team',
                     'customer','ensure','supply','work','year','plans','customer','developing','records','technologies','computer','monitoring','building','market',
                     'ensures','supply','options','learn','master','recommendation','science','risk','strategize','experienced','create','tracking','stock','students',
                     'previous','concerns','structures','budget','next','methods','stakeholders','define','making','profits','achievement','address','routine','installed',
                     'visual','higher','coming','teaching','letters','chain','content','trading','cross','headquarters','audiences','increase','warehouse','loss','car',
                     'advice','highly','shows','toward','commander','compare','fiscal','directly','instructor','reduced','working','project','monitor','learning',
                     'ethical','teach','trade']

soft_skill_remove = ["structure", "experience", "requirements", "worked", "years", "others", "skills", "communication", "ability", "application", "program", "customers",
                     "company", "information", "plan", "knowledge", "benefit", "process", "training", "developed", "assistant", "support", "schedules", "education", 
                     "provided", "business", "operation", "systems", "oriented", "level", "base", "strong", "procedures", "organization", "functional", "practices", 
                     "reports", "office", "people", "certificate", "pay", "industries", "accountable", "staff", "associate", "full", "equipment", "technology", 
                     "maintaining", "design", "record", "clients", "bachelor", "projects", "issues", "using", "relationship", "internal", "technical", "collaborative", 
                     "meet", "implementation", "sales", "background", "detail", "preparing", "lead", "build", "coordination", "monitored", "different", "software", 
                     "marketing", "result", "weeks", "testing", "financial", "security", "proficient", "ensure", "decision", "improve", "engineer", "efficiency", 
                     "driving", "first", "futures", "instruction", "contracts", "strategies", "conducted", "attention", "identified", "analytics", "evaluated"]

for char in remove_characters:
    while(char in keyword_dict) :
        keyword_dict.remove(char)

for char in soft_skill_remove:
    while(char in keyword_dict) :
        keyword_dict.remove(char)

# remove the repeated word in the dictionary
keyword_dict = list(dict.fromkeys(keyword_dict))

# Figure out the length of the keyword dictionaries
len(keyword_dict)

102

In [None]:
# Display list of dictionary
pretty_print(keyword_dict)

stack, developer, management, analyzing, data, scientist, machine, cyber, introduce, python, 
react, git, docker, web, development, ms, visio, jira, github, slack, 
html, css, json, bootstrap, r, shiny, server, framework, flask, restful, 
api, javascript, heroku, paas, lifecycle, agile, methodologies, visualization, dashboard, bi, 
mining, ml, ai, database, sql, mysql, algorithms, c, java, spring, 
mvc, net, eclipse, studio, vs, code, anaconda, pycharm, jupiter, notebook, 
servlet, apache, tomcat, automation, bot, script, hacking, ui, ux, explaining, 
multitask, virtual, machines, windows, linux, mac, osx, autocad, d, modelling, 
inventory, planning, forecasting, optimization, logistics, teams, prototype, predict, volatility, analyst, 
infrastructure, social, media, compared, logistic, vso, army, medals, vessel, lab, 
coding, research, 

### Sampling

The amount of words is too large, I will reduce the data to 100 row so that we can see how our model works and later develop it

In [None]:
sample_size=1000

# Create a sample of 10000 rows
dfjob_s_2 = dfjob.sample(n=sample_size)

# Validate the transactions
dfjob_s_2.head()

Unnamed: 0,company_name,company_rating,company_reviews_count,description_text,job_title,location,qualifications
3595,CACI,3.8,2086.0,Mid-Level Human Resource Support - TS/SCI with...,Mid-Level Human Resource Support - TS/SCI with...,"Chantilly, VA 20151","[""No requirement""]"
27998,Integris Health,3.3,245.0,"Full time Acute Care Nurse Practitioner (APRN,...",Acute Care Nurse Practitioner - Critical Care,"Oklahoma City, OK","[""No requirement""]"
29590,Prospect Healthcare,4.5,6.0,A lovely small animal general practice located...,Associate Veterinarian or Medical Director J17...,"Bentonville, AR","[""No requirement""]"
18455,Southern Glazer's Wine & Spirits,3.5,1499.0,What You Need To Know Overview The Inv...,Inventory Accounting Processor (FT),"Dallas, TX","[""No requirement""]"
16145,"DRMP, Inc.",3.7,30.0,Job Description:\n DRMP is a multi-discipline...,Project Engineer / Project Manager - Utilities,"Fort Myers, FL 33901","[""No requirement""]"


In [None]:
# Has to import NLTK and download averaged_perceptron_tagger
import nltk
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
ps = PorterStemmer()

# process the job description.
def prepare_job_desc(desc):
    # tokenize description.
    tokens = word_tokenize(desc)
        
    # Parts of speech (POS) tag tokens.
    token_tag = pos_tag(tokens)
    
    # Only include some of the POS tags.
    include_tags = ['VBN', 'VBD', 'JJ', 'JJS', 'JJR', 'CD', 'NN', 'NNS', 'NNP', 'NNPS']
    filtered_tokens = [tok for tok, tag in token_tag if tag in include_tags]
    
    # stem words.
    stemmed_tokens = [ps.stem(tok).lower() for tok in filtered_tokens]
    return set(stemmed_tokens)

dfjob_s_2['keywords'] = dfjob_s_2['description_text'].map(prepare_job_desc)

In [None]:
# Convert the data to list
dfjob_s_2['keywords'] = dfjob_s_2['keywords'].apply(list)

# Drop the unecessary_columns
dfjob_s_2.drop(['description_text'], axis=1, inplace=True)

# Merge the first 5 columns to create the company portfolio
dfjob_s_2['jobs_all_information'] = dfjob_s_2[dfjob_s_2.columns[0:4]].apply(
    lambda x: '|'.join(x.dropna().astype(str)),
    axis=1)

# remove unecessary attributes
dfjob_s_2.drop(['company_name', 'company_rating', 'company_reviews_count', 'job_title', 'location'], axis=1, inplace=True)

In [None]:
# Now merge the bag of words together
dfjob_s_2['bag_of_words'] = dfjob_s_2[dfjob_s_2.columns[1:2]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1)

# Drop the unecessary columns
dfjob_s_2.drop(['qualifications','keywords'], axis=1, inplace=True)

dfjob_s_2['bag_of_words'] = dfjob_s_2['bag_of_words'].apply(str)

dfjob_s_2['bag_of_words'] = dfjob_s_2['bag_of_words'].apply(lambda x: x.replace("'",''))
dfjob_s_2['bag_of_words'] = dfjob_s_2['bag_of_words'].apply(lambda x: x.replace(",",''))

dfjob_s_2 = dfjob_s_2.reset_index(drop=True)

In [None]:
dfjob_s_2

Unnamed: 0,jobs_all_information,bag_of_words
0,CACI|3.8|2086.0|Mid-Level Human Resource Suppo...,[financi impact work sociolog 10 full-tim educ...
1,Integris Health|3.3|245.0|Acute Care Nurse Pra...,[pleas respiratori limit blood follow-up manag...
2,Prospect Healthcare|4.5|6.0|Associate Veterina...,"[prefer veterinarian healthi chill gener ""appl..."
3,Southern Glazer's Wine & Spirits|3.5|1499.0|In...,[gener gaap organ family-own limit supervis bs...
4,"DRMP, Inc.|3.7|30.0|Project Engineer / Project...",[prefer de-brief ” organ work ask conceiv serv...
...,...,...
995,West Virginia Wesleyan College|4.0|31.0|Assist...,[prefer orient digniti gener ideal.found work ...
996,"AmeriGas Propane, Inc.|2.9|1173.0|CDL Transpor...",[prefer basi tractor/tank orient gener worker ...
997,"Harris Health System|3.7|541.0|Manager, Care M...",[prefer pathway lmsw-ap orient speak organ wor...
998,Aya Healthcare|4.2|329.0|LTC LVN / LPN,[gener help work pleas use click licensur 12:0...


In [None]:
# Only run 1 to add the value to the 
# Create the whole string keyword
keyword_dict_str = ' '.join([str(elem) for elem in keyword_dict])

# Add to the end of the resume
dfjob_s_2.loc[len(dfjob_s_2.index)] = ['new_candidates_resume',keyword_dict_str]

In [None]:
# drop if the data is not match
# dfjob_s_2 = dfjob_s_2. drop (index=[1000, 1001, 1002, 1003])

In [None]:
# Transform the 
count = CountVectorizer()
count_matrix = count.fit_transform(dfjob_s_2['bag_of_words'])

# Create the matrix to compare
cosine_sim = cosine_similarity(count_matrix, count_matrix)

# create a Series of job titles, so that the series index can match the row and column index of the similarity matrix.
indices = pd.Series(dfjob_s_2['jobs_all_information'])

# Validate similarity matrix
print(cosine_sim)

[[1.         0.17413824 0.22791836 ... 0.19094357 0.21823948 0.00736929]
 [0.17413824 1.         0.14901977 ... 0.15875666 0.11958351 0.01704924]
 [0.22791836 0.14901977 1.         ... 0.19246755 0.19054021 0.00679142]
 ...
 [0.19094357 0.15875666 0.19246755 ... 1.         0.17847433 0.00817888]
 [0.21823948 0.11958351 0.19054021 ... 0.17847433 1.         0.04723238]
 [0.00736929 0.01704924 0.00679142 ... 0.00817888 0.04723238 1.        ]]


In [None]:
(dfjob_s_2['bag_of_words'].iloc[1000])

'stack developer management analyzing data scientist machine cyber introduce python react git docker web development ms visio jira github slack html css json bootstrap r shiny server framework flask restful api javascript heroku paas lifecycle agile methodologies visualization dashboard bi mining ml ai database sql mysql algorithms c java spring mvc net eclipse studio vs code anaconda pycharm jupiter notebook servlet apache tomcat automation bot script hacking ui ux explaining multitask virtual machines windows linux mac osx autocad d modelling inventory planning forecasting optimization logistics teams prototype predict volatility analyst infrastructure social media compared logistic vso army medals vessel lab coding research'

In [None]:
def recommend(resume_title, cosine_sim = cosine_sim):
    recommended_jobs = []

    idx = indices[indices == resume_title].index[0]
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top_10_indices = list(score_series.iloc[1:11].index)
    
    for i in top_10_indices:
        recommended_jobs.append(list(dfjob_s_2['jobs_all_information'])[i])
        
    return recommended_jobs

In [None]:
recommend('new_candidates_resume', cosine_sim = cosine_sim)

['CACI|3.8|2086.0|Sr. Software Engineer',
 'Cella|3.4|16.0|SENIOR REACT FRONT END DEVELOPER',
 'Deloitte|4.0|10699.0|Java Developer Senior Consultant',
 'MATRIX Resources|4.0|94.0|Java Developer',
 'Humanity|4.2|36.0|Sr. Web Developer - Media',
 'Capgemini|3.8|8685.0|Senior Java Developer',
 'Intermex Wire Transfer|3.5|46.0|Lead Software Engineer',
 'Chenega Corporation|3.6|669.0|Software Developer – JavaScript',
 'Wipro Limited|3.8|15348.0|Sales Engineer',
 'Tesla|3.4|5148.0|Software QA Engineer – Vision Automation (Reno)']