<h2> Hard Skill Detection </h2>

Problem Statement:

- Develop a code that can clean a dataset containing technical jargon and extract Technical (Hard) skills
- Technical skills are defined as demonstrable and quantifiable skills. They can be tested to prove their capacity in each hard skill an individual possesses

Sample of 900 random examples of technical skills will be provided

Possible approaches:

- Identify acronyms
- Identify words that are not in dictionary
- Create reference using Example list
- Consider frequencies of words in the English language to identify unusual words/common words
- Treat numeric values
- Treat empty values



In [1]:
import numpy as np
import pandas as pd
import re
from wordfreq import zipf_frequency
import random

- Load data including raw data, example list of hard skills, and words in English language

In [2]:
data = pd.read_csv("Raw_Skills_Dataset.csv")
hardskills = pd.read_csv("Example_Technical_Skills.csv")
words_d = pd.read_csv("words.csv")


- Clean any missing values

In [3]:
words_d = words_d.dropna()
data = data.dropna()
hardskills = hardskills.dropna()

- Create list with all words in English dictionary

In [4]:
stringlist = list(words_d.iloc[:,0].astype('string'))

- Create list of lists containing words split by characters

In [5]:
res = []
for sub in stringlist:
    res.append(list(sub))


- Create dictionary with all letters as keys and all words starting with the corresponding letter under that letter's values

In [6]:
worddict = {}

for t in range(0,len(stringlist)):
    if list(stringlist[t])[0] not in list(worddict.keys()):
        worddict[list(stringlist[t])[0]] = list(stringlist[t])
    else:
        worddict[list(stringlist[t])[0]].append(stringlist[t])
    


- Create list containing data points of hard skills sample data, organized in lists containing individual words, with no special characters or numbers

In [7]:
wlist = pd.DataFrame(data = np.zeros(len(hardskills)), columns = ['Word Lists'])
wlist = []
for x in range(0,len(hardskills)):
    
    wlist.append(re.sub(r"[^a-zA-Z]"," ", hardskills.iloc[x,0]).split())

- Create dictionary containing all words that appear in hard skills sample list, ensuring that words like "the", "a", etc are not included

In [8]:
dictionary = {}

for x in wlist:
    for y in x:
        if zipf_frequency(y, 'en',wordlist='small') > 6: #value of 6 for frequency ensured that non-key words would not be included
            continue
        else:
            if y in list(dictionary.keys()):
                dictionary[y] += 1
            else: 
                dictionary[y] = 1
                
freq = sorted(dictionary, key=dictionary.get, reverse = True)
fredict = {}
for x in freq: #organized dictionary by frequency of word in hard skill sample
    fredict[x] = dictionary.get(x)

- Create list containing data points of raw data, organized in lists containing individual words, with no special characters or numbers

In [9]:
wlist_data = []
for x in range(0,len(data)):
    if len(re.sub(r"[^a-zA-Z]"," ", data.iloc[x,0]).split()) == 0:
        wlist_data.append(["a"]) #if list is empty, i.e. only had numbers/special characters, place placeholder of "a"
    else:
        wlist_data.append(re.sub(r"[^a-zA-Z]"," ", data.iloc[x,0]).split())

    
    

- Examined data point in raw data and classified using the following parameters:


    - Word appeared in Hard Skill sample
    - Word contains more than one capital letter
    - Unusual word as per frequency value
    - Word not appearing in English dictionary


In [10]:
analysis2 = pd.DataFrame(data = np.zeros(len(data)), columns = ['Result'], dtype = object)
reason = []
count = 0

for x in wlist_data:
    
    word_position = 0

    for y in x: 

        if y in fredict:
            analysis2.iloc[count,0] = 1
            reason.append("in sample of hard skills")
            count += 1
            break
        elif y.capitalize() in fredict:
            analysis2.iloc[count,0] = 1
            reason.append("in sample of hard skills")
            count += 1
            break
        elif sum(1 for c in y if c.isupper()) > 1: #checking for more than one capital letter per word
            analysis2.iloc[count,0] = 1
            reason.append("more than one capital letter")
            count += 1
            break  
        elif zipf_frequency(y, 'en',wordlist='small') < 3: #frequency of under 3 is considered unusual word
            analysis2.iloc[count,0] = 1
            reason.append("unusual word")
            count += 1
            break
        elif y.lower() not in worddict[y.lower()[0]]: #searching by initial letter vs. searching whole dictionary reduced run time from 20 minutes to 10 seconds
            analysis2.iloc[count,0] = 1
            reason.append("not in english dictionary")
            count += 1
            break
        else:
            if len(x) > 1 and word_position != len(x)-1: #cycle through all words until reaching last one
                word_position += 1 
            else: #when last word in list is reached and no classification has been made, classify as not a hard skill
                analysis2.iloc[count,0] = 0
                reason.append("not detected as hard skill")
                count += 1
    
analysis2.insert(0, "Data", list(data.iloc[:,0]))  
analysis2.insert(2, "Reason", reason)
analysis2.to_csv('Hard_Skill_Detection.csv', index = False)
analysis2[analysis2.iloc[:,1] == 1]["Data"].to_csv('Clean_Data.csv', index = False)

- Analyzing results using random sample of 25 data points

In [11]:
random.seed(55)
testing = random.sample(list(range(0,len(analysis2)+1)), k = 25)

In [12]:
analysis2.iloc[testing,:]

Unnamed: 0,Data,Result,Reason
5920,PySpark,1,more than one capital letter
12861,the core banking processor,1,in sample of hard skills
9828,iOS/Android automation frameworks,1,more than one capital letter
19836,the right tradeoff,1,unusual word
5220,Assess risks,0,not detected as hard skill
12043,announcements,0,not detected as hard skill
19787,e.g. PHP,1,in sample of hard skills
5800,a Data Orchestration solutions,1,in sample of hard skills
23044,client/developer feedback,1,in sample of hard skills
30930,analytical problem-solving skills,0,not detected as hard skill


- Results appear to indicate a 84% accuracy classification, missing index points 12861,19836,23044, and 25421. 


Future Considerations:

- Data set contained large number of hard skills, would have to try with a data set containing mostly soft skills
- Hard skills not extracted from sentences 
- Frequency of words library contained samples from many sources, including websites. Fiction based library could provide more reliable frequency values for tech words
- Further increase hard skill word bank
- Use NLTK tools to better analyze the language and adapt to never-before-seen cases

