## Problem statement
We are given a raw dataset of around 30k datapoints which contain technical skills and other values. In other dataset we are given around 900 examples of technical skills.The task is to extract technical skills from raw dataset.

In [2]:
import pandas as pd  
import numpy as np

In [3]:
# reading datasets
tech_skill_df=pd.read_csv("Example_Technical_Skills.csv")  
raw_skill=pd.read_csv("Raw_Skills_Dataset.csv")

In [4]:
tech_skill_df

Unnamed: 0,Technology Skills
0,SAP Fiori Developer
1,Oracle Instance Management & Strategy
2,Boomi Master Data Management
3,Digital Manufacturing on Cloud ( DMC)
4,DevOps
...,...
974,Oracle Cloud Revenue Management
975,Oracle EBS Grid Contral Mgt Pack
976,Amazon Elastic MapReduce (EMR)
977,Apache Kudu


In [5]:
raw_skill

Unnamed: 0,RAW DATA
0,What ifs
1,seniority
2,familiarity
3,functionalities
4,Lambdas
...,...
34111,negotiation
34112,deadlines
34113,"Self-motivated, enthusiastic and strong drive"
34114,negotiation


In [12]:
## to check wheter there are some nan values in the dataset so that we can remove them
tech_skill_df.isnull().sum(),

(Technology Skills    0
 dtype: int64,)

In [13]:
raw_skill.isnull().sum()

RAW DATA    0
dtype: int64

Thus both datasets do not have nan values

As now we will now process the text in both datasets so that we can use in our model

In [14]:
import nltk

In [15]:
## downloading the datapackages that we will need
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [16]:
from nltk.corpus import stopwords  ## importing stopwords to remove common words from text
from nltk.stem import WordNetLemmatizer  ##importing lemmatizer library to get base form of work
import re  ## importing regular expressions library

In [18]:
lemmatizer = WordNetLemmatizer()

In [19]:
corpus1 = []  ##to store the list of processed text of tech_skill_df dataframe
for i in range(0,tech_skill_df.shape[0]):
    review = re.sub('[^a-zA-Z]', ' ', tech_skill_df['Technology Skills'][i]) #to convert all the characters expect a to z and A-Z with blank
    review = review.lower() # to convert characters in lower case letters
    review = review.split() # to split the words
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))] 
    ## to remove the words that are present in stopwords set and to lemmatize the word
    review = ' '.join(review) ## joing list of words using join function
    corpus1.append(review)   ## adding list of processed text 

In [20]:
## similarly processing text of raw_skill dataframe
corpus2 = []
for i in range(0,raw_skill.shape[0]):
    review = re.sub('[^a-zA-Z]', ' ', raw_skill['RAW DATA'][i])
    review = review.lower()
    review = review.split()
    
    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus2.append(review)

In [24]:
## veiwing some processed text
corpus1[0:5]

['sap fiori developer',
 'oracle instance management strategy',
 'boomi master data management',
 'digital manufacturing cloud dmc',
 'devops']

In [29]:
## we will now convert the sentences into vectors using TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
tfidf_v=TfidfVectorizer()

In [31]:
X=tfidf_v.fit_transform(corpus1)#to fit the text by learning vocabulary and returning matrix

In [32]:
Y=tfidf_v.transform(corpus2)#transform it to matrix using the learnt vocabulary

In [40]:
X.toarray().shape ## to get the no. of features

(979, 1232)

As we can see that we get a matrix containing 1232 features(words)

As this problem is one class classification problem because we are given only one class that is technical skills in our datset and we need to extract techinal skills fom our raw datset.
So we will build our model with oneclass support vector machine(OneClassSVM)

In [44]:
## to extract OneClassSVM from scikit learn support vector machine
from sklearn.svm import OneClassSVM

In [45]:
model = OneClassSVM(kernel='rbf',gamma='auto')

In [47]:
model.fit(X)

OneClassSVM(gamma='auto')

In [48]:
pred=model.predict(Y) ## to get the predictions in the form of 1 and -1

In [49]:
pred

array([ 1,  1,  1, ..., -1,  1,  1])

As we got the predictions in the form of 1 and -1.As 1 represent that the text belongs to our techinal skills class and -1 represents that the text does not belong to our class.As we got the predictions from the model we will extract the text that have prediction as 1 from our raw skills dataset because they belong to our predicted technical skills class.To do this we will add our predictions list to our raw_skill dataset and  then we will extract the technical skills

In [70]:
raw_skill['predictions']=pred  ## adding prediction column to dataframe

In [71]:
##converting raw_skill.predicting=1 to numpy array
final_texts=raw_skill[raw_skill.predictions==1]['RAW DATA'].to_numpy() 

In [81]:
final_texts ##array of predicted technical skills

array(['What ifs', 'seniority', 'familiarity', ..., 'deadlines',
       'negotiation', 'deadlines'], dtype=object)

In [89]:
df=pd.DataFrame(final_texts)  ##converting to dataframe
df

Unnamed: 0,0
0,What ifs
1,seniority
2,familiarity
3,ORM
4,JPA2
...,...
17454,all applicants
17455,negotiation
17456,deadlines
17457,negotiation


In [90]:
#to make the csv file of our predicted technical_skills
df.to_csv('predicted_technical_skills.csv')