<h2><center>Extracting Job Descriptions</center></h2>

<left>

#### I will use Datasets to load the dataset. This allows me to use it without downloading onto my local machine. I will then select random 15 job descriptions, clean them and save them to a new csv file. 

In [2]:
#importing dataset
from datasets import load_dataset
raw_df = load_dataset('jacob-hugging-face/job-descriptions', split='train')
raw_df


Dataset({
    features: ['company_name', 'job_description', 'position_title', 'description_length', 'model_response'],
    num_rows: 853
})

In [3]:
#checking first entry
raw_df[0]


{'company_name': 'Google',
 'job_description': 'minimum qualifications\nbachelors degree or equivalent practical experience years of experience in saas or productivity tools businessexperience managing enterprise accounts with sales cycles\npreferred qualifications\n years of experience building strategic business partnerships with enterprise customersability to work through and with a reseller ecosystem to scale the businessability to plan pitch and execute a territory business strategyability to build relationships and to deliver results in a crossfunctionalmatrixed environmentability to identify crosspromoting and uppromoting opportunities within the existing account baseexcellent account management writtenverbal communication strategic and analyticalthinking skills\nabout the job\nas a member of the google cloud team you inspire leading companies schools and government agencies to work smarter with google tools like google workspace search and chrome you advocate the innovative pow

In [4]:
# Different types of categories
raw_df['position_title']


['Sales Specialist',
 'Apple Solutions Consultant',
 'Licensing Coordinator - Consumer Products',
 'Web Designer',
 'Web Developer',
 'Frontend Web Developer',
 'Remote Website Designer',
 'Web Designer',
 'Web Designer',
 'SR. Web Designer',
 'Web Developer',
 'Web Developer',
 'Senior UI Designer',
 'Wordpress Web Developer',
 'UI Web Designer',
 'Senior Web Designer (REMOTE)',
 'Chief Executive Officer',
 'Executive Vice President & Chief Executive Officer (CEO), Medical...',
 'CEO',
 'CEO',
 'CEO, Positivly',
 'Chief Executive Officer - CEO Consultant',
 'Chief Executive Officer',
 'Chief Executive Officer - Healthcare - Columbus',
 'CEO Coach',
 'Assistant Vice President Premier Relationship Manager',
 'Digital Marketing Specialist',
 'Senior Marketing Specialist',
 'Performance Marketing Specialist, Paid Media',
 'Software Engineer - Reno, NV',
 'Software Engineer',
 'Entry Level Software Engineer',
 'Associate Software Engineering',
 'Software Engineer',
 'Software Engineer',
 '

In [5]:
#creating df out of dataset
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.DataFrame(raw_df)
df.head()

Unnamed: 0,company_name,job_description,position_title,description_length,model_response
0,Google,minimum qualifications\nbachelors degree or eq...,Sales Specialist,2727,"{\n ""Core Responsibilities"": ""Responsible fo..."
1,Apple,description\nas an asc you will be highly infl...,Apple Solutions Consultant,828,"{\n ""Core Responsibilities"": ""as an asc you ..."
2,Netflix,its an amazing time to be joining netflix as w...,Licensing Coordinator - Consumer Products,3205,"{\n ""Core Responsibilities"": ""Help drive bus..."
3,Robert Half,description\n\nweb designers looking to expand...,Web Designer,2489,"{\n ""Core Responsibilities"": ""Designing webs..."
4,TrackFive,at trackfive weve got big goals were on a miss...,Web Developer,3167,"{\n ""Core Responsibilities"": ""Build and layo..."


**Now I will shuffle the dataframe and select first 15 rows**

In [6]:
df = df.sample(frac=1, random_state=42).head(15)
df

Unnamed: 0,company_name,job_description,position_title,description_length,model_response
66,Grand-flo Spritvest Sdn Bhd,responsibilities\n\nproduce clean efficient co...,Software Engineer (Web),475,"{\n ""Core Responsibilities"": ""produce clean ..."
434,Legal Marketing And Staffing,it systems administrator\n our client is a gr...,Information Technology System Administrator,3598,"{\n ""Core Responsibilities"": ""Respond to hel..."
198,Randstad,office clerkdo you have experience in adminis...,Office Clerk,1600,"{\n ""Core Responsibilities"": ""Answering phon..."
212,Cardinal Logistics,hiring cdl a company drivers in ohio\nstarting...,CDL-A Truck Driver - $5K Sign On Bonus,932,"{\n ""Core Responsibilities"": ""Drive trucks t..."
652,Motivate LLC,motivate llc is seeking a qualified assistant ...,Assistant Manager,3866,"{\n ""Core Responsibilities"": ""Oversee routin..."
543,Cleveland Metroparks,the marketing intern works in an inhouse posit...,Marketing Intern,2810,"{\n ""Core Responsibilities"": ""Provides suppo..."
280,Macys--Remote,\n\ncustomer service representative full time...,Customer Service Representative,3655,"{\n ""Core Responsibilities"": ""Answer and res..."
296,SimplyInsured,anywhere in us per hour annual bonus co...,Data Entry Clerk,473,"{\n ""Core Responsibilities"": ""Review, prepar..."
365,Journeys,journeys providence place as a sales associa...,Part-Time Sales Associate,342,"{\n ""Core Responsibilities"": ""Meet and excee..."
679,Firstsource Solution USA,fulltime entry level great way to get hands o...,Remote Patient Advocate Specialist,5162,"{\n ""Core Responsibilities"": ""Contact patien..."


**The useful part of the dataset is riddled with punctuations, symbols, whitespaces.**

In [7]:
#cleaning df

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer


def cleanJD(text):
    cleaned_text = text.replace("\n", " ")
    cleaned_text = cleaned_text.replace("[^a-zA-Z0-9]", " ")
    cleaned_text = re.sub(r'[^\w\s]|_', ' ', cleaned_text)
    re.sub('\W+', '', cleaned_text)
    cleaned_text = cleaned_text.lower()
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text


In [9]:
#cleaning and storing responsibilities, this will be used in cosine similarity 
skills = []
for i in range(0,15):
    text = df['model_response'].str.split(',').iloc[i]
    skills.append(cleanJD(text[1]))
df['skills'] = skills
df.head()

Unnamed: 0,company_name,job_description,position_title,description_length,model_response,skills
66,Grand-flo Spritvest Sdn Bhd,responsibilities\n\nproduce clean efficient co...,Software Engineer (Web),475,"{\n ""Core Responsibilities"": ""produce clean ...",develop test and implement new or existing sof...
434,Legal Marketing And Staffing,it systems administrator\n our client is a gr...,Information Technology System Administrator,3598,"{\n ""Core Responsibilities"": ""Respond to hel...",maintain network integrity and security
198,Randstad,office clerkdo you have experience in adminis...,Office Clerk,1600,"{\n ""Core Responsibilities"": ""Answering phon...",filing
212,Cardinal Logistics,hiring cdl a company drivers in ohio\nstarting...,CDL-A Truck Driver - $5K Sign On Bonus,932,"{\n ""Core Responsibilities"": ""Drive trucks t...",required skills class a commercial driver s li...
652,Motivate LLC,motivate llc is seeking a qualified assistant ...,Assistant Manager,3866,"{\n ""Core Responsibilities"": ""Oversee routin...",onboarding


In [10]:
#dropping useless columns
df = df.drop(columns='job_description').drop(columns='model_response').drop(columns='description_length')
df

Unnamed: 0,company_name,position_title,skills
66,Grand-flo Spritvest Sdn Bhd,Software Engineer (Web),develop test and implement new or existing sof...
434,Legal Marketing And Staffing,Information Technology System Administrator,maintain network integrity and security
198,Randstad,Office Clerk,filing
212,Cardinal Logistics,CDL-A Truck Driver - $5K Sign On Bonus,required skills class a commercial driver s li...
652,Motivate LLC,Assistant Manager,onboarding
543,Cleveland Metroparks,Marketing Intern,required skills must be currently enrolled in ...
280,Macys--Remote,Customer Service Representative,returns and sales answer and resolve calls fro...
296,SimplyInsured,Data Entry Clerk,prepare and submit customer applications to in...
365,Journeys,Part-Time Sales Associate,provide full service experience to customers
679,Firstsource Solution USA,Remote Patient Advocate Specialist,provide instructions


In [None]:
#saving jds.csv locally for final task
df.to_csv('jds.csv',index = False)