# Webscraping job descriptions from Indeed for keyword extraction:

## Introduction:



## Know the ATS bots: 

ATS stands for applicant tracking system. In short, an ATS is a piece of software used by employers to scan and rank the online job applications they receive for their open positions. These bots were initially created with large organizations in mind, which needed help sifting through the thousands of incoming applications they received on a weekly basis. An estimated 95% of Fortune 500 companies currently use an ATS to manage their applicant tracking process. Today, this software has become popular with employers and recruiting firms of all shapes and sizes.

Think of ATSs as the gatekeepers to your dream job. You’ve got to get past them first in order to succeed.


## Keywords matter

Using action verbs like “outperformed,” “solved,” “led,” and “delivered,” are essential when crafting a resume.

But keywords that are specific to a job description for example skills for a data scientist will probably involve: Python, R, SQL, machine learning, hadoop etc...

There are other skills companies want to see like communication, leadership. Although, these are generic keywords but I would like to get industry specific keywords generated. Or atleast I hope so 

These compelling action verbs powerfully show off what you did in each of your roles. However, when it comes to the bots, you’ve got to kick things up a notch.

The most important element — beyond formatting your resume so it can be accurately ‘read’ and parsed by the ATS — is keyword optimization. This is how the applicant tracking system determines if you possess the necessary qualifications to be considered for the position. In addition to listing out a specific term, be sure to also include any common abbreviations to cover your bases.

However,  it is warned keyword stuffing or packing your resume and cover letter with buzzwords is not something you wanna do. 

__If the ATS can’t sift through the B.S., I guarantee the recruiter or hiring manager will — and then promptly dismiss your application. Instead use keywords sparingly and intelligently.__

To make sure your resume is compatible with ATS system, it is advised to incorporate the best keywords throughout your resume 2-3 times, with at least one of those references falling within your Work Experience or Education section. It’s one thing to state that “SEO (search engine optimization)” is among your core competencies, but it’s another thing entirely to show where in your employment history you leveraged that knowledge to add value to an organization.


# Movtivation behind this notebook: 

Basically, I wanted to learn & practise web-scraping and NLP techniques.

The idea is to help me & my friends who are graduating this year help polish their resume with industry specific keywords.

The high level idea:

1. The user goes on indeed.com and searches up the job title they wants to find keywords for. 

For example: the user will input "data science intern" on the indeed home page and click search 

2. They will then copy the first page link into the variable __initial_searchpage_url__

3. They will then run the script as shown below. Basically it will go into each profile link and copy all the text from the job postings. 

4. It will finally return a list of all lines found in all the jobs you scraped



In [236]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.request import urlopen as uReq
import requests

base_url = "http://www.indeed.com"
# utility function that helps to load a url and create a html beautifulsoup file object    
def url_loader(my_url):

    UClient = uReq(my_url)

    page_html = UClient.read()

    UClient.close()
    
    page_soup = BeautifulSoup(page_html,"html.parser")
    
    
    return page_soup



## Function 2: scrapes the urls for a desired job description keyword search
def scrape_pages(initial_searchpage_url, no_of_jobs=10):
    
    urls = []
    resp = requests.get(initial_searchpage_url)
    soup = BeautifulSoup(resp.content)
    urls = soup.findAll('a',{'rel':'nofollow','target':'_blank'}) # scrapes the links for the initial first page
    urls = [link['href'] for link in urls]
    
    # create a new list of links for other pages e.g page 2 , page 3 , page 4 etc.
    pages = []
    for i in range(10,no_of_jobs,10):
        p = initial_searchpage_url+'{}'+'{}'
        p = p.format(r'&start=',str(i))
        pages.append(p)
 
    
    # for each link in the list pages --> copy all the related job post links in the list urls
    for page in pages:
        r = requests.get(page)
        s = BeautifulSoup(r.content)
        ur = s.findAll('a',{'rel':'nofollow','target':'_blank'}) 
        ur = [link['href'] for link in ur]
        for u in ur:
            urls.append(u)

    # each url is then re-written in the form: www.indeed.com/url      
    for i in range(0,len(urls)):
        elem = base_url+urls[i]
        urls[i] = elem
    
    # return only a unique set of urls as some jobs are repeated in different pages       
    return set(urls), pages


## Function 3: takes the list of links and iterates over each list and extracts the body of the job post and copies the 
## content in a new list called raw_data
def generate_corpus(links):
    raw_data = []
    for link in links: # for each job link in the list links
        my_url = link
        k = url_loader(my_url) # load the link and create a html soup object
        try:
            n = k.findAll("div", {"class" :"jobsearch-JobComponent-description icl-u-xs-mt--md"}) # from each object extract relevant div
            for li in n:
                g = li.text # convert the html type file in text
                f = g.split('\n') # split  on new line
            raw_data.append(f) # append the string in raw_data
        except: # if the link throws an error:
            pass # go to the next one
    flat = [item for sublist in raw_data for item in sublist] # flatten the list: make lists of list into a single list
        
    return flat

__I went on indeed and wrote data science intern and then clicked search. The first page url is as follows:__

In [313]:
initial_searchpage_url = 'https://www.indeed.com/jobs?q=Data+Science+Intern&l=&ts=1537107678897&rs=1'

In [316]:
links, pages = scrape_pages(initial_searchpage_url,20)

#### The scrape_pages() function produces all the urls for the jobs on the first 3 pages of indeed. We then iterate over these links and extract all the information.

In [317]:
links #just show the links for the jobs. If you copy paste a url in your brower you can see the job posting

['http://www.indeed.com/rc/clk?jk=6c14cab0a5158d2e&fccid=0562f887e2bed9ea&vjs=3',
 'http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0CqvC8G7aT6hT5Icz9kOi6WGm5w8ftueCiLpxbYAnpnrr2tD8nNrTvoHHbkHUo6Qk6rNc0QLjREHHQpMGNjiHPSh47IEtFThwIHCbauFCT4YmrA6DejReVLMDVe1Q2mO-UsQf6suSvgGjuXX4jQHb1cVjCObwfPkyQ6hVAqZOQrDGK7jew3-HxmWBy7-5zki3UkUSoBwl6SlMFPEGNQ0gUOPhZUuKQQyvNtEJyATYa_WF1YEZ4c9jqcHkiQNrd8tCKQXczlc-HO2pVqf0A64yALSX3_J4Es32iqXdv76x1FXNNLQdQQYMNbfUBZ95Wky1Mf2tm9iXh9RO2CamHAg44ba13g3J3e2p4RFn7c7b0_f5523E6edgvjPEbq1M22LsERCh8WNlmSUt0ttfrT2D9875pjCjeMhY_SEiVB35CLQRqUlg-qlp2MDcq7KFBWNRZlB529ptk2000IMbMiQPwIgzTcj-MNB2A=&vjs=3&p=1&sk=&fvj=0',
 'http://www.indeed.com/rc/clk?jk=62dd66850c9f654d&fccid=1c70eede37c5caee&vjs=3',
 'http://www.indeed.com/rc/clk?jk=bb3d4523cae0dde4&fccid=5a969b35c0256a8a&vjs=3',
 'http://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0B4T38LaclBH_EPfcmL_zyoW0miqhPUmO93XKpK6qbVhEGD7gSM5WdnKtdYwFg9_Bhm8iuadzRFaPZwuUdTfri4WDUUMPPSmobVRKUmswHOdj4dKo5qcKA_B-0ZJ5hFgUxvAdpQDjc3xnbe9

In [354]:
flat = generate_corpus(links=links) #flattened list is produced. by calling the generate_corpus function.
flat                                # dont happy trigger this function as you might get blocked on indeed. Use sparingly! 

['Palo Alto Networks® is the next-generation security company, leading a new era in cybersecurity by safely enabling applications and preventing cyber breaches for tens of thousands of organizations worldwide. If you are motivated, intelligent, creative, hardworking and want to make an impact, then this job is for you!',
 '',
 'Our Summer Internship Program (May-August) or (June-September) provides you:',
 '1:1 mentorship',
 'Fun and engaging events that inspire your intellectual curiosity',
 'Opportunities to expand your knowledge and work on challenging projects',
 'Connections to other interns, recent grads, and employees across the company as well as our leaders.',
 '',
 'The Palo Alto Networks Data Science team uses data science and machine learning to solve problems throughout the company. In the words of one of our members: “we develop machine learning based highly scalable solutions to address cyber security challenges.” We are looking for a talented and motivated intern that c

## Let's create a dataframe

In [357]:
import pandas as pd ## lets create a dataframe out of the above corpus of text
df = pd.DataFrame(flat)
df.columns = ['corpus_lines']
df.shape

(1089, 1)

In [358]:
pd.set_option('display.max_colwidth', 150)
df[100:200]

Unnamed: 0,corpus_lines
100,We believe in autonomy & taking initiative
101,"We are challenged, developed and have meaningful impact"
102,We take what we do seriously. We don't take ourselves seriously
103,"We have a smart, experienced leadership team that wants to do it right & is open to new ideas"
104,We offer competitive compensation packages and comprehensive health benefits
105,You will be proud to say that you work for Stitch Fix and will know that the work you do brings joy to our clients every day
106,
107,About Stitch Fix
108,
109,Stitch Fix is an online personal style service for men and women combining art and science to disrupt and redefine the retail industry. We're the ...


# Text pre-processing:

Basically at this point its 2am & I am really tired. I will revisit the best way to preprocess the data and eliminate useless information. Just for the sake of preprocessing I will run some generic fucntions like removing punctuation, tokenization, lemmatization etc. 

In [359]:
from nltk.tokenize import word_tokenize
import string
string.punctuation

def remove_punkt(string_):
    word_list_no_punkt = "".join([char for char in string_ if char not in string.punctuation])
    return word_list_no_punkt

In [360]:
df['cleaned_text'] = df['corpus_lines'].apply(lambda x: remove_punkt(x))
df['cleaned_text'] = df['cleaned_text'].apply(lambda x: str.lower(x))

In [361]:
df

Unnamed: 0,corpus_lines,cleaned_text
0,"Palo Alto Networks® is the next-generation security company, leading a new era in cybersecurity by safely enabling applications and preventing cyb...",palo alto networks® is the nextgeneration security company leading a new era in cybersecurity by safely enabling applications and preventing cyber...
1,,
2,Our Summer Internship Program (May-August) or (June-September) provides you:,our summer internship program mayaugust or juneseptember provides you
3,1:1 mentorship,11 mentorship
4,Fun and engaging events that inspire your intellectual curiosity,fun and engaging events that inspire your intellectual curiosity
5,Opportunities to expand your knowledge and work on challenging projects,opportunities to expand your knowledge and work on challenging projects
6,"Connections to other interns, recent grads, and employees across the company as well as our leaders.",connections to other interns recent grads and employees across the company as well as our leaders
7,,
8,The Palo Alto Networks Data Science team uses data science and machine learning to solve problems throughout the company. In the words of one of o...,the palo alto networks data science team uses data science and machine learning to solve problems throughout the company in the words of one of ou...
9,,


In [362]:
df['tokenized_words'] = df['cleaned_text'].apply(lambda x: word_tokenize(x))

In [363]:
df

Unnamed: 0,corpus_lines,cleaned_text,tokenized_words
0,"Palo Alto Networks® is the next-generation security company, leading a new era in cybersecurity by safely enabling applications and preventing cyb...",palo alto networks® is the nextgeneration security company leading a new era in cybersecurity by safely enabling applications and preventing cyber...,"[palo, alto, networks®, is, the, nextgeneration, security, company, leading, a, new, era, in, cybersecurity, by, safely, enabling, applications, a..."
1,,,[]
2,Our Summer Internship Program (May-August) or (June-September) provides you:,our summer internship program mayaugust or juneseptember provides you,"[our, summer, internship, program, mayaugust, or, juneseptember, provides, you]"
3,1:1 mentorship,11 mentorship,"[11, mentorship]"
4,Fun and engaging events that inspire your intellectual curiosity,fun and engaging events that inspire your intellectual curiosity,"[fun, and, engaging, events, that, inspire, your, intellectual, curiosity]"
5,Opportunities to expand your knowledge and work on challenging projects,opportunities to expand your knowledge and work on challenging projects,"[opportunities, to, expand, your, knowledge, and, work, on, challenging, projects]"
6,"Connections to other interns, recent grads, and employees across the company as well as our leaders.",connections to other interns recent grads and employees across the company as well as our leaders,"[connections, to, other, interns, recent, grads, and, employees, across, the, company, as, well, as, our, leaders]"
7,,,[]
8,The Palo Alto Networks Data Science team uses data science and machine learning to solve problems throughout the company. In the words of one of o...,the palo alto networks data science team uses data science and machine learning to solve problems throughout the company in the words of one of ou...,"[the, palo, alto, networks, data, science, team, uses, data, science, and, machine, learning, to, solve, problems, throughout, the, company, in, t..."
9,,,[]


In [364]:
df.iloc[29]

corpus_lines         
cleaned_text         
tokenized_words    []
Name: 29, dtype: object

In [365]:
import nltk
from nltk.corpus import stopwords
stopwords_nltk = list(stopwords.words('english'))
print(stopwords_nltk[0:5])
print(len(stopwords_nltk))

from sklearn.feature_extraction import text
stopwords_sklearn = list(text.ENGLISH_STOP_WORDS)
print(stopwords_sklearn[0:5])
print(len(stopwords_sklearn))

['i', 'me', 'my', 'myself', 'we']
179
['she', 'as', 'six', 'could', 'mostly']
318


In [366]:
stopword_data = open('C:/Users/1234567890/Desktop/python projects/stopwords-master/en/atire_puurula.txt').read() # reading the unstructured data
#stopword_data[0:115] # first 115 characters
stopwords_atire_puruula = stopword_data.split('\n')
print(stopwords_atire_puruula[:50])
print(len(stopwords_atire_puruula))


stopword_data = open('C:/Users/1234567890/Desktop/python projects/stopwords-master/en/terrier.txt').read() # reading the unstructured data
#stopword_data[0:115] # first 115 characters

stopwords_terrier = stopword_data.split('\n')
print(stopwords_terrier[:50])
print(len(stopwords_terrier))

["'ll", "'ve", '1-1', 'a', "a's", 'able', 'about', 'above', 'abroad', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'adopted', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ago', 'ah', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and']
989
['x', 'y', 'your', 'yours', 'yourself', 'yourselves', 'you', 'yond', 'yonder', 'yon', 'ye', 'yet', 'z', 'zillion', 'j', 'u', 'umpteen', 'usually', 'us', 'username', 'uponed', 'upons', 'uponing', 'upon', 'ups', 'upping', 'upped', 'up', 'unto', 'until', 'unless', 'unlike', 'unliker', 'unlikest', 'under', 'underneath', 'use', 'used', 'usedest', 'r', 'rath', 'rather', 'rathest', 'rathe', 're', 'relate', 'related', 'relatively', 'regarding', 'really']
734


In [367]:
stopwords_merged= list(set(stopwords_nltk + stopwords_sklearn + stopwords_atire_puruula + stopwords_terrier))

In [368]:
def remove_stopwords(tokenized_words):
    stopwords_removed = [word for word in tokenized_words if word not in stopwords_merged]
    return stopwords_removed

In [369]:
df['stop_words_removed'] = df['tokenized_words'].apply(lambda x: remove_stopwords(x))

In [370]:
df

Unnamed: 0,corpus_lines,cleaned_text,tokenized_words,stop_words_removed
0,"Palo Alto Networks® is the next-generation security company, leading a new era in cybersecurity by safely enabling applications and preventing cyb...",palo alto networks® is the nextgeneration security company leading a new era in cybersecurity by safely enabling applications and preventing cyber...,"[palo, alto, networks®, is, the, nextgeneration, security, company, leading, a, new, era, in, cybersecurity, by, safely, enabling, applications, a...","[palo, alto, networks®, nextgeneration, security, company, leading, era, cybersecurity, safely, enabling, applications, preventing, cyber, breache..."
1,,,[],[]
2,Our Summer Internship Program (May-August) or (June-September) provides you:,our summer internship program mayaugust or juneseptember provides you,"[our, summer, internship, program, mayaugust, or, juneseptember, provides, you]","[summer, internship, program, mayaugust, juneseptember]"
3,1:1 mentorship,11 mentorship,"[11, mentorship]","[11, mentorship]"
4,Fun and engaging events that inspire your intellectual curiosity,fun and engaging events that inspire your intellectual curiosity,"[fun, and, engaging, events, that, inspire, your, intellectual, curiosity]","[fun, engaging, events, inspire, intellectual, curiosity]"
5,Opportunities to expand your knowledge and work on challenging projects,opportunities to expand your knowledge and work on challenging projects,"[opportunities, to, expand, your, knowledge, and, work, on, challenging, projects]","[opportunities, expand, knowledge, challenging, projects]"
6,"Connections to other interns, recent grads, and employees across the company as well as our leaders.",connections to other interns recent grads and employees across the company as well as our leaders,"[connections, to, other, interns, recent, grads, and, employees, across, the, company, as, well, as, our, leaders]","[connections, interns, grads, employees, company, leaders]"
7,,,[],[]
8,The Palo Alto Networks Data Science team uses data science and machine learning to solve problems throughout the company. In the words of one of o...,the palo alto networks data science team uses data science and machine learning to solve problems throughout the company in the words of one of ou...,"[the, palo, alto, networks, data, science, team, uses, data, science, and, machine, learning, to, solve, problems, throughout, the, company, in, t...","[palo, alto, networks, data, science, team, data, science, machine, learning, solve, company, “, develop, machine, learning, based, highly, scalab..."
9,,,[],[]


In [371]:
wnlm = nltk.WordNetLemmatizer()

In [372]:
def lemmatize(tokens):
    lemmatized = [wnlm.lemmatize(token) for token in tokens]
    return lemmatized


In [373]:
df['lemmatized_words'] = df['stop_words_removed'].apply(lambda x: lemmatize(x))


In [374]:
df

Unnamed: 0,corpus_lines,cleaned_text,tokenized_words,stop_words_removed,lemmatized_words
0,"Palo Alto Networks® is the next-generation security company, leading a new era in cybersecurity by safely enabling applications and preventing cyb...",palo alto networks® is the nextgeneration security company leading a new era in cybersecurity by safely enabling applications and preventing cyber...,"[palo, alto, networks®, is, the, nextgeneration, security, company, leading, a, new, era, in, cybersecurity, by, safely, enabling, applications, a...","[palo, alto, networks®, nextgeneration, security, company, leading, era, cybersecurity, safely, enabling, applications, preventing, cyber, breache...","[palo, alto, networks®, nextgeneration, security, company, leading, era, cybersecurity, safely, enabling, application, preventing, cyber, breach, ..."
1,,,[],[],[]
2,Our Summer Internship Program (May-August) or (June-September) provides you:,our summer internship program mayaugust or juneseptember provides you,"[our, summer, internship, program, mayaugust, or, juneseptember, provides, you]","[summer, internship, program, mayaugust, juneseptember]","[summer, internship, program, mayaugust, juneseptember]"
3,1:1 mentorship,11 mentorship,"[11, mentorship]","[11, mentorship]","[11, mentorship]"
4,Fun and engaging events that inspire your intellectual curiosity,fun and engaging events that inspire your intellectual curiosity,"[fun, and, engaging, events, that, inspire, your, intellectual, curiosity]","[fun, engaging, events, inspire, intellectual, curiosity]","[fun, engaging, event, inspire, intellectual, curiosity]"
5,Opportunities to expand your knowledge and work on challenging projects,opportunities to expand your knowledge and work on challenging projects,"[opportunities, to, expand, your, knowledge, and, work, on, challenging, projects]","[opportunities, expand, knowledge, challenging, projects]","[opportunity, expand, knowledge, challenging, project]"
6,"Connections to other interns, recent grads, and employees across the company as well as our leaders.",connections to other interns recent grads and employees across the company as well as our leaders,"[connections, to, other, interns, recent, grads, and, employees, across, the, company, as, well, as, our, leaders]","[connections, interns, grads, employees, company, leaders]","[connection, intern, grad, employee, company, leader]"
7,,,[],[],[]
8,The Palo Alto Networks Data Science team uses data science and machine learning to solve problems throughout the company. In the words of one of o...,the palo alto networks data science team uses data science and machine learning to solve problems throughout the company in the words of one of ou...,"[the, palo, alto, networks, data, science, team, uses, data, science, and, machine, learning, to, solve, problems, throughout, the, company, in, t...","[palo, alto, networks, data, science, team, data, science, machine, learning, solve, company, “, develop, machine, learning, based, highly, scalab...","[palo, alto, network, data, science, team, data, science, machine, learning, solve, company, “, develop, machine, learning, based, highly, scalabl..."
9,,,[],[],[]


# Limitations & obstacles faced during the project: 

Limitations: 

From the dataframe above we can see there is a lot of noise empty lists and stuff like that. Later in the next iteration of this project, I will try to clean the data more precisely but for now this is good enought. Feel free to try this out yourself. 

I personally think this is not a robust crawler like pros would build with scrapy. 

Obstacles: 

I tried a of lot scripts, my first try was to use seleneium but that didnt work because I was trying to find elements by x_path, every job has its unique x_path except for the first 3 and the last 2. Their x_paths are fixed. So I had to do a lot of stack overflow before I kinda learned how to use beautifulsoup, urllib, & requests packages



# Conclusion: 

This is a good starting point for webscraping. Naturally, I have seen pros build more complex, robust, faster crawlers for scraping. But I am no programming genius. I get by reading stackoverflow and documentation. 

This script can provide learning data scientists with a custom corpus for indeed data descriptions. They can build NLP models for whatever projects they are doing. 

All in all this was a very rewardful project.


Basically there are three phases to this project: 


1. Getting the corpus by webscraping

2. Building the NLP model to extract keywords for the job description

3. Deploying & using the model to polish resumes


I hope you liked it! 

Feel free to shoot me an email if you have any questions or positive criticism regarding the code or how I could have done it better! 

You can also send me an email if you would like to further work on this idea and polish it a little more

Muhammadut@gmail.com


## Thank you for reading! 

I will keep you posted! 