## Overview

#### This tool would go through all the jobs in a particular city or cities and add those ones to a list that require particular skills that match with the user's skillset. All of the code is written in the form of functions in order to change parameters,search terms or the number of pages we want to search.



##### job_score is the score of a job based on the specific keywords we want the job description to contain. Some of the key words are like Python, SQL, R, SAS, Machine Learning, Tableau

In [1]:
#importing the necessary libraries
import requests
import bs4
import re
import time
import smtplib

#Defining a function that would score a job based on the specific keywords you want the job description to contain
def job_score(url):
    
    #obtaining the html script
    htmlcomplete = requests.get(url)
    htmlcontent = bs4.BeautifulSoup(htmlcomplete.content, 'lxml')
    htmlbody = htmlcontent('body')[0]
    
    #finding all the keywords
    r = len(re.findall('R[\,\.]', htmlbody.text))
    sql = htmlbody.text.count('sql')+htmlbody.text.count('Sql')+htmlbody.text.count('SQL')
    python = htmlbody.text.count('python')+htmlbody.text.count('Python')
    hadoop = htmlbody.text.count('hadoop')+htmlbody.text.count('Hadoop')+htmlbody.text.count('HADOOP')
    tableau = htmlbody.text.count('tableau')+htmlbody.text.count('Tableau')
    total=r+python+sql+hadoop+tableau
    print ('R count:', r, ',','Python count:', python, ',','SQL count:', sql, ',','Hadoop count:', hadoop, ',','Tableau count:', tableau, ',',)
    return total

### Example: job_score of a job

In [2]:
job_score('https://www.indeed.com/viewjob?jk=6bf736094f9552d9&tk=1ck7sb1n05u8m89o&from=serp&vjs=3')

R count: 1 , Python count: 0 , SQL count: 0 , Hadoop count: 0 , Tableau count: 1 ,


2

### Looking at the HTML script to know how it is structured and where is the relevant information located

In [None]:
#This section of the code lets you see the html script so that you can understand the structure and what information can be extracted from which part of the script 
URL = 'https://www.indeed.com/jobs?q=data+scientist&l='

#conducting a request of the stated URL above:
complete = requests.get(URL)

#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
content = bs4.BeautifulSoup(complete.text, 'html.parser')

#printing soup in a more structured tree format that makes for easier reading
print(content.prettify())

### Extracting the related job data from the HTML script  like 


##### Name of the company, Date when the job was posted, Job Title, Hyperlink to the job

In [3]:
def jobdata(url):
    htmlcomplete2 = requests.get(url)
    htmlcontent2 = bs4.BeautifulSoup(htmlcomplete2.content, 'lxml')
    #only getting the tags for organic job postings and not the ones that are sponsored
    tags = htmlcontent2.find_all('div', {'data-tn-component' : "organicJob"})
    #getting the list of companies that have the organic job posting tags
    companies = [x.span.text for x in tags]
    #extracting the features like the company name, complete link, date, etc.
    attributes = [x.h2.a.attrs for x in tags]
    dates = [x.find_all('span', {'class':'date'}) for x in tags]
    
    # update attributes dictionaries with company name and date posted
    [attributes[i].update({'company': companies[i].strip()}) for i, x in enumerate(attributes)]
    [attributes[i].update({'date posted': dates[i][0].text.strip()}) for i, x in enumerate(attributes)]
    return attributes

#### Sample of the attribute dictionary for the first job on the page


In [4]:
jobdata('https://www.indeed.com/jobs?q=data+scientist&l=')[0]


{'class': ['turnstileLink'],
 'company': 'Jvion',
 'data-tn-element': 'jobTitle',
 'date posted': '10 days ago',
 'href': '/company/Jvion/jobs/Junior-Data-Scientist-15a4facb3700bd14?fccid=357536219ef998c7&vjs=3',
 'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[0],true,1);',
 'onmousedown': 'return rclk(this,jobmap[0],1);',
 'rel': ['noopener', 'nofollow'],
 'target': '_blank',
 'title': 'Jr. Data Scientist'}

#### List of cities in which we want to search for jobs

In [5]:
#defining a list of cities you want to search jobs in
citylist = ['New+York','Chicago', 'Austin', 'San+Francisco', 'Seattle']#, 'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Washington+DC', 'Boulder']

#### looping through Indeed.com and applying the functions defined above to every page

In [6]:
#defining a list to store all the relevant jobs
newjobslist = []

#defining a new function to go through all the jobs posted in the last 'n' days for a specific role
#essentially looping over 2 
def newjobs(daysago = 1, startingpage = 0, pagelimit = 20, position = 'data+scientist'):
    for city in citylist:
        indeed_url = 'http://www.indeed.com/jobs?q={0}&l={1}&sort=date&start='.format(position, city)
        
        
        for i in range(startingpage, startingpage + pagelimit):
            print ('URL:', str(indeed_url + str(i*10)), '\n')
        
            attributes = jobdata(indeed_url + str(i*10))
            
            for j in range(0, len(attributes)):
                href = attributes[j]['href']
                title = attributes[j]['title']
                company = attributes[j]['company']
                date_posted = attributes[j]['date posted']
                
                print (repr(company),',', repr(title),',', repr(date_posted))
                
                evaluation = job_score('http://indeed.com' + href)
                
                if evaluation >= 1:
                    newjobslist.append('{0}, {1}, {2}, {3}'.format(company, title, city, 'http://indeed.com' + href))
                    
                print ('\n')
                
            time.sleep(1)
           
    newjobsstring = '\n\n'.join(newjobslist)
    return newjobsstring

#### Sending an email to myself using the smtplib library

In [7]:
def emailme(from_addr = 'praneeth.bomma401@gmail.com', to_addr = 'pbomma@uncc.edu', subject = 'Daily Data Science Jobs Update Scraped from Indeed', text = None):
    
    message = 'Subject: {0}\n\nJobs: {1}'.format(subject, text)

    # login information
    username = '******'
    password = '******'
    
    # send the message
    server = smtplib.SMTP('smtp.gmail.com:587')
    server.ehlo()
    server.starttls()
    server.ehlo
    server.login(username, password)
    server.sendmail(from_addr, to_addr, message)
    server.quit()
    print ('Please check your mail')

In [8]:
def main():
    print ('Searching for jobs...')

    starting_page = 0
    page_limit = 1
    datascientist = newjobs(position = 'data+scientist', startingpage = starting_page, pagelimit = page_limit)
    emailme(text = datascientist)

In [None]:
main()


Searching for jobs...
URL: http://www.indeed.com/jobs?q=data+scientist&l=New+York&sort=date&start=0 

'Alliant Insight' , 'Associate Data Scientist' , 'Just posted'
R count: 0 , Python count: 1 , SQL count: 1 , Hadoop count: 1 , Tableau count: 0 ,


'Alliant Insight' , 'Data Scientist' , 'Just posted'
R count: 0 , Python count: 1 , SQL count: 1 , Hadoop count: 1 , Tableau count: 0 ,


'HR Pundits' , 'Sr Java Technical Resource with Data Analysis' , 'Just posted'
R count: 0 , Python count: 0 , SQL count: 1 , Hadoop count: 0 , Tableau count: 0 ,


'Regeneron' , 'Scientist / Staff Scientist, Anti-Tumor Immunotherapy Development' , 'Just posted'
R count: 0 , Python count: 0 , SQL count: 0 , Hadoop count: 0 , Tableau count: 0 ,


'Benenson Strategy Group' , 'Market Research Analyst (Corporate)' , 'Just posted'
R count: 0 , Python count: 0 , SQL count: 0 , Hadoop count: 0 , Tableau count: 0 ,


'CORE Environmental Consultants' , 'Environmental Scientist/Engineer' , 'Just posted'
R count: 0 ,