#read the readme to know exactly what code is doing. 

Here is the breakdown of how I want to build the code. 

- We will use requests to load the first page of the results. 
- Store the links to each job description. I don't want to use the in page feature of the job description. That makes the HTML too complicated.
- Once we have the links, use requests to get content for each one. 
- Use beutiful soup to scrape the job description. 
- The idea is to give the user the choice of skills he wants the summary. For the initial process I will provide a list of skills and check the results to see if everything works clearly. Then move on to user inputs. 
- Once the code works for let's say 10 or 20 job searches, we can scale it to incorporate more companies.
- Be ethical. I will keep a cap on how many companies we can search in the end. Let's not be scummy and inundate the indeed server.


In [1]:
#let's begin 

import pyforest 
import requests 
from bs4 import BeautifulSoup

In [2]:
job_page = requests.get('https://www.indeed.com/jobs?q=data+scientist&l=United+States')
job_page

<Response [200]>

In [3]:
first_page_content = job_page.text

In [4]:
parser = BeautifulSoup(first_page_content, 'html.parser')

In [5]:
test = parser.body.find('div', attrs = {'data-tn-component':'organicJob'})

In [6]:
test_job_title = test.a.text.strip()
test_job_title

'Data Scientist'

In [7]:
job_link = test.a.get('href')

In [8]:
job_link

'/company/WTF-DIGITAL/jobs/Data-Scientist-dda20b6589d3de4f?fccid=8020b1127d57b812&vjs=3'

So we pretty much figured out how to get the job name and link. 

In [9]:
company_name = test.div.span.text.strip()
company_name

'dynamicbitit.com'

Now let's check if the same code can work for the rest of the job listings. 

In [10]:
test2 = parser.body.find_all('div', attrs = {'data-tn-component':'organicJob'})

In [11]:
len(test2)

15

In [12]:
test2[1].a.text.strip()

'Data Scientist (entry level)'

In [13]:
test2[3].a.text.strip()

'Research Data Scientist II'

In [14]:
test2[6].a.text.strip()

'Data Scientist'

Works so far. One issue that is arising is not every job has the div-tn-component attribute. We can choose to ignore this since we will be looking at quite a number of jobs. Indeed has 15 job listings per page. As of this code some jobs are being scraped and some aren't. I'm continuing and will come back to see if I can change something. Ok let put this in a loop and see if it works. 

____________________________________________________________________________________________

In [15]:
job_table = pd.DataFrame(columns = ['Job Title','Organization','Link'])

<IPython.core.display.Javascript object>

In [16]:
job_table

Unnamed: 0,Job Title,Organization,Link


In [17]:
scraped_page = parser.body.find_all('div', attrs = {'data-tn-component':'organicJob'})

In [18]:
titles = []
orgs = []
clean_links = []
    
for job in scraped_page:
    
    title = job.a.text.strip()
    titles.append(title)
    
    org = job.div.span.text.strip()
    orgs.append(org)
    
    link = job.a.get('href')
    
    if link.startswith('https'):
        clean_links.append(link)
    else:
        join_link = "https://www.indeed.com"+link
        clean_links.append(join_link)

In [19]:
clean_links

['https://www.indeed.com/company/WTF-DIGITAL/jobs/Data-Scientist-dda20b6589d3de4f?fccid=8020b1127d57b812&vjs=3',
 'https://www.indeed.com/rc/clk?jk=10ce94304e4f3254&fccid=e34a8bfa908cfda2&vjs=3',
 'https://www.indeed.com/company/WithHealth,-Inc./jobs/Senior-Data-Scientist-1b9a2896f416c9f5?fccid=afbf73832175724f&vjs=3',
 'https://www.indeed.com/rc/clk?jk=b612e4d97a48e2fb&fccid=e2a2a5c0f4f84192&vjs=3',
 'https://www.indeed.com/rc/clk?jk=abb4a8359abeeb2e&fccid=fe2d21eef233e94a&vjs=3',
 'https://www.indeed.com/rc/clk?jk=3d5724ad39c214bb&fccid=9993304a3df214bf&vjs=3',
 'https://www.indeed.com/company/CS-Solutions-Inc/jobs/Data-Scientist-fc457dfccb348979?fccid=d7d214065e5de0b6&vjs=3',
 'https://www.indeed.com/company/C2S-Technologies/jobs/Data-Scientist-9d198a5949bdec9c?fccid=eb4bc656c7659573&vjs=3',
 'https://www.indeed.com/company/Eateam/jobs/Data-Scientist-1c9685211454261a?fccid=50b31c1f60e549ba&vjs=3',
 'https://www.indeed.com/company/The-TIE/jobs/Cryptocurrency-Data-Scientist-dd27e5a232

In [20]:
job_table['Job Title'] = titles
job_table['Organization'] = orgs
job_table['Link'] = clean_links

In [21]:
job_table

Unnamed: 0,Job Title,Organization,Link
0,Data Scientist,dynamicbitit.com,https://www.indeed.com/company/WTF-DIGITAL/job...
1,Data Scientist (entry level),Saturn Cloud,https://www.indeed.com/rc/clk?jk=10ce94304e4f3...
2,Sr. Data Scientist,"WithHealth, Inc.","https://www.indeed.com/company/WithHealth,-Inc..."
3,Research Data Scientist II,Cleveland Clinic,https://www.indeed.com/rc/clk?jk=b612e4d97a48e...
4,"Data Scientist, Amazon Studios",Amazon Studios LLC,https://www.indeed.com/rc/clk?jk=abb4a8359abee...
5,Data Scientist - Analytics,Acorn Analytics,https://www.indeed.com/rc/clk?jk=3d5724ad39c21...
6,Data Scientist,CS Solutions Inc,https://www.indeed.com/company/CS-Solutions-In...
7,Data Scientist,C2S Technologies,https://www.indeed.com/company/C2S-Technologie...
8,Data Scientist,Eateam,https://www.indeed.com/company/Eateam/jobs/Dat...
9,Cryptocurrency Data Scientist,The TIE,https://www.indeed.com/company/The-TIE/jobs/Cr...


Now let's move towards scraping the job page. Let's take the first link. 

In [22]:
jd_link = job_table['Link'][0]
jd_link

'https://www.indeed.com/company/WTF-DIGITAL/jobs/Data-Scientist-dda20b6589d3de4f?fccid=8020b1127d57b812&vjs=3'

In [38]:
jd_page = requests.get(jd_link)

In [41]:
jd_page

<Response [200]>

In [39]:
jd_content = jd_page.text

In [40]:
jd_parser = BeautifulSoup(jd_content,'html.parser')

In [96]:
jd_parser.find('div', class_ = 'jobsearch-jobDescriptionText')

<div class="jobsearch-jobDescriptionText" id="jobDescriptionText"><p><b>Dear Professional,</b></p><p>Lost job because of Covid-19...!! do not worry.<br/>Looking for a new job/switch....!! do not worry.</p><p>Top MNC's have come up with vacancies, as everything is being digitalized so there are huge opportunities for skilled professionals.<br/>Till December 2020 WFH option is available</p><p><b>Job brief</b></p><p>We are looking for a Data Scientist to analyze large amounts of raw information to find patterns that will help improve our company. We will rely on you to build data products to extract valuable business insights.</p><p>In this role, you should be highly analytical with a knack for analysis, math and statistics. Critical thinking and problem-solving skills are essential for interpreting data. We also want to see a passion for machine-learning and research.</p><p>Your goal will be to help our company analyze trends to make better decisions.<br/><b>Responsibilities</b></p><ul><

In [97]:
jd = jd_parser.find('div', class_ = 'jobsearch-jobDescriptionText').text

In [98]:
jd

"Dear Professional,Lost job because of Covid-19...!! do not worry.Looking for a new job/switch....!! do not worry.Top MNC's have come up with vacancies, as everything is being digitalized so there are huge opportunities for skilled professionals.Till December 2020 WFH option is availableJob briefWe are looking for a Data Scientist to analyze large amounts of raw information to find patterns that will help improve our company. We will rely on you to build data products to extract valuable business insights.In this role, you should be highly analytical with a knack for analysis, math and statistics. Critical thinking and problem-solving skills are essential for interpreting data. We also want to see a passion for machine-learning and research.Your goal will be to help our company analyze trends to make better decisions.ResponsibilitiesIdentify valuable data sources and automate collection processesUndertake preprocessing of structured and unstructured dataAnalyze large amounts of informa

Some data cleaning

In [103]:
jd = jd.lower()
jd

"dear professional,lost job because of covid-19...!! do not worry.looking for a new job/switch....!! do not worry.top mnc's have come up with vacancies, as everything is being digitalized so there are huge opportunities for skilled professionals.till december 2020 wfh option is availablejob briefwe are looking for a data scientist to analyze large amounts of raw information to find patterns that will help improve our company. we will rely on you to build data products to extract valuable business insights.in this role, you should be highly analytical with a knack for analysis, math and statistics. critical thinking and problem-solving skills are essential for interpreting data. we also want to see a passion for machine-learning and research.your goal will be to help our company analyze trends to make better decisions.responsibilitiesidentify valuable data sources and automate collection processesundertake preprocessing of structured and unstructured dataanalyze large amounts of informa

In [110]:
'python' in jd

True

In [131]:
skills_set = ['Python','SQL','Command Line', 'Tableau','Excel', 'HADOOP','proven','knowledge']

In [132]:
test_dic = {}

for skill in skills_set:
    skill = skill.lower()
    
    if skill in jd:
        test_dic[skill] =1
        

In [133]:
test_dic

{'python': 1,
 'sql': 1,
 'tableau': 1,
 'excel': 1,
 'hadoop': 1,
 'proven': 1,
 'do': 1}

Luckily we don't have to clean the data that much. Ok now let's test another link.

In [134]:
jd_link2 = job_table['Link'][1]
jd_link2

'https://www.indeed.com/rc/clk?jk=10ce94304e4f3254&fccid=e34a8bfa908cfda2&vjs=3'

In [138]:
jd_page2 = requests.get(jd_link2).text

In [140]:
jd_parser2 = BeautifulSoup(jd_page2,'html.parser')

In [141]:
jd2 = jd = jd_parser2.find('div', class_ = 'jobsearch-jobDescriptionText').text

In [143]:
jd2 = jd2.lower()

In [144]:
jd2

'overview\nsaturn cloud helps companies perform data science at a new level of scale, with one-click solutions, to solve the world’s hardest problems. our product is a saas platform which equips data science teams with high-leverage automation tools, eliminating hours of traditional, manual work. the platform is user-friendly, scalable and secure.\n\nyou will be an entry-level data scientist for saturn cloud, an exciting new venture founded by the creators of anaconda and core authors of the pydata stack. the role features drafting the first generation of saturn resource materials, tutorials, and technical content.\nresponsibilities\nwork on high-visibility projects (e.g. text and video tutorials, use-case examples, contributions to our engineering blog)\n\nparticipate in research and user engagement to deliver polished technical content, documentation, and resources\n\nconduct various data science analysis and activities in order to generate reproduce-able, valuable, and up-to-date re

In [148]:
test_dic2 = {}

for skill in skills_set:
    skill = skill.lower()
    
    if skill in jd2:
        test_dic2[skill] =1
        

In [149]:
test_dic2

{'python': 1, 'sql': 1, 'excel': 1, 'do': 1}