<a href="https://colab.research.google.com/github/lifepopkay/Tech-Monies/blob/main/Main_dataScrappingIndeedScript.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Output
#### Main Columns:
---
| Information | Dataset Column | Available | Comment |
|---|---|---|---|
| Jobs title | `title` | ✅ | Posted Job Title |
| Description | `jobDesc` | ✅ | All details available in JD. Use `print` statement to get a formatted output |
| Salary | `salary` | ❌ | will be extracted from `salaryDesc` |
| Contract Type | `type` | ✅ | will be extracted from `salaryDesc` |
| Company Name | `company` | ✅ | - |
| Country | `country` | ❌ | will be extracted from `location` |
| State | `state` | ❌ | will be extracted from `location` |
| Years of Experience | `yearMinExp` | ❌ | will be extracted from `jobDesc` |
| Position | `level` | ❌ | will be extracted from `jobDesc` |
| Industry | `industry` | ❌ | will be extracted from `jobDesc` |
| Age Required | `ageCriteria` | ❌ | will be extracted from `jobDesc` |
| Skillset Required | `skills` | ❌ | will be extracted from `jobDesc` |
| Educational qualification | `eligibility` | ❌ | will be extracted from `jobDesc` | 
| Pay Frequency | `payFrequency` | ❌ | will be extracted from `jobDesc` |

---

There are some more columns available which are listed below.

#### Additional Columns:

| Information | Dataset Column | Available | Comment |
|---|---|---|---|
| Jobs ID | `jobID` | ✅ | - |
| Location | `location` | ✅ | One or more combination of city, state, country or pincode/zipcode |
| Salary Desc | `salaryDesc` | ✅ | One or more combination of salary (actual/estimated), job type, shift, etc. |
| JD link | `link` | ✅ | Link to actual Job Description provided by Indeed |
| Post Date | `postDate` | ✅ | Recency of Job Posting |
| Estimated by Indeed | `estimated` | ✅ | The salary is estimated by Indeed |

---

### Execute the block
**Instructions:**
1. Enter Job Title
2. Enter search location
3. Enter Main website e.g. www.indeed.com or in.indeed.com or ng.indeed.com
  1. You can also check in google & anything in ____ can go for https://*****/?q....
4. Enter Page Numbers to be scrapped. If left blank then 1st page will be scrapped.

In [None]:
##@title Imports and Functions { display-mode: "form" }
### Imports
# Data Handling
import numpy as np
import pandas as pd
import re

# Web Element Manipulation
from bs4 import BeautifulSoup
import requests

# timestamping
from datetime import date

# define headers for connection string
headers = {"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

### Functions
def find_jobs(what, where, baseUrl, nPage=1, verbose=True):
  # Handling URL
  jobTitle = '+'.join(what.split())
  location = '+'.join(where.split())
  
  # Initialize list for all jobs data
  jobs = []

  # Initialize stopping criteria
  totalJobs = -1
  mPage = nPage

  # Initialize Page Index
  page = 0

  # starting
  if verbose:
    print('===========| Start |============')
    print('===| Scrapping Job Postings |===')

  # create url for scraping
  while page < min(nPage, mPage):
    if page>0:
      targetUrl = baseUrl+"/jobs?q="+jobTitle+"&l="+location+"&sort=date"+"&start="+str(page * 10)
    else:
      targetUrl = baseUrl+"/jobs?q="+jobTitle+"&l="+location+"&sort=date"

    if verbose:
      print("\n+++++ Extracting Data from Page", page+1, "++++++")
      print("Extracting data from URL:", targetUrl)
    jobs, totalJobs, mPage = scrap_jobs(jobs, targetUrl, verbose, totalJobs, mPage)
    page += 1
  
  # attach JD
  if verbose:
    print('\n===| Attaching Job Description |===')
  
  # attach Job Description
  attach_jd(jobs, baseUrl)

  # drop duplicates
  jobsDF = pd.DataFrame(jobs)
  jobsDF.drop_duplicates(inplace=True)

  # extract & add other columns
  add_cols(jobsDF)

  # finishing
  if verbose:
    print('\n===| Cleaning Up |===')
    print("Total", jobsDF.shape[0], "unique jobs found.")
    print('\n========| Done |=========')
  
  # Check scrapped jobs
  return jobsDF


def scrap_jobs(jobs, url, verbose, totalJobs, mPage):

  # get the static page to scrap everything apart from job description
  response = requests.get(url, headers=headers)
  if verbose:
    if response.ok:
      print("Connected to", url, "Successfully.")
    else:
  	  print("Connection denied with response code:", response.status_code)
  html = response.text

  # Create soup
  soup = BeautifulSoup(html, 'html.parser')

  # Get Actual value for max Page
  if totalJobs == -1:
    totalJobs = int(re.search(r'of (.*) jobs', soup.find('div', {'id': 'searchCountPages'}).text)[1].replace(",", ""))
    # Estimate Actual Page number
    mPage = (totalJobs//15) + 1

  # Search Area
  block=soup.find('ul',attrs={'class': re.compile('jobsearch-ResultsList')})
  
  # Check for stopping criteria
  jobCards = block.find_all('div', {'class': 'job_seen_beacon'})
  # if len(jobCards) < 15:
  #   proceed = 0
  
  # iterate through job cards
  for card in block.find_all('div', {'class': 'job_seen_beacon'}):
    jobs.append(scrap_cards(card))
  if verbose:
    print("Found Total", len(jobs), "jobs so far.")
  return jobs, totalJobs, mPage
    
def scrap_cards(card):
  # temporary dictionary
    tempDict = dict()
  
    # Job Title & ID:
    title = card.find('h2',{'class':re.compile('jobTitle')})
    if not(isinstance(title, type(None))):
      tempDict['title']=title.find('a').text
      tempDict['id']=title.find('a').attrs['id']
    
    # Company Name:
    company = card.find('span',{'class':'companyName'})
    if not(isinstance(company, type(None))):
      tempDict['company']=company.text
    
    # Location:
    location = card.find('div',{'class':'companyLocation'})
    if not(isinstance(location, type(None))):
      tempDict['location']=location.text
                
    # Links: these Href links will take us to full job description
    link = card.find('a', {'class': re.compile('jcs-JobTitle')})
    if not(isinstance(link, type(None))):
      tempDict['link']=link['href']
    
    # Salary & Contract Type, if available:'
    # picking all text, cleaning will be done later
    salaryCard = card.find('div',{'class': re.compile('metadataContainer')})
    if not(isinstance(salaryCard, type(None))):
      tempDict['salaryDesc'] = salaryCard.text
      # salary = salaryCard.find('div',{'class': re.compile('salary')})
      # if not(isinstance(salary, type(None))):
      #   tempDict['salary']=salary.text
      # contract = salaryCard.find('div',{'class': 'metadata'})
      # if not(isinstance(contract, type(None))):
      #   tempDict['contractType']=contract.text

    # Job Post Date:
    postDate = card.find('span', attrs={'class': 'date'})
    if not(isinstance(postDate, type(None))):
      tempDict['postDate']=postDate.find(text=True, recursive=False)
      
    # Contract Type:
    # contractType = card.find('div', attrs={'class': 'attribute_snippet'})
    # if not(isinstance(contractType, type(None))):
    #   tempDict['contractType']=contractType.text

    # Put everything together in a list of lists for the default dictionary
    return tempDict

def attach_jd(jobs, baseUrl):
  for dict in jobs:
    response = requests.get(baseUrl+dict['link'], headers=headers)
    if response.ok:
      html_ = response.text
      # Create soup
      soup_ = BeautifulSoup(html_, 'html.parser')
      dict['JobDesc'] = soup_.find('div',{'class':'jobsearch-jobDescriptionText'}).text
    else:
      dict['JobDesc'] = 'Not Available'
  return jobs

def add_cols(jobs):
  # Job Type:
  def job_type(x):
    return 'Contract' if x.find('Contract') != -1 else 'FullTime' if x.find('Full-time') != -1 else None
  # Salary Range:
  def salary_range(x):
    return 'Contract' if x.find('Contract') != -1 else 'FullTime' if x.find('Full-time') != -1 else None
  # Estimated Salary:
  def estimated(x):
    return 1 if x.find('Estimated') != -1 else 0
  
  # add Salary related columns
  jobs['contractType'] = jobs.salaryDesc.apply(job_type)
  jobs['estimated'] = jobs.salaryDesc.apply(estimated)

  return jobs

## Inputs from Users
what = input('Enter job title: ')
where = input('Enter job location: ')
baseUrl = 'https://'+input('Enter base url: ')
nPage = input('Pages to Scrap: ')
nPage = 1 if not nPage else int(nPage)

# Scrap Data
df = find_jobs(what, where, baseUrl, nPage)

# Enter Job Title, Location/Country, primary URL, total pages for extraction & if any 
# data = find_jobs(what='Refuse Collector', where='USA', baseUrl='https://www.indeed.com', nPage=10, verbose=True)

## Write to file
fileName = what.replace(' ', '_')+'_'+where.replace(' ', '_')+'_'+str(date.today()).replace('-', '')+'.csv'
print('\nWriting to file:', fileName)
df.to_csv('/content/'+fileName,index=False)
print('\nDone. Please find file:', fileName, 'in left pane. Refresh, if required.')

## Print Outputs
print('\n===| Showing 10 records |===\n')
display(df[['title', 'company', 'location', 'salaryDesc', 'postDate']].head(10))


Enter job title: Business Analyst
Enter job location: USA
Enter base url: www.indeed.com
Pages to Scrap: 
===| Scrapping Job Postings |===

+++++ Extracting Data from Page 1 ++++++
Extracting data from URL: https://www.indeed.com/jobs?q=Business+Analyst&l=USA&sort=date
Connected to https://www.indeed.com/jobs?q=Business+Analyst&l=USA&sort=date Successfully.
Found Total 15 jobs so far.

===| Attaching Job Description |===

===| Cleaning Up |===
Total 15 unique jobs found.


Writing to file: Business_Analyst_USA_20220815.csv

Done. Please find file: Business_Analyst_USA_20220815.csv in left pane. Refresh, if required.

===| Showing 10 records |===



Unnamed: 0,title,company,location,salaryDesc,postDate
0,Business Intelligence Analyst,Elmer's Home Services Heating & Air Conditioning,"San Antonio, TX 78233",Estimated $59K - $74.6K a yearFull-time,Just posted
1,FP&A Analyst (Business Corporate Analyst),JET AVIATION (ASIA PACIFIC) PTE LTD,"Marina, CA","$6,000 - $8,000 a month",Just posted
2,Senior Business Analyst - Referring Provider O...,Mayo Clinic,"Rochester, MN+2 locations","$77,417 - $126,963 a yearFull-timeMonday to Fr...",Just posted
3,Business Analyst - Asplundh Innovate,"Asplundh Tree Expert, LLC - 010","Remote in Willow Grove, PA",Estimated $81.7K - $103K a year,Just posted
4,Business Systems Analyst II - Legal Records Ma...,"City of Portland, OR","Portland, OR 97204 (Downtown area)+2 locations","$87,318 - $122,845 a yearFull-time",Just posted
5,ServiceNow Business Analyst (Open to Location),Slalom Consulting,"Seattle, WA 98104 (Downtown area)",Estimated $92.8K - $118K a year,Just posted
6,Software Business Analyst (FT),Cott Systems,"Columbus, OH",Estimated $61.5K - $77.9K a yearFull-time,Just posted
7,Business Analyst,TGI Direct,"Hybrid remote in Flint, MI 48507",Estimated $66.1K - $83.7K a yearFull-time,Just posted
8,Sales Analyst- Entry Level,"ABC Plumbing, Sewer, Heating, Cooling and Elec...","Arlington Heights, IL 60004","$52,000 a yearFull-timeMonday to Friday",Just posted
9,SAP FICO Business Analyst,Malor & Company,"New York, NY 10022 (Turtle Bay area)",Estimated $84.5K - $107K a yearFull-time,Just posted


### Change Log
Track of changes done on this notebook -

| Date | Type | User | Details |
|---|---|---|---|
| 2022-08-13 | New Notebook | `@ajmasih0309` | Setting up basic functionalities to scrape data from job cards & corresponding job descriptions from indeed USA & other countries. Basic Cleaning of data to get useful output |
| 2022-08-14 | Modified Notebook | `@ajmasih0309` | Basic Cleaning of data to get useful output & visual aid for users who will execute the script for extraction |

---