<a href="https://colab.research.google.com/github/lifepopkay/Tech-Monies/blob/main/Main_dataScrappingIndeedScript.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Dictionary
#### Main Columns:
---
| Information | Dataset Column | Available | Comment |
|---|---|---|---|
| Jobs title | `title` | ✅ | Posted Job Title |
| Description | `jobDesc` | ✅ | All details available in JD. Use `print` statement to get a formatted output |
| Salary | `salary` | ❌ | will be extracted from `salaryDesc` |
| Contract Type | `type` | ✅ | will be extracted from `salaryDesc` |
| Company Name | `company` | ❌ | - |
| Country | `country` | ❌ | will be extracted from `location` |
| State | `state` | ❌ | will be extracted from `location` |
| Years of Experience | `yearMinExp` | ❌ | will be extracted from `jobDesc` |
| Position | `level` | ❌ | will be extracted from `jobDesc` |
| Industry | `industry` | ❌ | will be extracted from `jobDesc` |
| Age Required | `ageCriteria` | ❌ | will be extracted from `jobDesc` |
| Skillset Required | `skills` | ❌ | will be extracted from `jobDesc` |
| Educational qualification | `eligibility` | ❌ | will be extracted from `jobDesc` | 
| Pay Frequency | `payFrequency` | ❌ | will be extracted from `jobDesc` |

---

There are some more columns available which are listed below.

#### Additional Columns:

| Information | Dataset Column | Available | Comment |
|---|---|---|---|
| Jobs ID | `jobID` | ✅ | - |
| Location | `location` | ✅ | One or more combination of city, state, country or pincode/zipcode |
| Salary Desc | `salaryDesc` | ✅ | One or more combination of salary (actual/estimated), job type, shift, etc. |
| JD link | `link` | ✅ | Link to actual Job Description provided by Indeed |
| Post Date | `postDate` | ✅ | Recency of Job Posting |
| Estimated by Indeed | `estimated` | ❌ | The salary is estimated by Indeed |

---

### Execute the block
**Instructions:**

1. Enter Job Title. 🔴
2. Enter Country Abbreviations - 🔴

| Country | Base Url |
|---|---|
| **USA** | `www.indeed.com` |
| **UK** | `uk.indeed.com` |
| **IND** | `in.indeed.com` |
| **NG** | `ng.indeed.com` |
| **CA** | `ca.indeed.com` |

3. Enter location. This could be any city, state or province. Keep blank & Hit Enter/Return (↩) to get result across country. 🟢
4. Enter Page Numbers to be scrapped. Keep blank & Hit Enter/Return (↩) to get result from 1st page only. 🟢

##### 🔴 - Necessary Inputs, 🟢 - Optional Inputs

In [None]:
#@title Imports and Functions { display-mode: "form" }
### Imports
# Data Handling
import numpy as np
import pandas as pd
import re

# Web Element Manipulation
from bs4 import BeautifulSoup
import requests

# timestamping
from datetime import date


#Import the packages 
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
import pandas as pd
import time

# define headers for connection string
headers = {"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

### Functions
old_page = ''

chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("prefs", {"profile.default_content_settings.cookies": 2})

DRIVER_PATH = 'C:\Program Files\Google\Chrome\Application\chromedriver.exe' # ensure to change this to your chrome driver location 
driver = webdriver.Chrome(executable_path = DRIVER_PATH, chrome_options=chrome_options)

def openpage(url):
    global old_page

    driver.get(url)
    #ime.sleep(10)
    cont = driver.find_elements(By.CLASS_NAME, 'is-desktop')
    try:
        element = cont[0].get_attribute('outerHTML')
        page = bs(element, "html.parser")
        old_page = page
    except:
        page = old_page 
        
    #driver.close()
    return page


def find_jobs(what, where, baseUrl, nPage=1, verbose=True):
    # Handling URL
    jobTitle = '%20'.join(what.split())
    location = '%20'.join(where.split())

    # Initialize list for all jobs data
    jobs = []

    # Initialize stopping criteria
    totalJobs = -1
    mPage = nPage

    # Initialize Page Index
    page = 0

    # starting
    if verbose:
        print('===========| Start |============')
        print('===| Scrapping Job Postings |===')

    # create url for scraping
    while page < min(nPage, mPage):
        if page>0:
            targetUrl = baseUrl+"/jobs?q="+jobTitle+"&l="+location+"&sort=date"+"&start="+str(page * 10)
        else:
            targetUrl = baseUrl+"/jobs?q="+jobTitle+"&l="+location+"&sort=date"

        if verbose:
            print("\n+++++ Extracting Data from Page", page+1, "++++++")
            print("Extracting data from URL:", targetUrl)
        jobs, totalJobs, mPage = scrap_jobs(jobs, targetUrl, verbose, totalJobs, mPage)
        page += 1

    # attach JD
    if verbose:
        print('\n===| Attaching Job Description |===')

    # attach Job Description
    attach_jd(jobs, baseUrl)

    # drop duplicates
    jobsDF = pd.DataFrame(jobs)
    jobsDF.drop_duplicates(inplace=True)

    # extract & add other columns
    # add_cols(jobsDF)

    # finishing
    if verbose:
        print('\n===| Cleaning Up |===')
        print("Total", jobsDF.shape[0], "unique jobs found.")
        print('\n========| Done |=========')

    # Check scrapped jobs
    return jobsDF


def scrap_jobs(jobs, url, verbose, totalJobs, mPage):

    # get the static page to scrap everything apart from job description
    """
    response = requests.get(url, headers=headers)
    if verbose:
        if response.ok:
            print("Connected to", url, "Successfully.")
        else:
            print("Connection denied with response code:", response.status_code)
    ##html = response.text """
    # Create soup
    #soup = BeautifulSoup(html, 'html.parser')
    soup = openpage(url)

    # Get Actual value for max Page
    if totalJobs == -1:
        if soup.find('div', {'id': 'searchCountPages'}) is None:
            totalJobs = 0
        else:
            totalJobs = int(re.search(r'of (.*) jobs',  soup.find('div', {'id': 'searchCountPages'}).text)[1].replace(",", ""))
            # Estimate Actual Page number
            mPage = (totalJobs//15) + 1

    # Search Area
    block=soup.find('ul',attrs={'class': re.compile('jobsearch-ResultsList')})

    # Check for stopping criteria
    #jobCards = block.find_all('div', {'class': 'job_seen_beacon'})
    # if len(jobCards) < 15:
    #   proceed = 0

    # iterate through job cards
    for card in block.find_all('div', {'class': 'job_seen_beacon'}):
        jobs.append(scrap_cards(card))
    if verbose:
        print("Found Total", len(jobs), "jobs so far.")
        
    return jobs, totalJobs, mPage

def scrap_cards(card):
    # temporary dictionary
    tempDict = dict()

    # Job Title & ID:
    title = card.find('h2',{'class':re.compile('jobTitle')})
    if not(isinstance(title, type(None))):
        tempDict['title']=title.find('a').text
        tempDict['id']=title.find('a').attrs['id']

    # Company Name:
    company = card.find('span',{'class':'companyName'})
    if not(isinstance(company, type(None))):
        tempDict['company']=company.text

    # Location:
    location = card.find('div',{'class':'companyLocation'})
    if not(isinstance(location, type(None))):
        tempDict['location']=location.text

    # Links: these Href links will take us to full job description
    link = card.find('a', {'class': re.compile('jcs-JobTitle')})
    if not(isinstance(link, type(None))):
        tempDict['link']=link['href']

    # Salary & Contract Type, if available:'
    # picking all text, cleaning will be done later
    salaryCard = card.find('div',{'class': re.compile('metadataContainer')})
    if not(isinstance(salaryCard, type(None))):
        tempDict['salaryDesc'] = salaryCard.text
      # salary = salaryCard.find('div',{'class': re.compile('salary')})
      # if not(isinstance(salary, type(None))):
      #   tempDict['salary']=salary.text
      # contract = salaryCard.find('div',{'class': 'metadata'})
      # if not(isinstance(contract, type(None))):
      #   tempDict['contractType']=contract.text

    # Job Post Date:
    postDate = card.find('span', attrs={'class': 'date'})
    if not(isinstance(postDate, type(None))):
        tempDict['postDate']=postDate.find(text=True, recursive=False)

    # Contract Type:
    # contractType = card.find('div', attrs={'class': 'attribute_snippet'})
    # if not(isinstance(contractType, type(None))):
    #   tempDict['contractType']=contractType.text

    # Put everything together in a list of lists for the default dictionary
    return tempDict

def attach_jd(jobs, baseUrl):
    
    for dict in jobs:
        #response = requests.get(baseUrl+dict['link'], headers=headers)
        pager = openpage(baseUrl+dict['link'])
        try:
            soup_ = pager.find('div',{'class':'jobsearch-jobDescriptionText'}).text
            if not(isinstance(soup_, type(None))):
                dict['JobDesc'] = soup_
        except:
            dict['JobDesc'] = 'Not Available'
        """
        if response.ok:
          html_ = response.text
          # Create soup
          soup_ = BeautifulSoup(html_, 'html.parser').find('div',{'class':'jobsearch-jobDescriptionText'}).text
          if not(isinstance(soup_, type(None))):
            dict['JobDesc'] = soup_
        else:
          dict['JobDesc'] = 'Not Available' """
    
    return jobs

# def add_cols(jobs):
#   # Job Type:
#   def job_type(x):
#     return 'Contract' if x.find('Contract') != -1 else 'FullTime' if x.find('Full-time') != -1 else None
#   # Salary Range:
#   def salary_range(x):
#     return 'Contract' if x.find('Contract') != -1 else 'FullTime' if x.find('Full-time') != -1 else None
#   # Estimated Salary:
#   def estimated(x):
#     return 1 if x.find('Estimated') != -1 else 0
  
#   # add Salary related columns
#   jobs['contractType'] = jobs.salaryDesc.apply(job_type)
#   jobs['estimated'] = jobs.salaryDesc.apply(estimated)

#   return jobs

##### Main Code Section
# avaialable domain
domains = {
    'usa': 'www.indeed.com',
    'uk': 'uk.indeed.com',
    'ind': 'in.indeed.com',
    'ng': 'ng.indeed.com',
    'ca': 'ca.indeed.com'
}
## Inputs from Users
what = input('Enter job title: ')
country = input('Enter country code: ')
where = input('Enter job location: ')
nPage = input('Pages to Scrap: ')
# dtransformations for functions
baseUrl = 'https://'+domains[country.lower()]
nPage = 1 if not nPage else int(nPage)

# Scrap Data
df = find_jobs(what, where, baseUrl, nPage)

# Enter Job Title, Location/Country, primary URL, total pages for extraction & if any 
# data = find_jobs(what='Refuse Collector', where='USA', baseUrl='https://www.indeed.com', nPage=10, verbose=True)

## Write to file
fileName = what.replace(' ', '_')+'_'+where.replace(' ', '_')+'_'+country.upper()+'_'+str(date.today()).replace('-', '')+'.csv'
print('\nWriting to file:', fileName)
df.to_csv(fileName,index=False)
print('\nDone. Please find file:', fileName, 'in left pane. Refresh, if required.')

## Print Outputs
print('\n===| Showing 10 records |===\n')
display(df[['title', 'company', 'location', 'salaryDesc', 'postDate']].head(10))




Enter job title: data analyst
Enter country code: UK
Enter job location: 
Pages to Scrap: 600
===| Scrapping Job Postings |===

+++++ Extracting Data from Page 1 ++++++
Extracting data from URL: https://uk.indeed.com/jobs?q=data%20analyst&l=&sort=date
Found Total 15 jobs so far.

+++++ Extracting Data from Page 2 ++++++
Extracting data from URL: https://uk.indeed.com/jobs?q=data%20analyst&l=&sort=date&start=10
Found Total 30 jobs so far.

+++++ Extracting Data from Page 3 ++++++
Extracting data from URL: https://uk.indeed.com/jobs?q=data%20analyst&l=&sort=date&start=20
Found Total 45 jobs so far.

+++++ Extracting Data from Page 4 ++++++
Extracting data from URL: https://uk.indeed.com/jobs?q=data%20analyst&l=&sort=date&start=30
Found Total 60 jobs so far.

+++++ Extracting Data from Page 5 ++++++
Extracting data from URL: https://uk.indeed.com/jobs?q=data%20analyst&l=&sort=date&start=40
Found Total 75 jobs so far.

+++++ Extracting Data from Page 6 ++++++
Extracting data from URL: http

### Change Log
Track of changes done on this notebook -

| Date | Type | User | Details |
|---|---|---|---|
| 2022-08-13 | New Notebook | `@ajmasih0309` | Setting up basic functionalities to scrape data from job cards & corresponding job descriptions from indeed USA & other countries. Basic Cleaning of data to get useful output |
| 2022-08-14 | Modified Notebook | `@ajmasih0309` | Basic Cleaning of data to get useful output & visual aid for users who will execute the script for extraction |
| 2022-08-15 | Modified Notebook | `@ajmasih0309` | Added some cleaning steps |
| 2022-08-16 | Modified Notebook | `@ajmasih0309` | Removed cleaning steps. Tested multiple domain & added functionality to select country using abbreviations & select for location in country. |
| 2022-08-17 | Modified Notebook | `@ajmasih0309` | Modified Total Jobs checking technique & Job Descriptions to avoid failure. |
---