## <font color=blue>Introduction</font>

### We are scrapping different job description on Indeed.com and are using NER to extract skills for our skillset list which will be used in our model.

<hr style="border:2px solid gray">

In [25]:
# Importing Libraries

import pandas as pd
from bs4 import BeautifulSoup
import json
import numpy as np
import requests
from requests.models import MissingSchema
import spacy
import trafilatura
import warnings
warnings.filterwarnings("ignore")


In [46]:
# Defining Function for Scrapping

def beautifulsoup_extract_text(response_content):
    soup = BeautifulSoup(response_content, 'html.parser')
    text = soup.find_all(text=True)
    
    cleaned_text = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style',]
    for item in text:
        if item.parent.name not in blacklist:
            cleaned_text += '{} '.format(item)
    cleaned_text = cleaned_text.replace('\t', '')
    return cleaned_text.strip()

def extract_text(url):
    downloaded_url = trafilatura.fetch_url(url)
    try:
        a = trafilatura.extract(downloaded_url, output_format="JSON", with_metadata=True, include_comments = False, date_extraction_params={'extensive_search': True, 'original_date': True})
    except AttributeError:
        a = trafilatura.extract(downloaded_url, output_format="JSON", with_metadata=True, date_extraction_params={'extensive_search': True, 'original_date': True})
        
    if a:
        with open(a, 'r') as j:
            json_output = json.loads(j.read())
        return json_output['text']
    else:
        try:
            resp = requests.get(url)
            if resp.status_code == 200:
                return beautifulsoup_extract_text(resp.content)
            else:
                return np.nan
        except MissingSchema:
            return np.nan
            

In [20]:
# Scrapping JD from Indeed.com

url = 'https://in.indeed.com/viewjob?jk=909816a7750aa309&tk=1g9kqiksisp5e803&from=serp&vjs=3'
text = extract_text(url=url)
print(text)

Data Analyst - IAU - Gurgaon, Haryana - Indeed.com 
 
 
 
 
 
 
 
 
 
 
 
 
 Find jobs Company reviews Salary Guide Post your resume Sign in Sign in Employers / Post Job Start of main content 
 
 
 
 What Where Find Jobs Data Analyst - IAU Maruti Suzuki India Ltd 767 reviews Gurgaon, Haryana Maruti Suzuki India Ltd 767 reviews Read what people are saying about working here. Job details Job Type Full-time Full Job Description Key Responsibilities 
 
  - 
 
  To identify, design and develop data analytics routines to support the audit activities performed by the internal audit team. 
 
  Maintain an effective system of data analytics and models which provide enhanced insights into risks and controls, establishes an efficient/automated means to analyze and test large volumes of data for outliers, anomalies, patterns and trends and help evaluate the adequacy and effectiveness of controls. 
 
  Create repeatable data analytics to support continuous audit monitoring programs. 
 
  Essential 

In [24]:
# Defining function for conversion in JSON and applying NER

class EntityGenerator(object):
    _slots__ = ['text']
    
    def __init__(self, text=None):
        self.text = text
        
    def get(self):
        """
        Return a Json
        """
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(self.text)
        text = [ent.text for ent in doc.ents]
        entity = [ent.label_ for ent in doc.ents]
    
        from collections import Counter
        import json

        data = Counter(zip(entity))
        unique_entity = list(data.keys())
        unique_entity = [x[0] for x in unique_entity]

        d = {}
        for val in unique_entity:
            d[val] = []

        for key,val in dict(zip(text, entity)).items():
            if val in unique_entity:
                d[val].append(key)
        return d


# Printing skills as dataframe

helper = EntityGenerator(text=text)
response = helper.get()


res = {key: response[key] for key in response.keys()
                               & {'ORG'}}
df = pd.DataFrame.from_dict(res) 
pd.options.display.max_colwidth = 100
df

Unnamed: 0,ORG
0,Data Analyst - IAU - Gurgaon
1,Salary Guide Post
2,Sign in Sign in
3,Jobs Data Analyst - IAU
4,Maruti Suzuki India Ltd
5,Haryana Maruti Suzuki India Ltd
6,SQL
7,Tableau
8,Analytical
9,Apply now Apply


In [26]:
# Scrapping JD from Indeed.com

url1 = 'https://in.indeed.com/viewjob?jk=fcc156433e584ced&tk=1g9kqiksisp5e803&from=serp&vjs=3'
text1 = extract_text(url=url1)
print(text1)

Analyst-Data Visualization - Bengaluru, Karnataka - Indeed.com 
 
 
 
 
 
 
 
 
 
 
 
 
 Find jobs Company reviews Salary Guide Post your resume Sign in Sign in Employers / Post Job Start of main content 
 
 
 
 What Where Find Jobs Analyst-Data Visualization Accenture 22,076 reviews Bengaluru, Karnataka Accenture 22,076 reviews Read what people are saying about working here. Job Company Job details Job Type Full-time Full Job Description 
 
 Skill required:  Data Visualization - SQL (Structured Query Language) 
 
 
 Designation:  Analyst 
 
 
 Job Location:  Bengaluru 
 
 
 Qualifications:  BE/BTech 
 
 
 Years of Experience:  3-5 years 
 
 
 About Accenture 
 
  Accenture is a global professional services company with leading capabilities in digital, cloud and security. Combining unmatched experience and specialized skills across more than 40 industries, we offer Strategy and Consulting, Technology and Operations services, and Accenture Song— all powered by the world’s largest networ

In [27]:
# Printing skills as dataframe

helper1 = EntityGenerator(text=text1)
response1 = helper1.get()


res1 = {key: response1[key] for key in response1.keys()
                               & {'ORG'}}
df1 = pd.DataFrame.from_dict(res1) 
pd.options.display.max_colwidth = 100
df1

Unnamed: 0,ORG
0,Analyst-Data Visualization - Bengaluru
1,Salary Guide Post
2,Sign in Sign in
3,Job Company Job
4,Structured Query Language
5,BE/BTech
6,"Strategy and Consulting, Technology and Operations"
7,Advanced Technology and Intelligent Operations
8,Insights & Intelligence
9,Artificial Intelligence (AI


In [31]:
# Scrapping JD from Indeed.com

url2 = 'https://in.indeed.com/viewjob?jk=14afacd055317604&tk=1g9krrlvdjl1r801&from=serp&vjs=3'
text2 = extract_text(url=url2)
print(text2)

Data Analyst 2-8 - Bengaluru, Karnataka - Indeed.com 
 
 
 
 
 
 
 
 
 
 
 
 
 Find jobs Company reviews Salary Guide Post your resume Sign in Sign in Employers / Post Job Start of main content 
 
 
 
 What Where Find Jobs Data Analyst 2-8 PayPal 1,589 reviews Bengaluru, Karnataka PayPal 1,589 reviews Read what people are saying about working here. Job Company Job details Job Type Full-time Benefits Pulled from the full job description Health insurance Full Job Description 
 At PayPal (NASDAQ: PYPL), we believe that every person has the right to participate fully in the global economy. Our mission is to democratize financial services to ensure that everyone, regardless of background or economic standing, has access to affordable, convenient, and secure products and services to take control of their financial lives. 
 Job Description Summary: PayPal is a leading technology platform and digital payments company that enables digital and mobile payments on behalf of consumers and merchants

In [32]:
# Printing skills as dataframe

helper2 = EntityGenerator(text=text2)
response2 = helper2.get()


res2 = {key: response2[key] for key in response2.keys()
                               & {'ORG'}}
df2 = pd.DataFrame.from_dict(res2) 
pd.options.display.max_colwidth = 100
df2

Unnamed: 0,ORG
0,Salary Guide Post
1,Sign in Sign in
2,Jobs Data
3,Job Company Job
4,PYPL
5,"PayPal, PayPal Credit"
6,Venmo
7,Collaborative & Team-Oriented
8,Statistics
9,"Science, Engineering"


<hr style="border:2px solid gray">