# **Fetching Data using Web Scrapping**
Web scraping is the process of automatically extracting information and data from websites. It involves writing code to retrieve specific pieces of content, such as text, images, tables, or other structured data, from web pages. Web scraping is commonly used to gather data for various purposes, such as research, analysis, data mining, automation, and more.

Web scraping can range from simple tasks, like retrieving the title of an article from a news website, to more complex tasks, like extracting financial data from multiple web pages for analysis. However, it's important to note that while web scraping can be a powerful tool, it should be used ethically and responsibly. Some websites have terms of use that prohibit or restrict scraping, and improperly scraping data or overloading a website's server with requests can have legal and ethical implications.

## **Import Required Libraries**

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## **Fetch Webpage from the Server**

In [2]:
web_url = r"https://www.ambitionbox.com/list-of-companies?page=2"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36"}
webpage = requests.get(web_url, headers=headers).text

In [3]:
# Creating an object of the webpage using BeautifulSoup Class
soup = BeautifulSoup(webpage, "lxml")

## **Extract Companies Name from the URL**

In [4]:
# Create an empty list of companies
companies = []
for i in range(0, 20):
    company = soup.find_all("h2")[i].text.strip()
    companies.append(company)

In [5]:
# Print all the companies name
companies

['Vodafone Idea',
 'Deloitte',
 'Kotak Mahindra Bank',
 "BYJU'S",
 'Reliance Industries',
 'Bharti Airtel',
 'Tata Motors',
 'Flipkart',
 'Ernst & Young',
 'WNS',
 'Mahindra & Mahindra',
 'IndusInd Bank',
 'DXC Technology',
 'Infosys BPM',
 'IDFC FIRST Bank',
 'Maruti Suzuki',
 'Conneqt Business Solutions',
 'iEnergizer',
 'Bajaj Finserv',
 'Hinduja Global Solutions']

In [6]:
# Print the length of companies
len(companies)

20

## **Extract Rating of All Companies**

In [7]:
# Create an empty list of ratings
ratings = []
for i in range(0, 20):
    rating = soup.find_all("span", class_="companyCardWrapper__companyRatingValue")[i].text.strip()
    ratings.append(rating)

In [8]:
# Print all the ratings
ratings

['4.2',
 '4.0',
 '3.8',
 '3.2',
 '4.0',
 '4.0',
 '4.2',
 '4.1',
 '3.7',
 '3.6',
 '4.1',
 '3.7',
 '3.9',
 '3.9',
 '4.0',
 '4.2',
 '3.8',
 '4.3',
 '4.0',
 '4.0']

In [9]:
# Print the length of ratings
len(ratings)

20

## **Extract Content of All Companies**

In [10]:
# Create a list of contents
contents = []
for i in range(0, 20):
    content = soup.find_all("span", class_="companyCardWrapper__interLinking")[i].text.strip()
    contents.append(content)

In [11]:
# Print all the content
print(contents)

['Telecom | 10k-50k Employees | Public | 5 years old | Gandhinagar +600 more', 'Management Consulting | 1 Lakh+ Employees | 178 years old | London +143 more', 'Banking | 10k-50k Employees | Public | 20 years old | Mumbai +506 more', 'EdTech | 1k-5k Employees | Indian Unicorn | 12 years old | Bangalore +285 more', 'Oil & Gas | Public | 46 years old | Navi Mumbai +578 more', 'Telecom | 10k-50k Employees | Public | 28 years old | Gurgaon/Gurugram +589 more', 'Automobile | 10k-50k Employees | Public | 78 years old | Pune +417 more', 'Internet | 10k-50k Employees | Public | 16 years old | Bangalore +491 more', 'Accounting & Auditing | 10k-50k Employees | 21 years old | London +85 more', 'BPO | 10k-50k Employees | 27 years old | Mumbai +30 more', 'Automobile | 1 Lakh+ Employees | Public | 78 years old | Mumbai +447 more', 'Banking | 10k-50k Employees | Public | 29 years old | Gurgaon/Gurugram +625 more', 'IT Services & Consulting | 10k-50k Employees | Public | 6 years old | Minato +57 more',

In [12]:
# Print the length of contents
len(contents)

20

## **Write a Function to Automate the Workflow**

In [30]:
# Creating a function to fetch the data for each page
def fetchData(soup):

    companies = soup.find_all("div", class_="companyCardWrapper__companyDetails")

    # Extracting the data of the companies
    companyNames = []
    ratings = []
    contents = []
    industries = []
    employees = []
    ctypes = []
    age = []
    hq = []

    for i in range(0, 20):
        company = companies[i]

        # Getting the name of the company
        company_name = company.find("h2", class_="companyCardWrapper__companyName").text.strip()
        companyNames.append(company_name)

        # Extracting the ratings
        rating = company.find("span", class_="companyCardWrapper__companyRatingValue").text.strip()
        ratings.append(rating)

        # Extracting the contents
        content = company.find("span", class_="companyCardWrapper__interLinking").text.strip()
        # Convert the content into a list
        content = content.split(sep=" | ")

        # Extracting the industry
        industry = content[0]
        industries.append(industry)

        # Extracting the employees
        employee = ""
        ctype = ""
        old = ""
        headquarter = ""

        for j in content[1:]:
            if "Employees" in j:
                employee = j
            elif "old" in j:
                old = j
            elif "more" in j:
                headquarter = j.split(sep=" ")[0:-2]
                headquarter = " ".join(headquarter)
            else:
                ctype = j

        if len(employee) > 0:
            employees.append(employee[0:-10])
        else:
            employees.append("NaN")

        if len(old) > 0:
            age.append(old[0:-4])
        else:
            age.append("NaN")

        if len(headquarter) > 0:
            hq.append(headquarter)
        else:
            hq.append("NaN")

        if len(ctype) > 0:
            ctypes.append(ctype)
        else:
            ctypes.append("NaN")
    
    # Create a dataframe
    columns = {"name": companyNames, "rating": ratings, "industry": industries,
               "employee": employees, "type": ctypes, "age": age, "headquarter": hq}
    df = pd.DataFrame(columns)
    
    return df

In [31]:
# Checking the fetchData function
fetchData(soup)

Unnamed: 0,name,rating,industry,employee,type,age,headquarter
0,John Deere,4.3,IT Services & Consulting,5k-10k,Fortune India 500,22 years,Moline
1,L&T Finance,3.9,NBFC,10k-50k,Public,29 years,Mumbai
2,Whitehat jr,3.6,EdTech,5k-10k,Startup,5 years,Mumbai
3,Intas Pharmaceuticals,4.2,Pharma,10k-50k,Fortune India 500,23 years,Ahmedabad
4,Varun Beverages,4.0,Beverage,10k-50k,Public,28 years,Gurgaon/Gurugram
5,Thermax Limited,4.2,Industrial Machinery,5k-10k,Public,43 years,Pune
6,Ford Motor,4.5,Automobile,5k-10k,Public,120 years,Dearborn
7,UnitedHealth,4.2,Healthcare,10k-50k,Forbes Global 2000,46 years,Minnetonka
8,Sodexo,4.1,Facility Management Services,10k-50k,Forbes Global 2000,15 years,Issy les Moulineaux
9,Hetero Drugs,3.8,Pharma,1k-5k,,30 years,Hyderabad/Secunderabad


## **Fetch the Data for all the Webpages**

In [35]:
# Store the headers in a variable
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36"}

# Create a blank dataframe
final_df = pd.DataFrame()

# Storing data in the final dataframe for 500 pages
for i in range(1, 500):
    web_url = "https://www.ambitionbox.com/list-of-companies?page={}".format(i)
    webpage = requests.get(web_url, headers=headers).text
    
    soup = BeautifulSoup(webpage, "lxml")
    
    df = fetchData(soup)
    final_df = pd.concat([final_df, df], ignore_index=True)

In [37]:
final_df

Unnamed: 0,name,rating,industry,employee,type,age,headquarter
0,TCS,3.8,IT Services & Consulting,1 Lakh+,Public,55 years,Mumbai
1,Accenture,4.1,IT Services & Consulting,1 Lakh+,Public,34 years,Dublin
2,Cognizant,3.9,IT Services & Consulting,1 Lakh+,Forbes Global 2000,29 years,Teaneck. New Jersey.
3,Wipro,3.8,IT Services & Consulting,1 Lakh+,Public,78 years,Bangalore/Bengaluru
4,ICICI Bank,4.0,Banking,1 Lakh+,Public,29 years,Mumbai
...,...,...,...,...,...,...,...
9975,Kestone Integrated Marketing Services,3.4,Marketing & Advertising,51-200,,26 years,New Delhi
9976,SSPL,3.7,Industrial Automation,11-50 Employee,,20 years,Hyderabad
9977,Magnum,3.5,Food Processing,1k-5k Employee,Public,30 years,Bhopal
9978,Crimsoune Club,4.4,1k-5k Employees,,,18 years,Sonipat


## **Export the Final Dataframe as a CSV**

In [38]:
output_path = r"D:\\Coding\\Datasets\\"
file_name = "company_web_scrapping_data.csv"
final_df.to_csv(output_path+file_name)