# Web Scraping: Largest Companies in India

## Project Objective
The objective of this project is to scrape data about the largest companies in India
from Wikipedia using Python and BeautifulSoup, convert the data into a structured
format, and store it as a CSV file for further analysis.


## Tools & Libraries Used
- Python
- Requests
- BeautifulSoup
- Pandas


In [9]:
url= 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_India'

In [2]:
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/List_of_largest_companies_in_India"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = requests.get(url, headers=headers)

if response.status_code != 200:
    raise Exception("Failed to fetch the webpage")



In [5]:
Soup = BeautifulSoup(response.text, 'html')

In [6]:
table =Soup.find_all('table')[0]

In [7]:
Col_dn =table.find_all('th')

In [8]:
Col_name = [a.text.strip() for a in Col_dn]
print(Col_name)

['Rank', 'Forbes 2000 rank', 'Name', 'Headquarters', 'Revenue(billions US$)', 'Profit(billions US$)', 'Assets(billions US$)', 'Value(billions US$)', 'Industry']


In [9]:
import pandas as pd

In [10]:
df_companies = pd.DataFrame(columns=Col_name)
df_companies

Unnamed: 0,Rank,Forbes 2000 rank,Name,Headquarters,Revenue(billions US$),Profit(billions US$),Assets(billions US$),Value(billions US$),Industry


In [12]:
table_data=table.find_all('tr')

In [13]:
for rows in table_data[1:]:
    row_data =rows.find_all('td') 
    full_data = [b.text.strip() for b in row_data]
    
    length = len(df_companies)
    df_companies.loc[length] = full_data

In [14]:
df_companies

Unnamed: 0,Rank,Forbes 2000 rank,Name,Headquarters,Revenue(billions US$),Profit(billions US$),Assets(billions US$),Value(billions US$),Industry
0,1,49,Reliance Industries Limited,Mumbai,108.8,8.4,210.5,233.1,Conglomerate
1,2,55,State Bank of India,Mumbai,71.8,8.1,807.4,87.6,Banking
2,3,65,HDFC Bank,Mumbai,49.3,7.7,483.2,133.6,Banking
3,4,70,Life Insurance Corporation,New Delhi,98.0,4.9,561.4,73.6,Insurance
4,5,142,ICICI Bank,Mumbai,28.5,5.3,283.5,95.3,Banking
...,...,...,...,...,...,...,...,...,...
66,65,1895,Dr. Reddy's Laboratories,Hyderabad,3.4,0.7,4.6,11.6,Pharmaceuticals
67,66,1908,Varun Beverages,Gurgaon,2.0,0.3,1.8,23.6,Beverages
68,67,1949,CIFCL,Chennai,2.3,0.4,18.8,13.0,Financials
69,68,1957,NMDC,Hyderabad,2.5,0.8,3.9,9.7,Mining


In [16]:
df_companies.to_csv("largest_companies_india.csv", index=False)

In [18]:
df_companies.head()

Unnamed: 0,Rank,Forbes 2000 rank,Name,Headquarters,Revenue(billions US$),Profit(billions US$),Assets(billions US$),Value(billions US$),Industry
0,1,49,Reliance Industries Limited,Mumbai,108.8,8.4,210.5,233.1,Conglomerate
1,2,55,State Bank of India,Mumbai,71.8,8.1,807.4,87.6,Banking
2,3,65,HDFC Bank,Mumbai,49.3,7.7,483.2,133.6,Banking
3,4,70,Life Insurance Corporation,New Delhi,98.0,4.9,561.4,73.6,Insurance
4,5,142,ICICI Bank,Mumbai,28.5,5.3,283.5,95.3,Banking


In [19]:
df_companies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 71 entries, 0 to 70
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Rank                   71 non-null     object
 1   Forbes 2000 rank       71 non-null     object
 2   Name                   71 non-null     object
 3   Headquarters           71 non-null     object
 4   Revenue(billions US$)  71 non-null     object
 5   Profit(billions US$)   71 non-null     object
 6   Assets(billions US$)   71 non-null     object
 7   Value(billions US$)    71 non-null     object
 8   Industry               71 non-null     object
dtypes: object(9)
memory usage: 5.5+ KB


In [22]:
top_10 = df_companies.head(10)
top_10

Unnamed: 0,Rank,Forbes 2000 rank,Name,Headquarters,Revenue(billions US$),Profit(billions US$),Assets(billions US$),Value(billions US$),Industry
0,1,49,Reliance Industries Limited,Mumbai,108.8,8.4,210.5,233.1,Conglomerate
1,2,55,State Bank of India,Mumbai,71.8,8.1,807.4,87.6,Banking
2,3,65,HDFC Bank,Mumbai,49.3,7.7,483.2,133.6,Banking
3,4,70,Life Insurance Corporation,New Delhi,98.0,4.9,561.4,73.6,Insurance
4,5,142,ICICI Bank,Mumbai,28.5,5.3,283.5,95.3,Banking
5,6,207,Oil and Natural Gas Corporation,New Delhi,77.5,5.1,80.6,41.9,Oil and gas
6,7,259,Indian Oil Corporation,New Delhi,93.8,5.0,57.8,27.8,Oil and gas
7,8,284,Tata Motors,Mumbai,52.9,3.8,44.4,43.8,Automotive
8,9,293,Axis Bank,Mumbai,16.7,3.2,182.0,42.3,Banking
9,10,372,NTPC Limited,New Delhi,21.2,2.4,54.7,42.5,Utilities


## Conclusion
This project demonstrates how web data can be collected, cleaned, and structured
using Python. The scraped dataset can be further used for business or financial
analysis.
