# Scraping Data from a Real Website + Pandas

This project involves scraping data from the Wikipedia page "List of largest companies in the United States by revenue" using Python. The objective is to extract information about the largest companies by revenue, including details such as company names, revenue figures, and industry classifications. The extracted data will be processed and analyzed using the pandas library to provide insights and facilitate further data analysis.

Key Features:

Web Scraping: Utilize libraries such as requests and BeautifulSoup to fetch and parse HTML content from the Wikipedia page.

Data Extraction: Extract relevant data points including company names, revenue, and industry categories.


Future work:
Data Processing: Use pandas for manipulation and analysis.
Output: Generate a clean, structured analysis through visualizations using Matplotlib

In [None]:
from bs4 import BeautifulSoup
import requests

# Pulling the information from the URL and turning into readable text

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

print(soup.prettify())

In [None]:
# Located the table I want and now I'll pull the table from the web page using index

soup.find_all('table')[1]

In [None]:
table  = soup.find_all('table')[1]
print(table)

In [27]:
# Getting the table headers

world_titles = table.find_all('th')
print(world_titles)

[<th>Rank
</th>, <th>Name
</th>, <th>Industry
</th>, <th>Revenue <br/>(USD millions)
</th>, <th>Revenue growth
</th>, <th>Employees
</th>, <th>Headquarters
</th>]


In [29]:
# List comprehension to pull out each table title as text
# and cleaning the results with .strip()

world_table_titles = [title.text.strip() for title in world_titles]
print(world_table_titles)

['Rank', 'Name', 'Industry', 'Revenue (USD millions)', 'Revenue growth', 'Employees', 'Headquarters']


In [None]:
# While reviewing titles I noticed it ended up pulling information from the second table in the page which is not what I want
# I had to go back and review the code to debug and I fixed the error which was on te line 27
# where I had to find all on the table and it was previously as find all on soup.

# Now we are good to pull the data into Pandas dataframe

In [31]:
import pandas as pd

In [39]:
df = pd.DataFrame(columns = world_table_titles)
df

Unnamed: 0,Rank,Name,Industry,Revenue (USD millions),Revenue growth,Employees,Headquarters


In [None]:
# Now I'm going to go back to the web page to pull up the rows from the table

column_data = table.find_all('tr')
print(column_data)

In [93]:
# It's time to fill in the dataframe but I encoutered a error when trying to iterate each data row to the existing dataframe
# So I try a different approach wich is creating a new dataframe using a list comprehension to gather all data at once 
# and replace the existing empty dataframe. 
# I also encountered another error which was an empty row on the top
# To fix that bug I added an index to the column data to point to where the loop should start

rows_to_add = [
    [data.text.strip() for data in row.find_all('td')]
    for row in column_data[1:101]
]
df = pd.DataFrame(rows_to_add, columns=world_table_titles)

In [95]:
df

Unnamed: 0,Rank,Name,Industry,Revenue (USD millions),Revenue growth,Employees,Headquarters
0,1,Walmart,Retail,611289,6.7%,2100000,"Bentonville, Arkansas"
1,2,Amazon,Retail and cloud computing,513983,9.4%,1540000,"Seattle, Washington"
2,3,ExxonMobil,Petroleum industry,413680,44.8%,62000,"Spring, Texas"
3,4,Apple,Electronics industry,394328,7.8%,164000,"Cupertino, California"
4,5,UnitedHealth Group,Healthcare,324162,12.7%,400000,"Minnetonka, Minnesota"
...,...,...,...,...,...,...,...
95,96,Best Buy,Retail,46298,10.6%,71100,"Richfield, Minnesota"
96,97,Bristol-Myers Squibb,Pharmaceutical industry,46159,0.5%,34300,"New York City, New York"
97,98,United Airlines,Airline,44955,82.5%,92795,"Chicago, Illinois"
98,99,Thermo Fisher Scientific,Laboratory instruments,44915,14.5%,130000,"Waltham, Massachusetts"


In [97]:
# It worked perfectly, so It's time to export as a csv file

df.to_csv(r'C:\Users\gusta\Desktop\Dri - Data Bootcamp\Companies.csv', index = False)