• Prepare a structured dataset that includes the following fields, where available:
o Company name: Name of the company
o Country: country in which the company has its main headquarters
o Industry: Industry classification based on your source
o Year(s): Year(s) associated with the financial value; please extract the most recent 3
years of the companies’ financials, if available
o Revenue: Revenue figure
o Revenue unit: Unit or currency of the revenue
o (Optional: Add 3–5 additional KPIs of your choice in the same manner as for Revenue.)
• Please ensure that the dataset contains at least 100 companies and no more than 500
companies.

In [100]:
#!pip install kaggle

In [101]:
#!pip install kagglehub[pandas-datasets]


Before the next step I set up the kaggle API with my token:

export KAGGLE_API_TOKEN=<--Token-->

kaggle competitions list  // this helped me check if the api was working

In [102]:
#!kaggle datasets download -d rm1000/fortune-500-companies

In [103]:
#!unzip fortune-500-companies.zip

Firsy Look at the database

In [104]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd

file_path = "/home/nathalia-uribe/Documentos/NATHY 2.0/LEARNING/Networks/Ejercicios Libro/Fortune 500 Companies.csv"

fortune500df = pd.read_csv(file_path)

print("First 5 records:", fortune500df.head())

First 5 records:                          name  rank  year                industry sector  \
0  General Motors Corporation     1  1996  Motor Vehicles & Parts    NaN   
1          Ford Motor Company     2  1996  Motor Vehicles & Parts    NaN   
2           Exxon Corporation     3  1996      Petroleum Refining    NaN   
3       Wal-Mart Stores, Inc.     4  1996   General Merchandisers    NaN   
4                  AT&T Corp.     5  1996      Telecommunications    NaN   

  headquarters_state headquarters_city  market_value_mil  revenue_mil  \
0                 MI               NaN               NaN     168828.6   
1                 MI               NaN               NaN     137137.0   
2                 TX               NaN               NaN     110009.0   
3                 AR               NaN               NaN      93627.0   
4                 NY               NaN               NaN      79609.0   

   profit_mil  asset_mil  employees founder_is_ceo female_ceo  \
0         NaN        N

Checking what metrics are easier to obtain

In [105]:
fortune500df.isnull().mean()


name                       0.000000
rank                       0.000000
year                       0.000000
industry                   0.000000
sector                     0.677188
headquarters_state         0.000000
headquarters_city          0.462339
market_value_mil           0.628049
revenue_mil                0.000000
profit_mil                 0.605667
asset_mil                  0.605452
employees                  0.677188
founder_is_ceo             0.677188
female_ceo                 0.677188
newcomer_to_fortune_500    0.677188
global_500                 0.713056
dtype: float64

Dropping every column that is not needed for the assignment

In [106]:
f500_clean = fortune500df.drop(columns=['founder_is_ceo', 'female_ceo', 'newcomer_to_fortune_500', 'global_500','rank'])

In [107]:
f500_clean.head()

Unnamed: 0,name,year,industry,sector,headquarters_state,headquarters_city,market_value_mil,revenue_mil,profit_mil,asset_mil,employees
0,General Motors Corporation,1996,Motor Vehicles & Parts,,MI,,,168828.6,,,
1,Ford Motor Company,1996,Motor Vehicles & Parts,,MI,,,137137.0,,,
2,Exxon Corporation,1996,Petroleum Refining,,TX,,,110009.0,,,
3,"Wal-Mart Stores, Inc.",1996,General Merchandisers,,AR,,,93627.0,,,
4,AT&T Corp.,1996,Telecommunications,,NY,,,79609.0,,,


Checking how many unique companies I have

In [108]:
f500_clean['name'].nunique()

2255

Dropping more companies by suitability of the name, i deleted every name that is numeric only, is punctuation only, is too short (like single letters). 
Plus all the rows of companies that don't have available the information that i need like sector, city or state.

In [109]:
f500_clean = f500_clean.dropna(subset=['name'])

valid_names= r'^[A-Za-z]{2,}.*$'
f500_clean = f500_clean[f500_clean['name'].str.match(valid_names, na=False)]

f500_clean = f500_clean.dropna(subset=['sector'])


f500_clean = f500_clean.dropna(subset=['headquarters_city'])

f500_clean = f500_clean.dropna(subset=['headquarters_state'])

Checking how many companies I still have...

In [110]:
f500_clean['name'].nunique()

745

In [111]:


f500_clean.head()

Unnamed: 0,name,year,industry,sector,headquarters_state,headquarters_city,market_value_mil,revenue_mil,profit_mil,asset_mil,employees
9440,Walmart,2015,General Merchandisers,Retailing,AR,Bentonville,265344.0,485651.0,16363.0,203706.0,2200000.0
9441,Exxon Mobil,2015,Petroleum Refining,Energy,TX,Irving,356549.0,382597.0,32520.0,349493.0,83700.0
9442,Chevron,2015,Petroleum Refining,Energy,CA,San Ramon,197381.0,203784.0,19241.0,266026.0,64700.0
9443,Berkshire Hathaway,2015,Insurance: Property and Casualty (Stock),Financials,NE,Omaha,357344.0,194673.0,19872.0,526186.0,316000.0
9444,Apple,2015,"Computers, Office Equipment",Technology,CA,Cupertino,724773.0,182795.0,39510.0,231839.0,97200.0


Adding the count of years with market value ...

In [112]:
f500_clean['num_years'] = f500_clean.groupby('name')['year'].transform('nunique')

Ordering by name and years , so i see them in order

In [113]:
f500_clean.sort_values(["name", "year"], ascending=[True, False])


Unnamed: 0,name,year,industry,sector,headquarters_state,headquarters_city,market_value_mil,revenue_mil,profit_mil,asset_mil,employees,num_years
13914,ABM Industries,2023,Diversified Outsourcing Services,Business Services,NY,New York,2971.0,7807.0,230.0,4869.0,127000.0,7
12901,ABM Industries,2021,Diversified Outsourcing Services,Business Services,NY,New York,3422.0,5988.0,0.0,3777.0,114000.0,7
12401,ABM Industries,2020,Diversified Outsourcing Services,Business Services,NY,New York,1623.0,6499.0,127.0,3693.0,140000.0,7
11902,ABM Industries,2019,Diversified Outsourcing Services,Business Services,NY,New York,2408.0,6442.0,98.0,3628.0,140000.0,7
11437,ABM Industries,2018,Diversified Outsourcing Services,Business Services,NY,New York,2200.0,5454.0,4.0,3813.0,140000.0,7
...,...,...,...,...,...,...,...,...,...,...,...,...
11679,salesforce.com,2019,Computer Software,Technology,CA,San Francisco,122103.0,13282.0,1110.0,30737.0,35000.0,5
11224,salesforce.com,2018,Computer Software,Technology,CA,San Francisco,85074.0,10480.0,128.0,21010.0,29000.0,5
10765,salesforce.com,2017,Computer Software,Technology,CA,San Francisco,58362.0,8392.0,180.0,17585.0,25000.0,5
10325,salesforce.com,2016,Computer Software,Technology,CA,San Francisco,49533.0,6667.0,47.0,12771.0,19742.0,5


Creating a copy of the table without companies that have less than 2 years.....

In [114]:
f500_clean= f500_clean[f500_clean['num_years'] >= 3].copy()

Checking how many companies I still have...

In [115]:
f500_clean['name'].nunique()

588

In [116]:
f500_clean.sort_values(["name", "year"], ascending=[True, False])

Unnamed: 0,name,year,industry,sector,headquarters_state,headquarters_city,market_value_mil,revenue_mil,profit_mil,asset_mil,employees,num_years
13914,ABM Industries,2023,Diversified Outsourcing Services,Business Services,NY,New York,2971.0,7807.0,230.0,4869.0,127000.0,7
12901,ABM Industries,2021,Diversified Outsourcing Services,Business Services,NY,New York,3422.0,5988.0,0.0,3777.0,114000.0,7
12401,ABM Industries,2020,Diversified Outsourcing Services,Business Services,NY,New York,1623.0,6499.0,127.0,3693.0,140000.0,7
11902,ABM Industries,2019,Diversified Outsourcing Services,Business Services,NY,New York,2408.0,6442.0,98.0,3628.0,140000.0,7
11437,ABM Industries,2018,Diversified Outsourcing Services,Business Services,NY,New York,2200.0,5454.0,4.0,3813.0,140000.0,7
...,...,...,...,...,...,...,...,...,...,...,...,...
11679,salesforce.com,2019,Computer Software,Technology,CA,San Francisco,122103.0,13282.0,1110.0,30737.0,35000.0,5
11224,salesforce.com,2018,Computer Software,Technology,CA,San Francisco,85074.0,10480.0,128.0,21010.0,29000.0,5
10765,salesforce.com,2017,Computer Software,Technology,CA,San Francisco,58362.0,8392.0,180.0,17585.0,25000.0,5
10325,salesforce.com,2016,Computer Software,Technology,CA,San Francisco,49533.0,6667.0,47.0,12771.0,19742.0,5


Ordering my table and dropping the bottom 88 companies with less years .... Since I have 588 companies with all the information I need , then I'll drop the ones that have been on the list for the least years 

In [117]:
f500_clean = f500_clean.sort_values(by='num_years', ascending=False)

company_years = f500_clean[['name', 'num_years']].drop_duplicates()
companies_to_drop = company_years.sort_values(by='num_years').head(88)['name']
f500_clean = f500_clean[~f500_clean['name'].isin(companies_to_drop)].copy()
f500_clean.head()

Unnamed: 0,name,year,industry,sector,headquarters_state,headquarters_city,market_value_mil,revenue_mil,profit_mil,asset_mil,employees,num_years
13925,Genworth Financial,2023,"Insurance: Life, Health (Stock)",Financials,VA,Richmond,2477.0,7507.0,609.0,86442.0,2500.0,9
9440,Walmart,2015,General Merchandisers,Retailing,AR,Bentonville,265344.0,485651.0,16363.0,203706.0,2200000.0,9
9441,Exxon Mobil,2015,Petroleum Refining,Energy,TX,Irving,356549.0,382597.0,32520.0,349493.0,83700.0,9
9442,Chevron,2015,Petroleum Refining,Energy,CA,San Ramon,197381.0,203784.0,19241.0,266026.0,64700.0,9
9443,Berkshire Hathaway,2015,Insurance: Property and Casualty (Stock),Financials,NE,Omaha,357344.0,194673.0,19872.0,526186.0,316000.0,9


In [118]:
f500_clean['name'].nunique()

500

Adding extra columns , as requested with the currency/ metrics  AND also country since the data set I found had the city of headquarters but not the country. The state and city were left intentionally... for clarity, that ineed all headquarters are in the US.

In [119]:
f500_clean['country'] = 'USA'
f500_clean['market_value_currency'] = 'USD'
f500_clean['revenue_currency'] = 'USD'
f500_clean['profit_currency'] = 'USD'
f500_clean['asset_currency'] = 'USD'
f500_clean['employees_metric'] = 'employees'
f500_clean.sort_values(["name", "year"], ascending=[True, False])

Unnamed: 0,name,year,industry,sector,headquarters_state,headquarters_city,market_value_mil,revenue_mil,profit_mil,asset_mil,employees,num_years,country,market_value_currency,revenue_currency,profit_currency,asset_currency,employees_metric
13914,ABM Industries,2023,Diversified Outsourcing Services,Business Services,NY,New York,2971.0,7807.0,230.0,4869.0,127000.0,7,USA,USD,USD,USD,USD,employees
12901,ABM Industries,2021,Diversified Outsourcing Services,Business Services,NY,New York,3422.0,5988.0,0.0,3777.0,114000.0,7,USA,USD,USD,USD,USD,employees
12401,ABM Industries,2020,Diversified Outsourcing Services,Business Services,NY,New York,1623.0,6499.0,127.0,3693.0,140000.0,7,USA,USD,USD,USD,USD,employees
11902,ABM Industries,2019,Diversified Outsourcing Services,Business Services,NY,New York,2408.0,6442.0,98.0,3628.0,140000.0,7,USA,USD,USD,USD,USD,employees
11437,ABM Industries,2018,Diversified Outsourcing Services,Business Services,NY,New York,2200.0,5454.0,4.0,3813.0,140000.0,7,USA,USD,USD,USD,USD,employees
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11679,salesforce.com,2019,Computer Software,Technology,CA,San Francisco,122103.0,13282.0,1110.0,30737.0,35000.0,5,USA,USD,USD,USD,USD,employees
11224,salesforce.com,2018,Computer Software,Technology,CA,San Francisco,85074.0,10480.0,128.0,21010.0,29000.0,5,USA,USD,USD,USD,USD,employees
10765,salesforce.com,2017,Computer Software,Technology,CA,San Francisco,58362.0,8392.0,180.0,17585.0,25000.0,5,USA,USD,USD,USD,USD,employees
10325,salesforce.com,2016,Computer Software,Technology,CA,San Francisco,49533.0,6667.0,47.0,12771.0,19742.0,5,USA,USD,USD,USD,USD,employees


Ordering the columns

In [120]:
f500_clean = f500_clean[
    ['name','num_years','year','industry','sector','country','headquarters_state','headquarters_city',
     'market_value_mil','market_value_currency','revenue_mil','revenue_currency',
     'profit_mil','profit_currency','asset_mil','asset_currency','employees',
     'employees_metric']
]

f500_clean.sort_values(["name", "year"], ascending=[True, False])


Unnamed: 0,name,num_years,year,industry,sector,country,headquarters_state,headquarters_city,market_value_mil,market_value_currency,revenue_mil,revenue_currency,profit_mil,profit_currency,asset_mil,asset_currency,employees,employees_metric
13914,ABM Industries,7,2023,Diversified Outsourcing Services,Business Services,USA,NY,New York,2971.0,USD,7807.0,USD,230.0,USD,4869.0,USD,127000.0,employees
12901,ABM Industries,7,2021,Diversified Outsourcing Services,Business Services,USA,NY,New York,3422.0,USD,5988.0,USD,0.0,USD,3777.0,USD,114000.0,employees
12401,ABM Industries,7,2020,Diversified Outsourcing Services,Business Services,USA,NY,New York,1623.0,USD,6499.0,USD,127.0,USD,3693.0,USD,140000.0,employees
11902,ABM Industries,7,2019,Diversified Outsourcing Services,Business Services,USA,NY,New York,2408.0,USD,6442.0,USD,98.0,USD,3628.0,USD,140000.0,employees
11437,ABM Industries,7,2018,Diversified Outsourcing Services,Business Services,USA,NY,New York,2200.0,USD,5454.0,USD,4.0,USD,3813.0,USD,140000.0,employees
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11679,salesforce.com,5,2019,Computer Software,Technology,USA,CA,San Francisco,122103.0,USD,13282.0,USD,1110.0,USD,30737.0,USD,35000.0,employees
11224,salesforce.com,5,2018,Computer Software,Technology,USA,CA,San Francisco,85074.0,USD,10480.0,USD,128.0,USD,21010.0,USD,29000.0,employees
10765,salesforce.com,5,2017,Computer Software,Technology,USA,CA,San Francisco,58362.0,USD,8392.0,USD,180.0,USD,17585.0,USD,25000.0,employees
10325,salesforce.com,5,2016,Computer Software,Technology,USA,CA,San Francisco,49533.0,USD,6667.0,USD,47.0,USD,12771.0,USD,19742.0,employees


Filtering the years to keep the 3 most recent only

In [121]:
Final_Fortune_500 = f500_clean.groupby('name', group_keys=False).apply(lambda x: x.nlargest(3, 'year'))
Final_Fortune_500.sort_values(["name", "year"], ascending=[True, False])

  Final_Fortune_500 = f500_clean.groupby('name', group_keys=False).apply(lambda x: x.nlargest(3, 'year'))


Unnamed: 0,name,num_years,year,industry,sector,country,headquarters_state,headquarters_city,market_value_mil,market_value_currency,revenue_mil,revenue_currency,profit_mil,profit_currency,asset_mil,asset_currency,employees,employees_metric
13914,ABM Industries,7,2023,Diversified Outsourcing Services,Business Services,USA,NY,New York,2971.0,USD,7807.0,USD,230.0,USD,4869.0,USD,127000.0,employees
12901,ABM Industries,7,2021,Diversified Outsourcing Services,Business Services,USA,NY,New York,3422.0,USD,5988.0,USD,0.0,USD,3777.0,USD,114000.0,employees
12401,ABM Industries,7,2020,Diversified Outsourcing Services,Business Services,USA,NY,New York,1623.0,USD,6499.0,USD,127.0,USD,3693.0,USD,140000.0,employees
13749,AECOM,9,2023,Engineering & Construction,Engineering & Construction,USA,TX,Dallas,11716.0,USD,13496.0,USD,311.0,USD,11139.0,USD,50000.0,employees
13199,AECOM,9,2022,Engineering & Construction,Engineering & Construction,USA,TX,Dallas,10857.0,USD,14112.0,USD,173.0,USD,11734.0,USD,51000.0,employees
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11391,iHeartMedia,5,2018,Entertainment,Media,USA,TX,San Antonio,39.0,USD,6178.0,USD,704.0,USD,12257.0,USD,18700.0,employees
10865,iHeartMedia,5,2017,Entertainment,Media,USA,TX,San Antonio,330.0,USD,6274.0,USD,296.0,USD,12862.0,USD,18700.0,employees
11679,salesforce.com,5,2019,Computer Software,Technology,USA,CA,San Francisco,122103.0,USD,13282.0,USD,1110.0,USD,30737.0,USD,35000.0,employees
11224,salesforce.com,5,2018,Computer Software,Technology,USA,CA,San Francisco,85074.0,USD,10480.0,USD,128.0,USD,21010.0,USD,29000.0,employees


Grouping the rows by 'name', 'num_years', 'industry', 'sector', 'country', 'headquarters_state', 'headquarters_city'.

And here is the result:

In [122]:
group_cols = ['name', 'num_years', 'industry', 'sector', 'country', 'headquarters_state', 'headquarters_city']
Final_Fortune_500 = Final_Fortune_500.groupby(group_cols).agg(list).reset_index()
Final_Fortune_500.sort_values(["num_years"], ascending=[False])

Unnamed: 0,name,num_years,industry,sector,country,headquarters_state,headquarters_city,year,market_value_mil,market_value_currency,revenue_mil,revenue_currency,profit_mil,profit_currency,asset_mil,asset_currency,employees,employees_metric
530,eBay,9,Internet Services and Retailing,Technology,USA,CA,San Jose,"[2023, 2022, 2021]","[23821.0, 33642.0, 41671.0]","[USD, USD, USD]","[9795.0, 12394.0, 11351.0]","[USD, USD, USD]","[1269.0, 13608.0, 5667.0]","[USD, USD, USD]","[20850.0, 26626.0, 19310.0]","[USD, USD, USD]","[11600.0, 10800.0, 12700.0]","[employees, employees, employees]"
512,Western Digital,9,"Computers, Office Equipment",Technology,USA,CA,San Jose,"[2023, 2022, 2021]","[12029.0, 15536.0, 20432.0]","[USD, USD, USD]","[18793.0, 16922.0, 16736.0]","[USD, USD, USD]","[1500.0, 821.0, 250.0]","[USD, USD, USD]","[26259.0, 26132.0, 25662.0]","[USD, USD, USD]","[65000.0, 65600.0, 63800.0]","[employees, employees, employees]"
510,WestRock,9,"Packaging, Containers",Materials,USA,GA,Atlanta,"[2023, 2022, 2021]","[7759.0, 12379.0, 13716.0]","[USD, USD, USD]","[21257.0, 18746.0, 17579.0]","[USD, USD, USD]","[945.0, 838.0, 691.0]","[USD, USD, USD]","[28406.0, 29254.0, 28780.0]","[USD, USD, USD]","[50500.0, 49900.0, 49300.0]","[employees, employees, employees]"
509,Wells Fargo,9,Commercial Banks,Financials,USA,CA,San Francisco,"[2023, 2022, 2021]","[141188.0, 184225.0, 161521.0]","[USD, USD, USD]","[82859.0, 82407.0, 80303.0]","[USD, USD, USD]","[13182.0, 21548.0, 3301.0]","[USD, USD, USD]","[1881016.0, 1948068.0, 1955163.0]","[USD, USD, USD]","[238000.0, 247848.0, 268531.0]","[employees, employees, employees]"
506,Waste Management,9,Waste Management,Business Services,USA,TX,Houston,"[2023, 2022, 2021]","[66372.0, 65803.0, 54452.0]","[USD, USD, USD]","[19698.0, 17931.0, 15218.0]","[USD, USD, USD]","[2238.0, 1816.0, 1496.0]","[USD, USD, USD]","[31367.0, 29097.0, 29345.0]","[USD, USD, USD]","[49500.0, 48500.0, 48250.0]","[employees, employees, employees]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472,Truist Financial,4,Commercial Banks,Financials,USA,NC,Charlotte,"[2023, 2022, 2021]","[45290.0, 75354.0, 78409.0]","[USD, USD, USD]","[25356.0, 23064.0, 24427.0]","[USD, USD, USD]","[6260.0, 6440.0, 4482.0]","[USD, USD, USD]","[555255.0, 541241.0, 509228.0]","[USD, USD, USD]","[53987.0, 51462.0, 53638.0]","[employees, employees, employees]"
76,Bath & Body Works,4,Specialty Retailers: Other,Retailing,USA,OH,Columbus,"[2023, 2022]","[8368.0, 11420.0]","[USD, USD]","[7560.0, 7882.0]","[USD, USD]","[800.0, 1334.0]","[USD, USD]","[5494.0, 6026.0]","[USD, USD]","[33000.0, 32850.0]","[employees, employees]"
490,Univar Solutions,4,Wholesalers: Diversified,Wholesalers,USA,IL,Downers Grove,"[2023, 2022, 2021]","[5521.0, 5460.0, 3651.0]","[USD, USD, USD]","[11475.0, 9536.0, 8265.0]","[USD, USD, USD]","[545.0, 461.0, 53.0]","[USD, USD, USD]","[7146.0, 6778.0, 6355.0]","[USD, USD, USD]","[9746.0, 9450.0, 9457.0]","[employees, employees, employees]"
499,Vistra,4,Energy,Energy,USA,TX,Irving,"[2023, 2022, 2021]","[9155.0, 10435.0, 8512.0]","[USD, USD, USD]","[13728.0, 12077.0, 11443.0]","[USD, USD, USD]","[1227.0, 1274.0, 636.0]","[USD, USD, USD]","[32787.0, 29683.0, 25208.0]","[USD, USD, USD]","[4910.0, 5060.0, 5365.0]","[employees, employees, employees]"


In [123]:
Final_Fortune_500.isnull().mean()

name                     0.0
num_years                0.0
industry                 0.0
sector                   0.0
country                  0.0
headquarters_state       0.0
headquarters_city        0.0
year                     0.0
market_value_mil         0.0
market_value_currency    0.0
revenue_mil              0.0
revenue_currency         0.0
profit_mil               0.0
profit_currency          0.0
asset_mil                0.0
asset_currency           0.0
employees                0.0
employees_metric         0.0
dtype: float64

In [125]:
Final_Fortune_500.to_csv('final_f500.csv', index=False)