In the last lesson, we learned some of the built-in functions and methods that make exploring and analyzing data easier with pandas. In this lesson, we'll continue working with the 2017 Fortune Global 500 dataset as we learn more advanced selection and exploration techniques.

As a reminder, the data dictionary for the main columns in the `f500.csv` file is below:

- `company`: Name of the company.
- `rank`: Global 500 rank for the company.
- `revenues`: Company's total revenue for the fiscal year, in millions of dollars (USD).
- `revenue_change`: Percentage change in revenue between the current and prior fiscal year.
- `profits`: Net income for the fiscal year, in millions of dollars (USD).
- `sector`: Sector in which the company operates.
- `previous_rank`: Global 500 rank for the company for the prior year.
- `country`: Country in which the company is headquartered.
- `hq_location`: City and country, (or city and state for the USA) where the company is headquartered.
- `employees`: Total employees (full-time equivalent, if available) at fiscal year-end.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# read the data into a pandas dataframe
f500 = pd.read_csv('f500.csv', index_col=0)
f500.index.name = None

In [3]:
f500.head()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


In [4]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

In [5]:
f500.loc[f500['previous_rank'] == 0].head()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Legal & General Group,49,105235,442.3,1697.9,577954,3.4,Nigel Wilson,"Insurance: Life, Health (stock)",Financials,0,Britain,"London, Britain",http://www.legalandgeneralgroup.com,17,8939,8579
Uniper,91,74407,,-3557.5,51541,,Klaus Schafer,Energy,Energy,0,Germany,"Dusseldorf, Germany",http://www.uniper.energy,1,12890,12889
Dell Technologies,124,64806,18.1,-1672.0,118206,,Michael S. Dell,"Computers, Office Equipment",Technology,0,USA,"Round Rock, TX",http://www.delltechnologies.com,17,138000,13243
Anbang Insurance Group,139,60800,124.0,3883.9,430040,0.9,Wu Xiaohui,"Insurance: Life, Health (Mutual)",Financials,0,China,"Beijing, China",http://www.anbanggroup.com,1,40707,20372
Albertsons Cos.,141,59678,1.6,-373.3,23755,,Robert G. Miller,Food and Drug Stores,Food & Drug Stores,0,USA,"Boise, ID",http://www.albertsons.com,13,273000,1371


In [6]:
# replace 0 values in the previous_rank column with NaN
f500.loc[f500['previous_rank'] == 0, 'previous_rank'] = np.nan

In [7]:
f500.loc[['Uniper', 'Dell Technologies', 'Nokia']] #Uniper, Dell Technologies, Nokia had 0 as previous rank

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Uniper,91,74407,,-3557.5,51541,,Klaus Schafer,Energy,Energy,,Germany,"Dusseldorf, Germany",http://www.uniper.energy,1,12890,12889
Dell Technologies,124,64806,18.1,-1672.0,118206,,Michael S. Dell,"Computers, Office Equipment",Technology,,USA,"Round Rock, TX",http://www.delltechnologies.com,17,138000,13243
Nokia,415,26113,73.4,-847.1,47354,-131.0,Rajeev Suri,Network and Other Communications Equipment,Technology,,Finland,"Espoo, Finland",http://www.nokia.com,19,102687,21192


In [8]:
f500_selection = f500[['rank', 'revenues', 'revenue_change']].head(5)
f500_selection

Unnamed: 0,rank,revenues,revenue_change
Walmart,1,485873,0.8
State Grid,2,315199,-4.4
Sinopec Group,3,267518,-9.1
China National Petroleum,4,262573,-12.3
Toyota Motor,5,254694,7.7


In [9]:
# read in the csv file but without the index_col parameter
myf500 = pd.read_csv('f500.csv')
myf500[['company', 'rank','revenues']].head()

Unnamed: 0,company,rank,revenues
0,Walmart,1,485873
1,State Grid,2,315199
2,Sinopec Group,3,267518
3,China National Petroleum,4,262573
4,Toyota Motor,5,254694


In [10]:
# replace 0 values in the previous_rank column with NaN
myf500.loc[myf500['previous_rank'] == 0, 'previous_rank'] = np.nan

There are two differences with this approach:

- The company column is now included as a regular column, instead of being used for the index.
- The index labels are now integers starting from 0.

This is the more conventional way to read in a dataframe, and it's the method we'll use from here on. 

Just like in NumPy, we can also use integer positions to select data using Dataframe.iloc[ ] and Series.iloc[ ]. It's easy to get loc[ ] and iloc[ ] confused at first, but the easiest way is to remember the first letter of each method:

- loc: label based selection
- iloc: integer position based selection

Using iloc[] is almost identical to indexing with NumPy, with integer positions starting at 0 like ndarrays and Python lists. 

In [11]:
#Select just the fifth row of the f500 dataframe. Assign the result to fifth_row.
#Select the value in first row of the company column. Assign the result to company_value.

fifth_row = myf500.iloc[4]
company_value = myf500.iloc[0, 0]

In [12]:
fifth_row

company                                     Toyota Motor
rank                                                   5
revenues                                          254694
revenue_change                                       7.7
profits                                          16899.3
assets                                            437575
profit_change                                      -12.3
ceo                                          Akio Toyoda
industry                        Motor Vehicles and Parts
sector                            Motor Vehicles & Parts
previous_rank                                        8.0
country                                            Japan
hq_location                                Toyota, Japan
website                     http://www.toyota-global.com
years_on_global_500_list                              23
employees                                         364445
total_stockholder_equity                          157210
Name: 4, dtype: object

In [13]:
company_value

'Walmart'

As a reminder, the full syntax for DataFrame.iloc[ ] in pseudocode is:

`df.iloc[row_index,column_index]`

In [14]:
# Select the first three rows of the f500 dataframe. Assign the result to first_three_rows.
first_three_rows = myf500.iloc[:3]
first_three_rows

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


In [15]:
# Select the first and seventh rows and the first five columns of the f500 dataframe. 
# Assign the result to first_seventh_row_slice.

first_seventh_row_slice = myf500.iloc[[0,6], :5]
first_seventh_row_slice

Unnamed: 0,company,rank,revenues,revenue_change,profits
0,Walmart,1,485873,0.8,13643.0
6,Royal Dutch Shell,7,240033,-11.8,4575.0


In the last couple lessons, we used Python boolean operators like >, <, and == to create boolean masks to select subsets of data. There are also a number of pandas methods that return boolean masks useful for exploring data.

Two examples are the Series.isnull() method and Series.notnull() method. These can be used to select either rows that contain null (or NaN) values or rows that do not contain null values for a certain column.

First, let's use the Series.isnull() method to view rows with null values in the revenue_change column:

In [16]:
myf500['revenue_change'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
495    False
496    False
497    False
498    False
499    False
Name: revenue_change, Length: 500, dtype: bool

Just like in NumPy, we can use this series to filter our dataframe, f500:

In [17]:
rev_is_null = myf500['revenue_change'].isnull()
rev_change_null = myf500[rev_is_null]
rev_change_null[['company', 'country', 'sector']]

Unnamed: 0,company,country,sector
90,Uniper,Germany,Energy
180,Hewlett Packard Enterprise,USA,Technology


In [18]:
# Use the Series.isnull() method to select all rows from f500 that have a null value for the previous_rank column. 
# Select only the company, rank, and previous_rank columns. Assign the result to null_previous_rank.

null_bool = myf500['previous_rank'].isnull()
null_previous_rank = myf500[null_bool][['company', 'rank', 'previous_rank']]
null_previous_rank.head()

Unnamed: 0,company,rank,previous_rank
48,Legal & General Group,49,
90,Uniper,91,
123,Dell Technologies,124,
138,Anbang Insurance Group,139,
140,Albertsons Cos.,141,


In [19]:
# Assign the first five rows of the null_previous_rank dataframe to the variable top5_null_prev_rank 
# by choosing the correct method out of either loc[] or iloc[].

top5_null_prev_rank = null_previous_rank.iloc[:5]
top5_null_prev_rank

Unnamed: 0,company,rank,previous_rank
48,Legal & General Group,49,
90,Uniper,91,
123,Dell Technologies,124,
138,Anbang Insurance Group,139,
140,Albertsons Cos.,141,


Now that we've identified the rows with null values in the previous_rank column, let's use the Series.notnull() method to exclude them from the next part of our analysis.

previously_ranked = f500[f500["previous_rank"].notnull()]

We can then create a rank_change column by subtracting the rank column from the previous_rank column:

rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]

print(rank_change.shape)

print(rank_change.tail(3))

(467,)

496   -70.0

497   -61.0

498   -32.0

dtype: float64

Above, we can see that our rank_change series has 467 rows. Since the last integer index label is 498, we know that our index labels no longer align with the integer positions.

Suppose now we decided to add the rank_change series to the f500 dataframe as a new column. Its index labels no longer match the index labels in f500, so how could this be done?

Another powerful aspect of pandas is that almost every operation will align on the index labels. Let's look at an example – below we have a dataframe named food and a series named alt_name:

![Align Operations](align_index_1_updated.svg)

The food dataframe and the alt_name series not only have a different number of items, but they also only have two of the same index labels - corn and eggplant - and they're in different orders. If we wanted to add alt_name as a new column in our food dataframe, we can use the following code:

food["alt_name"] = alt_name

When we do this, pandas will ignore the order of the alt_name series, and align on the index labels:

![Align Operations](align_index_2_updated.svg)

The pandas library will align on index at every opportunity, no matter if our index labels are strings or integers - this makes working with data from different sources or working with data when we have removed, added, or reordered rows much easier than it would be otherwise.

### Exercise:

- Use the Series.notnull() method to select all rows from f500 that have a non-null value for the previous_rank column. Assign the result to previously_ranked
- From the previously_ranked dataframe, subtract the rank column from the previous_rank column. Assign the result to rank_change.
- Assign the values in the rank_change to a new column in the f500 dataframe, "rank_change".

In [20]:
previously_ranked = myf500[myf500['previous_rank'].notnull()]
previously_ranked

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8.0,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
490,National Grid,491,22036,-3.2,10150.6,82310,160.2,John Pettigrew,Utilities,Energy,471.0,Britain,"London, Britain",http://www.nationalgrid.com,12,22132,25463
492,Telecom Italia,493,21941,-17.4,1999.4,74295,,Flavio Cattaneo,Telecommunications,Telecommunications,404.0,Italy,"Milan, Italy",http://www.telecomitalia.com,18,61227,22366
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427.0,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437.0,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111


In [21]:
rank_change = previously_ranked['previous_rank'] - previously_ranked['rank']
rank_change

0       0.0
1       0.0
2       1.0
3      -1.0
4       3.0
       ... 
490   -20.0
492   -89.0
496   -70.0
497   -61.0
498   -32.0
Length: 467, dtype: float64

In [22]:
myf500['rank_change'] = rank_change

In [23]:
myf500[['company', 'rank_change']]

Unnamed: 0,company,rank_change
0,Walmart,0.0
1,State Grid,0.0
2,Sinopec Group,1.0
3,China National Petroleum,-1.0
4,Toyota Motor,3.0
...,...,...
495,Teva Pharmaceutical Industries,
496,New China Life Insurance,-70.0
497,Wm. Morrison Supermarkets,-61.0
498,TUI,-32.0


### Exercise:

Select all companies with revenues over 100 billion and negative profits from the f500 dataframe. The result should include all columns.

- Create a boolean array that selects the companies with revenues greater than 100 billion. Assign the result to large_revenue.
- Create a boolean array that selects the companies with profits less than 0. Assign the result to negative_profits.
- Combine large_revenue and negative_profits. Assign the result to combined.
- Use combined to filter f500. Assign the result to big_rev_neg_profit.


In [24]:
large_revenue = myf500['revenues'] > 100000
negative_profits = myf500['profits'] < 0
combined = large_revenue & negative_profits
big_rev_neg_profit = myf500[combined]
big_rev_neg_profit

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
32,Japan Post Holdings,33,122990,3.6,-267.4,2631385,-107.5,Masatsugu Nagato,"Insurance: Life, Health (stock)",Financials,37.0,Japan,"Tokyo, Japan",http://www.japanpost.jp,21,248384,91532,4.0
44,Chevron,45,107567,-18.0,-497.0,260078,-110.8,John S. Watson,Petroleum Refining,Energy,31.0,USA,"San Ramon, CA",http://www.chevron.com,23,55200,145556,-14.0


### Exercise


- Select all rows for companies whose country value is either Brazil or Venezuela. Assign the result to brazil_venezuela.
- Select the first five companies in the Technology sector for which the country is not the USA from the f500 dataframe. Assign the result to tech_outside_usa.


In [25]:
# Select all rows for companies whose country value is either Brazil or Venezuela. 
# Assign the result to brazil_venezuela.
brazil_venezuela = myf500[(myf500['country'] == 'Brazil') | (myf500['country'] == 'Venezuela')]
brazil_venezuela

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
74,Petrobras,75,81405,-16.3,-4838.0,246983,,Pedro Pullen Parente,Petroleum Refining,Energy,58.0,Brazil,"Rio de Janeiro, Brazil",http://www.petrobras.com.br,23,68829,76779,-17.0
112,Itau Unibanco Holding,113,66876,21.4,6666.4,415972,-13.7,Candido Botelho Bracher,Banks: Commercial and Savings,Financials,159.0,Brazil,"Sao Paulo, Brazil",http://www.itau.com.br,4,94779,37680,46.0
150,Banco do Brasil,151,58093,-13.4,2013.8,426416,-52.3,Paulo Rogerio Caffarelli,Banks: Commercial and Savings,Financials,115.0,Brazil,"Brasilia, Brazil",http://www.bb.com.br,23,100622,26551,-36.0
153,Banco Bradesco,154,57443,31.3,5127.9,366418,-5.7,Luiz Carlos Trabuco Cappi,Banks: Commercial and Savings,Financials,209.0,Brazil,"Osasco, Brazil",http://www.bradesco.com.br,21,94541,32369,55.0
190,JBS,191,48825,-0.1,107.7,31605,-92.3,Wesley Mendonca Batista,Food Production,"Food, Beverages & Tobacco",185.0,Brazil,"Sao Paulo, Brazil",http://jbss.infoinvest.com.br,8,237061,7307,-6.0
369,Vale,370,29363,14.7,3982.0,99014,,Fabio Schvartsman,"Mining, Crude-Oil Production",Energy,417.0,Brazil,"Rio de Janeiro, Brazil",http://www.vale.com,11,73062,39042,47.0
441,Mercantil Servicios Financieros,442,24403,50.3,2004.2,148659,-10.5,Gustavo J. Vollmer A.,Banks: Commercial and Savings,Financials,,Venezuela,"Caracas, Venezuela",http://www.msf.com,1,8370,7550,
486,Ultrapar Holdings,487,22167,-2.3,447.5,7426,-0.8,Thilo Mannhardt,Energy,Energy,474.0,Brazil,"Sao Paulo, Brazil",http://www.ultra.com.br,8,15173,2621,-13.0


In [26]:
# Select the first five companies in the Technology sector for which the country is not the USA 
# from the f500 dataframe. Assign the result to tech_outside_usa.

tech_outside_usa = myf500[(myf500['sector'] == 'Technology') & (myf500['country'] != 'USA')].head(5)
tech_outside_usa

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
14,Samsung Electronics,15,173957,-2.0,19316.5,217104,16.8,Oh-Hyun Kwon,"Electronics, Electrical Equip.",Technology,13.0,South Korea,"Suwon, South Korea",http://www.samsung.com,23,325000,154376,-2.0
26,Hon Hai Precision Industry,27,135129,-4.3,4608.8,80436,-0.4,Terry Gou,"Electronics, Electrical Equip.",Technology,25.0,Taiwan,"New Taipei City, Taiwan",http://www.foxconn.com,13,726772,33476,-2.0
70,Hitachi,71,84558,1.2,2134.3,86742,48.8,Toshiaki Higashihara,"Electronics, Electrical Equip.",Technology,79.0,Japan,"Tokyo, Japan",http://www.hitachi.com,23,303887,26632,8.0
82,Huawei Investment & Holding,83,78511,24.9,5579.4,63837,-5.0,Ren Zhengfei,Network and Other Communications Equipment,Technology,129.0,China,"Shenzhen, China",http://www.huawei.com,8,180000,20159,46.0
104,Sony,105,70170,3.9,676.4,158519,-45.1,Kazuo Hirai,"Electronics, Electrical Equip.",Technology,113.0,Japan,"Tokyo, Japan",http://www.sony.net,23,128400,22415,8.0


In [28]:
# Suppose we wanted to find the company that employs the most people in China. 
# We can accomplish this by first selecting all of the rows where the country column equals China:

selected_rows = myf500[myf500['country'] == 'China']

# Then, we can use the DataFrame.sort_values() method to sort the rows on the employees column.
# To do so, we pass the column name to the method:
#By default, the sort_values() method will sort the rows in ascending order — from smallest to largest.
# To sort the rows in descending order instead, 
# so the company with the largest number of employees appears first, we can set the ascending parameter to False:
sorted_rows = selected_rows.sort_values('employees', ascending = False)
sorted_rows[['company', 'country', 'employees']]

Unnamed: 0,company,country,employees
3,China National Petroleum,China,1512048
118,China Post Group,China,941211
1,State Grid,China,926067
2,Sinopec Group,China,713288
37,Agricultural Bank of China,China,501368
...,...,...,...
182,Amer International Group,China,17852
128,Tewoo Group,China,17353
438,China National Aviation Fuel Group,China,11739
458,Yango Financial Holding,China,10234


### Exercise


Find the company headquartered in Japan with the largest number of employees.
- Select only the rows that have a country name equal to Japan.
- Use DataFrame.sort_values() to sort those rows by the employees column in descending order.
- Use DataFrame.iloc[] to select the first row from the sorted dataframe.
- Extract the company name from the index label company from the first row. Assign the result to top_japanese_employer.

In [37]:
japan = myf500[myf500['country'] == 'Japan']
japan_sorted = japan.sort_values('employees', ascending = False)
top_japanese_employer = japan_sorted.iloc[0, 0]
top_japanese_employer

'Toyota Motor'

Suppose we wanted to calculate the company that employs the most people in each of the 34 countries. Using the method from the last screen would be very inefficient, so we'll rely on a technique we haven't used yet with pandas - loops.

We've explicitly avoided using loops in pandas because one of the key benefits of pandas is that it has vectorized methods to work with data more efficiently. We'll learn more advanced techniques in later courses, but for now, we'll learn how to use loops for aggregation.

Aggregation is where we apply a statistical operation to groups of our data. Let's say that we wanted to calculate the average revenue for each country in the data set. Our process might look like this:

- Identify each unique country in the data set.
- For each country:
  - Select only the rows corresponding to that country.
  - Calculate the average revenue for those rows.

To identify the unique countries, we can use the Series.unique() method. This method returns an array of unique values from any series. Then, we can loop over that array and perform our operation. Here's what that looks like:



In [39]:
# Create an empty dictionary to store the results
avg_rev_by_country = {}

# Create an array of unique countries
countries = myf500["country"].unique()

# Use a for loop to iterate over the countries
for c in countries:

    # Use boolean comparison to select only rows that correspond to a specific country
    selected_rows = myf500[myf500["country"] == c]

    # Calculate the mean average revenue for just those rows
    mean = selected_rows["revenues"].mean()

    # Assign the mean value to the dictionary, using the country name as the key
    avg_rev_by_country[c] = mean
    
avg_rev_by_country

{'USA': 64218.371212121216,
 'China': 55397.880733944956,
 'Japan': 53164.03921568627,
 'Germany': 63915.0,
 'Netherlands': 61708.92857142857,
 'Britain': 51588.708333333336,
 'South Korea': 49725.6,
 'Switzerland': 51353.57142857143,
 'France': 55231.793103448275,
 'Taiwan': 46364.666666666664,
 'Singapore': 54454.333333333336,
 'Italy': 51899.57142857143,
 'Russia': 65247.75,
 'Spain': 40600.666666666664,
 'Brazil': 52024.57142857143,
 'Mexico': 54987.5,
 'Luxembourg': 56791.0,
 'India': 39993.0,
 'Malaysia': 49479.0,
 'Thailand': 48719.0,
 'Australia': 33688.71428571428,
 'Belgium': 45905.0,
 'Norway': 45873.0,
 'Canada': 31848.0,
 'Ireland': 32819.5,
 'Indonesia': 36487.0,
 'Denmark': 35464.0,
 'Saudi Arabia': 35421.0,
 'Sweden': 27963.666666666668,
 'Finland': 26113.0,
 'Venezuela': 24403.0,
 'Turkey': 23456.0,
 'U.A.E': 22799.0,
 'Israel': 21903.0}

The resulting dictionary is below (we've shown just the first few keys):

{'USA': 64218.371212121216,

 'China': 55397.880733944956,

 'Japan': 53164.03921568627,

 'Germany': 63915.0,

 'Netherlands': 61708.92857142857,

 'Britain': 51588.708333333336,

 'South Korea': 49725.6,

 ...

 }

We'll practice this pattern to calculate the company that employs the most people in each country.

In this exercise, we're going to produce the following dictionary of the top employer in each country:

{'USA': 'Walmart',  
 'China': 'China National Petroleum',  
 'Japan': 'Toyota Motor',  
 ...
 'Turkey': 'Koc Holding',  
 'U.A.E': 'Emirates Group',  
 'Israel': 'Teva Pharmaceutical Industries'}

1. Create an empty dictionary, top_employer_by_country to store the results of the exercise.
2. Use the Series.unique() method to create an array of unique values from the country column.
3. Use a for loop to iterate over the array unique countries. In each iteration:
  - Select only the rows that have a country name equal to the current iteration.
  - Use DataFrame.sort_values() to sort those rows by the employees column in descending order.
  - Select the first row from the sorted dataframe.
  - Extract the company name from the index label company from the first row.
  - Assign the results to the top_employer_by_country dictionary, using the country name as the key, and the company name as the value.


In [42]:
top_employer_by_country = {}
countries = myf500['country'].unique()

for country in countries:
    filtered_by_country = myf500[myf500['country'] == country]
    most_employees = filtered_by_country.sort_values('employees', ascending=False).iloc[0,0]
    top_employer_by_country[country] = most_employees

top_employer_by_country

{'USA': 'Walmart',
 'China': 'China National Petroleum',
 'Japan': 'Toyota Motor',
 'Germany': 'Volkswagen',
 'Netherlands': 'EXOR Group',
 'Britain': 'Compass Group',
 'South Korea': 'Samsung Electronics',
 'Switzerland': 'Nestle',
 'France': 'Sodexo',
 'Taiwan': 'Hon Hai Precision Industry',
 'Singapore': 'Flex',
 'Italy': 'Poste Italiane',
 'Russia': 'Gazprom',
 'Spain': 'Banco Santander',
 'Brazil': 'JBS',
 'Mexico': 'America Movil',
 'Luxembourg': 'ArcelorMittal',
 'India': 'State Bank of India',
 'Malaysia': 'Petronas',
 'Thailand': 'PTT',
 'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Norway': 'Statoil',
 'Canada': 'George Weston',
 'Ireland': 'Accenture',
 'Indonesia': 'Pertamina',
 'Denmark': 'Maersk Group',
 'Saudi Arabia': 'SABIC',
 'Sweden': 'H & M Hennes & Mauritz',
 'Finland': 'Nokia',
 'Venezuela': 'Mercantil Servicios Financieros',
 'Turkey': 'Koc Holding',
 'U.A.E': 'Emirates Group',
 'Israel': 'Teva Pharmaceutical Industries'}

Now it's time for a challenge to bring everything together! In this challenge we're going to add a new column to our dataframe, and then perform some aggregation using that new column.

The column we create is going to contain a metric called return on assets (ROA). ROA is a business-specific metric which indicates a company's ability to make profit using their available assets.

$
return\ on\ assets = \frac{profit}{assets}
$

Once we've created the new column, we'll aggregate by sector, and find the company with the highest ROA from each sector. Like previous challenges, we'll provide some guidance in the hints, but try to complete it without them if you can.

Don't be discouraged if this challenge takes a few attempts to get correct. Working iteratively is a great way to work, and this challenge is more difficult than exercises you have previously completed.

### Exercise

1. Create a new column roa in the f500 dataframe, containing the return on assets metric for each company.
2. Aggregate the data by the sector column, and create a dictionary top_roa_by_sector, with:
  - Dictionary keys with the sector name.
  - Dictionary values with the company name with the highest ROA value from that sector.


In [51]:
myf500.columns

Index(['company', 'rank', 'revenues', 'revenue_change', 'profits', 'assets',
       'profit_change', 'ceo', 'industry', 'sector', 'previous_rank',
       'country', 'hq_location', 'website', 'years_on_global_500_list',
       'employees', 'total_stockholder_equity', 'rank_change'],
      dtype='object')

In [54]:
myf500['roa'] = myf500['profits'] / myf500['assets']
myf500[['company', 'roa']]

Unnamed: 0,company,roa
0,Walmart,0.068618
1,State Grid,0.019540
2,Sinopec Group,0.004048
3,China National Petroleum,0.003189
4,Toyota Motor,0.038620
...,...,...
495,Teva Pharmaceutical Industries,0.003542
496,New China Life Insurance,0.007394
497,Wm. Morrison Supermarkets,0.034944
498,TUI,0.070887


In [57]:
top_roa_by_sector = {}

sectors = myf500['sector'].unique()

for sector in sectors:
    top_company_in_sector = myf500[myf500['sector'] == sector].sort_values('roa', ascending=False).iloc[0,0]
    top_roa_by_sector[sector] = top_company_in_sector
    
top_roa_by_sector

{'Retailing': 482,
 'Energy': 491,
 'Motor Vehicles & Parts': 352,
 'Financials': 8,
 'Technology': 305,
 'Wholesalers': 11,
 'Health Care': 358,
 'Telecommunications': 219,
 'Engineering & Construction': 89,
 'Industrials': 361,
 'Food & Drug Stores': 308,
 'Aerospace & Defense': 178,
 'Food, Beverages & Tobacco': 406,
 'Household Products': 150,
 'Transportation': 257,
 'Materials': 363,
 'Chemicals': 374,
 'Media': 161,
 'Apparel': 331,
 'Hotels, Restaurants & Leisure': 436,
 'Business Services': 434}