In [1]:
import numpy as np
import pandas as pd

f500 = pd.read_csv("f500.csv")
# This is the more conventional way to read in a dataframe, and it's the method we'll use from here on.


In [2]:
# replace 0 values in the "previous_rank" column with NaN\

f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

In [3]:
# Select the rank, revenues, and revenue_change columns in f500
f500_selection = f500[["rank","revenues","revenue_change"]].head()
f500_selection

Unnamed: 0,rank,revenues,revenue_change
0,1,485873,0.8
1,2,315199,-4.4
2,3,267518,-9.1
3,4,262573,-12.3
4,5,254694,7.7


When we worked with a dataframe with string index labels, we used `loc[]` to select data:

In some scenarios, using labels to make selections makes things easier — in others though, it makes things harder.

Just like in NumPy, we can also use integer positions to select data using `Dataframe.iloc[]` and `Series.iloc[]`. It's easy to get `loc[]` and `iloc[]` confused at first, but the easiest way is to remember the first letter of each method:

* `loc`: label based selection
* `iloc`: integer position based selection

In [4]:
# elect just the fifth row of the f500 dataframe
fifth_row = f500.iloc[4]

In [5]:
# Select the value in first row of the company column.
company_value = f500.iloc[0,0]

`loc[]` handles slicing differently:

- With `loc[]`, the ending slice is included.
- With `iloc[]`, the ending slice is not included.

In [6]:
# Select the first three rows of the f500 dataframe
first_three_rows = f500[:3] # or f500.iloc[:3]


In [7]:
# Select the first and seventh rows and the first five columns of the f500 dataframe
first_seventh_row_slice = f500.iloc[[0,6],0:5]

When we really only want to access a single location, it is recommended to use the `DataFrame.iat` property. It has the same syntax but doesn't allow ranges.

In [3]:
f500.iat[1, 0]

'State Grid'

As with the `iloc` property, when we only want a specific cell, it is better to use the `DataFrame.at` property. Here is an example:

In [4]:
f500.at[1, "rank"]

2

We used Python boolean operators like `>`, `<`, and `==` to create boolean masks to select subsets of data. There are also a number of pandas methods that return boolean masks useful for exploring data.

Two examples are the `Series.isnull()` method and `Series.notnull()` method. These can be used to select either rows that contain null (or NaN) values or rows that do not contain null values for a certain column.

In [8]:
# Use the Series.isnull() method to select all rows from f500 that have a null value for the previous_rank column. 
# Select only the company, rank, and previous_rank columns
null_previous_rank = f500.loc[f500["previous_rank"].isnull(),
                              ["company", "rank", "previous_rank"]]


In [9]:
previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["rank"] - previously_ranked["previous_rank"]

In [10]:
null_previous_rank = f500[f500["previous_rank"].isnull()]
# Assign the first five rows of the null_previous_rank dataframe to the variable top5_null_prev_rank by choosing the correct method out of either loc[] or iloc[].
top5_null_prev_rank = null_previous_rank[0:5] # or we can use iloc[:5] or.head()
top5_null_prev_rank

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
48,Legal & General Group,49,105235,442.3,1697.9,577954,3.4,Nigel Wilson,"Insurance: Life, Health (stock)",Financials,,Britain,"London, Britain",http://www.legalandgeneralgroup.com,17,8939,8579
90,Uniper,91,74407,,-3557.5,51541,,Klaus Schafer,Energy,Energy,,Germany,"Dusseldorf, Germany",http://www.uniper.energy,1,12890,12889
123,Dell Technologies,124,64806,18.1,-1672.0,118206,,Michael S. Dell,"Computers, Office Equipment",Technology,,USA,"Round Rock, TX",http://www.delltechnologies.com,17,138000,13243
138,Anbang Insurance Group,139,60800,124.0,3883.9,430040,0.9,Wu Xiaohui,"Insurance: Life, Health (Mutual)",Financials,,China,"Beijing, China",http://www.anbanggroup.com,1,40707,20372
140,Albertsons Cos.,141,59678,1.6,-373.3,23755,,Robert G. Miller,Food and Drug Stores,Food & Drug Stores,,USA,"Boise, ID",http://www.albertsons.com,13,273000,1371


In [11]:
# Use the Series.notnull() method to select all rows from f500 that have a non-null value for the previous_rank column.
previously_ranked =f500.loc[f500["previous_rank"].notnull()]
print(previously_ranked.shape[0])
print(previously_ranked.tail(1))


467
    company  rank  revenues  revenue_change  profits  assets  profit_change  \
498     TUI   499     21655            -5.5   1151.7   16247          195.5   

                   ceo         industry             sector  previous_rank  \
498  Friedrich Joussen  Travel Services  Business Services          467.0   

     country       hq_location                  website  \
498  Germany  Hanover, Germany  http://www.tuigroup.com   

     years_on_global_500_list  employees  total_stockholder_equity  
498                        23      66779                      3006  


In [12]:
# From the previously_ranked dataframe, subtract the rank column from the previous_rank column.
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]

In [13]:
# Assign the values in the rank_change to a new column in the f500 dataframe

f500["rank_change"] = rank_change

Pandas will ignore the order of the `rank_change` series, and align on the  `rank_change` index labels with `f500`  `rank_change`. 

Pandas will also:

* Discard any items that have an index that doesn't match the dataframe.
* Fill any remaining rows with `NaN`

In [14]:
f500.loc[f500["rank_change"] .isnull(), "rank_change"].head()

48    NaN
90    NaN
123   NaN
138   NaN
140   NaN
Name: rank_change, dtype: float64

# comparison operators in pandas and Numpy: 

* `==` (equal), 
* `>` (greater than),
* `<` (less than), 
* `!=` (not equal) etc.

# Logical opertator in Pandas and Numpy

* `&` (and), 
* `|` (or) , 
* `~` (not)

In [15]:
large_revenue = f500["revenues"]>100000 # companies with revenues greater than 100 billion

negative_profits = f500["profits"] < 0 # companies with profits less than 0.

combined = large_revenue & negative_profits

big_rev_neg_profit = f500.loc[combined]
big_rev_neg_profit

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
32,Japan Post Holdings,33,122990,3.6,-267.4,2631385,-107.5,Masatsugu Nagato,"Insurance: Life, Health (stock)",Financials,37.0,Japan,"Tokyo, Japan",http://www.japanpost.jp,21,248384,91532,4.0
44,Chevron,45,107567,-18.0,-497.0,260078,-110.8,John S. Watson,Petroleum Refining,Energy,31.0,USA,"San Ramon, CA",http://www.chevron.com,23,55200,145556,-14.0


In [16]:
# Tech companies outside USA
tech_outside_usa = f500[(f500["sector"] == "Technology") & ~ (f500["country"] == "USA")].head()

# We used parentheses around each of our boolean comparisons. 
# This is very important — our boolean operation will fail without parentheses. 

In [17]:
# Select all rows for companies headquartered in either Brazil or Venezuela.
brazil_venezuela = f500[(f500["country"] == "Brazil") |(f500["country"] == "Venezuela")]

# Find the company headquartered in Japan with the largest number of employees.

In [18]:
sorted_rows = f500[f500["country"] == "Japan"] # Select only the rows that have a country name equal to Japan
sorted_rows.sort_values("employees", ascending = False) # sort those rows by the employees column in descending order
top_employer = sorted_rows.iloc[0] # select the first row from the sorted dataframe
top_japanese_employer = top_employer["company"]
top_japanese_employer

'Toyota Motor'

Japanese company that employs the most people is `Toyota Motor`

We've explicitly avoided using loops in pandas because one of the key benefits of pandas is that it has vectorized methods to work with data more efficiently.

We'll learn how to use loops for aggregation. Aggregation is where we apply a statistical operation to groups of our data.

In [19]:
top_employer_by_country = {}

countries = f500["country"].unique() # To identify the unique countries, we can use the Series.unique() method.

for c in countries:
    df = f500[f500["country"] == c]
    sort_employees = df.sort_values("employees", ascending = False)
    top_employer = sort_employees["company"]
    top_employer_by_country[c] = top_employer.iloc[0]
    
top_employer_by_country 
    

{'USA': 'Walmart',
 'China': 'China National Petroleum',
 'Japan': 'Toyota Motor',
 'Germany': 'Volkswagen',
 'Netherlands': 'EXOR Group',
 'Britain': 'Compass Group',
 'South Korea': 'Samsung Electronics',
 'Switzerland': 'Nestle',
 'France': 'Sodexo',
 'Taiwan': 'Hon Hai Precision Industry',
 'Singapore': 'Flex',
 'Italy': 'Poste Italiane',
 'Russia': 'Gazprom',
 'Spain': 'Banco Santander',
 'Brazil': 'JBS',
 'Mexico': 'America Movil',
 'Luxembourg': 'ArcelorMittal',
 'India': 'State Bank of India',
 'Malaysia': 'Petronas',
 'Thailand': 'PTT',
 'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Norway': 'Statoil',
 'Canada': 'George Weston',
 'Ireland': 'Accenture',
 'Indonesia': 'Pertamina',
 'Denmark': 'Maersk Group',
 'Saudi Arabia': 'SABIC',
 'Sweden': 'H & M Hennes & Mauritz',
 'Finland': 'Nokia',
 'Venezuela': 'Mercantil Servicios Financieros',
 'Turkey': 'Koc Holding',
 'U.A.E': 'Emirates Group',
 'Israel': 'Teva Pharmaceutical Industries'}

In [34]:
# Alternate of above
top_employer_by_country = f500.sort_values("employees", ascending = False).groupby("country")["company"].first()
top_employer_by_country

country
Australia                            Wesfarmers
Belgium                    Anheuser-Busch InBev
Brazil                                      JBS
Britain                           Compass Group
Canada                            George Weston
China                  China National Petroleum
Denmark                            Maersk Group
Finland                                   Nokia
France                                   Sodexo
Germany                              Volkswagen
India                       State Bank of India
Indonesia                             Pertamina
Ireland                               Accenture
Israel           Teva Pharmaceutical Industries
Italy                            Poste Italiane
Japan                              Toyota Motor
Luxembourg                        ArcelorMittal
Malaysia                               Petronas
Mexico                            America Movil
Netherlands                          EXOR Group
Norway                          

# Challenge

We're going to add a new column to our dataframe, and then perform some aggregation using that new column.

The column we create is going to contain a metric called `return on assets (ROA)`. `ROA` is a business-specific metric which indicates a companies ability to make profit using their available assets.

`return on assets = profit/assets`

Once we've created the new column, we'll aggregate by `sector`, and find the company with the highest ROA from each sector

In [20]:
roa = f500["profits"]/f500["assets"]
f500["roa"] = roa

top_roa_by_sector = {}

sector = f500["sector"].unique()
for s in sector:
    df = f500[f500["sector"] == s]
    sorted_df_roa = df.sort_values("roa", ascending = False)
    top_roa_company = sorted_df_roa["company"].iloc[0]
    top_roa_by_sector[s] = top_roa_company
    
    

top_roa_by_sector  

{'Retailing': 'H & M Hennes & Mauritz',
 'Energy': 'National Grid',
 'Motor Vehicles & Parts': 'Subaru',
 'Financials': 'Berkshire Hathaway',
 'Technology': 'Accenture',
 'Wholesalers': 'McKesson',
 'Health Care': 'Gilead Sciences',
 'Telecommunications': 'KDDI',
 'Engineering & Construction': 'Pacific Construction Group',
 'Industrials': '3M',
 'Food & Drug Stores': 'Publix Super Markets',
 'Aerospace & Defense': 'Lockheed Martin',
 'Food, Beverages & Tobacco': 'Philip Morris International',
 'Household Products': 'Unilever',
 'Transportation': 'Delta Air Lines',
 'Materials': 'CRH',
 'Chemicals': 'LyondellBasell Industries',
 'Media': 'Disney',
 'Apparel': 'Nike',
 'Hotels, Restaurants & Leisure': 'McDonald’s',
 'Business Services': 'Adecco Group'}