In the last two missions, we explored how the NumPy library makes working with data easier. Because we can easily work across multiple dimensions, our code is a lot easier to understand. By using vectorized operations instead of loops, our code runs faster with larger data.

Although NumPy provides fundamental structures and tools that make working with data easier, there are several things that limit its usefulness:

- The lack of support for column names forces us to frame questions as multi-dimensional array operations.
- Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
- There are lots of low level methods, but there are many common analysis patterns that don't have pre-built methods.

The **pandas library** provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you've been learning will come in handy as you begin to learn more about pandas.

The primary data structure in pandas is called a **dataframe**. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:

- Axis values can have string labels, not just numeric ones.
- Dataframes can contain columns with multiple data types: including integer, float, and string.

![Image](Images/Pandas_DF.png)

As we learn pandas, we'll work with a data set from Fortune magazine's 2017 Global 500 list, which ranks the top 500 corporations worldwide by revenue. 

![Image](Images/Fortune500.png)

The data set is a CSV file called f500.csv. Here is a data dictionary for some of the columns in the CSV:

- company: Name of the company.
- rank: Global 500 rank for the company.
- revenues: Company's total revenue for the fiscal year, in millions of dollars (USD).
- revenue_change: Percentage change in revenue between the current and prior fiscal year.
- profits: Net income for the fiscal year, in millions of dollars (USD).
- ceo: Company's Chief Executive Officer.
- industry: Industry in which the company operates.
- sector: Sector in which the company operates.
- previous_rank: Global 500 rank for the company for the prior year.
- country: Country in which the company is headquartered.

In [2]:
import pandas as pd

f500 = pd.read_csv('f500.csv',index_col=0)

f500_type = type(f500)
f500_shape = f500.shape

In [3]:
f500.head(3)

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


In [4]:
f500.tail(3)

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


Another feature that makes pandas better for working with data is that dataframes can contain more than one data type:

We can use the `DataFrame.dtypes` attribute (similar to NumPy's ndarray.dtype attribute) to return information about the types of each column.


In [5]:
f500.dtypes

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object

In [8]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

### Selecting a column from a DataFrame by Label

Because our axes in pandas have labels, we can select data using those labels — unlike in NumPy, where we needed to know the exact index location. To do this, we can use the DataFrame.loc[] attribute. The syntax for DataFrame.loc[] is:

```
df.loc[row_label, column_label]
```

Notice that we use brackets ([]) instead of parentheses (()) when selecting by location.

Let's select a single column by specifying a single label:

In [9]:
f500.loc[:, 'rank']

company
Walmart                             1
State Grid                          2
Sinopec Group                       3
China National Petroleum            4
Toyota Motor                        5
                                 ... 
Teva Pharmaceutical Industries    496
New China Life Insurance          497
Wm. Morrison Supermarkets         498
TUI                               499
AutoNation                        500
Name: rank, Length: 500, dtype: int64

Notice we used `:` to specify that we wish to select all rows. Also note that the new dataframe has the same row labels as the original.

We can also use the following shortcut to select a single column:

In [11]:
rank_col = f500["rank"]
print(rank_col)

company
Walmart                             1
State Grid                          2
Sinopec Group                       3
China National Petroleum            4
Toyota Motor                        5
                                 ... 
Teva Pharmaceutical Industries    496
New China Life Insurance          497
Wm. Morrison Supermarkets         498
TUI                               499
AutoNation                        500
Name: rank, Length: 500, dtype: int64


In [12]:
print(type(rank_col))

<class 'pandas.core.series.Series'>


## Introduction to Series.

On the last screen, we observed that when you select just one column of a dataframe, you get a new pandas type: a series object. Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.

In fact, you can think of a dataframe as a collection of series objects, which is similar to how pandas stores the data behind the scenes.

![Image](Images/df_series.png)

### Selecting columns from a dataframe by label (cont.)

Below, we use a list of labels to select specific columns:

In [13]:
f500.loc[:, ['country', 'rank']]

Unnamed: 0_level_0,country,rank
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,USA,1
State Grid,China,2
Sinopec Group,China,3
China National Petroleum,China,4
Toyota Motor,Japan,5
...,...,...
Teva Pharmaceutical Industries,Israel,496
New China Life Insurance,China,497
Wm. Morrison Supermarkets,Britain,498
TUI,Germany,499


In [14]:
f500[['country', 'rank']]

Unnamed: 0_level_0,country,rank
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,USA,1
State Grid,China,2
Sinopec Group,China,3
China National Petroleum,China,4
Toyota Motor,Japan,5
...,...,...
Teva Pharmaceutical Industries,Israel,496
New China Life Insurance,China,497
Wm. Morrison Supermarkets,Britain,498
TUI,Germany,499


Let's finish by using **a slice object with labels to select specific columns**:

In [15]:
f500.loc[:, 'rank':'profits']

Unnamed: 0_level_0,rank,revenues,revenue_change,profits
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Walmart,1,485873,0.8,13643.0
State Grid,2,315199,-4.4,9571.3
Sinopec Group,3,267518,-9.1,1257.9
China National Petroleum,4,262573,-12.3,1867.5
Toyota Motor,5,254694,7.7,16899.3
...,...,...,...,...
Teva Pharmaceutical Industries,496,21903,11.5,329.0
New China Life Insurance,497,21796,-13.3,743.9
Wm. Morrison Supermarkets,498,21741,-11.3,406.4
TUI,499,21655,-5.5,1151.7


We again get a dataframe object, with all of the columns from the first up until — **and including** — the last column in our slice. Also note there is no shortcut for selecting column slices.

A summary of the techniques we've learned so far is below:

![Image](Images/slicing_optiomns.png)

In [16]:
countries = f500['country']
revenues_years = f500[['revenues','years_on_global_500_list']]
ceo_to_sector = f500.loc[:,"ceo" : "sector"]

### Selecting rows from a DataFrame by Label

Now that we've learned how to select columns by label, let's learn how to select rows using the labels of the **index** axis. We use the same syntax to select rows from a dataframe as we do for columns:

```
df.loc[row_label, column_label]
```

#### Select a single row

In [21]:
single_row = f500.loc["Sinopec Group"]
print(type(single_row))
single_row

<class 'pandas.core.series.Series'>


rank                                             3
revenues                                    267518
revenue_change                                -9.1
profits                                     1257.9
assets                                      310726
profit_change                                  -65
ceo                                      Wang Yupu
industry                        Petroleum Refining
sector                                      Energy
previous_rank                                    4
country                                      China
hq_location                         Beijing, China
website                     http://www.sinopec.com
years_on_global_500_list                        19
employees                                   713288
total_stockholder_equity                    106523
Name: Sinopec Group, dtype: object

#### Select a list of rows

In [20]:
list_rows = f500.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))

list_rows

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798


#### Select a slice object with labels

For selection using slices, we can use the shortcut below. This is the reason we can't use this shortcut for columns - because it's reserved for use with rows:

In [22]:
slice_rows = f500["State Grid":"Toyota Motor"]
print(type(slice_rows))
slice_rows

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


More examples:

In [27]:
toyota = f500.loc['Toyota Motor']
drink_companies = f500.loc[['Anheuser-Busch InBev', 'Coca-Cola', 'Heineken Holding']]
middle_companies = f500.loc['Tata Motors': 'Nationwide' , 'rank':'country']

#### Series vs Dataframes

![Image](Images/df1.png)
![Image](Images/df2.png)
![Image](Images/df3.png)
![Image](Images/df4.png)

### Value Counts Method

Because series and dataframes are two distinct objects, they have their own unique methods. Let's look at an example of a series method next - the Series.value_counts() method. This method displays each unique non-null value in a column and their counts in order.

First, we'll select just one column from the f500 dataframe:

In [29]:
sectors = f500["sector"]
print(type(sectors))

<class 'pandas.core.series.Series'>


Next, we'll substitute "Series" in Series.value_counts() with the name of our sectors series, like below:

In [30]:
sectors_value_counts = sectors.value_counts()
print(sectors_value_counts)

Financials                       118
Energy                            80
Technology                        44
Motor Vehicles & Parts            34
Wholesalers                       28
Health Care                       27
Food & Drug Stores                20
Transportation                    19
Telecommunications                18
Retailing                         17
Food, Beverages & Tobacco         16
Materials                         16
Industrials                       15
Aerospace & Defense               14
Engineering & Construction        13
Chemicals                          7
Hotels, Restaurants & Leisure      3
Business Services                  3
Household Products                 3
Media                              3
Apparel                            2
Name: sector, dtype: int64


In the resulting series, we can see each unique non-null value in the column and their counts.


Let's see what happens when we try to use the `Series.value_counts()` method with a dataframe. First, we'll select the `sector` and `industry` columns to create a dataframe named `sectors_industries`:

In [31]:
sectors_industries = f500[["sector", "industry"]]
print(type(sectors_industries))

<class 'pandas.core.frame.DataFrame'>


Then, we'll try to use the `value_counts()` method:

In [32]:
si_value_counts = sectors_industries.value_counts()
print(si_value_counts)

AttributeError: 'DataFrame' object has no attribute 'value_counts'

`value_counts` is only for series...

### Selecting items from a Series by Label

As with dataframes, we can use `Series.loc[]` to select items from a series using single labels, a list, or a slice object. We can also omit `loc[]` and use bracket shortcuts for all three:

![Image](Images/series_loc.png)

In [34]:
# Examples

countries = f500['country']
countries_counts = countries.value_counts()

india = countries_counts['India']
north_america = countries_counts[['USA','Canada','Mexico']]

## Summary

Let's take a look at a summary of all the different label selection mechanisms we've learned in this mission:

![Image](Images/summary_table.png)

In [35]:
# Example:

big_movers = f500.loc[['Aviva','HP','JD.com','BHP Billiton'],['rank','previous_rank']]

bottom_companies = f500.loc['National Grid':'AutoNation', ['rank','sector','country']]