# From Numpy to Pandas

In [1]:
import numpy as np
import pandas as pd

Let's say I have some countries in mind:

In [2]:
country_names = np.array(['Australia', 'Brazil', 'Canada',
                          'China', 'Germany', 'Spain',
                          'France', 'United Kingdom', 'India',
                          'Italy', 'Japan', 'Korea, Rep.',
                          'Mexico', 'Russian Federation', 'United States'])
country_names

array(['Australia', 'Brazil', 'Canada', 'China', 'Germany', 'Spain',
       'France', 'United Kingdom', 'India', 'Italy', 'Japan',
       'Korea, Rep.', 'Mexico', 'Russian Federation', 'United States'],
      dtype='<U18')

For compactness, I'll also want to use the corresponding [standard three-letter code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) for each country, like so:

In [3]:
country_codes = np.array(['AUS', 'BRA', 'CAN', 'CHN',
                          'DEU', 'ESP', 'FRA', 'GBR',
                          'IND', 'ITA', 'JPN', 'KOR',
                          'MEX', 'RUS', 'USA'])
country_codes

array(['AUS', 'BRA', 'CAN', 'CHN', 'DEU', 'ESP', 'FRA', 'GBR', 'IND',
       'ITA', 'JPN', 'KOR', 'MEX', 'RUS', 'USA'], dtype='<U3')

For each of these countries, I have the corresponding Maternal Mortality Rate (MMR).  This the number of women who die per 100,000 live births.  With a well-functioning health system, the death of a mother should be an extremely rare event.

In [4]:
mmrs = np.array([6.0, 49.5 , 7.25,
                 28.75, 6.25, 5.0,
                 8.75, 9.25, 185.25,
                 4.0, 5.75, 12.0,
                 40.0, 25.25, 14. ])
mmrs

array([  6.  ,  49.5 ,   7.25,  28.75,   6.25,   5.  ,   8.75,   9.25,
       185.25,   4.  ,   5.75,  12.  ,  40.  ,  25.25,  14.  ])

By the way, these data are real; they come from statistics compiled by the World Bank from 2012 to 2016.  See [the gender data page](https://github.com/odsti/datasets/tree/main/gender_stats) for more detail.

Let's say I also have the amount that each country spends on health care, per person.  Call this HeathCare Expenditure Per Person (HCEPP).   In due course, I'm interested to see whether HCEPP can predict the MMR values.

In [5]:
hcepps = np.array([4256, 1303, 4617,
                          658, 4910, 2964,
                          4388, 3358, 242,
                          3267, 3687, 2385,
                          1081, 1756, 9060.])
hcepps

array([4256., 1303., 4617.,  658., 4910., 2964., 4388., 3358.,  242.,
       3267., 3687., 2385., 1081., 1756., 9060.])

I want a good way to keep it clear which value corresponds to each country.  I'm going to start with the MMR values.

One way of doing that is to make a new data structure, that contains the MMR values, but also has *labels* for each value.  Pandas has an object for that, called a `Series`.  You can construct a series by passing the values and the labels:

In [6]:
mmr_series = pd.Series(mmrs, index=country_codes)
mmr_series

AUS      6.00
BRA     49.50
CAN      7.25
CHN     28.75
DEU      6.25
ESP      5.00
FRA      8.75
GBR      9.25
IND    185.25
ITA      4.00
JPN      5.75
KOR     12.00
MEX     40.00
RUS     25.25
USA     14.00
dtype: float64

Notice the `index=` named argument.  Pandas terms the collection of labels for each value - the *index*.  Think of the index as you would an index for a book - it's a way to get from - in our case - the country code, to the corresponding value.  We can get to the collection of labels with the `.index` attribute of the Series.

In [7]:
mmr_series.index

Index(['AUS', 'BRA', 'CAN', 'CHN', 'DEU', 'ESP', 'FRA', 'GBR', 'IND', 'ITA',
       'JPN', 'KOR', 'MEX', 'RUS', 'USA'],
      dtype='object')

Of course `mmr_series` also contains the MMR values, accessible with the `.values` attribute:

In [8]:
mmr_series.values

array([  6.  ,  49.5 ,   7.25,  28.75,   6.25,   5.  ,   8.75,   9.25,
       185.25,   4.  ,   5.75,  12.  ,  40.  ,  25.25,  14.  ])

Think then of the Series as an object that associates an array of values (`.values`) with the corresponding labels for each value (`.index`).

We can access values from their corresponding label, by using the `.loc` accessor, an attribute of the Series object.

In [9]:
mmr_series.loc['MEX']

40.0

`.loc` is an accessor that allows us to pass labels (in the `.index`), and returns the corresponding values.  Here we ask for more than one value, by passing in a list of labels:

In [10]:
mmr_series.loc[['KOR', 'USA']]

KOR    12.0
USA    14.0
dtype: float64

Notice above, that passing one label to `.loc` returns the value, but passing two or more labels to `.loc` returns a subset of the Series.   Put another way, one label gives a value, but more than one label gives a Series.

Indexing with `.loc` is called *label-based indexing*.  You can also index by
position, as you would with a Numpy array.  To remind ourselves, of basic
indexing in Numpy, to get the thirteenth value in the Numpy array of MMR
values, one could run:

In [11]:
mmrs[12]

40.0

Numpy indexing with integers, like the above, is always indexing by position.
Position 12 is the thirteenth element.

You can do the same type of indexing with a Pandas series, with the `.iloc` accessor.  Think of `.iloc` as *integer* indexing, or, if you like, `loc`ating with `i`ntegers.

In [12]:
mmr_series.iloc[12]

40.0

In [13]:
mmr_series.iloc[[11, 14]]

KOR    12.0
USA    14.0
dtype: float64

Notice again that one integer to `.iloc` gives a value, but two or more
integers gives a Series.

You can already imagine that this kind of label-based indexing could be
useful, because it is easier to avoid mistakes with:

In [14]:
mmr_series.loc['MEX']

40.0

— than it is to work out the position of Mexico in the array, and do:

In [15]:
mmrs[11]  # Was Mexico at position 11?

12.0

— oh, whoops, I mean:

In [16]:
mmrs[12]  # Ouch, no, it was a position 12.

40.0

As well as being harder to make mistakes, it makes the code easier to read, and therefore, easier to debug.

But the real value from this idea comes when you have more than one Series with corresponding labels.

For example, I can also make a Series with the health care expenditure (HCEPP) data, like this:

In [17]:
hcepp_series = pd.Series(hcepps, index=country_codes)
hcepp_series

AUS    4256.0
BRA    1303.0
CAN    4617.0
CHN     658.0
DEU    4910.0
ESP    2964.0
FRA    4388.0
GBR    3358.0
IND     242.0
ITA    3267.0
JPN    3687.0
KOR    2385.0
MEX    1081.0
RUS    1756.0
USA    9060.0
dtype: float64

In [18]:
hcepp_series.loc['MEX']

1081.0

But now imagine I want to look at the corresponding MMR and HCEPP values.   I can so this separately, for each Series, like this:

In [19]:
mmr_series.loc['MEX']

40.0

In [20]:
hcepp_series.loc['MEX']

1081.0

Imagine though, that I'm going to be doing this for multiple countries, and that I have multiple (not just two) values per country.  I would like a way of putting these Series together into a table, where the rows have labels (just as the Series values do), and the columns have names.

Each Series corresponds to one column in this table.  Pandas calls these tables *data frames*.

In [21]:
df = pd.DataFrame({'mmr': mmr_series, 'hcepp': hcepp_series})
df

Unnamed: 0,mmr,hcepp
AUS,6.0,4256.0
BRA,49.5,1303.0
CAN,7.25,4617.0
CHN,28.75,658.0
DEU,6.25,4910.0
ESP,5.0,2964.0
FRA,8.75,4388.0
GBR,9.25,3358.0
IND,185.25,242.0
ITA,4.0,3267.0


Think of this data frame as being like a dictionary of Series.

The keys in this dictionary are the column names we provided: `mmr` and `hcepp`; the values are the corresponding Series.

We can get the `mmr` series by name, by indexing directly into the data frame,
like this:

In [22]:
mmr_from_df = df['mmr']
mmr_from_df

AUS      6.00
BRA     49.50
CAN      7.25
CHN     28.75
DEU      6.25
ESP      5.00
FRA      8.75
GBR      9.25
IND    185.25
ITA      4.00
JPN      5.75
KOR     12.00
MEX     40.00
RUS     25.25
USA     14.00
Name: mmr, dtype: float64

Notice this returns a Series, extracted back from the data frame.

In [23]:
type(mmr_from_df)

pandas.core.series.Series

Notice too, that the Series now has an extra attribute, which is the Name.

We said above that Series are the association between an array of `.values`, and a corresponding collection of labels, in `.index`.  Now we see that the Series also has a `.name`, that we had not set in our original series:

In [24]:
# This is the Series we've extracted from the data frame.
mmr_from_df.name

'mmr'

As you can see, we had not set the name of the Series we built above using `pd.Series`, so it gets the default `.name` of `None`.

In [25]:
# This was the original series we built with pd.Series.
mmr_series.name is None

True

We can also use `.loc` and `.iloc` accessors on the data frame, to get rows by label (index value) or by position:

In [26]:
df.loc['MEX']

mmr        40.0
hcepp    1081.0
Name: MEX, dtype: float64

Notice what Pandas did here.  As for indexing into Series, indexing into the data frame *with a single label* returns the *contents* of the row.   And Pandas, being a general thinker, sees that the contents of the row are values, that have labels, where the labels are the column names.  Thus it returns the row to you as a new Series, where the Series has values from the row values, and labels from the column names.

Notice too that indexing with more than one value, returns a subset of the data frame.  In strict parallel to indexing into a Series, indexing with multiple values into a data frame, returns a subset of the data frame, which is itself, a data frame.

In [27]:
df.loc[['KOR', 'USA']]

Unnamed: 0,mmr,hcepp
KOR,12.0,2385.0
USA,14.0,9060.0
