[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek06.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week06.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week06.ipynb)

# Week 6: `pandas` Objects

![](imgs/Pandas.png)

We are talking more about data wrangling now.

The package `pandas` is *ubiquitous* in data science and machine learning.

It's a simple package and built on top of `NumPy`.

Everything we have learned with `NumPy` applies to `pandas`.

I view `pandas` as a 'data science enhancement' of `NumPy`. 

`pandas` offers tremendous flexibility and convenience in the form of robust methods.

The **core objects** are the
- Series
- DataFrame
- Indexer

In [1]:
import numpy as np
import pandas as pd

## Series

A `pandas` Series is an `ndarray` with labels. 

Here's an example without explicit labels.

In [2]:
a = np.random.random_sample(10)
ser = pd.Series(a)
ser

0    0.463198
1    0.651114
2    0.910421
3    0.591982
4    0.760952
5    0.440205
6    0.956933
7    0.886521
8    0.915643
9    0.606886
dtype: float64

Let's provide explicit labels using the `index` keyword argument.

In [3]:
ser = pd.Series(a, index=[f"x{i}" for i in range(10)])
ser

x0    0.463198
x1    0.651114
x2    0.910421
x3    0.591982
x4    0.760952
x5    0.440205
x6    0.956933
x7    0.886521
x8    0.915643
x9    0.606886
dtype: float64

You might be tempted to say that a Series is a glorified Python dictionary. 

This is perhaps somewhat right, but it misses the point.

- Firstly, labels need not be unique, which is already different from dictionaries.

- Secondly, a type is more than its primitives. The functions and methods are truly time saving.

In [4]:
ser.index

Index(['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'], dtype='object')

You can (still) access the entries of a Series by its explicit index *or its implict index*.

In [5]:
print(ser['x5'])
print(ser[5])          # deprecated

0.44020507436331324
0.44020507436331324


  print(ser[5])          # deprecated


This however can cause confusion when the explicit indices are also integers. 

We will say more about this when we get to the Index object.

### Basic operations on Series

You can create a Series from a dictionary.

The following comes from [2022 census data](https://visual.cso.ie/?body=entity/ima/cop/2022). (Full csv files in `data/`.)

In [6]:
pops = pd.Series({
    "Cork": 360152,
    "Dublin": 592713,
    "Fingal": 330506,
    "Galway": 193323,
    "Sligo": 70198
})
print(pops)

Cork      360152
Dublin    592713
Fingal    330506
Galway    193323
Sligo      70198
dtype: int64


As mentioned before, one can think of a Series as an enhanced dictionary.

You can do similar sorts of operations.

In [7]:
print(pops.keys())
print(list(pops.items()))

Index(['Cork', 'Dublin', 'Fingal', 'Galway', 'Sligo'], dtype='object')
[('Cork', 360152), ('Dublin', 592713), ('Fingal', 330506), ('Galway', 193323), ('Sligo', 70198)]


In [8]:
pops["Galway"]

193323

#### Indexing and beyond

It might be helpful to think of Series as a $1$-dimensional `ndarray` to start.

In [9]:
pops['Dublin':'Galway']

Dublin    592713
Fingal    330506
Galway    193323
dtype: int64

Notice the change in slicing convention. We eneded at our ending value `Galway`.

We can slice with the implict indices, and that follows the Python convention.

In [10]:
pops[1:3]

Dublin    592713
Fingal    330506
dtype: int64

You can apply techniques like masking and advanced indexing to Series in much of the same way.

(In fact I did this for you in Week05 with the running rainfall example!)

Try applying masking and advanced indexing yourself.

#### `loc` and `iloc`

It's good practice to be explicit.

Instead of indexing and slicing via `pops[...]`, use
- `loc`
- `iloc`

In [11]:
pops.loc["Galway"]

193323

In [12]:
pops.iloc[3]

193323

A helpful nmemonic to help remember:
- `loc` and label both start with L
- `iloc` and implicit both start with I.

## DataFrame

The DataFrame object is the primary `pandas` object.

These are $2$-dimensional analogs of the Series object.

They describe tabulated data (e.g. Excel sheets and `csv` files).

### Creating a DataFrame

Common way to build a DataFrame by hand is through a dictionary.

In [13]:
df = pd.DataFrame({
    "Population": [360152, 592713, 330506, 193323, 70198],
    "No cars": [10335, 69661, 10371, 5033, 3350],
    "One car": [44419, 84685, 44495, 23406, 10422],
    "Two cars": [50310, 34861, 36562, 27542, 9055],
    "Three cars": [11243, 6129, 6368, 5593, 1676],
    "Four or more cars": [4908, 1736, 1933, 2291, 567],
}, index=["Cork", "Dublin", "Fingal", "Galway", "Sligo"])
df

Unnamed: 0,Population,No cars,One car,Two cars,Three cars,Four or more cars
Cork,360152,10335,44419,50310,11243,4908
Dublin,592713,69661,84685,34861,6129,1736
Fingal,330506,10371,44495,36562,6368,1933
Galway,193323,5033,23406,27542,5593,2291
Sligo,70198,3350,10422,9055,1676,567


If no `index` is given, integers are used starting from $0$.

In [14]:
df.index

Index(['Cork', 'Dublin', 'Fingal', 'Galway', 'Sligo'], dtype='object')

In [15]:
df.columns

Index(['Population', 'No cars', 'One car', 'Two cars', 'Three cars',
       'Four or more cars'],
      dtype='object')

- If `a` is a $2$-dimensional `ndarray`, then `a[i]` is the $i^{th}$ row.
  
- If `df` is a DataFrame, then `df['col-val']` is the column corresponding to `'col-val'` (as a Series)
  
- If you want the row(s) corresponding the explicit index `'label'` or implicit index `i`:
  - `df.loc['label']`
  - `df.iloc[i]`

In [16]:
df["Population"]

Cork      360152
Dublin    592713
Fingal    330506
Galway    193323
Sligo      70198
Name: Population, dtype: int64

In [17]:
df.loc["Galway"]

Population           193323
No cars                5033
One car               23406
Two cars              27542
Three cars             5593
Four or more cars      2291
Name: Galway, dtype: int64

In [18]:
df.iloc[0].name

'Cork'

Column values that are strings grow up to become attributes of the DataFrame.

In [19]:
df.Population

Cork      360152
Dublin    592713
Fingal    330506
Galway    193323
Sligo      70198
Name: Population, dtype: int64

We can easily add a new column of data

In [20]:
# So we don't keep adding to the total if we run this cell more than once.
if not "Households" in df.columns.values:
    df["Households"] = np.sum(df.drop("Population", axis=1), axis=1)
df

Unnamed: 0,Population,No cars,One car,Two cars,Three cars,Four or more cars,Households
Cork,360152,10335,44419,50310,11243,4908,121215
Dublin,592713,69661,84685,34861,6129,1736,197072
Fingal,330506,10371,44495,36562,6368,1933,99729
Galway,193323,5033,23406,27542,5593,2291,63865
Sligo,70198,3350,10422,9055,1676,567,25070


Now we can convert the columns to percentages to make it easier to understand.

In [21]:
if not "%" in df.columns.values[1]:
    for col in [c for c in df.columns.values if "car" in c]:
        df[col + " (%)"] = df[col] / df["Households"] * 100
        df = df.drop([col], axis=1)

In [22]:
df

Unnamed: 0,Population,Households,No cars (%),One car (%),Two cars (%),Three cars (%),Four or more cars (%)
Cork,360152,121215,8.526173,36.644805,41.504764,9.275255,4.049004
Dublin,592713,197072,35.347995,42.971604,17.689474,3.110031,0.880896
Fingal,330506,99729,10.399182,44.615909,36.661352,6.385304,1.938253
Galway,193323,63865,7.880686,36.649182,43.125343,8.757535,3.587254
Sligo,70198,25070,13.362585,41.5716,36.118867,6.685281,2.261667


We take what we know about indexing with NumPy together with how Series in pandas works:

Now you know what to expect when indexing with DataFrames.

### Other ways to build DataFrames

Rarely does one *type* a DataFrame via a dictionary. 

(This is so unreliable that it might be outright banned at some companies.)

Thankfull there are standard methods to get you DataFrames from files.

#### CSVs

`csv` files are a common standard with data. 

Let's read in `data/Populations2022.csv` directly.

In [23]:
df_pop = pd.read_csv("data/Populations2022.csv")

Often one does not want to print out the *entire* DataFrame. Use `head` for sanity checking.

In [24]:
df_pop.head()

Unnamed: 0,Statistic Label,Census Year,Administrative Counties 2019,Age,Sex,UNIT,VALUE
0,Population,2022,Carlow County Council,Total,Both Sexes,Number,61968
1,Population,2022,Dublin City Council,Total,Both Sexes,Number,592713
2,Population,2022,Dún Laoghaire Rathdown County Council,Total,Both Sexes,Number,233860
3,Population,2022,Fingal County Council,Total,Both Sexes,Number,330506
4,Population,2022,South Dublin County Council,Total,Both Sexes,Number,301075


#### Excel files (e.g. `xls` and `xlsx`)

I don't have and files of this type, but it works the same way as our `csv` example.

#### Many more!

At this point, type `pd.read` and hit tab so that Jupyter shows the possibilities. 

`pandas` can read. A lot.

In [25]:
# pd.read

## Index Object

Both Series and DataFrame come with an Index object. We've already seen it in the wild.

In [26]:
df.index

Index(['Cork', 'Dublin', 'Fingal', 'Galway', 'Sligo'], dtype='object')

One can think of the Index object as an immutable array.

Index has many attributes of a numpy array.

In [27]:
ind = df.index 
print(ind.shape, ind.size, ind.ndim, ind.dtype)

(5,) 5 1 object


But the array is immutable...

In [28]:
# ind[0] = 'Not Cork'           # naughty naughty

One can also think of Index as an ordered multiset. 

Or in other words, it also has set-theoretic operations.

In [29]:
ind1 = pd.Index(list(range(0, 13, 2)))
ind2 = pd.Index(list(range(0, 13, 3)))
print(ind1)
print(ind2)

Index([0, 2, 4, 6, 8, 10, 12], dtype='int64')
Index([0, 3, 6, 9, 12], dtype='int64')


We can use the logical operators to do set-theoretic operations.

In [30]:
print(ind1.intersection(ind2))
print(ind1.union(ind2))
print(ind1.symmetric_difference(ind2))

Index([0, 6, 12], dtype='int64')
Index([0, 2, 3, 4, 6, 8, 9, 10, 12], dtype='int64')
Index([2, 3, 4, 8, 9, 10], dtype='int64')


## Exercises

1. Load the `IrishLanguage2022.csv` into a DataFrame directly.
2. Simplify the DataFrame by creating a new one and doing the following:
   - The index should be the "Administrative Counties 2019" value.
   - Instead of four rows per county, there should be one row. 
   - The column values should be 
     - "Population" (previously the entry in the "VALUE" column in the row with "Total").
     - "Can Speak" the percentage (as a real number between $0$ and $100$) that can speak
     - "Cannot Speak" the percentage (as a real number between $0$ and $100$) that cannot speak.
3. What are the top three counties with the highest population (stating) that can speak Irish per capita?