[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek06.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week06.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week06.ipynb)

# Week 6: `pandas` Objects

![](imgs/Pandas.png)

We are talking more about data wrangling now.

The package `pandas` is *ubiquitous* in data science and machine learning.

It's a simple package and built on top of `NumPy`.

Everything we have learned with `NumPy` applies to `pandas`.

I view `pandas` as a 'data science enhancement' of `NumPy`. 

`pandas` offers tremendous flexibility and convenience in the form of robust methods.

The **core objects** are the
- Series
- DataFrame
- Indexer

In [None]:
import numpy as np
import pandas as pd

## Series

A `pandas` Series is an `ndarray` with labels. 

Here's an example without explicit labels.

In [None]:
a = np.random.random_sample(10)
ser = pd.Series(a)
ser

Let's provide explicit labels using the `index` keyword argument.

In [None]:
ser = pd.Series(a, index=[f"x{i}" for i in range(10)])
ser

You might be tempted to say that a Series is a glorified Python dictionary. 

This is perhaps somewhat right, but it misses the point.

- Firstly, labels need not be unique, which is already different from dictionaries.

- Secondly, a type is more than its primitives. The functions and methods are truly time saving.

In [None]:
ser.index

You can (still) access the entries of a Series by its explicit index *or its implict index*.

In [None]:
print(ser['x5'])
print(ser[5])          # deprecated

This however can cause confusion when the explicit indices are also integers. 

We will say more about this when we get to the Index object.

### Basic operations on Series

You can create a Series from a dictionary.

The following comes from [2022 census data](https://visual.cso.ie/?body=entity/ima/cop/2022). (Full csv files in `data/`.)

In [None]:
pops = pd.Series({
    "Cork": 360152,
    "Dublin": 592713,
    "Fingal": 330506,
    "Galway": 193323,
    "Sligo": 70198
})
print(pops)

As mentioned before, one can think of a Series as an enhanced dictionary.

You can do similar sorts of operations.

In [None]:
print(pops.keys())
print(list(pops.items()))

In [None]:
pops["Galway"]

#### Indexing and beyond

It might be helpful to think of Series as a $1$-dimensional `ndarray` to start.

In [None]:
pops['Dublin':'Galway']

Notice the change in slicing convention. We eneded at our ending value `Galway`.

We can slice with the implict indices, and that follows the Python convention.

In [None]:
pops[1:3]

You can apply techniques like masking and advanced indexing to Series in much of the same way.

(In fact I did this for you in Week05 with the running rainfall example!)

Try applying masking and advanced indexing yourself.

#### `loc` and `iloc`

It's good practice to be explicit.

Instead of indexing and slicing via `pops[...]`, use
- `loc`
- `iloc`

In [None]:
pops.loc["Galway"]

In [None]:
pops.iloc[3]

A helpful nmemonic to help remember:
- `loc` and label both start with L
- `iloc` and implicit both start with I.

## DataFrame

The DataFrame object is the primary `pandas` object.

These are $2$-dimensional analogs of the Series object.

They describe tabulated data (e.g. Excel sheets and `csv` files).

### Creating a DataFrame

Common way to build a DataFrame by hand is through a dictionary.

In [None]:
df = pd.DataFrame({
    "Population": [360152, 592713, 330506, 193323, 70198],
    "No cars": [10335, 69661, 10371, 5033, 3350],
    "One car": [44419, 84685, 44495, 23406, 10422],
    "Two cars": [50310, 34861, 36562, 27542, 9055],
    "Three cars": [11243, 6129, 6368, 5593, 1676],
    "Four or more cars": [4908, 1736, 1933, 2291, 567],
}, index=["Cork", "Dublin", "Fingal", "Galway", "Sligo"])
df

If no `index` is given, integers are used starting from $0$.

In [None]:
df.index

In [None]:
df.columns

- If `a` is a $2$-dimensional `ndarray`, then `a[i]` is the $i^{th}$ row.
  
- If `df` is a DataFrame, then `df['col-val']` is the column corresponding to `'col-val'` (as a Series)
  
- If you want the row(s) corresponding the explicit index `'label'` or implicit index `i`:
  - `df.loc['label']`
  - `df.iloc[i]`

In [None]:
df["Population"]

In [None]:
df.loc["Galway"]

In [None]:
df.iloc[0].name

Column values that are strings grow up to become attributes of the DataFrame.

In [None]:
df.Population

We can easily add a new column of data

In [None]:
# So we don't keep adding to the total if we run this cell more than once.
if not "Households" in df.columns.values:
    df["Households"] = np.sum(df.drop("Population", axis=1), axis=1)
df

Now we can convert the columns to percentages to make it easier to understand.

In [None]:
if not "%" in df.columns.values[1]:
    for col in [c for c in df.columns.values if "car" in c]:
        df[col + " (%)"] = df[col] / df["Households"] * 100
        df = df.drop([col], axis=1)

In [None]:
df

We take what we know about indexing with NumPy together with how Series in pandas works:

Now you know what to expect when indexing with DataFrames.

### Other ways to build DataFrames

Rarely does one *type* a DataFrame via a dictionary. 

(This is so unreliable that it might be outright banned at some companies.)

Thankfull there are standard methods to get you DataFrames from files.

#### CSVs

`csv` files are a common standard with data. 

Let's read in `data/Populations2022.csv` directly.

In [None]:
df_pop = pd.read_csv("data/Populations2022.csv")

Often one does not want to print out the *entire* DataFrame. Use `head` for sanity checking.

In [None]:
df_pop.head()

#### Excel files (e.g. `xls` and `xlsx`)

I don't have and files of this type, but it works the same way as our `csv` example.

#### Many more!

At this point, type `pd.read` and hit tab so that Jupyter shows the possibilities. 

`pandas` can read. A lot.

In [None]:
# pd.read

## Index Object

Both Series and DataFrame come with an Index object. We've already seen it in the wild.

In [None]:
df.index

One can think of the Index object as an immutable array.

Index has many attributes of a numpy array.

In [None]:
ind = df.index 
print(ind.shape, ind.size, ind.ndim, ind.dtype)

But the array is immutable...

In [None]:
# ind[0] = 'Not Cork'           # naughty naughty

One can also think of Index as an ordered multiset. 

Or in other words, it also has set-theoretic operations.

In [None]:
ind1 = pd.Index(list(range(0, 13, 2)))
ind2 = pd.Index(list(range(0, 13, 3)))
print(ind1)
print(ind2)

We can use the logical operators to do set-theoretic operations.

In [None]:
print(ind1.intersection(ind2))
print(ind1.union(ind2))
print(ind1.symmetric_difference(ind2))

## Exercises

1. Load the `IrishLanguage2022.csv` into a DataFrame directly.
2. Simplify the DataFrame by creating a new one and doing the following:
   - The index should be the "Administrative Counties 2019" value.
   - Instead of four rows per county, there should be one row. 
   - The column values should be 
     - "Population" (previously the entry in the "VALUE" column in the row with "Total").
     - "Can Speak" the percentage (as a real number between $0$ and $100$) that can speak
     - "Cannot Speak" the percentage (as a real number between $0$ and $100$) that cannot speak.
3. What are the top three counties with the highest population (stating) that can speak Irish per capita?