# Pandas
- Designed to work with **heterogeneous tabular data**
- Two main data structures: `Series` and `DataFrame`

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## Series

A `Series` is a one-dim array-like object containing:
- a sequence of values of possibly heterogenous types
- a sequence of labels called *index*

In [None]:
s = pd.Series([-3, 4, 5.2, 'c'])
s

In [None]:
s = pd.Series([-3, 4, 5.2, 'c'], index = ['alpha', 'beta', 'gamma', 'delta'])
s

In [None]:
s.index

The attribute `values` returns a `ndarray` containing the values of the `Series` object

In [None]:
s.values

Indices can be used with the indexing operator `[]` to access values.

In [None]:
s['beta']

We can use the `in` operator to check for the presence of an index.

In [None]:
'beta' in s

We can filter the values of a `Series` using a boolean expression as argument of the indexing operator.

In [None]:
s = pd.Series([-3, 4, 5.2, 8.7], index = ['alpha', 'beta', 'gamma', 'delta'])
s[s > s.median()]

We can apply functions elementwise as with Numpy's ndarrays.

In [None]:
np.exp(s)

Series can be created from a Python `dict` using the `Series()` constructor.

In [None]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

The `Series()` constructor takes also two lists as arguments, one list for the values and one for the indices.

Combining two `Series` with non-overlapping indices creates a new `Series` whose indices are the union of the indices of the original `Series` and with `NaN` values (used by Panda to indicate a missing value) for indices that are not in the intersection.

In [None]:
s1 = pd.Series([-3, 4, 5], index = ['alpha', 'beta', 'gamma'])
s2 = pd.Series([9, -1, -2], index = ['beta', 'gamma', 'delta'])
s1 + s2

Additional attributes of `Series` are `name` and `index.name`.

In [None]:
s = pd.Series([1.72, 1.65, 1.83], index = ['John', 'Bob', 'Jean'])
s.name = 'Height'
s.index.name = 'First name'
s

## DataFrames
- 2-dimensional labeled data structure with columns of potentially different types
- the most commonly used pandas object
- the structure used to store datasets

Type constructor: `pd.DataFrame()`

Construction from `Series` (series can have different lengths, `NaN` used for missing values)

In [None]:
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
d = {'one' : s1, 'two' : s2}
df = pd.DataFrame(d)
df

Construction from ndarrays/lists (must have same lengths)

In [None]:
d = {
    'one' : [1, 2, 3],
    'two' : ['alpha', 'beta', 'gamma']
}
df = pd.DataFrame(d, index = ['a', 'b', 'c'])
df

The row and column labels can be accessed through the `index` and `columns` attributes

In [None]:
df.index

In [None]:
df.columns

DataFrames can be operated on pretty much like `dict` of `Series` objects that share a common indexing.

Accessing columns by name

In [None]:
df['two']

Adding a new column

In [None]:
df['three'] = df['one'] > 1
df

Adding a new column using a list

In [None]:
df['four'] = [3.0, 2.9, 2.3]
df

Deleting a column

In [None]:
del df['three']
df

Checking existence of a column by name

In [None]:
'one' in df

Similarly to `Series`, the `values` attribute returns a 2-dimensional `ndarray` containing the `DataFrame` data.

In [None]:
df.values

Next, we load the Iris dataset. First, we take a look at it using the `cat` shell command

In [None]:
!cat Datasets/Iris.csv

A dataset in csv form is loaded via the `pd.read_csv()` method.

The method `info()` provides information about number of rows and columns, name of columns, and type of column entries.

In [None]:
iris = pd.read_csv("Datasets/Iris.csv")
iris

We don't really need the `Id` column.

In [None]:
del iris['Id']
iris.head() # print only the first few rows

The method `describe()` computes some useful statistics over the columns

In [None]:
iris.describe()

Let's check also the `index` and `columns` attributes

In [None]:
iris.index

In [None]:
iris.columns

**Slice rows.** Note that slicing with `iloc` follows the notation of Python as the right endpoint is not included.

In [None]:
iris.iloc[5:10]

**Slice columns.**

In [None]:
iris.iloc[:,1:3]

Slicing rows and columns.

In [None]:
iris.iloc[5:10, 3:5]

This is how the average of a specific column can be computed

In [None]:
iris['SepalLengthCm'].mean()

We can also compute the means of all columns at once.

In [None]:
iris.iloc[:,:4].mean()

By default, **aggregation works over columns** (in Pandas, columns are the `axis` zero).

If we want to aggregate over rows (`axis` one in Pandas), we can give the axis as argument.

In [None]:
iris.iloc[:,:4].mean(axis=1)

Aggregation operators over axis return `Series` objects.

Other operators are elementwise. Here is an example using `abs()` for computing the absolute value of each dataframe entry.

In [None]:
df = pd.DataFrame({
    'a': [4, 5, 6, -7],
    'b': [10, -20, 30, 40],
    'c': [100, 50, -30, -50]
})
df.abs()

We can select a subset of rows by combining different conditions using the boolean operators `& | ~`.

In [None]:
iris[ (iris['SepalLengthCm'] > iris['SepalLengthCm'].mean()) & (iris['Species'] == 'Iris-versicolor') ]

We can also sample a random row

In [None]:
iris.sample()

Or sample a random subset of $n$ rows

In [None]:
iris.sample(n=3)

Or subsample a pre-specified fraction of rows

In [None]:
iris.sample(frac=.05)

We can use the method `insert()` to add a column in a given position.

In [None]:
iris.insert(4, 'SepalRatio', iris['SepalWidthCm']/iris['SepalLengthCm'])
iris.head()

We now show how to perform some simple data normalization tasks.

First, we show how to rescale the entries in each column to the $[0,1]$ interval

In [None]:
iris2 = iris.iloc[:,:4] # create copy including only the selected columns
iris2.head()

In [None]:
iris2 = (iris2-iris2.min())/(iris2.max() - iris2.min()) # normalize columns in [0,1]
iris2.head()

In [None]:
iris2.min() # doublecheck

In [None]:
iris2.max() # doublecheck

In [None]:
iris.iloc[:,:4] = iris2 # copy normalized columns back into Iris
iris.head()

We reload the dataset and this time we change the entries of each column by subtracting their mean.

In [None]:
iris = pd.read_csv("Datasets/Iris.csv")
del iris['Id']

In [None]:
mean = iris.iloc[:,:4].mean() # compute mean of each column except the last one
mean

In [None]:
iris.iloc[:,:4] = iris.iloc[:,:4] - mean # subtract mean from each entry (excluding entries in column Species)
iris.head()

In [None]:
iris.iloc[:,:4].mean().round(2) # doublecheck

Here is a faster why of doing the same transformation using lambda functions.

In [None]:
iris = pd.read_csv("Datasets/Iris.csv")
del iris['Id']
iris2 = iris.iloc[:,:4].apply(lambda x: x-x.mean())
iris2.head()

In [None]:
iris2.iloc[:,:4].mean().round(2) # doublecheck