# Introduction to pandas

* A library to analyze data in **table format**
* Very rich set of features; a fixture of data science.

Table format:
*  Each observation / an individual point of data is in its own row.
*  Each observation’s distinct characteristics, or features, are in separate columns.



## Fundamental data structures in pandas

* `Series`: A column of values of the same type.
* `DataFrame`: The table itself
* `Index`: Labels of data points

A `Series` is a column in a `DataFrame`.

A `DataFrame` is a collection of `Series` objects with a common `Index`.

![](img/series_index_dataframe.png)

In [32]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

## `Series` objects
A `Series` represents a column of a `DataFrame`; more generally, it can be any 1-dimensional array-like object.

It contains both:
* A sequence of values of the same type.
* A sequence of data labels called the index.

Create a Series named `s`:

In [4]:
s = pd.Series(["hello", "world", "AI111"])
s

0    hello
1    world
2    AI111
dtype: object

Accessing data values within the Series

In [5]:
 s.values

array(['hello', 'world', 'AI111'], dtype=object)

Accessing the Index of the Series

In [6]:
s.index

RangeIndex(start=0, stop=3, step=1)

By default, the index of a `Series` is a sequential list of integers beginning from 0. Optionally, a manually specified list of desired indices can be passed to the index argument.

In [7]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s

a    -1
b    10
c     2
dtype: int64

Indices can also be changed after initialization.

In [8]:
s.index = ["first", "second", "third"]
s

first     -1
second    10
third      2
dtype: int64

A Series can also be generated from a NumPy array

In [20]:
import numpy as np
pd.Series( np.linspace(0,5,11) )

0     0.0
1     0.5
2     1.0
3     1.5
4     2.0
5     2.5
6     3.0
7     3.5
8     4.0
9     4.5
10    5.0
dtype: float64

Series objects can be used in vectorized operations, like NumPy arrays.

In [129]:
temp_fahrenheit = pd.Series([86, 113, 95, 77, 122])
temp_celsius = 5/9 * (temp_fahrenheit-32)
temp_celsius

0    30.0
1    45.0
2    35.0
3    25.0
4    50.0
dtype: float64

NumPy operations can be used directly with Series:

In [130]:
np.sqrt(temp_celsius)

0    5.477226
1    6.708204
2    5.916080
3    5.000000
4    7.071068
dtype: float64

## Selection in Series

Much like when working with NumPy arrays, we can select a single value or a set of values from a Series. To do so, there are three primary methods:

* A single label.
* A list of labels.
* A filtering condition.

To demonstrate this, let’s define a new Series s.

In [9]:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s

a    4
b   -2
c    0
d    6
dtype: int64

A single label: We return the value stored at the index label "a"

In [10]:
s["a"] 

np.int64(4)

A list of labels: We return a Series of the values stored at the index labels "a" and "c"

In [11]:
s[["a", "c"]] 

a    4
c    0
dtype: int64

A filtering condition: First, we apply a boolean operation to the Series. This creates a new Series of boolean values.

In [16]:
s > 0  # Series with Boolean values

a     True
b    False
c    False
d     True
dtype: bool

We then use this boolean condition to index into our original Series. pandas will select only the entries in the original Series that satisfy the condition.

In [14]:
s[s > 0] # Filter condition: select all elements greater than 0

a    4
d    6
dtype: int64

## Exercise

Let's create a Series with random values.

In [28]:
import numpy as np
np.random.seed(1111) # for repeatable outcomes
s = pd.Series(
    np.random.randint(1,101, 20),
    index = list("abcdefghijklmnopqrst") # converts to a list of characters
)
s

a    29
b    56
c    82
d    13
e    35
f    53
g    25
h    23
i    21
j    12
k    15
l     9
m    13
n    87
o     9
p    63
q    62
r    52
s    43
t    77
dtype: int64

Write expressions to do the following:

* Get the value with the label "g"
* Get the values with labels "a","f",m", and "n"
* Get the elements greater than 50.
* Get the elements between 30 and 60

## Data Frames

We can create data frames in several ways.

(a) Using a list (or nested list) of values and column names:

In [33]:
pd.DataFrame([1,2,3], columns=["Numbers"])

Unnamed: 0,Numbers
0,1
1,2
2,3


The list can be nested, each sublist one row of data:

In [35]:
pd.DataFrame([[1,"one"], [2,"two"], [3,"three"]], columns=["number", "description"])

Unnamed: 0,number,description
0,1,one
1,2,two
2,3,three


(b) Using a dictionary where keys are column names and values are the columns.

In [36]:
pd.DataFrame({
    "Fruit": ["Strawberry", "Orange"], 
    "Price": [5.49, 3.99]
})

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


(c) Using Series objects having the same index

In [37]:
# Notice how our indices, or row labels, are the same

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])

In [38]:
s_a

r1    a1
r2    a2
r3    a3
dtype: object

In [None]:
Pass this to the `DataFrame()` function to convert to a data frame

In [39]:
pd.DataFrame(s_a)

Unnamed: 0,0
r1,a1
r2,a2
r3,a3


Alternatively, use the `to_frame` method of the Series object.

In [42]:
s_b.to_frame()

Unnamed: 0,0
r1,b1
r2,b2
r3,b3


Combine these two Series into a single data frame using the dictionary syntax.

In [43]:
pd.DataFrame({
    "A-column": s_a, 
    "B-column": s_b
})

Unnamed: 0,A-column,B-column
r1,a1,b1
r2,a2,b2
r3,a3,b3


What if the two Series do not have the same index?

They are still merged into a single data frame with empty (NaN) values.

In [44]:
s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r4"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])

pd.DataFrame({
    "A-column": s_a, 
    "B-column": s_b
})

Unnamed: 0,A-column,B-column
r1,a1,b1
r2,a2,b2
r3,,b3
r4,a3,


## Indexing

The index of the data frame can be changed as required.

Index values do not have to be integers. They do not have to be unique.

Let's read the grades table using the `read_csv` function.

In [60]:
grades = pd.read_csv("data_ex8_1.csv")
grades

Unnamed: 0,idno,name,math,literature,physics,music
0,27-6234266,Domenic Been,62,71,51,81
1,32-8500006,Carmel Vondrach,58,62,60,61
2,46-4848244,Gabey Stanlock,63,51,50,83
3,40-4452613,Miner Spoure,61,53,55,87
4,49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...,...
95,91-1396398,Clayson Toma,50,75,62,88
96,39-5570882,Gallard Burston,60,72,46,72
97,69-7809678,Tate Pevsner,52,78,55,72
98,97-5658808,Perice Castri,64,47,60,68


Set the index to the *idno* column:

In [46]:
grades.set_index("idno")

Unnamed: 0_level_0,name,math,literature,physics,music
idno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
27-6234266,Domenic Been,62,71,51,81
32-8500006,Carmel Vondrach,58,62,60,61
46-4848244,Gabey Stanlock,63,51,50,83
40-4452613,Miner Spoure,61,53,55,87
49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...
91-1396398,Clayson Toma,50,75,62,88
39-5570882,Gallard Burston,60,72,46,72
69-7809678,Tate Pevsner,52,78,55,72
97-5658808,Perice Castri,64,47,60,68


Note that this operation returns a new data frame. The original data frame is not affected.

In [47]:
grades

Unnamed: 0,idno,name,math,literature,physics,music
0,27-6234266,Domenic Been,62,71,51,81
1,32-8500006,Carmel Vondrach,58,62,60,61
2,46-4848244,Gabey Stanlock,63,51,50,83
3,40-4452613,Miner Spoure,61,53,55,87
4,49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...,...
95,91-1396398,Clayson Toma,50,75,62,88
96,39-5570882,Gallard Burston,60,72,46,72
97,69-7809678,Tate Pevsner,52,78,55,72
98,97-5658808,Perice Castri,64,47,60,68


To modify *grades*, either reassign to it:

In [48]:
grades = grades.set_index("idno")
grades

Unnamed: 0_level_0,name,math,literature,physics,music
idno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
27-6234266,Domenic Been,62,71,51,81
32-8500006,Carmel Vondrach,58,62,60,61
46-4848244,Gabey Stanlock,63,51,50,83
40-4452613,Miner Spoure,61,53,55,87
49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...
91-1396398,Clayson Toma,50,75,62,88
39-5570882,Gallard Burston,60,72,46,72
69-7809678,Tate Pevsner,52,78,55,72
97-5658808,Perice Castri,64,47,60,68


or, use the `inplace` parameter:

In [49]:
grades = pd.read_csv("data_ex8_1.csv")
grades.set_index("idno", inplace = True)
grades

Unnamed: 0_level_0,name,math,literature,physics,music
idno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
27-6234266,Domenic Been,62,71,51,81
32-8500006,Carmel Vondrach,58,62,60,61
46-4848244,Gabey Stanlock,63,51,50,83
40-4452613,Miner Spoure,61,53,55,87
49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...
91-1396398,Clayson Toma,50,75,62,88
39-5570882,Gallard Burston,60,72,46,72
69-7809678,Tate Pevsner,52,78,55,72
97-5658808,Perice Castri,64,47,60,68


We can also set the index column while reading the data file:

In [58]:
grades = pd.read_csv("data_ex8_1.csv", index_col="idno")
grades

Unnamed: 0_level_0,name,math,literature,physics,music
idno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
27-6234266,Domenic Been,62,71,51,81
32-8500006,Carmel Vondrach,58,62,60,61
46-4848244,Gabey Stanlock,63,51,50,83
40-4452613,Miner Spoure,61,53,55,87
49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...
91-1396398,Clayson Toma,50,75,62,88
39-5570882,Gallard Burston,60,72,46,72
69-7809678,Tate Pevsner,52,78,55,72
97-5658808,Perice Castri,64,47,60,68


And, if we’d like, we can revert the index back to the default list of integers.

In [51]:
grades.reset_index(inplace=True)
grades

Unnamed: 0,idno,name,math,literature,physics,music
0,27-6234266,Domenic Been,62,71,51,81
1,32-8500006,Carmel Vondrach,58,62,60,61
2,46-4848244,Gabey Stanlock,63,51,50,83
3,40-4452613,Miner Spoure,61,53,55,87
4,49-8047324,Joanie Padbery,49,60,51,68
...,...,...,...,...,...,...
95,91-1396398,Clayson Toma,50,75,62,88
96,39-5570882,Gallard Burston,60,72,46,72
97,69-7809678,Tate Pevsner,52,78,55,72
98,97-5658808,Perice Castri,64,47,60,68


## DataFrame attributes

The index of the data frame

In [61]:
grades.set_index("idno", inplace=True)
grades.index

Index(['27-6234266', '32-8500006', '46-4848244', '40-4452613', '49-8047324',
       '06-2459518', '35-7239913', '63-7712715', '16-0378773', '04-0828998',
       '84-8664238', '59-7103403', '50-1336034', '53-0435525', '04-7757095',
       '32-2310582', '49-9767932', '47-4087703', '68-3466140', '03-9174358',
       '14-8569680', '06-1148581', '78-7735038', '33-7615591', '01-3377637',
       '01-4681700', '51-8820392', '37-8668008', '22-7818285', '76-9375991',
       '70-9061896', '09-3064687', '93-9984100', '43-7640441', '87-1632149',
       '91-1524711', '38-9010265', '30-7706025', '76-7301665', '64-2963914',
       '37-0162523', '22-7681239', '43-0193918', '26-4326392', '20-7921323',
       '66-4928286', '05-8898294', '69-5776987', '22-9257090', '99-5454981',
       '44-0224192', '68-9308646', '50-0376646', '65-9869687', '58-5892136',
       '58-9041710', '68-1924853', '20-1831473', '87-9366708', '55-5355823',
       '74-4852223', '97-3396835', '28-9811319', '97-5732059', '59-1040545',

The names of the columns as an Index object

In [62]:
grades.columns

Index(['name', 'math', 'literature', 'physics', 'music'], dtype='object')

The `.shape` attribute gives how many rows and columns the data frame has.

In [63]:
grades.shape

(100, 5)

## Peeking at the data

Sometimes the data frame is too big, and we want to see either some parts of it, or a summary of it.

`.head(n)` gives the first *n* rows (5 by default)

In [71]:
grades = pd.read_csv("data_ex8_1.csv")
grades.head()

Unnamed: 0,idno,name,math,literature,physics,music
0,27-6234266,Domenic Been,62,71,51,81
1,32-8500006,Carmel Vondrach,58,62,60,61
2,46-4848244,Gabey Stanlock,63,51,50,83
3,40-4452613,Miner Spoure,61,53,55,87
4,49-8047324,Joanie Padbery,49,60,51,68


`tail(n)` gives the last *n* rows (5 by default)

In [72]:
grades.tail(7)

Unnamed: 0,idno,name,math,literature,physics,music
93,64-4445589,Chrystel Ewart,78,72,53,60
94,18-1560214,Brander Durrant,31,69,50,63
95,91-1396398,Clayson Toma,50,75,62,88
96,39-5570882,Gallard Burston,60,72,46,72
97,69-7809678,Tate Pevsner,52,78,55,72
98,97-5658808,Perice Castri,64,47,60,68
99,18-4503877,Sophi Magarrell,66,42,65,91


`.describe()` provides some descriptive statistics over the entire frame.

In [73]:
grades.describe()

Unnamed: 0,math,literature,physics,music
count,100.0,100.0,100.0,100.0
mean,58.19,60.64,50.3,69.05
std,9.730401,9.215183,9.785064,9.575194
min,31.0,35.0,17.0,47.0
25%,52.0,55.0,45.0,62.0
50%,58.0,61.5,50.0,68.0
75%,65.25,66.0,57.0,74.0
max,85.0,82.0,78.0,93.0


## Slicing in data frames

We can extract single values, single rows/columns, several rows/columns from data frames.

* `.loc` provides label-based selection
* `.iloc` provides integer-based selection (the numerical order)
* `[]` can do both, depending on the context.

### Label-based extraction with `.loc`

To grab data with `.loc`, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the `.loc` function; the column labels are the second.

Arguments to .loc can be:
* A single value.
* A slice.
* A list.

For example, to select a single value, we can select the row labeled 0 and the column labeled "name" from the grades data frame.

In [75]:
grades.loc[0, "name"]

'Domenic Been'

Select rows labeled 11, 23, 67 and the "name" column

In [76]:
grades.loc[[11,23,67],"name"]

11      Adara Southern
23    Hailey Blomfield
67      Junette Djekic
Name: name, dtype: object

Note that the output above is a Series. If we give the column label as a one-element list, the result is a DataFrame.

In [77]:
grades.loc[[11,23,67],["name"]]

Unnamed: 0,name
11,Adara Southern
23,Hailey Blomfield
67,Junette Djekic


To select multiple rows and columns, we can use Python slice notation. Here, we select the rows from labels 0 to 3 and the columns from labels "math" to "physics".

In [78]:
grades.loc[0:3,"math":"physics"]

Unnamed: 0,math,literature,physics
0,62,71,51
1,58,62,60
2,63,51,50
3,61,53,55


Notice:
* Unlike conventional Python, pandas allows us to slice string values (in our example, the column labels)
* `.loc` is inclusive of the right upper bound.

All labels are selected with the colon (:) notation.

In [82]:
grades.loc[[0,1,2,4],:]

Unnamed: 0,idno,name,math,literature,physics,music
0,27-6234266,Domenic Been,62,71,51,81
1,32-8500006,Carmel Vondrach,58,62,60,61
2,46-4848244,Gabey Stanlock,63,51,50,83
4,49-8047324,Joanie Padbery,49,60,51,68


In [81]:
grades.loc[:,"idno":"name"]

Unnamed: 0,idno,name
0,27-6234266,Domenic Been
1,32-8500006,Carmel Vondrach
2,46-4848244,Gabey Stanlock
3,40-4452613,Miner Spoure
4,49-8047324,Joanie Padbery
...,...,...
95,91-1396398,Clayson Toma
96,39-5570882,Gallard Burston
97,69-7809678,Tate Pevsner
98,97-5658808,Perice Castri


Like with arrays and Series, we can provide *filtering conditions* with `.loc`.

Select students that have a math score greater than 80:

In [97]:
grades.loc[grades["math"]>80,:]

Unnamed: 0,idno,name,math,literature,physics,music
59,55-5355823,Maryjane Gaze,85,65,49,80
77,71-1841974,Patsy Baird,83,72,37,64


### Integer-based extraction with `.iloc`

Slicing with `.iloc` works similarly to `.loc`. However, `.iloc` uses the index positions of rows and columns rather than the labels.

**l**oc : **l**abel, **i**loc : **i**nteger

The arguments to the `.iloc` also behave similarly — single values, lists, indices, and any combination of these are permitted.

In [83]:
grades.iloc[0, 1]

'Domenic Been'

In [85]:
grades.iloc[[11,23,67], 1]

11      Adara Southern
23    Hailey Blomfield
67      Junette Djekic
Name: name, dtype: object

And, as before, if we were to pass in only one single column index, our result would be a Series. If we pass a list of column indices, we get a data frame.

In [86]:
grades.iloc[[11,23,67], [1]]

Unnamed: 0,name
11,Adara Southern
23,Hailey Blomfield
67,Junette Djekic


Slicing works slightly differently, though:

In [87]:
grades.iloc[0:4, 2:5]

Unnamed: 0,math,literature,physics
0,62,71,51
1,58,62,60
2,63,51,50
3,61,53,55


Slicing is no longer inclusive in `.iloc`. The right end of a slice is not included when using `.iloc`.

And just like with `.loc`, we can use a colon (:) with `.iloc` to extract all rows or columns.

In [88]:
grades.iloc[0:4, :]

Unnamed: 0,idno,name,math,literature,physics,music
0,27-6234266,Domenic Been,62,71,51,81
1,32-8500006,Carmel Vondrach,58,62,60,61
2,46-4848244,Gabey Stanlock,63,51,50,83
3,40-4452613,Miner Spoure,61,53,55,87


In [89]:
grades.iloc[:, 2:6]

Unnamed: 0,math,literature,physics,music
0,62,71,51,81
1,58,62,60,61
2,63,51,50,83
3,61,53,55,87
4,49,60,51,68
...,...,...,...,...
95,50,75,62,88
96,60,72,46,72
97,52,78,55,72
98,64,47,60,68


### Indexing with `[]`

The `[]` selection operator is takes a single argument, which may be one of the following:

* A slice of row numbers.
* A list of column labels.
* A single-column label.

That is, `[]` is context-dependent. Examples:

A slice of row numbers:

In [90]:
grades[2:7]

Unnamed: 0,idno,name,math,literature,physics,music
2,46-4848244,Gabey Stanlock,63,51,50,83
3,40-4452613,Miner Spoure,61,53,55,87
4,49-8047324,Joanie Padbery,49,60,51,68
5,06-2459518,Zsazsa Shelford,50,68,62,52
6,35-7239913,Dallon Besnard,59,70,44,47


A list of column labels:

In [92]:
grades[["math","music","physics"]]

Unnamed: 0,math,music,physics
0,62,81,51
1,58,61,60
2,63,83,50
3,61,87,55
4,49,68,51
...,...,...,...
95,50,88,62
96,60,72,46
97,52,72,55
98,64,68,60


Notice that we have changed the order of the columns during the extraction.

A single column label:

In [93]:
grades["literature"]

0     71
1     62
2     51
3     53
4     60
      ..
95    75
96    72
97    78
98    47
99    42
Name: literature, Length: 100, dtype: int64

One can also use the attribute notation (`grades.literature`)

In [103]:
grades.literature

0     71
1     62
2     51
3     53
4     60
      ..
95    75
96    72
97    78
98    47
99    42
Name: literature, Length: 100, dtype: int64

However, this has limited applicability. Cannot be used if the column name is not a string, or contains a space, for example.

Conditional filtering can be applied:

In [100]:
grades[grades["math"]>80]

Unnamed: 0,idno,name,math,literature,physics,music
59,55-5355823,Maryjane Gaze,85,65,49,80
77,71-1841974,Patsy Baird,83,72,37,64


Conditions can be combined with `&`, `|`, `~`. But put parentheses around each condition.

In [142]:
grades[(grades["math"]>80) | (grades["literature"]>80)]

Unnamed: 0,idno,name,math,literature,physics,music
15,32-2310582,Nathan Tilt,75,82,41,54
59,55-5355823,Maryjane Gaze,85,65,49,80
77,71-1841974,Patsy Baird,83,72,37,64


# Exercises

Use the `grades` data frame to do the following:

1. Select the "math" and "music" scores for labels 10 through 20
2. Select all rows where both "math" and "physics" scores are greater than 60.
3. Create a Series named `average` that stores the average of scores of every student. Extract each column using its name, add them up, and divide by 4.
4. Repeat the one above, but the resulting Series must be indexed by "idno" values.

For the following, use the Wholesale Customers Data (file `data_ex8_3.csv`)

1. Get basic descriptive statistics on the data set using the `describe` function.
2. Select rows where the "Channel" value is equal to 2.
3. Get descriptive statistics of the subset where "Channel" is equal to 2 and "Region" is equal to 3.
4. Get the first 10 lines of rows where "Channel" is equal to 2 and "Region" is equal to 3, *excluding* the "Channel" and "Region" columns.