# Introduction to Pandas: Data exploration and indexing
Lukas Jarosch

### Modules and import syntax
Last week, we have looked at basic syntax patterns and data types in Python. While these concepts form the foundation, Python's real power comes from its incredible amount of libraries that you can use for additional functionalities. Some of these libraries already come bundled with Python itself ("standard library"), but most other libraries like pandas have to be separately installed. Many of those libraries like pandas or numpy are also just a wrapper around underlying C code, which makes them much faster than implementing the same functionalities in base Python. Modules can be imported using the import statement, and their methods are accessed with dot notation (like `module.method()`). We will demonstrate the important syntax with the `math` module from Python's standard library, which contains additional methods for mathematical operations:

In [1]:
# import the module
import math

# access a method
math.sqrt(25)

5.0

You can also import a module under a custom name. Programmers often shorten module names to make the code shorter and save themselves unnecessary typing. In Python, you will find that there is an informal consensus on how to abbreviate many popular modules. For example, most of the time you will see pandas getting imported as pd and numpy getting imported as np.

In [2]:
# same code as above but with a shorter module name
import math as m

m.sqrt(25)

5.0

If you only need specific functions from a module, it can be cleaner to import only those and avoid importing the whole module and cluttering up your namespace.

In [3]:
from math import sqrt, sin
print(sqrt(25), sin(3.14))

5.0 0.0015926529164868282


Now we are ready to import pandas, the most popular package for data wrangling in Python, and look at some of its basic concepts.

In [4]:
import pandas as pd

pd.__version__

'1.3.4'

### DataFrames
Coming from base Python, a natural way to store data would be a dictionary of lists. For example, you might have some data with different students and their information about them, and could store it in a dictionary like this:

In [5]:
student_data = {
    "first name": ["Peter", "Hanna", "Tom", "Sarah", "Lisa", "Steven", "James"],
    "last name": ["Smith", "Jones", "Williams", "Taylor", "Brown", "Davies", "Evans"],
    "age": [20, 23, 24, 24, 22, 21, 20],
    "major": ["History", "English", "Chemistry", "Physics", "Engineering", "Biology", "Computer Science"],
    "average grade": [2.5, 2.9, 1.5, 1.2, 1.1, 2.3, 2.1],
    "student ID": ['fj233', 'vc404', 'qd119', 'pr426', 'gx486', 'im401', 'rb231'],
}

student_data

{'first name': ['Peter', 'Hanna', 'Tom', 'Sarah', 'Lisa', 'Steven', 'James'],
 'last name': ['Smith',
  'Jones',
  'Williams',
  'Taylor',
  'Brown',
  'Davies',
  'Evans'],
 'age': [20, 23, 24, 24, 22, 21, 20],
 'major': ['History',
  'English',
  'Chemistry',
  'Physics',
  'Engineering',
  'Biology',
  'Computer Science'],
 'average grade': [2.5, 2.9, 1.5, 1.2, 1.1, 2.3, 2.1],
 'student ID': ['fj233', 'vc404', 'qd119', 'pr426', 'gx486', 'im401', 'rb231']}

However, this is not very pleasant to work with, and we would have to implement a lot of methods for dealing with our data manually. Using pandas, we can instead use a pandas `DataFrame`, which is an object specifically implemented for handling data in Python. To convert our data dictionary into a DataFrame, we can simply use the `DataFrame()` method. DataFrames are 2D data structures with rows and columns, which both get specific row labels ("index") and column labels.

In [6]:
df = pd.DataFrame(student_data)

# print the dataframe
print(df)

# print the dataframe class
print(type(df))

# use Jupyter's renderer
df

  first name last name  age             major  average grade student ID
0      Peter     Smith   20           History            2.5      fj233
1      Hanna     Jones   23           English            2.9      vc404
2        Tom  Williams   24         Chemistry            1.5      qd119
3      Sarah    Taylor   24           Physics            1.2      pr426
4       Lisa     Brown   22       Engineering            1.1      gx486
5     Steven    Davies   21           Biology            2.3      im401
6      James     Evans   20  Computer Science            2.1      rb231
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,first name,last name,age,major,average grade,student ID
0,Peter,Smith,20,History,2.5,fj233
1,Hanna,Jones,23,English,2.9,vc404
2,Tom,Williams,24,Chemistry,1.5,qd119
3,Sarah,Taylor,24,Physics,1.2,pr426
4,Lisa,Brown,22,Engineering,1.1,gx486
5,Steven,Davies,21,Biology,2.3,im401
6,James,Evans,20,Computer Science,2.1,rb231


Conveniently, Jupyter renders DataFrames in a really nice way compared to just printing them.

**Tip:** If you ever want to use Jupyter's way of displaying data for something that is not the last line of the current cell, you can also call the `display()` function explicitly.

A few basic DataFrame methods are the `.columns` and `.index` attributes, which return the column and row labels, and the `.shape` attribute which returns the number of rows and columns. Python's `len()` function also works on DataFrames and returns the number of rows by default.

In [7]:
print(df.columns)   # column labels
print(df.index)     # row labels

Index(['first name', 'last name', 'age', 'major', 'average grade',
       'student ID'],
      dtype='object')
RangeIndex(start=0, stop=7, step=1)


In [8]:
print(df.shape)     # number of rows and columns
print(len(df))      # number of rows

(7, 6)
7


You can also use custom row labels when creating a dataframe, or modify them later on.

In [9]:
# dataframe with a custom index
df2 = pd.DataFrame(student_data, index=["A", "B", "C", "D", "E", "F", "G"])

df2

Unnamed: 0,first name,last name,age,major,average grade,student ID
A,Peter,Smith,20,History,2.5,fj233
B,Hanna,Jones,23,English,2.9,vc404
C,Tom,Williams,24,Chemistry,1.5,qd119
D,Sarah,Taylor,24,Physics,1.2,pr426
E,Lisa,Brown,22,Engineering,1.1,gx486
F,Steven,Davies,21,Biology,2.3,im401
G,James,Evans,20,Computer Science,2.1,rb231


In [10]:
# change the index of an existing dataframe
df2.index = list("hijklmn")

df2

Unnamed: 0,first name,last name,age,major,average grade,student ID
h,Peter,Smith,20,History,2.5,fj233
i,Hanna,Jones,23,English,2.9,vc404
j,Tom,Williams,24,Chemistry,1.5,qd119
k,Sarah,Taylor,24,Physics,1.2,pr426
l,Lisa,Brown,22,Engineering,1.1,gx486
m,Steven,Davies,21,Biology,2.3,im401
n,James,Evans,20,Computer Science,2.1,rb231


### Creating DataFrames from 2D data and files
As DataFrames are 2D structures, you can also construct them from a matrix-like format that contains the DataFrame values and add the column labels through the `columns` keyword.

In [11]:
data_matrix = [
    ["Hanna", "Jones", 23, "English", 2.8, "vc404"],
    ["Peter", "Smith", 20, "History", 2.4, "fj233"],
    ["Tom", "Williams", 24, "Chemistry", 1.4, "qd119"],
]

pd.DataFrame(data_matrix, columns=["first name", "last name", "age", "major", "average grade", "student ID"])

Unnamed: 0,first name,last name,age,major,average grade,student ID
0,Hanna,Jones,23,English,2.8,vc404
1,Peter,Smith,20,History,2.4,fj233
2,Tom,Williams,24,Chemistry,1.4,qd119


Often, you will have data saved as files instead of an already initialized Python dictionary. A very common file format is the `.csv` format which contains comma-separated values. You can read in csv files with the `read_csv` function.

**Tip**: If your data is separated with a different character (e.g. tab-separated or semicolon-separated) you can change the `sep` keyword in the `read_csv()` function, which is set to "," by default.

In [12]:
df = pd.read_csv("../data/student_data.csv")
df

Unnamed: 0,first name,last name,age,major,average grade,student ID
0,Peter,Smith,20,History,2.5,fj233
1,Hanna,Jones,23,English,2.9,vc404
2,Tom,Williams,24,Chemistry,1.5,qd119
3,Sarah,Taylor,24,Physics,1.2,pr426
4,Lisa,Brown,22,Engineering,1.1,gx486
5,Steven,Davies,21,Biology,2.3,im401
6,James,Evans,20,Computer Science,2.1,rb231


### Series
Similarly to how our data dict was a dictionary of lists, you could view a pandas DataFrame as a super-charged dictionary of pandas `Series` objects, which are themselves a super-charged combination of list and dictionary (very roughly put). You can also create a `Series` directly, and it will have a name and index attribute, similar to the column labels and index of the DataFrame.

In [13]:
# a Series with integers and a custom index and name
s1 = pd.Series([1, 2, 3, 4], index=["A", "B", "C", "D"], name="numbers")

print(s1)
print(s1.name)
print(s1.index)

A    1
B    2
C    3
D    4
Name: numbers, dtype: int64
numbers
Index(['A', 'B', 'C', 'D'], dtype='object')


### Vectorized operations and label alignment
When you use standard operators like `+` or `-` with pandas Series or DataFrames, it will automatically broadcast the operation to all values. This means that explicitly looping through values is usually not necessary (and also a lot less efficient). Below are some examples for vectorized operations to show how they work in practice:

In [14]:
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]})

display(df, df * 5, df + 5, df / 5)


Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
3,4,8


Unnamed: 0,A,B
0,5,25
1,10,30
2,15,35
3,20,40


Unnamed: 0,A,B
0,6,10
1,7,11
2,8,12
3,9,13


Unnamed: 0,A,B
0,0.2,1.0
1,0.4,1.2
2,0.6,1.4
3,0.8,1.6


In [15]:
# example with Series and string values
s = pd.Series(["A", "B", "C"])

print(s, s*2, sep="\n")

0    A
1    B
2    C
dtype: object
0    AA
1    BB
2    CC
dtype: object


It is also possible to add Series or DataFrames together. Pandas will automatically align the data on the labels before the computation, so you will need to make sure that the labels between your two objects match.

In [16]:
# matching labels
df1 = pd.DataFrame({"A": [1, 2, 3, 4], "B": [1, 2, 3, 4]})
df2 = pd.DataFrame({"A": [1, 1, 1, 1], "B": [2, 2, 2, 2]})

display(df1, df2, df1 + df2)

Unnamed: 0,A,B
0,1,1
1,2,2
2,3,3
3,4,4


Unnamed: 0,A,B
0,1,2
1,1,2
2,1,2
3,1,2


Unnamed: 0,A,B
0,2,3
1,3,4
2,4,5
3,5,6


In [17]:
# partially matching labels
df1 = pd.DataFrame({"A": [1, 2, 3, 4], "B": [1, 2, 3, 4]}, index=[0, 1, 2, 3])
df2 = pd.DataFrame({"A": [1, 1, 1, 1], "C": [2, 2, 2, 2]}, index=[1, 2, 3, 4])

display(df1, df2, df1 + df2)


Unnamed: 0,A,B
0,1,1
1,2,2
2,3,3
3,4,4


Unnamed: 0,A,C
1,1,2
2,1,2
3,1,2
4,1,2


Unnamed: 0,A,B,C
0,,,
1,3.0,,
2,4.0,,
3,5.0,,
4,,,


### Indexing
One of the most important operations for dealing with data is accessing specific values from a dataframe. Pandas provides a lot of ways for indexing data, for which we will give a brief overview here. A more thorough explanation can be found in the [official pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html).

#### Indexing with []
The `[]` operator from base Python works in pandas as well. Here, it is usually used to get the values of specific columns and can also be used for reassignment.

In [18]:
# get the student data from above again with a custom index
df = pd.DataFrame(student_data, index=[1, 2, 3, 4, 5, 6, 7])
df

Unnamed: 0,first name,last name,age,major,average grade,student ID
1,Peter,Smith,20,History,2.5,fj233
2,Hanna,Jones,23,English,2.9,vc404
3,Tom,Williams,24,Chemistry,1.5,qd119
4,Sarah,Taylor,24,Physics,1.2,pr426
5,Lisa,Brown,22,Engineering,1.1,gx486
6,Steven,Davies,21,Biology,2.3,im401
7,James,Evans,20,Computer Science,2.1,rb231


In [19]:
# single column access
df["first name"]

1     Peter
2     Hanna
3       Tom
4     Sarah
5      Lisa
6    Steven
7     James
Name: first name, dtype: object

In [20]:
# multi-column access
df[["first name", "age", "major"]]

Unnamed: 0,first name,age,major
1,Peter,20,History
2,Hanna,23,English
3,Tom,24,Chemistry
4,Sarah,24,Physics
5,Lisa,22,Engineering
6,Steven,21,Biology
7,James,20,Computer Science


In [21]:
# reassignment
df["average grade"] = df["average grade"] - 0.1
df

Unnamed: 0,first name,last name,age,major,average grade,student ID
1,Peter,Smith,20,History,2.4,fj233
2,Hanna,Jones,23,English,2.8,vc404
3,Tom,Williams,24,Chemistry,1.4,qd119
4,Sarah,Taylor,24,Physics,1.1,pr426
5,Lisa,Brown,22,Engineering,1.0,gx486
6,Steven,Davies,21,Biology,2.2,im401
7,James,Evans,20,Computer Science,2.0,rb231


In [22]:
# create a new column
df["semester"] = [1, 3, 3, 1, 1, 5, 5]

df

Unnamed: 0,first name,last name,age,major,average grade,student ID,semester
1,Peter,Smith,20,History,2.4,fj233,1
2,Hanna,Jones,23,English,2.8,vc404,3
3,Tom,Williams,24,Chemistry,1.4,qd119,3
4,Sarah,Taylor,24,Physics,1.1,pr426,1
5,Lisa,Brown,22,Engineering,1.0,gx486,1
6,Steven,Davies,21,Biology,2.2,im401,5
7,James,Evans,20,Computer Science,2.0,rb231,5


#### Attribute indexing
Pandas also tries to supply column names as attributes that can be accessed with dot notation. However, this only works for column names that are allowed python variable names (e.g. "first name" won't work because of the " " character) and it is usually cleaner to use the bracket notation above.

In [23]:
# access the "age" column with attribute notation
df.age

1    20
2    23
3    24
4    24
5    22
6    21
7    20
Name: age, dtype: int64

#### .loc and .iloc
`.loc` and `.iloc` are specialized pandas accessors, that are more powerful and optimized than the methods above and allow slicing through rows and columns simultaneously. `.loc` follows the pattern `.loc[rows, columns]` and uses **inclusive** ranges of the row and column labels.

In [24]:
# print the df again to make clear what happens below
df

Unnamed: 0,first name,last name,age,major,average grade,student ID,semester
1,Peter,Smith,20,History,2.4,fj233,1
2,Hanna,Jones,23,English,2.8,vc404,3
3,Tom,Williams,24,Chemistry,1.4,qd119,3
4,Sarah,Taylor,24,Physics,1.1,pr426,1
5,Lisa,Brown,22,Engineering,1.0,gx486,1
6,Steven,Davies,21,Biology,2.2,im401,5
7,James,Evans,20,Computer Science,2.0,rb231,5


In [25]:
# get the value at index 7 and column "average grade"
df.loc[7, "average grade"]

2.0

In [26]:
# get the columns from "first name" till "age" for index 1-4
df.loc[1:4, "first name":"age"]

Unnamed: 0,first name,last name,age
1,Peter,Smith,20
2,Hanna,Jones,23
3,Tom,Williams,24
4,Sarah,Taylor,24


In [27]:
# get the values at index 1, 5, 7 and columns "age" and "major"
df.loc[[1, 5, 7], ["age", "major"]]

Unnamed: 0,age,major
1,20,History
5,22,Engineering
7,20,Computer Science


`.iloc` (integer-location) in turn works only on the raw positional indices, and does not look at the row or column labels. `.iloc` is therefore similar to the indexing syntax you already know from lists, and it will use 0-based indexing with a non-inclusive stop index.

In [28]:
# get the value at the first row and first column
df.iloc[0, 0]

'Peter'

In [29]:
# get the first two rows and the first three columns
df.iloc[:2, :3]

Unnamed: 0,first name,last name,age
1,Peter,Smith,20
2,Hanna,Jones,23


#### Boolean indexing
`[]` and `.loc`/`.iloc` also accept boolean Series/lists for indexing. For `[]`, you will need a boolean array that matches the number of rows in your DataFrame, and it will return the rows where the array contains `True`. 

In [30]:
df

Unnamed: 0,first name,last name,age,major,average grade,student ID,semester
1,Peter,Smith,20,History,2.4,fj233,1
2,Hanna,Jones,23,English,2.8,vc404,3
3,Tom,Williams,24,Chemistry,1.4,qd119,3
4,Sarah,Taylor,24,Physics,1.1,pr426,1
5,Lisa,Brown,22,Engineering,1.0,gx486,1
6,Steven,Davies,21,Biology,2.2,im401,5
7,James,Evans,20,Computer Science,2.0,rb231,5


In [31]:
bool_index = [True, True, False, True, False, False, False]

df[bool_index]

Unnamed: 0,first name,last name,age,major,average grade,student ID,semester
1,Peter,Smith,20,History,2.4,fj233,1
2,Hanna,Jones,23,English,2.8,vc404,3
4,Sarah,Taylor,24,Physics,1.1,pr426,1


`.loc` and `.iloc` behave in a similar way, but with them you can additionally use a boolean that matches the columns. For boolean indices, `.loc` and `.iloc` return the same results and are interchangeable.

In [32]:
row_index = [True, True, False, False, False, False, True]
col_index = [False, False, False, False, True, True, True]

# it does not matter whether you use .loc or .iloc for bools
display(df.loc[row_index, col_index], df.iloc[row_index, col_index])

Unnamed: 0,average grade,student ID,semester
1,2.4,fj233,1
2,2.8,vc404,3
7,2.0,rb231,5


Unnamed: 0,average grade,student ID,semester
1,2.4,fj233,1
2,2.8,vc404,3
7,2.0,rb231,5


Boolean indices are very useful for filtering data based on specific conditions, and we will show more of that in the next section.