## Lecture Notes: Pandas Basics

____

Goals:

Introduce Pandas and key Pandas concepts. By the end, you should understand:

* Create ad manipulate Pandas Series and DataFrames
* Use essential functions to filter, sort and summarize data
* Differentiate between .iloc and .loc for indexing and selecting data
* Handle real-world datasets using CSV files

____

**Create a new environment**

Begin by creating a new environment, let's call it *databehandling*.

In your command, type the following:

        conda create --name databehandling python=3.13

After it's created, **active** it and run the following to install some necesarry libraries for this course

        pip install numpy pandas openpyxl ipykernel seaborn plotly_express nbformat

In [2]:
import random

import numpy as np
import pandas as pd

---

## Pandas Series

A pandas Series is a one-dimensional-array-like object that can hold any data type (integers, floats, strings, etc etc)

**Creating a Series**

You can create a Series from e.g., a list, an array or a dictionary.

In [3]:
# create a Series using a list

number_list = [x for x in range(30, 40)]

print(number_list)

[30, 31, 32, 33, 34, 35, 36, 37, 38, 39]


In [4]:
my_first_series = pd.Series(number_list)  # creates a Series-object from the provided list

my_first_series

0    30
1    31
2    32
3    33
4    34
5    35
6    36
7    37
8    38
9    39
dtype: int64

In [5]:
isinstance(my_first_series, pd.Series)

True

Here are some basic class methods for Series:

In [11]:
print(f'Seriens minsta värde      : {my_first_series.min()}')
print(f'Seriens största värde     : {my_first_series.max()}')
print(f'Seriens medelvärde        : {my_first_series.mean()}')
print(f'Seriens standardavvikelse : {my_first_series.std()}')

Seriens minsta värde      : 30
Seriens största värde     : 39
Seriens medelvärde        : 34.5
Seriens standardavvikelse : 3.0276503540974917


**Interaction with other Series object**

How does a Series object interact with other objects of the same class under e.g., addition or multiplication?

In [16]:
list_one = [1,2,3]
list_two = [5,6,7]

list_one + list_two 

[1, 2, 3, 5, 6, 7]

In [18]:
series_one = pd.Series(list_one)
series_two = pd.Series(list_two)

# elementwise addition
series_one + series_two

# NOTE: returns a new Series-type object!

0     6
1     8
2    10
dtype: int64

In [25]:
# elemtwise multiplication
series_one*series_two

0     5
1    12
2    21
dtype: int64

In [23]:
# elemtwise division
series_one/series_two

0    0.200000
1    0.333333
2    0.428571
dtype: float64

In [27]:
# elemtwise exponentiation
series_one**series_two

0       1
1      64
2    2187
dtype: int64

We can extract individual elements from a Series by simple indexing

In [32]:
my_first_series[1]

np.int64(31)

We can also extract multiple elements simultaneously

In [36]:
my_first_series[:4]

0    30
1    31
2    32
3    33
dtype: int64

Read more about these operations and methods in the documentation!

---

## Pandas DataFrame

a dataframe is a 2D labeled data structure in Pandas, similar to a table or a spreadsheet.

Each column might hold different types of data (integers, floats, strings, etc.)

Let's create our frist DataFrame

In [57]:
names = ['Amir', 'Sawash', 'Rozann', 'Ali']
ages = [2.5, 5, 31, 35]
eye_colors = ['blue', 'brown', 'green', 'brown']

# we can with ease create DataFrames using dictionaries

family_dict = {'name': names, 
               'age': ages, 
               'eye color': eye_colors}

family_dict


{'name': ['Amir', 'Sawash', 'Rozann', 'Ali'],
 'age': [2.5, 5, 31, 35],
 'eye color': ['blue', 'brown', 'green', 'brown']}

In [58]:
# _df is standard naming convention to signify the object in question is a DataFrame

family_df =  pd.DataFrame(family_dict)

family_df

Unnamed: 0,name,age,eye color
0,Amir,2.5,blue
1,Sawash,5.0,brown
2,Rozann,31.0,green
3,Ali,35.0,brown


In [60]:
isinstance(family_df, pd.DataFrame)

True

In [61]:
# an important method is .info(), it gives us general meta-data about the contents of the dataframe
# note that Dtype "object" signifies either "string" or mixed type column (e.g., strings and integers)

family_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       4 non-null      object 
 1   age        4 non-null      float64
 2   eye color  4 non-null      object 
dtypes: float64(1), object(2)
memory usage: 228.0+ bytes


We can select specific columns from our DataFrame using the column names


In [62]:
# note that the returned column is given as a Series object!

family_df['name']

0      Amir
1    Sawash
2    Rozann
3       Ali
Name: name, dtype: object

We can also extract individual columns by callon on their names as attributes

In [56]:
family_df.name

0      Amir
1    Sawash
2    Rozann
3       Ali
Name: name, dtype: object

In [66]:
# note that calling on columns using the attribute technique is prone to errors, 
# specifically if we e.g., have blanc spaces in the column names


family_df.eye color

SyntaxError: invalid syntax (2509278268.py, line 5)

In [67]:
family_df['eye color']

0     blue
1    brown
2    green
3    brown
Name: eye color, dtype: object

We can index several columns at once by passing a list of column names to the DataFrame

In [70]:
family_df[['name', 'age']]

Unnamed: 0,name,age
0,Amir,2.5
1,Sawash,5.0
2,Rozann,31.0
3,Ali,35.0


---

## Further indexing

In Pandas, selecting specific rows and columns is essential for analyzing data. Pandas offers two primary methods to do this:

.iloc[] # index location

.loc[]  # location

In [72]:
family_df

Unnamed: 0,name,age,eye color
0,Amir,2.5,blue
1,Sawash,5.0,brown
2,Rozann,31.0,green
3,Ali,35.0,brown


In [73]:
family_df[0]    # this does NOT work for a DataFrame

KeyError: 0

In [75]:
# let's create some new data
data = {
        'Name' : ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ian'],
        'Age' : [25, 30, 35, 40, 22, 29, 28, 22, 31], 
        'Salary': [50000, 60000, 70000, 80000, 52000, 62000, 75000, 55000, 100000]
        }

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Eva,22,52000
5,Frank,29,62000
6,Grace,28,75000
7,Helen,22,55000
8,Ian,31,100000


Before proceeding, remember that we can get access to Series methods very simply by e.g., first querying on a given column

In [82]:
print(df['Salary'].mean().round(3))

67111.111


**.iloc[] position-based indexing**

Use .iloc[] to select data based on the position of rows and columns

The general syntax is

.iloc[row_indexer, column_indexer]

In [91]:
# if we provide only one index, it'll be understood by Pandas to be the row index

df.iloc[0]   # select the first row

Name      Alice
Age          25
Salary    50000
Name: 0, dtype: object

In [93]:
# we can use slicing to select multiple rows

df.iloc[1:5]   # rows 1 to 4 (5 not included)

Unnamed: 0,Name,Age,Salary
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Eva,22,52000


In [94]:
my_slice_df = df.iloc[1:5]

my_slice_df

Unnamed: 0,Name,Age,Salary
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Eva,22,52000


Let's now also give the column index

In [99]:
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Eva,22,52000
5,Frank,29,62000
6,Grace,28,75000
7,Helen,22,55000
8,Ian,31,100000


In [101]:
print(df.iloc[3, 2])  # select the value in the row with index 3, and column with index 2 (Salary)

print(df.iloc[6, 1])  # select the value in the row with index 6, and column with index 1 (Age)

80000
28


In [102]:
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Eva,22,52000
5,Frank,29,62000
6,Grace,28,75000
7,Helen,22,55000
8,Ian,31,100000


In [104]:
df.iloc[1:4, 0:2]   # rows 1 to 3 (4 not included), columns 0 and 1 (2 not included)

Unnamed: 0,Name,Age
1,Bob,30
2,Charlie,35
3,David,40


In [None]:
df.iloc[2:4, 1:]   # 2:4 means rows 2 and 3 (4 not included), 1: means all columns from column 1 

Unnamed: 0,Age,Salary
2,35,70000
3,40,80000


In [109]:
df.iloc[1:4, :]   # : means all columns

Unnamed: 0,Name,Age,Salary
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000


We can also give a list of either rows/columns that we want to index

In [110]:
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Eva,22,52000
5,Frank,29,62000
6,Grace,28,75000
7,Helen,22,55000
8,Ian,31,100000


In [116]:
df.iloc[[1, 2, 7], [1, 2]]

Unnamed: 0,Age,Salary
1,30,60000
2,35,70000
7,22,55000


**loc**

Use .loc[] to select data based on labels (row index and column names)

In [123]:
data = {
        'Name' : ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ian'],
        'Age' : [25, 30, 35, 40, 22, 29, 28, 22, 31], 
        'Salary': [50000, 60000, 70000, 80000, 52000, 62000, 75000, 55000, 100000]
        }

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])


df

Unnamed: 0,Name,Age,Salary
a,Alice,25,50000
b,Bob,30,60000
c,Charlie,35,70000
d,David,40,80000
e,Eva,22,52000
f,Frank,29,62000
g,Grace,28,75000
h,Helen,22,55000
i,Ian,31,100000


Note that even though we have a custom index, .iloc[] still works

In [124]:
df.iloc[1:4, 0:2]

Unnamed: 0,Name,Age
b,Bob,30
c,Charlie,35
d,David,40


The syntax for .loc is the same as for .iloc, but instead of using integer indeces, we use labels.

The general syntax is 

.loc[row_label, column_label]

In [126]:
df.loc['e']   # select the row with label 'e'

Name        Eva
Age          22
Salary    52000
Name: e, dtype: object

In [None]:
print(df.loc['d', 'Salary'])  # select the element at row 'b' and column 'Age'

80000


In [130]:
df.loc[['a', 'c', 'e'], ['Name', 'Salary']]   # select rows 'a', 'c', 'e' and columns 'Name', 'Salary'

Unnamed: 0,Name,Salary
a,Alice,50000
c,Charlie,70000
e,Eva,52000


In [None]:
df.loc['a':'d']  # slicing works for labels too, under very strict conditions - but it always works for .iloc

Unnamed: 0,Name,Age,Salary
a,Alice,25,50000
b,Bob,30,60000
c,Charlie,35,70000
d,David,40,80000


---

## Masking

Masking is a powerful feature in Pandas that allows you to filter data based on certain conditions.

Masking is often used to filter data, perform calculations or create subsets of data for further analysis.

In [132]:
family_df

Unnamed: 0,name,age,eye color
0,Amir,2.5,blue
1,Sawash,5.0,brown
2,Rozann,31.0,green
3,Ali,35.0,brown


In [135]:
# the above df has 4 rows, so let's create a list of booleans of the same size

my_mask = [True, False, True, False]

family_df[my_mask]         # the rows in which the mask is True are returned

Unnamed: 0,name,age,eye color
0,Amir,2.5,blue
2,Rozann,31.0,green


In [142]:
# we have extreme flexibility here, and can use any condition we want to create our mask

eye_color_mask = [color == 'brown' for color in family_df['eye color']]

print(eye_color_mask)

family_df[eye_color_mask]  # the rows where eye color is brown are returned

[False, True, False, True]


Unnamed: 0,name,age,eye color
1,Sawash,5.0,brown
3,Ali,35.0,brown


In [146]:
family_df['eye color'] == 'brown'    # elemetwise comparison, returns a Series of booleans

0    False
1     True
2    False
3     True
Name: eye color, dtype: bool

In [147]:
my_color_mask = family_df['eye color'] == 'brown'

family_df[my_color_mask]

Unnamed: 0,name,age,eye color
1,Sawash,5.0,brown
3,Ali,35.0,brown


In [None]:
family_df[ family_df['eye color']=='brown' ]

Unnamed: 0,name,age,eye color
1,Sawash,5.0,brown
3,Ali,35.0,brown
