## Lecture Notes: Pandas Basics

____

Goals:

Introduce Pandas and key Pandas concepts. By the end, you should understand:

* Create ad manipulate Pandas Series and DataFrames
* Use essential functions to filter, sort and summarize data
* Differentiate between .iloc and .loc for indexing and selecting data
* Handle real-world datasets using CSV files

____

**Create a new environment**

Begin by creating a new environment, let's call it *databehandling*.

In your command, type the following:

        conda create --name databehandling python=3.12

After it's created, active it and run the following to install some necesarry libraries for this course

        pip install numpy pandas openpyxl ipykernel seaborn plotly_express nbformat

In [1]:
import random

import numpy as np
import pandas as pd

_____

## Pandas Series

A Pandas Series is a one-dimensional array-like object that can hold any data type (integers, floats, strings, etc etc.)

**Creating a Series**

You can create a Series from a list, an array, or a dictionary.

In [None]:
# create a Series using a list

numbers_list = [x for x in range(30,40)]
print(numbers_list)

In [3]:
my_first_series = pd.Series(numbers_list) # create a Series from the list

In [None]:
my_first_series

In [None]:
isinstance(my_first_series, pd.Series)

Here are some basic class methods for Series:

In [None]:
print(f'Seriens minsta värde  : {my_first_series.min()}')
print(f'Seriens största värde : {my_first_series.max()}')
print(f'Seriens medelvärde    : {my_first_series.mean()}')


**Interaction with other series**

How does a Series object interact with other Series objects under e.g., addition or mulitplication?

In [None]:
list_one = [1,2,3]
list_two = [5,6,7]

list_one + list_two

In [None]:
series_one = pd.Series(list_one)
series_two = pd.Series(list_two)

# elementwise addition
series_one + series_two

In [None]:
# elementwise multiplication
series_one * series_two

In [None]:
# elementwise division
series_one / series_two

We can extract individual elements from a Series by simple indexing

In [None]:
my_first_series[1]

We can also do extract multiple elements from a Series

In [None]:
my_first_series[:4]

Read more about these operations and methods in the documentation!

____

## Pandas DataFrame

A dataframe is a 2D labeled data structure in Pandas, similar to a table or a spreadsheat.

Each column can hold different types of data (integers, floats, strings etc).

Let's create our first DataFrame

In [None]:
names = ['Amir', 'Sawash', 'Rozann', 'Ali']
age = [1.5, 4, 28, 34]
eye_color = ['blue', 'brown', 'green', 'brown']

# we can very easily create DataFrames from dictionaries

family_dict = {'name': names, 
               'age': age, 
               'eye color': eye_color}

family_dict

In [None]:
# _df is standard naming convention to signify that it's a DataFrame

family_df = pd.DataFrame(family_dict) # create a DataFrame from the dictionary

family_df

In [None]:
isinstance(family_df, pd.DataFrame)

In [None]:
# an important method is .info(), it gives us general meta-data about the contents of the dataframe
# note that Dtype "object" signifies either "string" or mixed type column (e.g., strings and integers)

family_df.info()

We can select specific columns from our DataFrame using the column names.

In [None]:
# Note that the returned column is given as a Series object!

family_df['name']

We can also get to individual columns by calling on their names as attributes.

In [None]:
family_df.name

In [None]:

# note that calling on columns using the attribute technique is prone to errors, specifically if we e.g., have blanc spaces in the column names
family_df.eye color

In [None]:
family_df['eye color']

We can index several columns at once by passing a list of column names to the DataFrame.

In [None]:
family_df[['name', 'eye color']]

_____

## Further indexing

In Pandas, selecting specific rows and columns is essential for analyzing data. Pandas offers two primary methods to do this:

.iloc and .loc

In [None]:
family_df[0] # this does not work for a DataFrame

In [None]:
# let's create some new data

data = {
        'Name' : ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ian'],
        'Age' : [25, 30, 35, 40, 22, 29, 28, 22, 31], 
        'Salary': [50000, 60000, 70000, 80000, 52000, 62000, 75000, 55000, 100000]
        }

df = pd.DataFrame(data)

df

Before proceeding, note that we can get access to Series methods very simply by e.g., first querying on a given column

In [None]:
print(df['Salary'].mean())

**.iloc[] - Position-based Indexing**

Use .iloc[] to select data based on the position of rows and columns.

The general syntax is

.iloc[row_indexer, column_indexer]


In [None]:
# if we only provide one index, it'll be understood that it is the row index

df.iloc[0]   # select the first row

In [None]:
# we can use slicing to select multiple rows

df.iloc[1:5] # rows 1 to 4 (5 not included)

In [None]:
my_slice_df = df.iloc[1:5]

my_slice_df

In [None]:
print(df.iloc[0, 2]) # select the value in the 0th row and 2nd column
print(df.iloc[5, 1]) # select the value in the 5th row and 1st column

We can also give a list of slicers for both rows and columns

In [None]:
df.iloc[0:3, 0:2] # rows 0 to 2 (3 not included), columns 0 to 1 (2 not included)

In [None]:
df.iloc[2:4, 1:] # 2:4 means rows 2 to 3 (4 not included), 1: means all columns from column 1 

In [None]:
df.iloc[1:4, :] # : means all columns

We can also give a list of either rows/columns we want to index


In [None]:
df.iloc[[0, 2, 4], [0, 2]] # rows 0, 2, 4, columns 0, 2

**loc**

Use .loc[] to select data based on labels (row index or column names).

In [None]:
data = {
        'Name' : ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ian'],
        'Age' : [25, 30, 35, 40, 22, 29, 28, 22, 31], 
        'Salary': [50000, 60000, 70000, 80000, 52000, 62000, 75000, 55000, 100000]
        }

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']) # we can provie our own index here, if we'd like

df

Note that even though we have a custom index, .iloc still works

In [None]:
df.iloc[1:4, 0:2]

The syntax for .loc is the same as for .iloc, but instead of using integer indexes, we use labels.

The general syntax is 

.loc[row_label, column_label]

In [None]:
df.loc['e'] # select the row with label 'e'

In [None]:
print(df.loc['b', 'Age']) # select the element at row 'b' and column 'Age'

In [None]:
df.loc[['a', 'c', 'e'], ['Name', 'Salary']] # select rows 'a', 'c', 'e' and columns 'Name', 'Salary'

In [None]:
df.loc['a':'c'] # slicing works for labels too, under very strict conditions - but it always works for .iloc

_____

## Masking

Masking is a powerful feature in Pandas that allows you to filter data based on certain conditions.

Masking is often used to filter data, perform calculations or create subsets of data for further analysis.

In [None]:
family_df

In [None]:
# the above df has 4 rows, so let's create a list of booleans of the same size

my_mask = [True, False, True, False]

family_df[my_mask] # the rows in which the mask is True are returned

In [None]:
# we have extreme flexibility here, and can use any condition we want to create our mask

eye_color_mask = [color == 'brown' for color in family_df['eye color']]
print(eye_color_mask)

family_df[eye_color_mask]

Series and DataFrames have built-in support for creating masks

In [None]:
family_df['eye color'] == 'brown' # elemetwise comparison, returns a Series of booleans

In [None]:
my_color_mask = family_df['eye color'] == 'brown'

family_df[my_color_mask]

In [None]:
# if we feel very confident, we can create and provide the mask directly in the index (not recommended though)

family_df[family_df['eye color'] == 'brown']

In [None]:
family_df[family_df['age'] > 25]

In [None]:
age_mask = family_df['age'] > 25

adults_df = family_df[age_mask]

adults_df

In [124]:
adults_df = adults_df.reset_index(drop=True) # 0, 2 are the original indices of the rows that were selected

In [None]:
adults_df

We can combine filters too!

Only rows that satisfy all masks will then be returned.

In [None]:
print(age_mask)
print(eye_color_mask)

In [None]:
family_df[age_mask & eye_color_mask] # only the last row satisfy both masks

Just a little bit more fun

In [None]:
short_names_mask = [len(name) < 4 for name in df['Name']]

df[short_names_mask]

BTW, you can also NEGATE masks - i.e., we can get the opposite of the mask

In [None]:
family_df[~age_mask]

_____

## Read excel

In [None]:
calories_df = pd.read_excel('../data/calories.xlsx')

calories_df.head() # shows the first 5 rows of the DataFrame

In [None]:
calories_df.head(10) # default is 5, but we can change it

In [None]:
calories_df.tail() # shows by default the last 5 rows, but can be changed aswell

In [None]:
calories_df.info()

In [None]:
# see how many unique values a given column has

calories_df['FoodCategory'].nunique() # nunique stands for number of unique

In [None]:
# print the unique values of a given column

print(calories_df['FoodCategory'].unique())

In [None]:
# see the amount of times each value appears

calories_df['FoodCategory'].value_counts()

In [None]:
calories_df.iloc[224]

In [None]:
calories_df.iloc[224:229, 3:5]

In [None]:
calories_df[calories_df['FoodCategory']=='FastFood']

_____

## Rename columns

In [None]:
calories_df

In [163]:
calories_df.rename(columns={"FoodItem":"Food"}, inplace=True) # inplace=True means that the changes are made to the original DataFrame

In [None]:
calories_df

____

## A bit of data cleaning

We will very often need to handle and manipulate data in dataframes, e.g.,

change column names, change element values, create new columns, handle missing data, alter values etc etc.

In [None]:
calories_df.head()

In [None]:
# we can index elements in a string Series by using .str[]

calories_df['Cals_per100grams'].str[:4]

In [None]:
# we can convert datatype to int (if all elements allow it)

calories_df['Cals_per100grams'].str[:2].astype(int)

In [188]:
calories_df['cals/100g in integers'] = pd.Series([x.split()[0] for x in calories_df['Cals_per100grams']]).astype(int)

In [None]:
calories_df

_____

## Sort dataframe

In [None]:
calories_df.sort_values(by='cals/100g in integers') # by default, ascending=True

In [None]:
calories_df.sort_values(by='cals/100g in integers', ascending=False)

In [None]:
calories_df.iloc[1621]

In [None]:
calories_df

____

## Read excel with several sheets and choose header column

In [None]:
population_df = pd.read_excel('../data/komtopp50_2020.xlsx', sheet_name='Totalt', header=6)

population_df

In [None]:
sorted_df = population_df.sort_values(by=2020).reset_index(drop=True)

sorted_df

____

## Assigning and re-asigning columns in a DataFrame

In [None]:
family_df

In [None]:
family_df['gender'] = ['male', 'male', 'female', 'male']

family_df

In [None]:
family_df['age'] = family_df['age'] + 1

family_df

In [None]:
family_df['age'] = family_df['age']*2

family_df['age']

In [None]:
new_colors = ['purple', 'indigo', 'violet', 'cyan']

family_df['eye color'] = new_colors

family_df

_____

## Concatenate two DataFrames

In [None]:
more_data = {'name': ['john', 'jane', 'jim'],
             'age': [23, 24, 25], 
             'eye color': ['r', 'g', 'b'], 
             'gender' : ['male', 'female', 'male']}

strangers_df = pd.DataFrame(more_data)

strangers_df

In [None]:
pd.concat([family_df, strangers_df])

_____

## Plot data with Matplotlib

You're going to get **really** good at visualising data in this course. Here's just a soft start.

In [None]:
old_colors = ['blue', 'brown', 'green', 'brown']

family_df['eye color'] = old_colors

family_df

In [None]:
family_df['eye color'].value_counts()

In [None]:
# the .values attribute for series returns an array with the values of that series

counts = family_df['eye color'].value_counts().values

print(counts)

In [None]:
# the .index attribute for series returns an array with the index of that series

indeces = family_df['eye color'].value_counts().index.values

print(indeces)

In [None]:
import matplotlib.pyplot as plt

plt.bar(indeces, counts)
plt.show()

______

## Two Important Notes

* It's *VERY* important that you put alot of time into beig mindful with not only what you visualise, but also HOW you visualise it.
* Furthermore, I'm going to require you to read the documentation of the libraries that we use. Including Pandas, matplotlib, Seaborn etc.

We will not have nearly enough time to cover every aspects of the libraries, so it's your responsibility to seek out the information that you need.

_____