
<a href="https://colab.research.google.com/github/aleylani/Databehandling-AI25/blob/main/lectures/L1_pandas_basics.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp;

## Lecture Notes: Pandas Basics

____

**Introduction**

Pandas is a powerful Python library for data manipulation and analysis. It provides easy-to-use data structures like Series and DataFrame, which allow for efficient handling of structured data.

This lecture will introduce key Pandas concepts and operations, providing practical examples along the way. By the end, you'll understand how to:

* Create and manipulate Pandas Series and DataFrames
* Use essential functions to filter, sort, and summarize data
* Differentiate between .iloc and .loc for indexing and selecting data
* Handle real-world datasets using CSV files

____

**Create a new environment**

Begin by creating a new environment for this course, let's call it *databehandling*

Use the following command.

        conda create --name databehandling python=3.12

After it's created, active it and run the following to install some necessary libraries for this course.

        pip install numpy pandas matplotlib openpyxl ipykernel seaborn plotly_express nbformat 

In [84]:
import random

import numpy as np
import pandas as pd

____

## Pandas Series

A Pandas Series is a one-dimensional array-like object that can hold any data type (integers, floats, strings, etc.). 

What makes a Series powerful is that it comes with labeled indices.

**Creating a Series**

We can create a series from a list, array, dictionary or scalar value.

In [None]:
# using a simple list

a_list = [x for x in range(30, 40)]
print(a_list)

In [None]:
my_first_series = pd.Series(a_list)

print(my_first_series)

Let's check that this really is of the class Series

In [None]:
isinstance(my_first_series, pd.Series)

You'll have ample time to learn more about methods and attributes of the Series class in this course. 

I highly recommend you to explore the official documentation:

https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Here are some basic methods:

In [None]:
print(f'Seriens minsta värde   : {my_first_series.min()}')
print(f'Seriens maximala värde : {my_first_series.max()}')
print(f'Seriesn medelvärde     : {my_first_series.mean()}')

**Interaktion with other series**

How does an instance of class Series interact with other instances of the same, under addition, multiplication, division etc?

In [None]:
another_list = [x*10 for x in range(1, 11)]

my_second_series = pd.Series(another_list)

print(my_second_series)

In [None]:
# elementwise addition

my_first_series + my_second_series

In [None]:
# elementwise multiplication 

my_first_series * my_second_series

In [None]:
# elementwise division

my_first_series / my_second_series

We can extract individual elements from a Series by simple indexing

In [None]:
my_first_series[0] # 0 is the index of the first element

We can also do extract multiple elements from a Series

In [None]:
my_first_series[2:5] # returns all elements from index 2 to 4 (5 not included)

Read more about Series in the official documentation linked above!

____

## Pandas DataFrame

A DataFrame is a 2D labeled data structure in Pandas, similar to a table or spreadsheet. 

Each column can hold different data types (integers, floats, strings, etc.).

In [None]:
#obs värdena i vår dictionary behöver INTE vara listor, du kan gott kombinera blanda annat listor, arrays & tuples

# Let's create a dictionary with some data.
# note that the 3 lists are of equal length

names = ['Ali', 'Amir', 'Rozann', 'Sawash']
age = [34, 1.6, 28, 4]
eye_color = ['brown', 'blue', 'green', 'brown']

my_dictionary = {'person' : names, 
                 'age': age, 
                 'eye color': eye_color}

for key, value in my_dictionary.items():

    print(f'{key}: {value}')

We now have a dictionary. We can instantly create a DataFrame from this dictionary.

In [None]:
# _df is standard naming convention to signifify that this is a DataFrame

family_df = pd.DataFrame(my_dictionary)

family_df

In [None]:
# we can run the .info() method to get some information about the DataFrame
# note that Dtype "object" signifies either a string or a mixed type column (e.g. strings and integers)

family_df.info()

We can select specific columns from the DataFrame using the column names.

In [None]:
family_df['person']

Note that each individual column of a DataFrame is Series!

In [None]:
isinstance(family_df['person'], pd.Series)

In [None]:
# We can do simply indexing to get e.g., the first element of a column

family_df['person'][0]

We can get to individual colums by calling on their names as attributes aswell. 

In [None]:
# note that this method is prone to error if the name of the column contains a space or special characters

family_df.person

# this wont work
# family_df.eye color

We can index several columns at once by passing a list of column names to the DataFrame.

In [None]:
family_df[['person', 'eye color']]

____

## Further indexing

In Pandas, selecting specific rows and columns is essential for analyzing data. Pandas offers two primary methods to do this: 

.loc[] and .iloc[].

In [None]:
# let's create some new data

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ian'],
    'Age': [25, 30, 35, 40, 22, 29, 38, 42, 31],
    'Salary': [50000, 60000, 70000, 80000, 52000, 58000, 62000, 75000, 55000]
}

# create a DataFrame from the data

df = pd.DataFrame(data)

df

Before proceeding, just note that dataFrames are super flexible!

In [None]:
print(df['Salary'].mean())

**.iloc[] - Position-based Indexing**

Use .iloc[] to select data based on the position of rows and columns.

In [None]:
df.iloc[0] # Select the first row

In [None]:
# we can use slicing to select mulitple rows

df.iloc[0:5] # Select rows from index 0 to 4

.iloc[] usually takes two arguments, the first for rows and the second for columns. 

The syntax is df.iloc[row_indexer, column_indexer].

In [None]:
print(df.iloc[0, 1]) # Select the element at the first row and second column
print(df.iloc[5, 2]) # Select the element at the sixth row and third column

We can also use slicing to select multiple rows and columns.

In [None]:
df.iloc[0:3, 0:2] # Select rows 0 to 2 and columns 0 to 1

In [None]:
df.iloc[2:4, 1:3] # Select rows 2 to 3 and columns 1 to 2

In [None]:
df.iloc[1:4, :] # Select rows 1 to 3 and all columns

We can also give a list of either rows/columns we want to index

In [None]:
df.iloc[[0, 2, 4], [1, 2]] # Select specific rows and columns

**loc**

Use .loc[] to select rows and columns by their labels. This is more intuitive when working with labeled data.

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ian'],
    'Age': [25, 30, 35, 40, 22, 29, 38, 42, 31],
    'Salary': [50000, 60000, 70000, 80000, 52000, 58000, 62000, 75000, 55000]
}

# create a DataFrame from the data
# NOTE that we can pass our own custom index here

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])

df

Custom index does not affect the .iloc[] method, and we can use it as we did above.

In [None]:
df.iloc[4:8, 0:2]

In [None]:
df.loc['b'] # Select the row with label 'b'

In [None]:
print(df.loc['b', 'Age']) # Select the element at row 'b' and column 'Age'

In [None]:
df.loc[['b', 'd', 'f'], ['Age', 'Salary']] # Select specific rows and columns

____

## Masking

Masking is a powerful feature in Pandas that allows you to filter data based on certain conditions. It's a way to select rows or columns that meet specific criteria. Masking is often used to filter data, perform calculations, or create subsets of data for further analysis.

In [None]:
family_df

In [None]:
# the above df has 4 rows, so let's create a list of booleans, of the same size

my_mask = [True, False, True, False]

family_df[my_mask]  # will only show rows where our list value is True

In [None]:
# we have extreme flexiblity here, we can use any condition we want to create our mask

my_color_mask = [color == 'brown' for color in family_df['eye color']]

my_color_mask

In [None]:
family_df[my_color_mask]

In [None]:
# Series and DataFrames have built-in support a

color_filter = family_df['eye color'] == 'brown'  # this creates a Series of booleans, which we can also use as a mask

color_filter

In [None]:
family_df[color_filter]

In [None]:
# we can also create and use a mask instantly aswell (not best practice though)

family_df[family_df['age'] > 10]

In [None]:
# if we want to store the filtered information in a seperate df, we can easily do so.

age_mask = family_df['age'] > 10

mature_df = family_df[age_mask]

mature_df

You can combine filters too!

Only rows that satisfy both conditions will then be selected.

In [None]:
print(age_mask)

print(color_filter)

In [None]:
family_df[age_mask & color_filter]

A little bit more fun, just to show how much flexibility we have in creating masks

In [None]:
df

In [None]:
short_names_filter = [len(name) < 4 for name in df['Name']]

df[short_names_filter]

We can also NEGATE a Series. This is very handy when we want to get the oppostive of a mask.

We negate a Series using the ~ symbol.

In [None]:
age_mask

In [None]:
~age_mask # note that all True have been turned into False, and vice versa

In [None]:
family_df[~age_mask] # this now gives us the oppostive of the original mask

____

## Read excel

In [None]:
calories_df = pd.read_excel('../data/calories.xlsx')

calories_df

In [None]:
calories_df.info()

In [None]:
# see the first 5 rows of the dataframe

calories_df.head()

# calories_df.head(10)

# calories_df.tail(10)

In [None]:
# print the unique values of a column

print(calories_df['FoodCategory'].unique())

In [None]:
# see how many unique values this column contains

calories_df['FoodCategory'].nunique()

In [None]:
# see the amount of each unique value in a column

calories_df['FoodCategory'].value_counts()

In [None]:
calories_df.iloc[224]

In [None]:
calories_df[calories_df['FoodCategory']=='FastFood']

____

# Rename columns

In [None]:
calories_df

In [141]:
calories_df = calories_df.rename( columns={'FoodItem':'Food'} )

___

## A bit of data cleaning 

We will very often need to handle and manipulate data in dataframes, e.g.,

change column names, change element values, create new columns, handle missing data, alter values etc etc.

Let's say we want to create a new column wherein cals/100g is given as integers - not strings, as it is now.

In [None]:
calories_df.info()

In [None]:
# we can index elements in a string Series by using .str[]

calories_df['Cals_per100grams'].str[:-4] # by doing this, we remove the last 4 characters of each string in the Series which in this case is ' cal'.

In [None]:
calories_df['Cals_per100grams'].str[:-4].astype(int) # we've now also converted the datatypes to integers.

Another way of doing the same thing is perhaps the following

In [None]:
# loop over each value in the Series and only keep the numbers. Exploit the fact that the we can split by a blanc space here.
# also, directly convert the number from strings to integers

only_number_portion = [int(x.split()[0]) for x in calories_df['Cals_per100grams']]

print(only_number_portion)

In [146]:
# add the new list as a new column to the dataframe

calories_df['cals/100g in integers'] = only_number_portion

In [None]:
calories_df.head()

In [None]:
calories_df.info()

____

## Sort DataFrame

In [None]:
calories_df.sort_values(by='cals/100g in integers') # by default, ascending=True

In [None]:
calories_df.sort_values(by='cals/100g in integers', ascending=False)

In [None]:
calories_df.head(20)

In [None]:
calories_df.tail(20)

____

## Read excel with several sheets and choose header column

In [None]:
population_df = pd.read_excel('../data/komtopp50_2020.xlsx', sheet_name='Totalt', header=6)

population_df

reset_index of a view, and assign it to a new variable

In [153]:
sorted_df = population_df.sort_values(by=2020, ascending=False).reset_index(drop=True)

In [None]:
sorted_df

____

## Assigning and re-assigning columns in a DataFrame

In [None]:
family_df

In [159]:
family_df['gender'] = ['male', 'male', 'female', 'male']

In [None]:
family_df

In [None]:
# series har stöd för elementvisa operationer

# elementvis addition nedan

family_df['age'] = family_df['age'] + 1

family_df

In [None]:
family_df['age'] = family_df['age']*2

family_df

In [None]:
new_colors = ['purple', 'indigo', 'violet', 'cyan']

family_df['eye color'] = new_colors

family_df

____

## Concatenate two DataFrames

In [None]:
# create and concat two dfs

more_data = {'person':['john', 'jane', 'jim'], 
             'age':[23, 24, 25], 
             'eye color':['dark', 'darker', 'lightest']}

strangers_df = pd.DataFrame(more_data)
strangers_df

In [None]:
pd.concat( [family_df, strangers_df] )

____

## Plot data with Matplotlib

You're going to get **really** good at plotting in this course. Here's a soft start.

In [None]:
old_colors = ['blue', 'brown', 'green', 'brown']

family_df['eye color'] = old_colors

family_df

In [None]:
family_df['eye color'].value_counts()

In [None]:
# we can get the values of a Series using .values

counts = family_df['eye color'].value_counts().values

print(counts)

In [None]:
# likewise, we can get the index of a Series using .index.values

colors = family_df['eye color'].value_counts().index.values

print(colors)

In [None]:
import matplotlib.pyplot as plt

plt.bar(colors, counts)
plt.show()

____

## Two Important Notes 



* It's VERY important that you put alot of time into being mindful with what you visualize, but also HOW you visualize it

* Furthermore, I'm going to require you to read the documentation of libraries we use. Including Pandas, Matplotlib and more. 

We will not have nearly enough time to cover every aspects of the libraries, so it's your responsibility to seek out the information you need.

____

## Read more

- [documentation - Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)

- [documentation - Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)

- [documentation - DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame)

- [documentation - read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

- [documentation - indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

- [documentation - masking](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html)

- [documentation - read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)

---