
<a href="https://colab.research.google.com/github/kokchun/Databehandling-21/blob/main/Lectures/L0-pandas-basics.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code

---
# Lecture notes - Pandas basics

---
This is the lecture note for **Pandas basics** - but it's built upon contents from previous course: 
- Python programming

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to Pandas. I encourage you to read further about pandas.

Read more 

- [documentation - Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)

- [documentation - pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)

- [documentation - DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame)

- [documentation - read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

- [documentation - indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

- [documentation - masking](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html)

- [documentation - read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
---

## Pandas Series

1D array with flexible indices. Series can be seened as a "typed dictionary". The typing makes it more efficient than dictionary in certain computations
- create from dictionary 
- create from list 
- create from array 

In [5]:
import pandas as pd 
data = dict(AI = 25, NET = 30 , APP = 30, Java = 27) # number of students

series_programs = pd.Series(data=data)
print(series_programs)

# extract values 
print(f"series_programs[0] -> {series_programs[0]}")
print(f"series_programs[-1] -> {series_programs[-1]}")

# get the keys
print(f"series_programs.keys() -> {series_programs.keys()}") 
print(f"series_programs.keys()[2] -> {series_programs.keys()[2]}") 

AI      25
NET     30
APP     30
Java    27
dtype: int64
series_programs[0] -> 25
series_programs[-1] -> 27
series_programs.keys() -> Index(['AI', 'NET', 'APP', 'Java'], dtype='object')
series_programs.keys()[2] -> APP


In [6]:
import random as rnd
rnd.seed(42)

# create Series using list
dice_series = pd.Series([rnd.randint(1,6) for _ in range(5)])
print(dice_series)

# some useful methods
print(f"Min value {dice_series.min()}")
print(f"Mean value {dice_series.mean()}")
print(f"Median value {dice_series.median()}")

0    6
1    1
2    1
3    6
4    3
dtype: int64
Min value 1
Mean value 3.4
Median value 3.0


---
## DataFrame
Analog of 2D Numpy array with flexible row indices and col names. Can also be seened as specialized dictionary where each col name is mapped to a Series object. 

In [7]:
df_programs = pd.DataFrame(series_programs,columns=("Num students",))
df_programs

Unnamed: 0,Num students
AI,25
NET,30
APP,30
Java,27


In [8]:
# create 2 Series objects using dictionary
students = pd.Series(dict(AI = 25, NET = 30 , APP = 30, Java = 27))
language = pd.Series(dict(AI="Python", NET="C#", APP="Kotlin", Java = "Java"))

# create a DataFrame from 2 Series objects using dictionary
df_programs = pd.DataFrame({"Students":students, "Language":language}) # key becomes col name
df_programs

Unnamed: 0,Students,Language
AI,25,Python
NET,30,C#
APP,30,Kotlin
Java,27,Java


In [9]:
import numpy as np
# can also be created directly
df_programs = pd.DataFrame({
    "Students": np.array((25, 30, 30, 27)),
    "Language": np.array(("Python", "C#", "Kotlin", "Java"))},
    index = ["AI", ".NET", "APP", "Java"])
df_programs

Unnamed: 0,Students,Language
AI,25,Python
.NET,30,C#
APP,30,Kotlin
Java,27,Java


In [10]:
df_programs.index # dtype object is used for text or mixed numeric or non-numeric values

Index(['AI', '.NET', 'APP', 'Java'], dtype='object')

---
## Data selection
- dictionary-style indexing
- attribute-style indexing
    - can give unexpected errors as some methods can share same name as col name   

In [46]:
# gives a Series object of Students 
df_programs["Students"] # dictionary indexing

AI      25
.NET    30
APP     30
Java    27
Name: Students, dtype: int64

In [50]:
# select multiple columns using list 
df_programs[["Language", "Students"]]

Unnamed: 0,Language,Students
AI,Python,25
.NET,C#,30
APP,Kotlin,30
Java,Java,27


In [20]:
df_programs.Language # attribute indexing

AI      Python
.NET        C#
APP     Kotlin
Java      Java
Name: Language, dtype: object

In [24]:
df_programs["Language"][".NET"] # selects the Language Series and indexes .NET

'C#'

---
## Indexers
Gives a slicing interface for the indices. loc and iloc are attributes of Series and DataFrame objects.
| Indexer | Description                                         |
| :-----: | --------------------------------------------------- |
|   loc   | slicing and indexing referencing explicit index     |
|  iloc   | slicing and indexing referencing Python-style index |

In [40]:
print(df_programs.loc["Java"])

# index multiple rows
df_programs.loc[["Java", "APP"]]

Students      27
Language    Java
Name: Java, dtype: object


Unnamed: 0,Students,Language
Java,27,Java
APP,30,Kotlin


In [45]:
# slicing with array-style indices
df_programs.iloc[1:3]

Unnamed: 0,Students,Language
.NET,30,C#
APP,30,Kotlin


---
## Masking
Replaces values where the condition is True

```py
df = df[conditions]

In [56]:
print(df_programs["Students"] > 25) # this gives a pandas Series of type bool 

df_over_25 = df_programs[df_programs["Students"]>25]
df_over_25

AI      False
.NET     True
APP      True
Java     True
Name: Students, dtype: bool


Unnamed: 0,Students,Language
.NET,30,C#
APP,30,Kotlin
Java,27,Java


---
## Read excel data
- reads an .xlsx-file and stores it as DataFrame object

In [42]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # used for plotting 

df = pd.read_excel("../Data/calories.xlsx")
df.head() # see the first n rows of DataFrame, n = 5 by default 

Unnamed: 0,FoodCategory,FoodItem,per100grams,Cals_per100grams,KJ_per100grams
0,CannedFruit,Applesauce,100g,62 cal,260 kJ
1,CannedFruit,Canned Apricots,100g,48 cal,202 kJ
2,CannedFruit,Canned Blackberries,100g,92 cal,386 kJ
3,CannedFruit,Canned Blueberries,100g,88 cal,370 kJ
4,CannedFruit,Canned Cherries,100g,54 cal,227 kJ


In [43]:
df.info() # prints info about df (dtypes, non-null vals and memory usage)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   FoodCategory      2225 non-null   object
 1   FoodItem          2225 non-null   object
 2   per100grams       2225 non-null   object
 3   Cals_per100grams  2225 non-null   object
 4   KJ_per100grams    2225 non-null   object
dtypes: object(5)
memory usage: 87.0+ KB


## Data cleaning 

- notice that all data types are object, need to type convert to int
Strategy
- change column names
- convert Cals_per100grams to int to make calculations with it

In [44]:
df = df.rename(dict(Cals_per100grams = "Calories", per100grams="per100",KJ_per100grams="kJ"), axis="columns")

df.head()

Unnamed: 0,FoodCategory,FoodItem,per100,Calories,kJ
0,CannedFruit,Applesauce,100g,62 cal,260 kJ
1,CannedFruit,Canned Apricots,100g,48 cal,202 kJ
2,CannedFruit,Canned Blackberries,100g,92 cal,386 kJ
3,CannedFruit,Canned Blueberries,100g,88 cal,370 kJ
4,CannedFruit,Canned Cherries,100g,54 cal,227 kJ


In [45]:
# convert Calories to int 
df["Calories"] = df["Calories"].str[:-3].astype(int)
df.head()

Unnamed: 0,FoodCategory,FoodItem,per100,Calories,kJ
0,CannedFruit,Applesauce,100g,62,260 kJ
1,CannedFruit,Canned Apricots,100g,48,202 kJ
2,CannedFruit,Canned Blackberries,100g,92,386 kJ
3,CannedFruit,Canned Blueberries,100g,88,370 kJ
4,CannedFruit,Canned Cherries,100g,54,227 kJ
...,...,...,...,...,...
2220,Spreads,Sunflower Butter,100g,617,2591 kJ
2221,Spreads,Tapenade,100g,233,979 kJ
2222,Spreads,Unsalted Butter,100g,717,3011 kJ
2223,Spreads,Vegemite,100g,180,756 kJ


In [47]:
df.Calories.head()

0    62
1    48
2    92
3    88
4    54
Name: Calories, dtype: int32

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---