# Pandas for Data Analysis: 

## Learning Objectives
- Understand the purpose and importance of the pandas library
- Learn about core pandas data structures: Series and DataFrames
- Load and inspect real-world datasets
- Apply indexing, selection, filtering, and transformation operations
- Handle missing data and perform aggregation using groupby
---

## 1. Introduction to pandas

**Pandas** is a powerful Python library designed for data manipulation and analysis. It provides flexible and high-performance data structures that simplify working with structured data like tables, spreadsheets, and databases.

Key features include:
- Tabular data representation using `DataFrame` objects
- Efficient data indexing and selection
- Support for handling missing values
- Built-in functions for statistical analysis and grouping
- Easy input/output (I/O) for CSV, Excel, SQL, etc.

---

## 2. Core Data Structures
### 2.1 Series

A `Series` is a one-dimensional labeled array. Think of it as a single column from a table.  
A `DataFrame` is a two-dimensional data structure — essentially a table with rows and columns, similar to an Excel spreadsheet.

<br>

<p align="center">
  <img src="fig7.png" alt="Series and DataFrame illustration" width="500"/>
</p>

<p align="center">
  <em>Figure : A Pandas Series (left) and DataFrame (right).</em>
</p>

In [1]:
import pandas as pd # importing the pandas library, which is used for data manipulation and analysis

In [2]:
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print(s)

a    100
b    200
c    300
dtype: int64


### 2.2 DataFrame

A `DataFrame` is a two-dimensional data structure — essentially a table with rows and columns, similar to an Excel spreadsheet.

<br>

<p align="center">
  <img src="figures/fig8.png" alt="Pandas DataFrame illustration" width="600"/>
</p>

<p align="center">
  <em>Figure : A Pandas DataFrame 


Each column is a Series; the full table is the DataFrame.


#### Create an empty DataFrame

In [18]:

df =pd.DataFrame() # create an empty dataframe with no columns or rows
print("empty DataFrame:", df)

empty DataFrame: Empty DataFrame
Columns: []
Index: []


##### Add columns to the DataFrame

In [19]:
df['ColumnName'] = [3, 10, 21, 3.23]
print("DataFrame with one column:", df)

DataFrame with one column:    ColumnName
0        3.00
1       10.00
2       21.00
3        3.23


In [5]:
# add a column with text values
df['TextColumn'] = ['a', 'b', 'c', 'd']
print("DataFrame with two columns:", df)

DataFrame with two columns:    ColumnName TextColumn
0        3.00          a
1       10.00          b
2       21.00          c
3        3.23          d


#### Create a DataFrame from a Dictionary

<div align="center">
  <img src="figures/pandasdataframe.png" alt="Dataframe from dictionary" width="450"/>
  <p style="font-size:small;">
    Dataframe from dictionaryw</a>
  </p>
</div>


In [15]:
# create a dictionary
mydict = {'Coulmn:' : [3, 10, 21, 3.23], 'TextColumn': ['a', 'b', 'c', 'd']}
#creates a dcitionary with two key-value pairs, where the values are lists of data

#create a DataFrame from the dictionary
df2 = pd.DataFrame(mydict)
print("DataFrame from dictionary:", df2)

DataFrame from dictionary:    Coulmn: TextColumn
0     3.00          a
1    10.00          b
2    21.00          c
3     3.23          d


In [16]:
data = {
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
}
df = pd.DataFrame(data)
print(df)

    Name  Age
0  Alice   25
1    Bob   30


#### Add Rows

In [6]:
# using the loc command
df.loc[4] = [100,'x'] # add a new row with index 4

df.loc[len(df)] = [200, 'y'] # add a new row with index equal to the current length of the DataFrame
df.loc[len(df)] = [300, 'z'] # add another new row with index equal to the current length of the DataFrame
print("DataFrame after adding new rows:", df)

DataFrame after adding new rows:    ColumnName TextColumn
0        3.00          a
1       10.00          b
2       21.00          c
3        3.23          d
4      100.00          x
5      200.00          y
6      300.00          z


#### Importing and Exporting DataFrames

In [13]:
# import DatFrame from a external file
df_imported = pd.read_csv('input.dat', index_col= None, header= 0, sep=';', decimal=".")

In [15]:
df_imported

Unnamed: 0,Zeile1
0,Zeile2
1,Test1!Test2
2,79:958!79:958
3,22:203!22:203
4,29:216!29:216
...,...
1016,19:188!19:188
1017,17:542!17:542
1018,66:565!66:565
1019,77:928!77:928


In [12]:
# export current Datafram to a csv file
df_imported.to_csv('output.csv', index=False)

## DataFrame Handling


In [7]:
# set "Textcolumn" as index
df.set_index(df['TextColumn'], inplace=True)  # set 'TextColumn' as the index of the DataFrame

In [8]:
# drop the 'TextColumn' from the DataFrame
df.drop(["TextColumn"], axis=1, inplace=True)  # remove 'TextColumn' from the DataFrame

In [9]:
# drop the row with index z
df.drop("z", axis=0, inplace=True)  # remove the row with index 'z' from the DataFrame

In [10]:
# add a new row tih index "z" and none values
df.loc["z"] = None 

In [11]:
# drop all rows tha conatin one mssing value
df.dropna(inplace=True)  # remove rows with any missing values

In [16]:
# Calculate the rolling average (window size 10) for column "Test1"
#rolling_avg = df_imported["Zeile1"].rolling(window=10).mean()

---

## 3. Loading a Real Dataset
We will use the **Palmer Penguins** dataset. It contains measurements for three penguin species from Antarctica.

In [32]:
url = "https://raw.githubusercontent.com/JohnMount/Penguins/main/penguins.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


 `pd.read_csv()` explanation:
- `url` is a direct link to the CSV file online
- This function returns a `DataFrame`

Common parameters:
- `index_col`: use a column as the row index
- `usecols`: load only specific columns
- `na_values`: treat specific strings as NaN


### Exploring the dataframe

In [34]:
# shape and columns of DF
print("Shape of DataFrame:", df.shape)  # prints the shape of the DataFrame (number of rows, number of columns)
print("Columns in DataFrame:", df.columns)  # prints the names of the columns in

Shape of DataFrame: (344, 7)
Columns in DataFrame: Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')


#### Information about the DataFrame `DataFrame.info()`

- Summary of each column
- Mean, std, min, max for numeric columns
- Counts and top values for object/categorical columns

In [None]:

print("DataFrame Info:", df.info())  # prints a concise summary of the DataFrame, including data types and non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
DataFrame Info: None


#### Detecting and Handling Missing Values

Why this matters:
Missing values (NaN = Not a Number) can break computations or skew analysis. You must inspect and treat them early.
```pythondf
df.isna()
```
- Returns True where values are missing, else False.
- `.isnull()` is equivalent.

```python
df.isna().sum()
```
- Count missing values per column

In [36]:
df.isna()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,True,True,True,True,True
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
339,False,False,False,False,False,False,False
340,False,False,False,False,False,False,False
341,False,False,False,False,False,False,False
342,False,False,False,False,False,False,False


In [44]:
df.isna().sum() # counts the number of missing values in each column

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

#### Remove rows with missing data

```python
df_clean = df.dropna()
```
- Drops any row with a NaN
- Use `subset` argument to limit to specific columns

<div align="center">
  <img src="figures/dropna.png" alt="dropna() function" width="450"/>
  <p style="font-size:small;">
    dropna() function</a>
  </p>
</div>


In [45]:
df_clean = df.dropna()  # drops any row with a NaN value

In [46]:
df_clean.isna().sum()  # checks for missing values in the cleaned DataFrame

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

#### Fill missing values
```python
df_filled = df.fillna({'body_mass_g': 'Unknown'})
```
- Only fills NaNs in specified columns
- Use `method='ffill'` or `'bfill'` for forward/backward fill



In [48]:
df_filled = df.fillna({'body_mass_g': 'Unknown'})

#### Accessing and Selecting Data

In [50]:
# access a column
df['species']  # returns the 'species' column as a Series


0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...    
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object

In [52]:
df[['species']]  # returns the 'species' column as a DataFrame

Unnamed: 0,species
0,Adelie
1,Adelie
2,Adelie
3,Adelie
4,Adelie
...,...
339,Chinstrap
340,Chinstrap
341,Chinstrap
342,Chinstrap


#### Access by label: `.loc[]`
```python
df.loc[0, 'body_mass_g']
```
- Use when you know the index and column names
- Allows label-based indexing and slicing

<p align="center">
  <img src="figures/fig9.png" alt="Series and DataFrame illustration" width="500"/>
</p>

<p align="center">
  <em>loc.</em>
</p>



df.loc[0, 'body_mass_g']

In [53]:
df.loc[0, 'body_mass_g']

3750.0

#### Access by position: `.iloc[]`
```python
df.iloc[0, 5]
```
- Use when selecting rows and columns by integer position
- Useful when column names are unknown

<p align="center">
  <img src="figures/fig10.png" alt="Series and DataFrame illustration" width="500"/>
</p>

<p align="center">
  <em>iloc.</em>
</p>



In [55]:
df.iloc[0, 5]  # access the first row and sixth column by position

3750.0

Common mistake:
- `.iloc[]` expects integers
- `.loc[]` expects labels (column names, row index values)

<div align="center">
  <img src="figures/loc_vs_iloc.png" alt="loc vs iloc" width="450"/>
  <p style="font-size:small;">
    loc vs iloc</a>
  </p>
</div>


####  Filtering Rows with Conditions
Boolean indexing
- returns all rows where confition is true

In [57]:
df[df['species'] == 'Adelie']  # filter rows where species is 'Adelie'

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
...,...,...,...,...,...,...,...
147,Adelie,Dream,36.6,18.4,184.0,3475.0,female
148,Adelie,Dream,36.0,17.8,195.0,3450.0,female
149,Adelie,Dream,37.8,18.1,193.0,3750.0,male
150,Adelie,Dream,36.0,17.1,187.0,3700.0,female


#### Combine multiple conditions
- Use `&` for AND, `|` for OR
- Always use parentheses around each condition

In [58]:
df[(df['species'] == 'Gentoo') & (df['body_mass_g'] > 5000)]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
153,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,male
155,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,male
156,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,male
159,Gentoo,Biscoe,46.7,15.3,219.0,5200.0,male
161,Gentoo,Biscoe,46.8,15.4,215.0,5150.0,male
...,...,...,...,...,...,...,...
267,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,male
269,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,male
273,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,male
274,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,female


#### Aggregation and Grouping
Average body mass by species
- Groups by species
- Calculates mean of body mass within each group

In [59]:
df.groupby('species')['body_mass_g'].mean()

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

In [61]:
df.groupby(['island', 'species']).size()
# Returns number of records in each group

island     species  
Biscoe     Adelie        44
           Gentoo       124
Dream      Adelie        56
           Chinstrap     68
Torgersen  Adelie        52
dtype: int64

## Summary of Key Concepts

| Concept           | Use Case                                                 |
|------------------|-----------------------------------------------------------|
| `pd.read_csv()`  | Load external dataset                                     |
| `.shape`         | Get dataset dimensions                                   |
| `.info()`        | Inspect columns and datatypes                            |
| `.isna()`        | Detect missing data                                      |
| `.dropna()`      | Remove incomplete rows                                   |
| `.fillna()`      | Replace missing values                                   |
| `.loc[]`         | Access rows/columns by label                             |
| `.iloc[]`        | Access rows/columns by index position                    |
| Boolean filtering| Extract subset of rows matching condition                |
| `.groupby()`     | Aggregate and summarize across categories                |
| `.sort_values()` | Sort DataFrame by values                                 |
| `.reset_index()` | Restore original index                                   |


---
## References
1. [W3Schools - Python Modules](https://www.w3schools.com/python/python_modules.asp)
2. HS Offenburg - Introductory Python Course
3. [Python Packaging Guide](https://packaging.python.org/en/latest/)
4. [DataCamp: Intro to Python](https://www.datacamp.com/courses/intro-to-python-for-data-science)
