# EDV-Coaching - Python
## Introduction to Pandas
***
This notebook covers:
- Creating DataFrame and Series
- Loading and saving data
- Data selection and filtering
- Basic data analysis
- Data transformation
- Grouping and aggregation
***
# What is Pandas?

Pandas is a powerful library for data manipulation and analysis. The name derives from "Panel Data", and the library is specifically optimized for working with structured data. <br>

Important features of Pandas are: <br>
- DataFrame object for intuitive handling of tabular data <br>
- Efficient data input and output in various formats (CSV, Excel, SQL, etc.) <br>
- Powerful tools for data cleaning and transformation <br>
- Flexible grouping and aggregation of data <br>
- Integrated tools for time series analysis <br>

Pandas has become the standard in data analysis because it effectively bridges the gap between raw data and statistical analysis. <br>

## 1 Creating DataFrame and Series

Pandas has two main data structures: Series (1D) and DataFrame (2D): <br>

In [None]:
import pandas as pd
import numpy as np

# Create Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])

# Create DataFrame from dictionary
data = {
    'Name': ['Anna', 'Ben', 'Clara', 'David'],
    'Alter': [25, 30, 22, 35],
    'Stadt': ['Berlin', 'Hamburg', 'Munich', 'Berlin'],
    'Gehalt': [45000, 55000, 35000, 65000]
}
df = pd.DataFrame(data)

print("Series:")
print(s)
print("\nDataFrame:")
print(df)

Series:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame:
    Name  Alter    Stadt  Gehalt
0   Anna     25   Berlin   45000
1    Ben     30  Hamburg   55000
2  Clara     22  München   35000
3  David     35   Berlin   65000


## 2 Load and save data

Pandas can read and write various data formats: <br>

In [None]:
# Create and read csv file
df.to_csv('/content/beispiel.csv', index=False)
df_csv = pd.read_csv('/content/beispiel.csv')

# Create and read Excel file
#df.to_excel('beispiel.xlsx', index=False)
#df_excel = pd.read_excel('beispiel.xlsx')

# JSON Format
#df.to_json('beispiel.json')
#df_json = pd.read_json('beispiel.json')

print("Loaded csv data:")
print(df_csv)

Geladene CSV-Daten:
    Name  Alter    Stadt  Gehalt
0   Anna     25   Berlin   45000
1    Ben     30  Hamburg   55000
2  Clara     22  München   35000
3  David     35   Berlin   65000


## 3 Selecting and filtering data

There are various ways to select data: <br>

In [None]:
# Select columns
names = df['Name']
info = df[['Name', 'Stadt']]

# Select rows by position
first_row = df.iloc[0]
block = df.iloc[0:2, 1:3]

# Select rows by condition
berliner = df[df['Stadt'] == 'Berlin']
good_pay = df[df['Gehalt'] > 50000]

# Combined conditions
young_berlin = df[(df['Stadt'] == 'Berlin') & (df['Alter'] < 30)]

print("Employees from Berlin:")
print(berliner)

Mitarbeiter aus Berlin:
    Name  Alter   Stadt  Gehalt
0   Anna     25  Berlin   45000
3  David     35  Berlin   65000


## 4 Grundlegende Datenanalyse

Pandas bietet viele Funktionen für beschreibende Statistik: <br>

In [None]:
# Statistical summary
summary = df.describe()

# Single statistical measures
average = df['Gehalt'].mean()
median = df['Gehalt'].median()
maximum = df['Gehalt'].max()

# Value counts
stadt_counts = df['Stadt'].value_counts()

# Correlations
correlations = df.corr()

print("Statistical summary :")
print(summary)
print("\nStädte-Verteilung:")
print(stadt_counts)

## 5 Datentransformation

Daten aufbereiten und transformieren: <br>

In [None]:
# Add new column
df['Bonus'] = df['Gehalt'] * 0.1
df['Gesamtgehalt'] = df['Gehalt'] + df['Bonus']

# Change data type
df['Alter'] = df['Alter'].astype(float)

# Replace values
df['Stadt'] = df['Stadt'].replace('Berlin', 'BER')

# Categorical data
df['Stadt_Kategorie'] = pd.Categorical(df['Stadt'])

print("Transformed data:")
print(df)

## 6 Gruppierung und Aggregation

Daten gruppieren und zusammenfassen: <br>

In [None]:
# Group by Stadt
nach_stadt = df.groupby('Stadt')

# Various aggregations
stadt_statistics = nach_stadt.agg({
    'Gehalt': ['mean', 'min', 'max'],
    'Alter': 'mean'
})

# Grouping by multiple columns
multi_group = df.groupby(['Stadt', 'Alter'])['Gehalt'].mean()

print("Statistics per city:")
print(stadt_statistics)


Statistik nach Städten:
          Gehalt               Alter
            mean    min    max  mean
Stadt                               
Berlin   55000.0  45000  65000  30.0
Hamburg  55000.0  55000  55000  30.0
München  35000.0  35000  35000  22.0


## 7 Missing values

Dealing with missing values (NaN): <br>

In [None]:
# Adding missing values
df.loc[1, 'Gehalt'] = np.nan

# Recognizing missing values
missing = df.isna()
missing_sum = df.isna().sum()

# Dealing with missing values
df_clean = df.dropna()           # Remove rows with NA
df_filled = df.fillna(0)         # Fill NAs with 0
df_mean = df.fillna(df['Gehalt'].mean())   # Fill NAs with mean value

print("Number of missing values per column:")
print(missing_sum)

Anzahl fehlender Werte pro Spalte:
Name      0
Alter     0
Stadt     0
Gehalt    1
dtype: int64


## 8 Time series

Pandas has special functions for time series: <br>

In [None]:
# Create time series index
dates = pd.date_range('20240101', periods=6)
ts = pd.Series(np.random.randn(6), index=dates)

# Time series operations
monthly = ts.resample('M').mean()
rolling = ts.rolling(window=3).mean()
shift = ts.shift(1)

print("Time series:")
print(ts)
print("\nRolling mean:")
print(rolling)

Zeitreihe:
2024-01-01   -0.258858
2024-01-02   -0.577659
2024-01-03    1.182864
2024-01-04    0.252935
2024-01-05   -1.542787
2024-01-06    0.822953
Freq: D, dtype: float64

Gleitender Durchschnitt:
2024-01-01         NaN
2024-01-02         NaN
2024-01-03    0.115449
2024-01-04    0.286047
2024-01-05   -0.035663
2024-01-06   -0.155633
Freq: D, dtype: float64


  monatlich = ts.resample('M').mean()


## Conclusion:

Pandas offers: <br>
- Flexible data structures for tabular data <br>
- Extensive import/export capabilities <br>
- Powerful data analysis tools <br>
- Efficient data transformation and cleaning <br>
- Advanced grouping and aggregation functions <br>

These features make Pandas the standard tool for data analysis in Python. <br>