# Vega-Altair

Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite. It offers a powerful and concise grammar that enables you to quickly build a wide range of statistical visualizations. 

You can install Altair with the terminal command: 

`pip install "altair[all]"`

* More infos: Check the [installation guide](https://altair-viz.github.io/getting_started/installation.html) 
* Inspiration: Browse the [Example gallery](https://altair-viz.github.io/gallery/index.html#example-gallery) 
* Great Youtube tutorial: [Buy the guy who encoded Altair](https://youtu.be/ms29ZPUKxbU?feature=shared&t=6)

## 01 Plot Types

In [22]:
#import libraries
import altair as alt
import pandas as pd


In [23]:
#load data
df_raw = pd.read_csv("data/pokemon.csv")
df_raw.head(2)

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary,year
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,1,65,65,45,grass,poison,6.9,1,0,1996
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,2,80,80,60,grass,poison,13.0,1,0,1996


### Data Cleaning

In [24]:
df_raw.columns

Index(['abilities', 'against_bug', 'against_dark', 'against_dragon',
       'against_electric', 'against_fairy', 'against_fight', 'against_fire',
       'against_flying', 'against_ghost', 'against_grass', 'against_ground',
       'against_ice', 'against_normal', 'against_poison', 'against_psychic',
       'against_rock', 'against_steel', 'against_water', 'attack',
       'base_egg_steps', 'base_happiness', 'base_total', 'capture_rate',
       'classfication', 'defense', 'experience_growth', 'height_m', 'hp',
       'japanese_name', 'name', 'percentage_male', 'pokedex_number',
       'sp_attack', 'sp_defense', 'speed', 'type1', 'type2', 'weight_kg',
       'generation', 'is_legendary', 'year'],
      dtype='object')

In [25]:
# select the columns we really want

df = df_raw[['name', 'type1',	'type2', 'classfication', 'height_m', 'weight_kg','year',
              'is_legendary', 'attack','speed','hp', 'base_happiness',  'capture_rate']]

In [26]:
# wir schauen nur Pokemons der 1. Generation an
df = df[df['year']==1996]

In [27]:
# sind alle spalten im richtigen Datentyp?
df.dtypes

name               object
type1              object
type2              object
classfication      object
height_m          float64
weight_kg         float64
year                int64
is_legendary        int64
attack              int64
speed               int64
hp                  int64
base_happiness      int64
capture_rate       object
dtype: object

In [28]:
# ... capture rate könnte ein integer sein.

df['capture_rate'] = df['capture_rate'].astype(int)

In [29]:
# Wir speichern das df als csv für spätere Notebooks
df.to_csv('data/pokemon_gen1.csv', index =False)

## Exkurs: Mehr pandas-funktionen

In [30]:
df['type1'].value_counts()

type1
water       28
normal      22
poison      14
grass       12
fire        12
bug         12
electric     9
rock         9
ground       8
psychic      8
fighting     7
ghost        3
dragon       3
fairy        2
ice          2
Name: count, dtype: int64

In [31]:
df['type1'].value_counts().reset_index()

Unnamed: 0,type1,count
0,water,28
1,normal,22
2,poison,14
3,grass,12
4,fire,12
5,bug,12
6,electric,9
7,rock,9
8,ground,8
9,psychic,8


groupby.() führt eine Funktion (z.B. Durchschnitt, Maximum, Summe, ect) pro Kategorie aus. Format: <br>
`df.groupby(<Kategorie>)[<Werte>].<funktion()>`

In [32]:
# groupby
df.groupby('type1')['weight_kg'].mean()

type1
bug         22.991667
dragon      76.600000
electric    32.012500
fairy       23.750000
fighting    54.285714
fire        54.650000
ghost       13.566667
grass       19.627273
ground      80.500000
ice         48.000000
normal      57.983333
poison      26.866667
psychic     51.562500
rock        60.583333
water       57.967857
Name: weight_kg, dtype: float64

## Plotting




### Anatomy of an Altair chart
An altair plot always follows this schema:

    alt.Chart(df).mark_bar().encode(
        x = 'column_A',
        y = 'column_B
    )

`Chart()`: variable inside sets from which dataframe data should be plotted <br>
`mark_bar()`: choose which form the plot should take<br>
`encode`: set what to plot


##### Bar chart

In [33]:
df_top_height = df.sort_values('height_m', ascending = False).head(10)
df_top_height

Unnamed: 0,name,type1,type2,classfication,height_m,weight_kg,year,is_legendary,attack,speed,hp,base_happiness,capture_rate
94,Onix,rock,ground,Rock Snake Pokémon,8.8,210.0,1996,0,45,70,35,70,45
129,Gyarados,water,flying,Atrocious Pokémon,6.5,235.0,1996,0,155,81,95,70,45
147,Dragonair,dragon,,Dragon Pokémon,4.0,16.5,1996,0,84,70,61,35,45
23,Arbok,poison,,Cobra Pokémon,3.5,65.0,1996,0,95,80,60,70,90
130,Lapras,water,ice,Transport Pokémon,2.5,220.0,1996,0,85,60,130,70,45
114,Kangaskhan,normal,,Parent Pokémon,2.2,80.0,1996,0,125,100,105,70,45
148,Dragonite,dragon,flying,Dragon Pokémon,2.2,210.0,1996,0,134,80,91,35,45
142,Snorlax,normal,,Sleeping Pokémon,2.1,460.0,1996,0,110,30,160,70,25
22,Ekans,poison,,Snake Pokémon,2.0,6.9,1996,0,60,55,35,70,255
2,Venusaur,grass,poison,Seed Pokémon,2.0,100.0,1996,0,100,80,80,70,45


In [34]:
# Bar chart
alt.Chart(df_top_height).mark_bar().encode(
    x = 'height_m',
    y = alt.Y('name', sort='-x' )
)

### Plot types

There is 
* `mark_bar`
* `mark_line`
* `mark_area` 
* `mark_point`
* `mark_boxplot`
* `mark_square`

Nicht in diesem Notebook
* `mark_arc` <- [Donut-Chart](https://altair-viz.github.io/gallery/donut_chart.html)


##### Line chart
Line charts machen besonders bei zeitlichen Abläufen Sinn.

In [35]:
# wir nehmen wieder den gesamten DF vom Anfang (vor der selektion auf Generation 1)
df_raw.head(2)

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,...,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary,year
0,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,1,65,65,45,grass,poison,6.9,1,0,1996
1,"['Overgrow', 'Chlorophyll']",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,...,2,80,80,60,grass,poison,13.0,1,0,1996


In [36]:
# wie viele Pokemons gab es pro Generation?
df_generation = df_raw.value_counts('year').reset_index().sort_values('year')
df_generation

Unnamed: 0,year,count
1,1996,151
4,2000,100
2,2003,135
3,2007,107
0,2011,156
6,2013,72
5,2016,80


In [None]:
# Line chart
alt.Chart(df_generation).mark_line().encode(
    x = 'year',
    y = 'count'
)

### Flächenchart

In [42]:
# Line chart
alt.Chart(df_generation).mark_area().encode(
    x = 'year',
    y = 'count'
)

#### Scatterplot
Dot plots eignen sich gut um Zusammenhänge zwischen Zahlenwerten aufzuzeigen.

In [None]:
df.head(2)

In [None]:
# Je grösser desto schwerer?

alt.Chart(df).mark_point().encode(
    x = 'weight_kg',
    y = 'height_m', 
    tooltip = ['name', 'weight_kg', 'height_m']
)

In [None]:
# Wähle zwei eigene Spalten und vergleiche sie 

alt.Chart(df).mark_point().encode(
    x = '',
    y = '', 
    tooltip = ['name', '', '']
)

### Boxplot
Boxplots zeigen die Verteilung von Werten nach Kategorie an.

In [None]:
df.head(1)

In [None]:
alt.Chart(df).mark_boxplot().encode(
    x = 'type1', 
    y = 'attack', 
    tooltip=['name', 'type1']
)

### Heatmap
Heatmap eignen sich gut um Werte in Farbe darzustellen, wobei die Achsen direkt miteinader im Bezug stehen. Sie kommen vor allem bei zeitlichen Daten zur Anwendung, z.B. um Werte über ein Jahr hinweg pro Tage & Monat darzustellen.

In [None]:
# Weil unser Dataframe noch keine Zeitdaten hat fügen wir fiktive Daten ein, wann wir das Pokemon gefangen haben. 
import numpy as np

# S
# eries von zufälligen Daten 
df_raw["capture_date"] = pd.to_datetime(
    np.random.choice(pd.date_range("2024-01-01", "2024-12-31"), size=len(df_capture))
)

In [None]:
df_raw.head(2)

In [None]:
# sicherstellen, dass "capture_date" im Datumsformat ist
df_raw.dtypes

In [None]:
df_capture_grouped = df_capture.groupby('capture_date')['name'].count().reset_index()
df_capture_grouped = df_capture_grouped.rename(columns={'name':'count'})

In [None]:
df_capture_grouped

In [None]:
alt.Chart(df_capture_grouped).mark_rect().encode(
    x = alt.X('date(capture_date):O'),
    y = alt.Y('month(capture_date):O'),
    color = alt.Color('count'), 
    tooltip = ['capture_date', 'count']
)

## Quizz


In [None]:
df

1. Erstelle einen Barchart mit den 10 langsamtsen Pokemons. 
* X-Achse: speed
* Y-Achse: name


2. Zeige die durchschnittliche Attack-Werte nach Jahr an mit dem df `df_mean_attack_year`

In [None]:
# Neuer Dataframe, der nach Jahr gruppiert und für 'attack' den Durchschnitt ausrechnet
df_mean_attack_year = df_raw.groupby('year')['attack'].mean().reset_index()
df_mean_attack_year

In [None]:
alt.Chart(df_mean_attack_year).mark_line().encode(
    x = 'year', 
    y = 'attack'
)

In [None]:
df

3. Erstelle ein Punktdiagramm von speed vs. hp. (Basis: df)
* Füge ein Tooltip mit name, speed & hp hinzu.

In [None]:
alt.Chart(df).mark_point().encode(
    x = 'speed',
    y = 'hp',
    tooltip=['name', 'hp', 'speed']
    
)

4. Erstelle einen Boxplot mit type2 als x-Achse und capture_rate auf der y-Achse (Basis: df)

In [None]:
df.head(1)

In [None]:
alt.Chart(df).mark_boxplot().encode(
    x = 'type2', 
    y = 'capture_rate', 
    tooltip='name'
)

5. Bonus: erstelle ein Heatmap, welche type1 und type2 als Achsen hat. <br>
Tipp: verwende `color = alt.Color('count(name)')` um die Anzahl Werte pro Feld einzufärben

In [None]:
alt.Chart(df).mark_rect().encode(
    x = 'type1', 
    y = 'type2', 
    color = alt.Color('count(name)'),
    tooltip = ['count(name)']
)