<a href="https://colab.research.google.com/github/mehrnazh/PythonVisualization/blob/main/Introdution_to_Data_Visualization_with_Altair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introdution to Data Visualization with Altair

By Mehrnaz Hosseinzadeh M.D., National Brain Centre, Mental Health Research Centre, IUMS


---



# Installation

In [None]:
!pip install altair
!pip install vega_datasets



We are going to use datasets from the vega_datasets package. To install, following command should be employed:

In [None]:
# Importing altair and pandas library
import altair as alt
import pandas as pd

# Example

## Example 1: Simple Bar Chart

Let's create a Pandas DataFrame and understand it

In [None]:
# Making a Pandas DataFrame
score_data = pd.DataFrame({
    'Website': ['StackOverflow', 'FreeCodeCamp',
                'GeeksForGeeks', 'MDN', 'CodeAcademy'],
    'Score': [65, 50, 99, 75, 33]
})

score_data

Unnamed: 0,Website,Score
0,StackOverflow,65
1,FreeCodeCamp,50
2,GeeksForGeeks,99
3,MDN,75
4,CodeAcademy,33


Now lets make an altair chart for it

In [None]:
# Making the Simple Bar Chart
alt.Chart(score_data).mark_bar().encode(
    # Mapping the Website column to x-axis
    x='Website',
    # Mapping the Score column to y-axis
    y='Score'
)

Think!
* What is a DataFrame in Pandas? Why do we use it?
* How would you add a new column to this DataFrame, for example, a column for the type of resource (forum, tutorial, documentation)?
* Can you modify the 'Score' values to change the data?
* What does alt.Chart() do?
* What is the purpose of mark_bar() in the code?
* How does encode() function work in Altair? What does mapping columns to axes mean?


Experimentation
* Add a new column called 'Type' with values ['Forum', 'Tutorial', 'Tutorial', 'Documentation', 'Tutorial'].
* Modify the chart to display the 'Type' column on the x-axis and 'Score' on the y-axis.
* change the color of the bars based on the 'Website' column?
* How would you sort the bars in descending order of the 'Score'?
* can you think of other ways to visualize this data?

In [None]:
# start experimenting with the code here

In [None]:
# @title click for solution

score_data['Type'] = ['Forum', 'Tutorial', 'Tutorial', 'Documentation', 'Tutorial']

alt.Chart(score_data).mark_bar().encode(
    x='Type',
    y='Score',
    color='Type'
)

## Example 2: Scatter Plot

In this example, we will visualize the iris dataset from the vega_datasets library in the form of a scatter plot. The mark method used for scatter plot in this example is mark_point(). For this bi-variate analysis, we map the sepalLength and petalLength columns to the x and y axes encoding. Further, to differentiate the points from each other, we map the shape encoding to the species column.

In [None]:
# Import data object from vega_datasets
from vega_datasets import data

# Selecting the data
iris = data.iris()

# Making the Scatter Plot
alt.Chart(iris).mark_point().encode(
	# Map the sepalLength to x-axis
	x='sepalLength',
	# Map the petalLength to y-axis
	y='petalLength',
	# Map the species to shape
	color='species'   #you can also encode speces as different shape
)

Think!
* What does the vega_datasets library provide? Why do we use it?
* What kind of data does the iris dataset contain? What are its features?
How would you describe the purpose of a scatter plot?
* What is the purpose of mark_point() in the code?
* How does the encode() function work in Altair? What does mapping columns to visual properties mean?
* How does encoding the species column as color or shape help in visualizing the data?

Experimentation
1. Print the first few rows of the iris dataset to understand its structure.
2. Modify the scatter plot to map the species column to the shape of the points instead of the color.
3. Combine both color and shape encodings for the species column to see how it enhances the visualization.
4. Change the x-axis to sepalWidth and observe how the scatter plot changes.
5. Add tooltips to the scatter plot to display sepalLength, sepalWidth, and species when hovering over a point.

In [None]:
# start experimenting with the code here

In [None]:
# @title click for solution 1-3
print(iris.head())

alt.Chart(iris).mark_point().encode(
    x='sepalLength',
    y='petalLength',
    color='species',
    shape='species'
)

   sepalLength  sepalWidth  petalLength  petalWidth species
0          5.1         3.5          1.4         0.2  setosa
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
4          5.0         3.6          1.4         0.2  setosa


In [None]:
# @title click for solution 4
alt.Chart(iris).mark_point().encode(
    x='sepalWidth',
    y='petalLength',
    color='species'
)


In [None]:
# @title click for solution 5
alt.Chart(iris).mark_point().encode(
    x='sepalLength',
    y='petalLength',
    color='species',
    tooltip=['sepalLength', 'sepalWidth', 'species']
)

## Bonus: Explore Vega Datasets

https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/

[Descriptions](https://github.com/vega/vega-datasets/blob/main/SOURCES.md)

# Grouped Bar charts

## Grouped Bar charts example 1

Letâ€™s take an example, suppose, we want to compare the runs made by two players across three formats.
Here,
* runs scored by the players act as **values**,
* Player name acts as a **series** and
* the format of the game acts as the **categories**.

Always, the runs scored by each player will have the same color representation across the different formats.


As we discussed, we need at least three rows/columns. Let's start by creating a dataset having three columns using pandas library.

In [None]:
# creating a custom dataframe
data = pd.DataFrame([[264, 'Rohit', 'ODI'],
                     [183, 'Virat', 'ODI'],
                     [118, 'Rohit', 'T20'],
                     [94, 'Virat', 'T20'],
                     [212, 'Rohit','Test'],
                     [254, 'Virat','Test']],
                     columns=['Highest Score', 'Player', 'Format'])
data

Unnamed: 0,Highest Score,Player,Format
0,264,Rohit,ODI
1,183,Virat,ODI
2,118,Rohit,T20
3,94,Virat,T20
4,212,Rohit,Test
5,254,Virat,Test


Now, that we have a dataset containing three columns, where we want to compare the highest score (values) of two players (series) across different formats (categories).

In [None]:
# start here hint: one encoding that is used in altair is "column"

In [None]:
# @title click for solution

gp_chart = alt.Chart(data).mark_bar().encode(
    alt.Column('Format'),
    alt.X('Player'), #here we are using posititional arguments rather than keyword arguments. positional arguments are defined as "alt.argument()" like functions in python.
    alt.Y('Highest Score', axis=alt.Axis(grid=False)), # as you see positional arguments can be more explicit and allows for additional customization through parameters.
    alt.Color('Player'))

gp_chart.display() # display the chart


In [None]:
# @title simpler way to make it
gp_chart = alt.Chart(data).mark_bar().encode(
    column='Format', # this way to define encoding is called keyword arguments
    x='Player',
    y='Highest Score', #note that you cannot customize it as much as you want
    color='Player'
)

gp_chart

Think!
* Can we use both positional arguments and keywords arguments in defining one charts
* How does using positional arguments differ from keyword arguments in the encode() function? Why might you use one over the other?
* What does alt.Column('Format') do? How does it affect the layout of the chart?
* How does alt.Axis(grid=False) modify the appearance of the Y-axis?
* Why is alt.Color('Player') used in this chart? How does it enhance the visualization?


Experimentation


*   Add tooltips to the bars to display Player, Format, and Highest Score when hovering over a bar.
*   Apply sorting to the X-axis by Highest Score to display players in order of their scores.
*   Change the X-axis to display Format and the columns to display Player.
*   Modify the chart to use alt.Row('Player') instead of alt.Column('Player') and observe how the layout changes.
*   Experiment with different mark types, such as mark_circle(), to see how the visualization changes.
*   Apply sorting to the X-axis by Highest Score to display players in order of their scores.





### answer think 1:
```
alt.Chart(df).mark_bar().encode(
    x=alt.X('Variable 1', sort='-y'),
    y='Variable 2'
)
```



In [None]:
# @title click for solution
gp_chart = alt.Chart(data).mark_bar().encode(
    alt.Row('Player'),
    alt.X('Format', sort='-y'),
    alt.Y('Highest Score', axis=alt.Axis(grid=False)),
    alt.Color('Player'),
    alt.Tooltip(['Player', 'Format', 'Highest Score'])
    )

gp_chart.display()