<a href="https://colab.research.google.com/github/mehrnazh/PythonVisualization/blob/main/Python_Guide_to_data__visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization using Altair and Plotly


By Mehrnaz Hosseinzadeh M.D., National Brain Centre, Mental Health Research Centre, IUMS

## Data Visualization
Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. Visualizing data is an esstential part of data analysis and machine learning, but choosing the right type of visualization is often challenging. This guide provides an introduction to popluar data visualization techniques, by presenting sample use cases and providing code examples using Python.

Types of graphs covered:

 - Line graph
 - Scatter plot
 - Histogram and Frequency Distribution
 - Heatmap
 - Box Plot
 - Bar Chart

T: Date-time

Q: Quantitative

O: Ordered

N: Nominal

## Import libraries

- [Matplotlib](https://matplotlib.org/): Plotting and visualization library for Python. We'll use the `pyplot` module from `matplotlib`. As convention, it is often imported as `plt`.
- [Seaborn](https://seaborn.pydata.org/): An easy-to-use visualizetion library that builds on top of Matplotlib and lets you create beautiful charts with just a few lines of code.

In [None]:
# Uncomment the next line to install the required libraries
!pip install altair plotly --upgrade --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m671.7/671.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.8/144.8 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np
import altair as alt
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

## Line Chart
A line chart displays information as a series of data points or markers, connected by a straight lines. You can customize the shape, size, color and other aesthetic elements of the markers and lines for better visual clarity.

### Example

We'll create a line chart to compare the expression of 2 genes over 12 years in the imaginary population.

In [None]:
# Sample data
years = range(2000, 2012)
gene_a = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
gene_b = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896]


# Preparing data for Altair
data = pd.DataFrame({
    'Year': list(years) * 2,
    'Expression Level': gene_a + gene_b,
    'Gene': ['Gene A'] * len(gene_a) + ['Gene B'] * len(gene_b)
})

data

Unnamed: 0,Year,Expression Level,Gene
0,2000,0.895,Gene A
1,2001,0.91,Gene A
2,2002,0.919,Gene A
3,2003,0.926,Gene A
4,2004,0.929,Gene A
5,2005,0.931,Gene A
6,2006,0.934,Gene A
7,2007,0.936,Gene A
8,2008,0.937,Gene A
9,2009,0.9375,Gene A


In [None]:
# Creating the Altair plot
chart = alt.Chart(data).mark_line(point=True).encode(
    x='Year:O',
    y='Expression Level:Q',
    color='Gene:N'
)
chart

In [None]:
# Creating the Altair plot
chart = alt.Chart(data).mark_line(point=True).encode(
    x='Year:O',
    y=alt.Y('Expression Level:Q', scale=alt.Scale(domain=[0.8, 1])),
    color='Gene:N',
    strokeDash='Gene:N'
).properties(
    title='Gene Expression Levels Over Time'
).configure_mark(
    size=12,  # Marker size
    strokeWidth=4  # Line width
).configure_title(
    fontSize=18  # Title font size
).configure_axis(
    labelFontSize=14,
    titleFontSize=14
)

chart

## Scatter Plot
In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additonally, you can also use a third variable to determine the size or color of the points.

### Example
The [cars dataset](https://github.com/altair-viz/vega_datasets/tree/master/vega_datasets/_data) provides features of different cars. The cars dataset is included with the `vega-altair` library, and can be loaded as a `pandas` dataframe.

In [None]:
import altair as alt
from vega_datasets import data

# Load the cars dataset
cars = data.cars()

# Create a scatter plot
scatter_plot = alt.Chart(cars).mark_circle(size=100).encode(
    x='Horsepower:Q',         # X-axis
    y='Miles_per_Gallon:Q',   # Y-axis
    color='Origin:N'          # Dot color based on car origin
).properties(
    title="Car Horsepower vs. MPG"
)

# Display the plot
scatter_plot


  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


## Histogram and Frequency Distribution

A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin.

### Example
We can use a histogram to visualize how the values of Miles per Gallon are distributed.

In [None]:
# Create a distribution (histogram) of the 'Miles_per_Gallon' column
histogram = alt.Chart(cars).mark_bar(opacity=0.6).encode(
    alt.X('Miles_per_Gallon:Q', bin=True),  # X-axis: Bin the data for a histogram
    y='count()',                            # Y-axis: Count of records in each bin
).properties(
    title="Distribution of Miles per Gallon"
)

# Display the plot
histogram

We can immediately see that values of cars' MPG fall in the range 10 - 45. We can also look at this data as a frequency **distribution**, where the values on Y-axis are percentagess instead of counts.

Kernel Density Estimate (KDE) is a technique used in statistics to estimate the probability density function of a continuous random variable. It provides a smooth, continuous curve that represents the distribution of data.

To do so we need KDE

**What is KDE?**

KDE helps you understand the distribution of a variable by smoothing out the data points into a continuous curve.

**How It Works:** It takes the data and fits a smooth curve to it. This curve shows where data points are concentrated and where they are sparse.


**Visualization:**

KDE plots are often used to visualize the distribution of data. They provide a clearer view of the data’s structure compared to histograms, especially when comparing multiple distributions.

In [None]:
# Kernel Density Estimate (KDE) plot for the distribution of horsepower
kde = alt.Chart(cars).transform_density(
    density='Miles_per_Gallon',
    as_=['Miles_per_Gallon', 'density']
).mark_line().encode(
    x='Miles_per_Gallon:Q',
    y='density:Q'
).properties(
    title="Density of Miles_per_Gallon"
)

kde

In [None]:
# Combine both charts using '+' operator
combined_chart = histogram + kde

# Adjust properties
combined_chart = combined_chart.properties(
    title="Distribution and Density of Horsepower"
).resolve_scale(
    y='independent'  # Use independent y-scales for both charts
).configure_title(
    fontSize=14,
    anchor='middle',
    color='black'
)

combined_chart

## Heatmap

A heatamp is used to visualize 2-dimensional data like a matrix or a table using colors.

### Example
We'll use another sample dataset from Seaborn, called "flights", to visualize monthly passenger footfall at an airport over 12 years.

In [None]:
from vega_datasets import data
weather = data.seattle_weather()
weather

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain
...,...,...,...,...,...,...
1456,2015-12-27,8.6,4.4,1.7,2.9,fog
1457,2015-12-28,1.5,5.0,1.7,1.3,fog
1458,2015-12-29,0.0,7.2,0.6,2.6,fog
1459,2015-12-30,0.0,5.6,-1.0,3.4,sun


In [None]:
# extract month and year information
weather['month'] = weather['date'].dt.month
weather['year'] = weather['date'].dt.year

# Aggregate the data to get average temperature for each combination of month and year
aggregated_data = weather.groupby(['month', 'year']).agg({'temp_max': 'mean'}).reset_index()

# Create a heatmap with month and year on the axes, and average temperature represented by color
heatmap = alt.Chart(aggregated_data).mark_rect().encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('month:O', title='Month'),
    color=alt.Color('temp_max:Q', scale=alt.Scale(scheme='viridis'), title='Avg Temp (°F)'),
    tooltip=['year:O', 'month:O', 'temp_max:Q']
).properties(
    title="Heatmap of Average Maximum Temperature by Month and Year"
)

# Display the chart
heatmap.configure_title(
    fontSize=16,
    anchor='middle',
    color='black'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
)


In [None]:
# Create a heatmap with Acceleration as the color
heatmap = alt.Chart(cars).mark_rect().encode(
    x=alt.X('Horsepower:Q', bin=alt.Bin(maxbins=20), title='Horsepower'),
    y=alt.Y('Miles_per_Gallon:Q', bin=alt.Bin(maxbins=20), title='Miles per Gallon'),
    color=alt.Color('mean(Acceleration):Q', scale=alt.Scale(scheme='blues'), title='Mean Acceleration'),
    tooltip=['mean(Acceleration):Q', 'Horsepower:Q', 'Miles_per_Gallon:Q']
).properties(
    title="Heatmap of Horsepower vs Miles per Gallon with Acceleration as Color"
)


# Configure text
text = heatmap.mark_text(baseline='middle').encode(
    text='mean(Acceleration):Q',
    color=alt.condition(
        alt.datum.num_cars > 100,
        alt.value('black'),
        alt.value('white')
    )
)

# Display the chart
heatmap.configure_title(
    fontSize=16,
    anchor='middle',
    color='black'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).interactive()

heatmap + text

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


## Box Plot
A box plot shows the distribution of data along a single axis, using a "box" and "whiskers". The lower end of the box represents the 1st quartile (i.e. 25% of values are below it), and the upper end of the box represents the 3rd quartile (i.e. 25% of values are above it). The median value is represented via a line inside the box. The "whiskers" represent the minimum & maximum values (sometimes excluding outliers, which are represented as dots).

### Example
We'll use sample dataset of cars. we want to generate a simple box plot of Miles_per_Gallon (MPG) grouped by the number of cylinders in each car.

In [None]:
import altair as alt
from vega_datasets import data

# Load the cars dataset
cars = data.cars()

# Create a basic box plot of MPG by number of cylinders
box_plot = alt.Chart(cars).mark_boxplot().encode(
    x=alt.X('Cylinders:O', title='Number of Cylinders'),
    y=alt.Y('Miles_per_Gallon:Q', title='Miles Per Gallon')
).properties(
    title="Box Plot of MPG by Number of Cylinders"
)

# Display the chart
box_plot


In [None]:
import altair as alt
from vega_datasets import data

# Load the cars dataset
cars = data.cars()

# Create a bar chart of average MPG by number of cylinders
bar_chart = alt.Chart(cars).mark_bar().encode(
    x=alt.X('Cylinders:O', title='Number of Cylinders'),
    y=alt.Y('mean(Miles_per_Gallon):Q', title='Average Miles Per Gallon'),
    color=alt.Color('Cylinders:O', scale=alt.Scale(scheme='set1'), title='Number of Cylinders'),
    tooltip=['Cylinders:O', 'mean(Miles_per_Gallon):Q']
).properties(
    title="Average MPG by Number of Cylinders"
)

# Display the chart
bar_chart.configure_title(
    fontSize=16,
    anchor='middle',
    color='black'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
)


In [None]:
cars

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA
...,...,...,...,...,...,...,...,...,...
401,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,1982-01-01,USA
402,vw pickup,44.0,4,97.0,52.0,2130,24.6,1982-01-01,Europe
403,dodge rampage,32.0,4,135.0,84.0,2295,11.6,1982-01-01,USA
404,ford ranger,28.0,4,120.0,79.0,2625,18.6,1982-01-01,USA


In [None]:
import altair as alt
from vega_datasets import data

# Load the cars dataset
cars = data.cars()

# Create a bar chart of average MPG by number of cylinders, segmented by origin
bar_chart = alt.Chart(cars).mark_bar().encode(
    x=alt.X('Cylinders:O', title='Number of Cylinders'),
    y=alt.Y('mean(Miles_per_Gallon):Q', title='Average Miles Per Gallon'),
    color=alt.Color('Origin:N', scale=alt.Scale(scheme='category10'), title='Origin'),
    column=alt.Column('Origin:N', title='Origin'),
    tooltip=['Cylinders:O', 'mean(Miles_per_Gallon):Q', 'Origin:N']
).properties(
    title="Average MPG by Number of Cylinders and Origin"
).configure_title(
    fontSize=16,
    anchor='middle',
    color='black'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
)

# Display the chart
bar_chart


In [None]:
# Create a grouped bar chart of average MPG by number of cylinders, segmented by origin
bar_chart = alt.Chart(cars).mark_bar().encode(
    x=alt.X('Origin:N', title='Origin'),
    y=alt.Y('mean(Miles_per_Gallon):Q', title='Average Miles Per Gallon'),
    color=alt.Color('Cylinders:O', scale=alt.Scale(scheme='category10'), title='number of cylinders'),
    column=alt.Column('Cylinders:O', title='number of cylinders'),
    tooltip=['Cylinders:O', 'mean(Miles_per_Gallon):Q', 'Origin:N']
).properties(
    title="Average MPG by Number of Cylinders and Origin"
).configure_title(
    fontSize=16,
    anchor='middle',
    color='black'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
)

# Display the chart
bar_chart

In [None]:
from sklearn.datasets import load_wine

wine = load_wine()

wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)

wine_df["Category"] = ["Category_%d"%(cat+1) for cat in wine.target]

wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,Category
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,Category_1
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,Category_1
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,Category_1
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,Category_1
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,Category_1


In [None]:
alt.Chart(avg_wine_df).mark_bar(
    color='tomato'
).encode(
    x = 'Category', y = 'malic_acid'
).properties(
    width=300, height=300,
    title="Avg Malic Acid per Wine Category"
)

In [None]:
avg_wine_df = wine_df.groupby(by="Category").mean().reset_index()

alt.Chart(avg_wine_df).mark_bar(
    color='dodgerblue'
).encode(
    x = 'proline', y = 'Category'
).properties(
    width=300, height=300,
    title="Avg Proline per Wine Category"
)

In [None]:
melted_wine_df = wine_df.melt(id_vars=['Category'],
                              value_vars=["malic_acid", "total_phenols", "flavanoids", "hue", "color_intensity", "proanthocyanins", ],
                              var_name="Ingredients", value_name="Value")

melted_wine_df.head()

Unnamed: 0,Category,Ingredients,Value
0,Category_1,malic_acid,1.71
1,Category_1,malic_acid,1.78
2,Category_1,malic_acid,2.36
3,Category_1,malic_acid,1.95
4,Category_1,malic_acid,2.59


In [None]:
alt.Chart(melted_wine_df).mark_bar().encode(
    x = 'Category', y = 'mean(Value)', color="Ingredients",
).properties(
    height=350, width=350,
    title="Average Ingredients per Wine Category"
)

In [None]:
ingredients = ["Category", "malic_acid", "total_phenols", "flavanoids", "hue", "color_intensity", "proanthocyanins", ]
avg_wine_df = wine_df[ingredients].groupby(by="Category").mean().T.reset_index().rename(columns={"index": "Ingredients"})
avg_wine_df = avg_wine_df[:11]

avg_wine_df

Category,Ingredients,Category_1,Category_2,Category_3
0,malic_acid,2.010678,1.932676,3.33375
1,total_phenols,2.840169,2.258873,1.67875
2,flavanoids,2.982373,2.080845,0.781458
3,hue,1.062034,1.056282,0.682708
4,color_intensity,5.528305,3.08662,7.39625
5,proanthocyanins,1.899322,1.630282,1.153542


In [None]:
alt.Chart(avg_wine_df).mark_arc().encode(
    theta=alt.Theta(field="Category_1", type="quantitative"),
    color=alt.Color(field="Ingredients", type="nominal"),
    tooltip=["Ingredients", "Category_1"] ## Displays tooltip
).properties(
    height=400, width=400,
    title="Avg. Ingredients Distribution for Category 1 Wine"
)

In [None]:
alt.Chart(avg_wine_df).mark_arc(innerRadius=80).encode(
    theta=alt.Theta(field="Category_1", type="quantitative"),
    color=alt.Color(field="Ingredients", type="nominal"),
    tooltip=["Ingredients", "Category_1"] ## Displays tooltip
).properties(
    height=400, width=400,
    title="Avg. Ingredients Distribution for Category 1 Wine"
)

In [None]:
alt.Chart(avg_wine_df).mark_arc(innerRadius=15, stroke="#fff").encode(
    theta=alt.Theta("Category_1", stack=True),
    radius=alt.Radius("Category_1", scale=alt.Scale(type="sqrt", zero=True)),
    color=alt.Color("Ingredients"),
    tooltip=["Ingredients", "Category_1"] ## Displays tooltip
).properties(
    height=400, width=400,
    title="Avg. Ingredients Distribution for Category 1 Wine"
)

In [None]:
alt.Chart(wine_df).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
    alt.Y(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
    color='Category:N'
).properties(
    width=150,
    height=150,
).repeat(
    row=['alcohol', 'malic_acid', 'proline'],
    column=['alcohol', 'malic_acid', 'proline']
).properties(
    title="ScatterMatrix of 'alcohol', 'malic_acid', 'proline'"
).interactive()

## Gantt Chart
This example shows how to make a simple Gantt chart.

In [3]:
source = pd.DataFrame([
    {"task": "A", "start": 1, "end": 3},
    {"task": "B", "start": 3, "end": 8},
    {"task": "C", "start": 8, "end": 10}
])

alt.Chart(source).mark_bar().encode(
    x='start',
    x2='end',
    y='task'
)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


In [None]:
source = alt.topo_feature(data.world_110m.url, 'countries')

base = alt.Chart(source).mark_geoshape(
    fill='#666666',
    stroke='white'
).properties(
    width=300,
    height=180
)

projections = ['equirectangular', 'mercator', 'orthographic', 'gnomonic']
charts = [base.project(proj).properties(title=proj)
          for proj in projections]

alt.concat(*charts, columns=2)

In [None]:
source = alt.topo_feature(data.world_110m.url, 'countries')

background = alt.Chart(source).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    width=500,
    height=300
).project('naturalEarth1')

background

In [None]:
starbucks_locations = pd.read_csv("~/datasets/starbucks_store_locations.csv")
starbucks_locations.head()

In [None]:
mean_long_lat = starbucks_locations.groupby(by="State/Province").mean()[["Longitude", "Latitude"]]
count_per_state  = starbucks_locations.groupby(by="State/Province").count()[["Store Number"]].rename(columns={"Store Number":"Count"})

count_per_state = count_per_state.join(mean_long_lat).reset_index()
count_per_state.head()

In [None]:
points  = alt.Chart(count_per_state).mark_circle(
    color="tomato"
).encode(
    x="Longitude:Q", y="Latitude:Q", size="Count:Q",
    tooltip = ["State/Province", "Count"]
).interactive()

In [None]:
background + points


In [None]:
airports = data.airports.url
states = alt.topo_feature(data.us_10m.url, feature='states')

# US states background
background = alt.Chart(states).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    width=500,
    height=300
).project('albersUsa')

# airport positions on background
points = alt.Chart(airports).transform_aggregate(
    latitude='mean(latitude)',
    longitude='mean(longitude)',
    count='count()',
    groupby=['state']
).mark_circle().encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    size=alt.Size('count:Q', title='Number of Airports'),
    color=alt.value('steelblue'),
    tooltip=['state:N','count:Q']
).properties(
    title='Number of airports in US'
)

background + points