# Introduction to Data Visualization in Python

This is a very brief introduction to data visualization in Python. GSB 544 will cover this material in much more detail.

We'll use the cereals data set.


In [None]:
import pandas as pd

In [None]:
df_cereals = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/msba-workshop/refs/heads/main/cereals.csv")

In [None]:
df_cereals

## Pandas/Matplotlib

There is some basic plotting capability built in to Pandas, based on the [Matplotlib](https://matplotlib.org/) package.

### Bar plots

In [None]:
df_cereals["mfr"].value_counts()

In [None]:
df_cereals["mfr"].value_counts().plot.bar()

In [None]:
df_cereals["mfr"].value_counts().sort_index().plot.bar()

In [None]:
df_cereals["protein"].value_counts().plot.bar()

In [None]:
df_cereals["protein"].value_counts().sort_index().plot.bar()

### Histograms

In [None]:
df_cereals["rating"].plot.hist()

### Scatterplots

In [None]:
df_cereals.plot.scatter(x = "sugars", y = "rating")

# Grammar of Graphics

The **grammar of graphics** organizes principles of data visualization into a coherent philosophy. Roughly, the grammar of graphics says that every plot can be described by just a few components:

- the data
- aesthetics, more precisely, a mapping of the data to aesthetic elements
- geometric objects (e.g., points, lines, bars)
- statistical transformations (e.g., binning and counting for a histogram)
- and a few other things (scales, coordinate system, etc)



![](https://github.com/kevindavisross/data301/blob/main/images/bertin-graphics.jpg?raw=1)

Source: Jacques Bertin, *Semiology of Graphics*. 1967

## Plotnine

Many software packages are built on the grammar of graphics. The most well-known is probably `ggplot2` in R. The Python package `plotnine` is basically `ggplot2` in Python


In [None]:
from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap, geom_boxplot, geom_bar, geom_histogram

### Bar plots

In [None]:
from plotnine import geom_bar
(ggplot(df_cereals,
        aes(x = "mfr"))
+ geom_bar()
)

In [None]:
(ggplot(df_cereals,
        aes(x = "protein"))
+ geom_bar()
)

### Histograms

In [None]:
(ggplot(df_cereals,
        aes(x = "rating"))
+ geom_histogram()
)

In [None]:
(ggplot(df_cereals,
        aes(x = "rating"))
+ geom_histogram(bins = 20)
)

### Scatterplots

In [None]:
(ggplot(df_cereals,
        aes(x = "sugars", y = "rating"))
 + geom_point()
)

### And much more!!!

In [None]:
(ggplot(df_cereals,
        aes(x = "sugars", y = "rating", color = "shelf"))
 + geom_point()
)

In [None]:
(ggplot(df_cereals,
        aes(x = "sugars", y = "rating", color = "shelf"))
 + geom_point()
 + facet_wrap("~shelf")
 )

In [None]:
(ggplot(df_cereals,
        aes(x = "sugars", y = "rating", color = "shelf"))
 + geom_point()
 + facet_wrap("~shelf")
 + stat_smooth(method = "lm")
 )

In [None]:
(ggplot(df_cereals,
        aes(y = "sugars", x = "mfr", fill = "mfr"))
 + geom_boxplot()
 )

## Plotly

Plotly is another popular graphics package. The [Plotly Express](https://plotly.com/python/plotly-express/) package contains functions that streamline the code for producing many common figures. As a bonus, Plotly includes some interactivity by default; hover your mouse over the scatterplot below.

In [None]:
import plotly.express as px

In [None]:
px.scatter(df_cereals,
           x = "sugars",
           y = "rating",
           color = "shelf")

In [None]:
px.scatter(df_cereals,
           x = "sugars",
           y = "rating",
           color = "shelf",
           facet_col = "shelf")