## Imports

In [None]:
# Some imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

## Help in Jupyter

If you write ?function name, jupyter will print out the doctring which will include parameters, descriptions, examples, etc. 

In [None]:
?plt.boxplot

## Pandas

### Filtering
In pandas, you can use boolean series along either axis to perform more complicated data filtering and selection

To construct a boolean series you perform logical operations on a pandas series, which can be a column or row from a dataframe.

In [None]:
# Lets get the HCEPDB data, and select 500 random row
data = pd.read_csv('http://faculty.washington.edu/dacb/HCEPDB_moldata.zip')
df = data.sample(500, random_state=42)

In [None]:
# View the first 10 rows
df.head(10)

In [None]:
# Lets say we want to find materials with mass less than 400, and pce greater than 2
# First construct the boolean series
bool_series = (df["mass"]<400)&(df["pce"]>2)
# Now, select all the rows where this is true
low_mass_high_pce = df[bool_series]
low_mass_high_pce.describe()

In [None]:
# This can all be done in a single step
df[(df["mass"]<400)&(df["pce"]>2)].describe()

### Cheatsheet
Pandas has made a "cheatsheet" to help with remembering how to perform data manipulation with pandas which can be found [here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

## Data Visualization

Data visualization allows you to understand relationships between variables much more easily than just pure numbers. Data visualization can happen at the end of a project as a way to share your results, but can also be an important part of exploring the data and understanding it, part of what is known as "exploratory data analysis".   


What is the relationship between voc and jsc?

In [None]:
df[["voc","jsc"]] # You can select multiple columns and only show those

From the pure numbers this is not easy to determine. But what if we make use of visualization to try and get a better understanding?

In [None]:
plt.scatter(df["voc"], df["jsc"])

Though the above plot can give you some clues to the relationship, it still isn't that easy to tell what is going on. Maybe another variable from the data would help, we can add this additional information using another 'encoding'. Encodings are ways to visually represent a variable. so far we have used position to encode the information, but using another encoding, like color, can enable us to see how another variable affects this distribution. 

In [None]:
small_mass = df[df["mass"]<300]
large_mass = df[df["mass"]>=300]
fig, ax = plt.subplots()

ax.scatter(small_mass['jsc'], small_mass["voc"], color = 'red', label = "Small Mass")
ax.scatter(large_mass['jsc'], large_mass["voc"], color = 'blue', label = "Large Mass")

ax.set_xlabel("Jsc")
ax.set_ylabel("VOC")
ax.set_title("VOC vs Jsc")

plt.legend()

In [None]:
fig, ax = plt.subplots()
scatter = ax.scatter(df["voc"], df["jsc"], c= df["mass"])
legend = ax.legend(*scatter.legend_elements(num=7), loc="upper right", title="Mass")
ax.add_artist(legend)
plt.show()

Using the additional encoding, we are able to visualize more information on this single plot, and get a little bit more understanding of the relationship between these variables. 

Lets look at another dataset about flower species (because it lends itself well to some easy visualizations). Iris is a dataset that records several variables about irises of different species. It is a very common dataset to use for practicing machine learning techniques, and available as one of the toy datasets in sklearn (a library we will talk much more about later). 

In [None]:
# Don't worry too much about this cell, I am just grabbing some data from sklearn
from sklearn import datasets
iris_data = datasets.load_iris() # Load the dataset, which is a dictionary
iris = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names']) # Change the data into a dataframe
target_dict = {i:iris_data["target_names"][i] for i in range(len(iris_data["target_names"]))} # Dict comprehension
target_series = pd.Series(iris_data["target"]).replace(target_dict) # Change from numbers to species names
iris["species"] = target_series # Add the species names

<div>
    <img src = "https://www.sciencefacts.net/wp-content/uploads/2021/12/Sepals.jpg" width =200/>
</div>

In [None]:
sns.pairplot(data = iris, hue = "species") # We will talk more about Seaborn in a bit

#### Types of Data

There are different types of data that you may want to represent
* Nominal (labels/categories)
* Ordinal (ordered labels)
* Quantitative (continous data)

#### Encodings

The above plot allows us to explore the relationship between a variety of different variables at once, again using a color encoding to get more information about the relationship in a single plot. There are a lot more encodings that just color though, here are the 

<div>
    <img src = "https://ltb.itc.utwente.nl/uploads/studyarea/509/Pics_2015_jpg/Fig10_11.jpg" width =700/>
</div>
<a href=https://ltb.itc.utwente.nl/509/concept/88863>https://ltb.itc.utwente.nl/509/concept/88863 </a>

#### Question:
What types of data can be represented by which types of encodings? Are there encodings which are better at representing different types of data?

### Expressiveness vs Effectiveness
* Expressiveness: How much information we can convey
* Effectiveness: How easy is the information to digest

We can include a huge amount of information in a single plot, using a different encoding for each element, but it can get very confused

In [None]:
df["jsc_cut"] = pd.cut(df["jsc"], 5)

sns.scatterplot(data=df, x = "pce", y="voc", hue="mass", size = "e_homo_alpha", style="jsc_cut",
               palette = sns.color_palette("vlag", as_cmap=True))

As seen above, a plot containing a lot of information (pce, voc, mass, e_homo_alpha, jsc) but is bad at conveying a message because it is hard to digest. On the other hand, the early plots were too simple to really convey much of anything. This is a tradeoff that you should keep in mind. Just adding information to a plot won't make it better, and being thoughtful about how you combine different encodings can make what you are trying to convey much clearer. 

### Python Plotting Options
* Directly from pandas using DataFrame.plot()
* Using Matplotlib 
* Using Seaborn

### Examples of Data Visualization (The Good, The Bad, The ugly)

#### Below examples from: [https://www.syntaxtechs.com/blog/data-visualization-examples](https://www.syntaxtechs.com/blog/data-visualization-examples)

Overly complex graphics are hard to understand

![Overly Complex Graphics](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf49e384b99811891b691_Blog%2049.8..png)

Plots with too much information can be bad at conveying their purpose

![Overly Expressive Pie Chart](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf4c6ea186da5af170265_Blog%2049.10..png)

Pie charts should always add up to 100%  

![Pie Chart](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf4f1649499ea50046dfa_Blog%2049.12..png)

Data Visualization can be misleading  

![Bad Bar Chart](https://assets.website-files.com/60078f9b9c5ea6f60974b74b/61bdf500194ffa8799a8400c_Blog%2049.13..jpg)

#### Examples Below From [https://www.oldstreetsolutions.com/good-and-bad-data-visualization](https://www.oldstreetsolutions.com/good-and-bad-data-visualization)

Bad Heat Map: Darker colors should represent higher values (notice none is darker than 1-100)

![Bad Heat Map](https://www.oldstreetsolutions.com/wp-content/uploads/2021/05/Bad-Heat-Map.png)

Even simple graphics can have low effectiveness (aka areas are harder to understand)

![Weird Groceries Chart](https://www.oldstreetsolutions.com/wp-content/uploads/2021/05/Groceries-Weird-Chart.jpg)

Including more information can be used to make visualization more clear

![Population Density](https://www.oldstreetsolutions.com/wp-content/uploads/2021/05/Proper-Heat-Map.png)

#### From [https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak](https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak)

<div>
    <img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Snow-cholera-map-1.jpg/1024px-Snow-cholera-map-1.jpg" width =700/>
</div>

### Plotting Zoo

#### Line Plot

In [None]:
df.plot(kind='line', x='jsc', y='voc')

#### Scatter

In [None]:
df.plot(kind='scatter', x='jsc', y='voc')

#### Bar plot

In [None]:
df["has_si"] = df["SMILES_str"].apply(lambda s: 'Si' in s) # Add column that indicates if material contains silicon

In [None]:
numeric_cols = ["mass", "pce", "voc", "jsc", "e_homo_alpha", "e_gap_alpha", 
                "e_lumo_alpha"]
df.groupby('has_si')[numeric_cols].mean().reset_index().plot(kind="bar", x="has_si", y="jsc")

#### Histogram

In [None]:
df.plot(kind='hist', y="mass")

#### Box and Whisker Plot

In [None]:
df.plot(kind="box", by="has_si", column=["mass", "jsc"], subplots=True)

#### Density Estimate

In [None]:
df.plot(kind="kde", y="mass")

In [None]:
df.plot(kind = "kde", y = ["e_homo_alpha", "e_lumo_alpha"])

#### Pie

In [None]:
df.groupby("has_si").count().plot(kind="pie", y="id")

### Seaborn

Seaborn is a wrapper library around matplotlib that allows you to easily create more complex plots. Everything in seaborn can be done in matplotlib, but it will take longer. 

In [None]:
df["mass_cut"] = pd.cut(df["mass"], 5)
fig, ax = plt.subplots()

sns.kdeplot(data=df, x='e_gap_alpha', hue="mass_cut", common_norm=False, palette="viridis", ax=ax)

In [None]:
fig, ax = plt.subplots()
plot = sns.barplot(data=df, x='mass_cut', y="e_gap_alpha", hue="has_si")
plt.xticks(rotation=30)

You can also easily set default stylization

In [None]:
sns.set_style('whitegrid') # Colors, grids, etc.
sns.set_context("talk") # Font and line sizes

#### Heatmap

In [None]:
sns.heatmap(data=df[numeric_cols].corr())

#### 2D KDE

In [None]:
sns.kdeplot(data = df, x="jsc", y="voc") 

Note how this shows the high density around 0? This might be due to some missing data...

#### Joint Plot

In [None]:
sns.jointplot(data=df, x='jsc', y="voc")

#### Violin Plot

In [None]:
sns.violinplot(data=df, x="e_gap_alpha", y="mass_cut")

#### Pair Plot

In [None]:
sns.pairplot(data=df[numeric_cols], kind="scatter")

## Excecises

1. Filter the HCEPDB data for data where voc is between 0.5 and 0.75

2. Create a plot of JSC vs PCE using square markers of size 2 using matplotlib

hint: [Documentation for Matplotlib Scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)

3. Create a plot to show how the difference in HOMO and LUMO energy gap (e_gap_alpha) affected by molar mass

4. Recreate the pairplot for the HCEPDB above, except filter out data where voc is approximately 0

hint: np.isclose can be used to check if a number is close to 0

5. Pick any three variables in the HCEPDB data set, and create a plot or plots to explore the relationship between these variables

6. Below I create a dataframe `diabetes` with a new dataset from sklearn, use different plots to explore the relationships between the variables

Data Description: Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

Variables:
- age: age in years
- sex
- bmi: body mass index
- bp: average blood pressure
- s1: tc, total serum cholesterol
- s2: ldl, low-density lipoproteins
- s3: hdl, high-density lipoproteins
- s4: tch, total cholesterol / HDL
- s5: ltg, possibly log of serum triglycerides level
- s6: glu, blood sugar level
- progression: Quantitative measure of disease progression



In [None]:
diabetes_data = datasets.load_diabetes()
diabetes = pd.DataFrame(diabetes_data["data"], columns = diabetes_data["feature_names"])
diabetes["progression"] = diabetes_data["target"]

Some possible questions to explore (fell free to create your own instead): 
 - Do the men and women in the dataset have simmilar age distributions?
 - How are the serum measurements affected by age?
 - What is the relationship between age, bp, and progression?
 - Create age groups, then create subplots of bmi vs s2 for each of the age groups (try using just matplotlib, try with [Seaborn Facetgrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html))

## Other Visualization Libraries and Methods

### Altair:
(Available in this jupyter hub)  
A plotting library that is for declaritively creating data visualiztions. It is built around vega-light grammar, the developement of which is led by the alumni and members of the University of Washington Interactive Data Lab (UW IDL).  
[Altair Documentation](https://altair-viz.github.io/)

In [None]:
# import altair with an abbreviated alias
import altair as alt

# make the chart
alt.Chart(diabetes).mark_point().encode(
    x='age',
    y='bp',
    color='progression',
).interactive()

From: [https://altair-viz.github.io/getting_started/overview.html#overview](https://altair-viz.github.io/getting_started/overview.html#overview)

### Plotly
(Available in this Jupyter Hub)  
Available for python (and a large number of other languages)  
Based on plotly javascript library, very good at making a variety of interactive plots, and these can also easily be made into web apps for others to explore.  
[Plotly Documentation](https://plotly.com/python/)  
[Dash User Guide](https://dash.plotly.com/)

In [None]:
import plotly.express as px
df = px.data.gapminder().query("year==2007")
fig = px.scatter_geo(df, locations="iso_alpha", color="continent",
                     hover_name="country", size="pop",
                     projection="natural earth")
fig.show()

From [https://plotly.com/python/bubble-maps/](https://plotly.com/python/bubble-maps/)

In [None]:
# Import data
import time
import numpy as np

from skimage import io

vol = io.imread("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/attention-mri.tif")
volume = vol.T
r, c = volume[0].shape

# Define frames
import plotly.graph_objects as go
nb_frames = 68

fig = go.Figure(frames=[go.Frame(data=go.Surface(
    z=(6.7 - k * 0.1) * np.ones((r, c)),
    surfacecolor=np.flipud(volume[67 - k]),
    cmin=0, cmax=200
    ),
    name=str(k) # you need to name the frame for the animation to behave properly
    )
    for k in range(nb_frames)])

# Add data to be displayed before animation starts
fig.add_trace(go.Surface(
    z=6.7 * np.ones((r, c)),
    surfacecolor=np.flipud(volume[67]),
    colorscale='Gray',
    cmin=0, cmax=200,
    colorbar=dict(thickness=20, ticklen=4)
    ))


def frame_args(duration):
    return {
            "frame": {"duration": duration},
            "mode": "immediate",
            "fromcurrent": True,
            "transition": {"duration": duration, "easing": "linear"},
        }

sliders = [
            {
                "pad": {"b": 10, "t": 60},
                "len": 0.9,
                "x": 0.1,
                "y": 0,
                "steps": [
                    {
                        "args": [[f.name], frame_args(0)],
                        "label": str(k),
                        "method": "animate",
                    }
                    for k, f in enumerate(fig.frames)
                ],
            }
        ]

# Layout
fig.update_layout(
         title='Slices in volumetric data',
         width=600,
         height=600,
         scene=dict(
                    zaxis=dict(range=[-0.1, 6.8], autorange=False),
                    aspectratio=dict(x=1, y=1, z=1),
                    ),
         updatemenus = [
            {
                "buttons": [
                    {
                        "args": [None, frame_args(50)],
                        "label": "&#9654;", # play symbol
                        "method": "animate",
                    },
                    {
                        "args": [[None], frame_args(0)],
                        "label": "&#9724;", # pause symbol
                        "method": "animate",
                    },
                ],
                "direction": "left",
                "pad": {"r": 10, "t": 70},
                "type": "buttons",
                "x": 0.1,
                "y": 0,
            }
         ],
         sliders=sliders
)

fig.show()

From [https://plotly.com/python/visualizing-mri-volume-slices/](https://plotly.com/python/visualizing-mri-volume-slices/)

### Bokeh
(Available in this Jupyter Hub)  
Based on Bokeh javascript library, good at making interactive plots, and can also be used to create web apps.   
[Bokeh Website](http://bokeh.org/)  
[Bokeh Documentation](https://docs.bokeh.org/en/latest/)  
[Bokeh Server](https://docs.bokeh.org/en/latest/docs/user_guide/server.html)  

In [None]:
# activate Bokeh output in Jupyter notebook
from bokeh.io import output_notebook

output_notebook()

# create a complex chart with mouse-over tooltips

from bokeh.palettes import HighContrast3
from bokeh.plotting import figure, show

fruits = ["Apples", "Pears", "Nectarines", "Plums", "Grapes", "Strawberries"]
years = ["2015", "2016", "2017"]

data = {"fruits": fruits, "2015": [2, 1, 4, 3, 2, 4], "2016": [5, 3, 4, 2, 4, 6], "2017": [3, 2, 4, 4, 5, 3]}

p = figure(x_range=fruits, height=250, title="Fruit Counts by Year", toolbar_location=None, tools="hover", tooltips="$name @fruits: @$name")

p.vbar_stack(years, x="fruits", width=0.9, color=HighContrast3, source=data, legend_label=years)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"

show(p)

From tutorial at [http://bokeh.org/](http://bokeh.org/)

### Shiny
(Not available in this Jupyter Hub)  
Available for python and R  
Library for creating interactive web visualization, can be a good way of sharing data that can be more easily explored for people unfamiliar with python   
[Shiny Example Gallery](https://shiny.posit.co/py/gallery/)  
[Shiny Example App](https://shinylive.io/py/app/#orbit-simulation)  
[Shiny Documentation](https://shiny.posit.co/py/docs/overview.html)  

This is just a small sample of some other visualization libraries, there are a huge number of other libraries. In this class we will focus only on matplotlib and seaborn, but if you want to create interactive visualizations, and/or data dashboards, these libraries provide some options you may want to use later.