![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Choosing the right type of visualization for your data

A visualization is only useful if it helps us to understand our data set better or communicate information about it more accurately and powerfully.

Not every type of visualization is useful for every data set. When choosing to represent data visually, it helps to remember what question we are trying to answer, and to ask ourselves, if what the visualization is presenting is relevant to that question.

Let's get some practice representing our `pets` data set from the previous module in different ways.

Run the following code in your Jupyter notebook to import the pandas library and recreate the `pets` DataFrame. 

In [None]:
#load "pandas" library under the alias "pd"
import pandas as pd

#identify the location of our online data
url = "https://raw.githubusercontent.com/callysto/online-courses/master/CallystoAndDataScience/data/pets-bootstrap.csv"

#read csv file from url and create a dataframe
pets = pd.read_csv(url)

#display the head of the data
pets.head()

### A note  about code comments

Have you noticed that some lines in code begin with a `#`, and seem to be written in plain English?

In Python, lines that begin with `#` are **comments** — basically notes for any humans looking at the code. The `#` tells the compiler to ignore the line, as it is not intended to be read by a computer.

Comments are useful when teaching others about what specific parts of our code does.  They also can help us keep our code organized, remind us about important information, and allow people less familiar with our code to understand it more easily.

## Grouping Variables

Our `pets` data set contains much more data than the lists we used to create data visualizations in Unit 1. For example, the head of the DataFrame shows us that for each animal eight different variables have been recorded.

Let's start by picking a few columns we are interested in, and get counts of the different values within those columns.

* Gender
* Species
* Age (in years)

To do this, we'll use the pandas method **groupby**, which lets us split data into groups and give those groups names so they can be easily referenced.

Run the code below to group data in the `gender`, `species`, and `age` columns of the `pets` DataFrame by count.

In [None]:
# Group by different Categories: Gender, Species, Age (years)
gender = pets.groupby("Gender").size().reset_index(name="Count")
species = pets.groupby("Species").size().reset_index(name="Count")
age = pets.groupby("Age (years)").size().reset_index(name="Count")

Now that we've created our groups, let's call each group name and see what it looks like as a table.  

In [None]:
gender

In [None]:
species

In [None]:
age

We can then visualize the data in multiple ways using tools from the Plotly Express library.

In [None]:
import plotly.express as px

In [None]:
# Display the data in multiple ways

# Visualizing the Species table
fig1 = px.scatter(species,x="Species", y="Count",title='Species Scatter plot')
fig1.show()

fig2 = px.bar(species,x="Species",y="Count",title="Species Bar chart")
fig2.show()

fig3 = px.pie(species,values='Count', names='Species', title="Species Pie chart")
fig3.show()

## Activity

Change the code below to visualize `age` data instead of `gender` data.

In [None]:
table_to_visualize = gender
x_value = "Gender"

fig1 = px.scatter(table_to_visualize,x=x_value, y="Count",title=str(x_value)+' Scatter plot')
fig1.show()

fig2 = px.bar(table_to_visualize,x=x_value,y="Count",title=str(x_value)+" Bar chart")
fig2.show()

fig3 = px.pie(table_to_visualize,values='Count', names=x_value, title=str(x_value)+" Pie chart")
fig3.show()

## Creating clear and useful visualizations

Sometimes a particular visualization represents data better than others.

Let's look at some more examples created with Plotly Express.

Suppose we wanted to see the relationship between age and time to adoption for the pets in our data set. The code below allows us to generate a bar graph to compare these variables.

Try running it now.

In [None]:
# Create bar plot
bar_pet = px.bar(pets,
           x="Time to Adoption (weeks)", 
           y="Age (years)",
           title="Age (in years) and Time to Adoption (weeks) for each pet")

# Display within our Jupyter notebook 
bar_pet.show()

Looking at this bar graph, we can see that there does seem to be a relationship between age and time to adoption, but some aspects of the visualization are not clear. For example, most of the bars are segmented, but there is no explanation of what the segments represent.

Let’s update our data visualization to include more information. We'll use colour to represent pet species and add labels with the names of each pet. 

Run the code below. 

In [None]:
# Create coloured bar chart
bar_pet = px.bar(pets,
           x="Time to Adoption (weeks)", 
           y="Age (years)",
           title="Age (in years) and Time to Adoption (weeks) for each pet",
            color ="Species",text="Name")

bar_pet.show()

These updates have made the bar chart easier to interpret, however there's a lot of information represented here and the chart still doesn't clearly display the relationship between age and time to adoption.

Some elements are also confusing, for example, some pets have tiny name labels and bar segments compared to others, but this doesn't communicate anything useful about their data — the chart type simply allows less room when different pets have similar ages and times to adoptions.

Let’s try a different type of visualization.

Run the code below to create a **scatter plot**. In this visualization, each dot will represent a single pet and the dot's colour will represent the species.


In [None]:
# Create scatter plot
scatter_pet = px.scatter(pets,
           x="Time to Adoption (weeks)", 
           y="Age (years)",
           title="Age (in years) and Time to Adoption (weeks) for each pet",
            color ="Species",hover_name="Name")

scatter_pet.show()

This scatterplot communicates our information much more clearly than our bar graph. It allows each pet to be represented equally, shows the strength of the relationship between our two variables, and allows us to easily identify outliers and see at a glance the different species of pets in our data set.

## Conclusion

Not all data visualizations are equally useful — depending on how we design them, they might actually make our data harder to understand or mislead us to assume things that are untrue.
When creating visualizations, our goal should be to present information relevant to the question we wish to answer as clearly as possible.

However, there also is no hard and fast rule about which method is best. Different data calls for different approaches, and the most important thing is to be able to easily see patterns and draw conclusions.

Keep in mind that there won't always be an obvious best answer, and different people may choose representations for the same data.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)