# M5 - Introduction to Data Visualization

----

## Distribution Analysis

Many times you are interested in the distribution of a variable/attribute/column in a dataset. We can create summary statistics as well as create some visuals. For distributional analysis we want to understand the "shape" of the variable. The two common aspects to the shape of the distribution is central location and dispersion/spread. Central location is often described by the mean, median, and/or mode. Dispersion is often described by standard deviation, variance, range, and/or inter-quartile range.

### An Example - The Normal Distribution

Remember the Normal distribution from your statistics classes? It is a **family** of distributions described by two parameters, a mean ($\mu$) and a standard deviation ($\sigma$). For a given $\mu$, the larger the $\sigma$ the more dispersed the data is, meaning the distribution is flatter with heavier tails. Let's plot two different Normal distributions with the same $\mu$ but two different $\sigma$s. 

In [None]:
# Import statements
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats

# Give the range for x from -10 to +10
xMin = -10.0
xMax = 10.0

# Set the mean at 0 and then set two different standard deviations
mean = 0.0 
std1 = 2.0
std2 = 3.0

# Return 100 evenly spaced numbers between -10 and +10.
x = np.linspace(xMin, xMax, num=100)

# Create the y variables for each distribution
y = scipy.stats.norm.pdf(x, mean, std1)
y2 = scipy.stats.norm.pdf(x, mean, std2)

# Plot the N(0, 2) in coral
plt.plot(x, y, color="coral")

# Plot the N(0, 3) in blue
plt.plot(x, y2, color="blue")

-----

### Automobile Data

We have the file `auto.csv` which contains information on different vehicles. The data contains the following variables:

- mpg: miles per gallon
- cylinders: number of cylinders for the engine (between 4 and 8)
- displacement: engine displacement (cubic inches)
- horsepower: engine horsepower
- weight: vehicle weight (lbs.)
- acceleration: time to accelerate from 0 to 60 mph (seconds)
- year: model year
- origin: origin of the vehicle (1=American, 2=European, 3=Japanese)
- name: vehicle name

In [None]:
# import pandas
import pandas as pd

# Read the file in
cars = pd.read_csv("auto.csv")
cars.info()

In [None]:
# Sample
cars.sample(7)

In [None]:
# summary statistics
cars.describe()

### Horsepower?

Why didn't we get summary statistics for horsepower? How do we fix it?

In [None]:
# Check to see if we have missing values
cars.isnull().sum()

We don't have any missing values. We saw the `horsepower` was an object. That is why we do not have any summary statistics for it. The method `describe()` can only provide summary statistics on numerical columns. We can look at `value_counts` for horsepower and see if that helps.

In [None]:
cars.horsepower.value_counts()

Unfortunately, that did not help much. We would have to spit out all of the 94 unique values to see if we notice anything. One way to do this is to convert the index of the `Series` that resulted from calling `value_counts()` to an array and print that out. Our hope here is that we see something that is making `horsepower` show up as an object.

In [None]:
# See all of the unique values for horsepower
cars.horsepower.value_counts().index.array

In [None]:
# Perhaps a simpler way
# We could also use .unique()
cars.horsepower.unique()

Ah! We found it -- there is a `?` in at least one of the rows of our data in the column `horsepower`. Let's find them.

In [None]:
cars[cars.horsepower == "?"]

At this point you have decide if you want to replace the `?` with data or delete those rows of data, etc. We saw how to fill in missing data in the last module, but this is slightly different. We don't have "missing" data, but rather a character that represents that the data are indeed missing. The "simplest" thing to do is to make a subset that contains only the rows with a "real" value for `horsepower`. We lose six row of data, but this approach should then allow us to see the summary statistics with `describe` for the column.

In [None]:
# Make a new DataFrame that only contains rows *without* horsepower == ?
subsetCars = cars[cars.horsepower != "?"]
subsetCars.info()

Notice that the field/variable/attribute `horsepower` is still an `object`, so we will not be able to get summary statistics with `.describe()`. We can force that column of the `DataFrame` to be an integer with the following code.

In [None]:
subsetCars = pd.DataFrame(data=subsetCars).astype(dtype={"horsepower": int})
subsetCars.info()

In [None]:
# Now describe() should give us summary statistics on horsepower
subsetCars.describe()

In [None]:
# Create a histogram of mpg using pandas
subsetCars.mpg.plot(kind="hist")

In [None]:
# Use matplotlib.pyplot to create a histogram
plt.hist(subsetCars.mpg)

You should have noticed that when use use `matplotlib.pyplot` and call `.hist()` it returned a `tuple` of the counts or frequencies for each bin as well as the starting and ending values for the x-axis bins.

-----

## Frequency Polygon

There is not a built-in way to get a frequency polygon. One option is to use the option `histype = "step"`. Another option is to write our own function to create the frequency polygon. To write our own function, notice that at the top of that output cell we saw a tuple with three elements. We are going to use that output to help us create a frquency polygon.

First, let's try the "step" option.

In [None]:
plt.hist(subsetCars.mpg, histtype="step")

#### Not Bad

Overall, that is not too bad. Generally, though a true frequency polygon would use a single point for each bin. Below, I have a function that will find the middle of each bin and use that for where to plot the frequency/count for a particular bin.

In [None]:
def getXandYForFreqPlot(valuesAndEdges):
    """
    This function will return a tuple containing x and y
    values to create a frequency plot. The input must be 
    a tuple created from calling pyplot.hist. It will find
    the center of each bin and use that in newX. 
    
    Parameters
    ----------
    valuesAndEdges - a tuple created from calling pyplot.hist
    
    Returns
    -------
    (newX, theCount) - a list of x-coordinates and a list of y-coordinates
    """
    theCount = valuesAndEdges[0]
    xValue = valuesAndEdges[1].tolist()

    newX = []
    beginning = xValue.pop(0)
    counter = 0
    for i in xValue:
        # find the midpoint of the bin and put in newX
        newX.append(beginning + ((i - beginning) / 2))
        beginning = i
        counter += 1

    return (newX, theCount)

In [None]:
# Call plt.hist and store the resulting tuple in n
n = plt.hist(subsetCars.mpg)

# Get the x and y values to creat the frequency plot
x, y = getXandYForFreqPlot(n)

# Plot the frequency plot on top of the histogram
plt.plot(x, y)

In [None]:
# Just the frequency polygon
plt.plot(x, y)

----

### Histograms Based on Another Variable

Many times we want to create histograms for a particular variable but visualize it based on another variable, often a categorical variable. In this particular dataset, the column `origin` is numeric but represents three different countries (see the data dictionary at the beginning of this notebook). We would like to plot a histogram of the `mpg` for each `origin`. Using `matplotlib` we can simply call `plt.hist()` for each origin and they will all plot on the same graph. 

Let's try it.

In [None]:
plt.hist(subsetCars[subsetCars.origin == 1].mpg)
plt.hist(subsetCars[subsetCars.origin == 2].mpg)
plt.hist(subsetCars[subsetCars.origin == 3].mpg)

We immediately see a potential issue with this approach. Some of the data can be "hidden" when we overlay the multiple histograms. There are several alternative approaches that we could try. One is to simply use the `step` option. Another is to use the frequency polygon method that we created earlier. A third is to use `seaborn` to create the plot. Let's look at each of these in turn.

#### Using `step` for Multiple Histograms

In [None]:
# Try using step
plt.hist(subsetCars[subsetCars.origin == 1].mpg, histtype="step")
plt.hist(subsetCars[subsetCars.origin == 2].mpg, histtype="step")
plt.hist(subsetCars[subsetCars.origin == 3].mpg, histtype="step")

#### Using Our Function

In [None]:
# Use our user-defined function to create a frequency polygon
# Call plt.hist and store the resulting tuple in n1
n1 = plt.hist(subsetCars[subsetCars.origin == 1].mpg)
# Get the x and y values to create the frequency plot
x1, y1 = getXandYForFreqPlot(n1)

# for second
n2 = plt.hist(subsetCars[subsetCars.origin == 2].mpg)
x2, y2 = getXandYForFreqPlot(n2)

# for third
n3 = plt.hist(subsetCars[subsetCars.origin == 3].mpg)
x3, y3 = getXandYForFreqPlot(n3)

In [None]:
# Plot the frequency plots
plt.plot(x1, y1)
plt.plot(x2, y2)
plt.plot(x3, y3)

#### Using `seaborn`

In [None]:
import seaborn as sns

In [None]:
# Histogram for all of the data
sns.histplot(subsetCars.mpg)

We can also use `seaborn` to create a histogram by group. In this case, we use a single line of code and send in the grouping variable to the argument `hue`.

In [None]:
sns.histplot(subsetCars, x="mpg", hue="origin")

Because `origin` is a numerical variable in the dataset, it creates a gradient color. To get distinct colors for each category, you need a categorical variable -- one with words, for example. We can create a new column in our `DataFrame` that converts the numerical `origin` into a categorical variable.

In [None]:
subsetCars.loc[subsetCars.origin == 1, "catOrigin"] = "American"
subsetCars.loc[subsetCars.origin == 2, "catOrigin"] = "European"
subsetCars.loc[subsetCars.origin == 3, "catOrigin"] = "Japanese"
    
subsetCars.sample(5)

In [None]:
sns.histplot(subsetCars, x="mpg", hue="catOrigin")

We can also use the `step` functionality with `seaborn`. Notice the argument name here is `element`.

In [None]:
sns.histplot(subsetCars, x="mpg", hue="catOrigin", element="step")

### Rug Plot

With `seaborn` we can add what is called a "rug plot". What it does is plot marginal distributions by drawing ticks for each individual observation along the $x$ and/or $y$ axes. In `seaborn` the function is called [`rugplot`](https://seaborn.pydata.org/generated/seaborn.rugplot.html).

This function is intended to complement other plots by showing the location of individual observations in an unobstrusive way.

We can add it to a histogram in two ways. First, we can call `rugplot` immediately after calling `histplot`. We can also use the function `displot` and tell it to add a `rugplot` in a single line of code. Let's look at both approaches.

In [None]:
sns.histplot(subsetCars.mpg)
sns.rugplot(subsetCars.mpg)

In [None]:
# Use displot with a single line of code
sns.displot(subsetCars.mpg, rug=True)

Sometime we want have a smooth shape of the variable of interest. To do this, we **kernel density estimation**, abbreviated to kde. Adding a `rugplot` to a `kdeplot` is also quite common. Let's do it.

In [None]:
# Create a kde plot with a rug plot of the mpg variable
sns.displot(subsetCars.mpg, kind="kde", rug=True)

-----

## Box Plots

Another visual that we use to examine the distribution of a variable is the box plot (or box and whisker plot). The "box" is the middle 50% of the distribution. The box plot can also help us identify outliers present in the data.

In [None]:
# Create a box plot of mpg
subsetCars.mpg.plot(kind="box")

You can also break it out by `origin` or `catOrigin`. You do so by calling `.boxplot()` on the **`DataFrame`** (not the column).

In [None]:
# Group by origin
subsetCars.boxplot(column="mpg", by="origin")

In [None]:
# Group by origin
subsetCars.boxplot(column="mpg", by="catOrigin")

We can also use `seaborn` to create box plots. By default, when you send in the variable to argument `x`, you will get a horizontal box plot. If you want a vertical box plot for a single variable, you can use the argument `data` instead.

In [None]:
sns.boxplot(x=subsetCars.mpg)

In [None]:
# Make it vertical
sns.boxplot(data=subsetCars.mpg)

In [None]:
# To break it out by origin, put origin on the x-axis
sns.boxplot(data=subsetCars, x="origin", y="mpg")

In [None]:
# To break it out by catOrigin, put catOrigin on the x-axis
sns.boxplot(data=subsetCars, x="catOrigin", y="mpg", order=["American", "European", "Japanese"])

----

<font color='red' size = '5'> Student Exercise </font>

You have been given a .csv file that contains spending across different marketing channels and the sales during each time period (i.e., each row) in the file `advertising.csv`. In the **Code** cells below, do the following:

1. Read the data from .csv file into a `DataFrame` called `ads`.
2. Sample `ads` to see what the data looks like.
3. Print out the summary statistics for `ads`.
4. Create a histogram of `sales`. Describe its shape.
5. Create a boxplot of `sales`. 

-----

In [None]:
# 1. Read in the data file to ads


# 2. Sample 5 rows of data


In [None]:
# 3. Summary statistics


In [None]:
# Create a histogram on sales


In [None]:
# 5. Create a box plot


**&copy; 2021 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**