# Lab 11 - Part 2 

## Visualizing Numerical Variables

### DATA 1201, Fall 2023


In [None]:
from datascience import * # datascience has plotting features built in
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

import warnings
warnings.filterwarnings('ignore')
Table.interactive_plots() 

## Review: Categorical Variables

To review categorical variables and bar charts, we're going to use a table containing data about tips at a restaurant.

In [None]:
tips = Table.from_df(sns.load_dataset('tips'))
tips

In [None]:
# You don't need to know how this works yet
tips_by_mealtime = tips.group('time')
tips_by_mealtime

Remember that we can use a bar chart to visualize the  numerical values corresponding to different categories (like meal time - lunch vs. dinner).

In [None]:
...  # Create a bar chart of tip counts by meal time using the `tips_by_mealtime` table

# `np.arange`

Before we dive into visualizing numerical variables, it will be important to understand how `np.arange()` works. This function generates an **array range**, which is a sequence of values. The easiest way to use `np.arange` is to generate an array of consecutive integers starting from 0 and going up to (but not including) the number you input:

In [None]:
np.arange(10)

Note that the array above has 10 elements, but starts at 0 and ends at 9.

The function can also take two arguments. In the form `np.arange(start, stop)`, the function generates an array of consecutive integers starting at `start` and going up to (but not including) `stop`.

In [None]:
arr = np.arange(3, 9)
arr

Finally, we can add a third argument to specify the **step size** we want to increment by (`np.arange(start, stop, step)`). This will generate an array starting with `start`, going up by increments of `step` and stopping _before_ the number `stop`.

In [None]:
np.arange(3, 9, 3)

Notice how the array above doesn't include the number 9 as an element, since 6+3 would include 9, but our `stop` argument tells us that we should go up to but not include 9.

Below are some more examples of how you can use `np.arange`

In [None]:
arr = np.arange(1, 11)
arr

In [None]:
np.sum(arr)

In [None]:
np.arange(3, 11, 2)

In [None]:
np.arange(10, 1, -3)

## Quick Check 1

What do each of these cells output when run? Try to figure out the answers before running the code yourself.

In [None]:
np.arange(5)

In [None]:
np.arange(3, 13, 3)

In [None]:
2 ** np.arange(8)

In [None]:
np.sum(np.arange(4)** 2)

# Histograms

A histogram visualizes the distribution of a numerical variable.

Returning to our `tips` data...

In [None]:
tips

...we can visualize the size of tips very easily with a histogram.

In [None]:
... # Create a histogram of tip amounts

### Why Do We Meed `density = False`?

Look at the histogram that results if we don't set `density = False`.

In [None]:
tips.hist('tip')

This is a perfectly valid histogram too, but it's not one that we will study in this class.

### Quick Check 2

In [None]:
heights = Table().with_columns(
    'Height', np.array([72, 61, 63, 74, 68, 67, 65, 73, 65, \
                        62, 66, 69, 75, 61, 61, 61, 65, 60, 64])
)
heights.show(5)

In [None]:
heights.hist('Height', density = False, bins = [60, 64, 68, 72, 76])

Using the histogram above, answer the following questions. If you don't believe it's possible to answer the question given the information you have, write "can't tell".

1. How many heights are between 60 inches (inclusive) and 64 inches (exclusive)?
2. How many heights are between 62 inches (inclusive) and 68 inches (exclusive)?

# Choosing Bins

Python will generate default bins for us. But often, we will want to specify custom bins. We can do this my passing an array into the optional `bins` argument.

In [None]:
tips.hist('tip', 
          density = False, 
          bins = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]))

### `np.arange`, revisited

The reason is why `np.arange` is useful for histograms is because it allows us to easily specify custom bins. The code below will generate the same histogram as above, but in a much more succinct way.

In [None]:
tips.hist('tip', 
          density = False, 
          bins = np.arange(12))

Let's look at another column.

In [None]:
tips.hist('total_bill', density = False)

Before setting bins, it's a good idea to look at the smallest and largest values in the column.

In [None]:
tips.column('total_bill').min()

In [None]:
tips.column('total_bill').max()

In [None]:
bins_3 = np.arange(3, 54, 3)
bins_3

If the first line of the cell above said `bins_3 = np.arange(3, 51, 3)` instead, the array would end at 48 and not include 51, because the `stop` argument in `np.arange` is _exclusive_.

In [None]:
tips.hist('total_bill', 
          density = False, 
          bins = bins_3)

In [None]:
bins_7 = np.arange(3, 53, 7)
bins_7

In [None]:
tips.hist('total_bill', 
          density = False, 
          bins = bins_7)

In [None]:
bins_10 = np.arange(3, 63, 10)
bins_10

In [None]:
tips.hist('total_bill', 
          density = False, 
          bins = bins_10)

Notice that smaller bin widths (and therefore more bins) generate a more granular (albeit choppy) histogram. Wider bins make for a much smoother shape, but we lose some useful information about the distribution along the way.

# Overlaid and Side-by-Side Histograms

In [None]:
tips

Looking at the `tips` table, we can see that one potentially informative category is `'time'` – we can make separate histograms for every unique value in `'time'`. As a reminder, there are two unique times, `'Lunch'` and `'Dinner'`, so we should expect to see two histograms.

In [None]:
... # Create an overlaid histogram of tip amounts grouped by the `time` column

In [None]:
... # Create an overlaid histogram of tips grouped by the `time` column with custom bins

If we want these on separate axes:

In [None]:
tips.hist('total_bill', density = False, group = 'time', overlay = False)

We could separate by other columns, like `'day'`.

In [None]:
tips

In [None]:
... # Create a histogram of tip amounts grouped by day of the week

There's too much going on there – but you can click the legend to hide certain days.

## Documentation

If you're ever confused about how a certain function works or what arguments a function takes, you can use this syntax to get more information:

In [None]:
# Run this cell
tips.hist?