# Chapter 1.4 and 1.5: Visualization and Testing

Goal: Create appropriate visualizations and write assertions to verify your analysis code.

### Topics:
- Choosing the right visualization type
- Creating histograms, scatter plots, and bar plots with Seaborn
- Writing `assert` statements as sanity checks
- Verifying data after filtering operations

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Loading the Data

We'll use the Tips dataset, which comes built into Seaborn. It contains information about restaurant bills and tips, including the total bill, tip amount, day of the week, and whether the customer was a smoker.

In [None]:
# Load the tips dataset from Seaborn
tips = sns.load_dataset('tips')
tips.head()

In [None]:
# How many rows and columns?
tips.shape

## Choosing the Right Visualization

Different questions need different visualizations:
- **Histogram**: "What's the distribution of X?" (one numeric variable)
- **Scatter plot**: "What's the relationship between X and Y?" (two numeric variables)
- **Bar plot**: "How does Y compare across categories?" (one category, one numeric)

Let's see each in action.

### Histograms

A histogram shows how values are distributed. For example: What's the distribution of total bills?

In [None]:
# Histogram of total_bill
sns.histplot(data=tips, x='total_bill')

Most bills are between $10 and $25, with a few expensive outliers above $40.

It's good practice to give our graphs axis labels and a title. Whenever you access a column (e.g. `x='total_bill'`), it will automatically add that title to the x-axis. However, you may want to include your own title with a nicer name. Also, since no y value is specific, there's no y-axis label. Let's add both of these, as well as a title.

In [None]:
# Histogram of total_bill
sns.histplot(data=tips, x='total_bill')
plt.title('Histogram of Total Bills')
plt.xlabel('Total Bill ($)')
plt.ylabel('Count')
plt.show() # This is optional, but it makes sure the graph displays

### Scatter Plots

A scatter plot shows the relationship between two numeric variables. For example: How does tip relate to total bill?

In [None]:
# Scatter plot of total_bill vs tip
sns.scatterplot(data=tips, x='total_bill', y='tip')
# Add a title, x and y axis labels
...

See how tips generally increase with larger bills? That's a positive relationship.

### Bar Plots

A bar plot compares values across categories. For example: What's the average tip on each day of the week?

In [None]:
# Bar plot of average tip by day
sns.barplot(data=tips, x='day', y='tip')
# Add nice title, x and y axis labels
...

The black lines on top of the bars show the confidence interval - they tell you how uncertain we are about the true average.

## Practice: Visualizations

1. Create a histogram of `tip` amounts
2. Create a scatter plot of `total_bill` vs `tip`, colored by `smoker`
3. Create a bar plot showing average `total_bill` by `day`
4. Add a title and axis labels to your bar plot

In [None]:
# 1. Create a histogram of tip amounts
# Step 1: Use sns.histplot() with data=tips and x='tip'
# Step 2: Add plt.show() at the end


In [None]:
# 2. Create a scatter plot of total_bill vs tip, colored by smoker
# Step 1: Use sns.scatterplot() with data=tips, x='total_bill', y='tip'
# Step 2: Add hue='smoker' to color by smoking status
# Step 3: Add plt.show()


In [None]:
# 3. Create a bar plot showing average total_bill by day
# Step 1: Use sns.barplot() with data=tips, x='day', y='total_bill'
# Step 2: Add plt.show()


In [None]:
# 4. Add a title and axis labels to your bar plot
# Step 1: Copy your bar plot code from above
# Step 2: Add plt.title('Your Title Here')
# Step 3: Add plt.xlabel('X Label') and plt.ylabel('Y Label')
# Step 4: Add plt.show()


## Introduction to `assert`

When you're analyzing data, things can go wrong in subtle ways. Maybe you filtered wrong, or a calculation gave unexpected results. `assert` statements help catch these errors early.

An `assert` statement says: "This should be true. If it's not, something is wrong."

In [None]:
# This passes silently (the condition is true)
assert len(tips) > 0

In [None]:
# This would fail (the condition is false)
assert len(tips) > 1000

You may have noticed, that when the `assert` fails, it just says "AssertionError". If you wrote this code a week ago, you may not remember why you thought `len(tips)` should be greater than 1000. That means lots of time investigating why you thought that was important, and why it's not passing. 

`assert` statement allow us to add a comment that will display if it fails. This can be useful for exaclty these kinds of circumstances. To add a message, simply do

```
assert(...), "My message"
```

In [None]:
assert len(tips) > 1000, "Start with 10k rows, after cleaning should be around 2k rows"

The syntax is:
```python
assert condition, "Error message if condition is False"
```

Some useful patterns:
- `assert len(df) > 0` - DataFrame isn't empty
- `assert (df['col'] > 0).all()` - All values in column are positive
- `assert df['col'].notna().all()` - No missing values
- `assert len(filtered_df) < len(original_df)` - Filter actually removed rows

### Why use asserts?

They're like sanity checks. After filtering data, you can assert that the filter worked correctly. Consider the code below, it seems completely reasonable.

In [None]:
# Remove Sunday
non_sunday_tips = tips[tips['day'] != 'Sun']

print(f"Days besides Sunday has {len(non_sunday_tips)} transactions out of {len(tips)} total")

Everything runs. However, if you actually look at the data `non_sunday_tips`, you'll realize Sunday is still there! The problem is that in this code we abbreviated Sunday to "Sun", which is incorrect (that's not how it's stored in the data). So it removed rows with `"Sun"`, but the problem is there's no rows like this, so it actually didn't remove anything!

In [None]:
# Remove Sunday
non_sunday_tips = tips[tips['day'] != 'Sun']

# Sanity check: we should have fewer rows than the original
assert len(non_sunday_tips) < len(tips), "Something's wrong - No rows were removed!"

print(f"Days besides Sunday has {len(non_sunday_tips)} transactions out of {len(tips)} total")

## Practice: Writing Assertions

1. After filtering to smokers only, assert that all values in the `smoker` column are "Yes"
2. Assert that all `tip` values in the original dataset are positive
3. After filtering to Saturday only, assert the row count is less than the original
4. Assert that the mean tip is between $1 and $10 (a reasonable sanity check)

In [None]:
# 1. Filter to smokers and assert all are "Yes"
# Step 1: Create smokers_only = tips[tips['smoker'] == 'Yes']
# Step 2: Assert that (smokers_only['smoker'] == 'Yes').all() is True

smokers_only = tips[tips['smoker'] == 'Yes']


In [None]:
# 2. Assert that all tip values are positive
# Step 1: Check if (tips['tip'] > 0).all() is True
# Step 2: Write the assert with a helpful error message


In [None]:
# 3. Filter to Saturday and assert row count is smaller
# Step 1: Create saturday_only = tips[tips['day'] == 'Sat']
# Step 2: Assert that len(saturday_only) < len(tips)

saturday_only = tips[tips['day'] == 'Sat']


In [None]:
# 4. Assert that mean tip is between $1 and $10
# Step 1: Compute mean_tip = tips['tip'].mean()
# Step 2: Assert that 1 < mean_tip < 10

mean_tip = tips['tip'].mean()


## Wrap-up

Today you learned:
- How to choose the right visualization (histogram, scatter, bar)
- How to create visualizations with Seaborn
- How to use `assert` statements to catch bugs in your analysis

Get in the habit of adding assertions after any filtering or data transformation. Your future self will thank you!