# Data Science: Bridging Principle and Practice
## Part 5: Categorizing Data

<img src="images/ab_test.PNG" style="width: 900px; height: 250px;" />


### Table of Contents


*In this notebook, we will learn how to manipulate and visualize categorical data.*

<ol start=5>
    <li><a href="section5">DataFrames II: Categorizing Data</a></li>
    <ol>
        <li><a href="subsection5a">Groupby</a></li>
        <li><a href="subsection5b">Visualization: Bar Plots</a></li>
        <li><a href="subsection5c">Pivots</a></li>
    </ol>
    </ol>

In [None]:
# dependencies: THIS CELL MUST BE RUN
import pandas as pd
import seaborn as sns
%matplotlib inline

In the previous notebook, we learned some basic Python and Table operations and applied them to the Rocket Fuel data to answer questions about the profitability of the ad campaign. 

Here, we'll build on those previous skills to dive deeper into the Rocket Fuel case. We'll explore questions including:

- How did conversions relate to different *hours of the day* and *days of the week*?
- How did the *total number of ads seen* relate to how often users converted?
- Was the difference in conversion proportions between the experimental and control groups *statistically significant*?

To answer these questions, we'll need to know a few additional Table operations.

# 5. DataFrames II: Categorizing Data <a id='section5'></a>

As a reminder, our data looks like this:

In [None]:
# load the rocket fuel data
ads = pd.read_csv('data/rocketfuel_data_renamed.csv', index_col=0)

# display the first ten rows
ads.head()

This Table shows the conversion behavior of each user in the study. But, we're not interested in individual user behavior right now- we want to know the behavior of *all users in a specific category*, like everyone who saw the most ads on a Monday, or everyone who saw 200 ads in total.

In this section, we'll talk about:
1. The definition of **categorical data**
2. How to use DataFrame methods to **group data into categories**
3. Using **visualizations** to view and compare categories



Let's say we have data about some of the handbags TaskBella sells, including the color, price, and rating for each collected from different stores.

In [None]:
# create the example handbag data Table
handbags = pd.DataFrame(data={"color":["black", "red", "red", "brown", "black"],
                               "price":[115.99, 130, 124.95, 144.99, 120.05],
                               "rating":[4, 4, 5, 3, 4]})
handbags

#### Data Types: Numerical and Categorical
This table has two different types of data: **numerical** and **categorical**.

Price and rating are **numerical**: they have numbers for values, and we can order those values along a scale from least to most.

Color is **categorial**: it has strings (text) for values, and those values can't really be ordered from least to most.

The type of data affects the kind of analysis we can do, in addition to how we visualize it. For now, we're going to focus on categorical data.



### 5a. Groupby <a id='subsection5a'></a>

A question we might ask is how price or rating is different for different categories of color. To conduct this analysis, we want to do something like this:

1. Find all the possible colors
2. Sort all the rows of the DataFrame into groups, one for each unique color
3. Return a new DataFrame with one row for each color and information about that color of bag

To use `groupby`, call it on a DataFrame using dot notation and specify which column you want to group on as the argument.

In [None]:
# group by handbag color
handbags.groupby("color")

The output above looks pretty strange. After we've told the computer to group the data by color, it doesn't know what to do with the groups. Should it count the items in the group? Should it take the average?

So, when we group items in a DataFrame, we also must say how we want to *aggregate* the groups by specifying an *aggregation function*. For this example, let's get the counts of each color of bag.

In [None]:
# group by handbag color and return the count of each
handbags.groupby("color").count()

We now have a new DataFrame with the count of each number of handbag. Notice that the "price" and "rating" columns are the same because counts don't depend on price or rating.

<br/>
<div class="alert alert-warning">
    <b>EXERCISE:</b> Use `group` to group the handbags by `"rating"`.
    <div>

In [None]:
# group by rating
handbags.groupby("...").count()

There are several aggregation methods besides `count`.

For example, if we want to know the average price and average rating of each color of handbag, we can use the `mean` method.

In [None]:
# get average price and rating per color
bag_avgs = handbags.groupby("color").mean()
bag_avgs

A picture might help to understand what just happened.

<img src="images/group_ex.png" style="width: 1000px; height: 400px;" />

<br/>


<div class="alert alert-warning">
<b>EXERCISE:</b> You can use a wide variety of aggregation functions, including `sum`, `min`, and `max`, `mean`, and `median`.

Using `groupby`, group the handbags by color and find the minimum price and rating for each group.
</div>

In [None]:
# fill in the ... with the correct code
...

<div class="alert alert-info">
**Collection functions and data types:**
The collect function you use must also work on the type of data in your rows for each group. For example, if you try to reduce using `mean` to get the average value and your rows include text data, you will get an error since the computer doesn't know how to take the average of a word.*

</div>

### 5b. Visualization: Bar Plots <a id="subsection5b"></a>

DataFrames provide great ways to organize and display data. But as data sets grow very large (i.e. thousands, tens of thousands, even millions of rows), it becomes harder and harder to understand what's going on with the data just by looking at it in a table.

*Visualizations* are helpful to:
- get a big-picture understanding of a data set
- compare two or more variables
- find the variance of a variable

and much more.

Let's look again at the DataFrame we just made containing the average price and rating for each color.

In [None]:
bag_avgs

One of the best ways to visualize categorical data is with a **bar plot**. Bar plots allow us to compare multiple categories within the same plot. 

To make our plots, we'll be using a software package called **Seaborn**. Seaborn is built to make polished visualizations in Python without a lot of code. The standard abbreviation for Seaborn in Python code is `sns`, after [a character in the TV show *The West Wing*](https://en.wikipedia.org/wiki/Sam_Seaborn).

In [None]:
sns.barplot(x="color", y="rating", data=bag_avgs.reset_index())

Each cateogory is listed on the vertical axis and represented by a bar. The length of each bar is the average price.

#### References

- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series

Author: Keeley Takimoto