### Introduction
This lesson you will learn about aggregates in Pandas. An aggregate statistic is a way of creating a single number that describes a group of numbers. Common aggregate statistics incluse mean, median, or standard deviation.

You will also learn how to rearrange a DataFrame into a pivot table, which is a great way to compare data across two dimensions.

### Calculating Column Statistics
In the previous lesson, you learned how to perform operations on each value in a column using apply.

In this exercise, you will learn how to combine all of the values from a column for a single calculation.

Some examples of this type of calculation include:

The DataFrame customers contains the names and ages of all of your customers. You want to find the median age:

In [1]:
'''
print(customers.age)
>> [23, 25, 31, 35, 35, 46, 62]
print(customers.age.median())
>> 35
'''

The DataFrame shipments contains address information for all shipments that you've sent out in the past year. You want to know how many different states you have shipped to (and how many shipments went to the same state).

In [2]:
'''
print(shipments.state)
>> ['CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ']
print(shipments.state.nunique())
>> 3
'''

"\nprint(shipments.state)\n>> ['CA', 'CA', 'CA', 'CA', 'NY', 'NY', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ', 'NJ']\nprint(shipments.state.nunique())\n>> 3\n"

The DataFrame inventory contains a list of types of t-shirts that your company makes. You want a list of the colors that your shirts come in.

In [3]:
'''
print(inventory.color)
>> ['blue', 'blue', 'blue', 'blue', 'blue', 'green', 'green', 'orange', 'orange', 'orange']
print(inventory.color.unique())
>> ['blue', 'green', 'orange']
'''

"\nprint(inventory.color)\n>> ['blue', 'blue', 'blue', 'blue', 'blue', 'green', 'green', 'orange', 'orange', 'orange']\nprint(inventory.color.unique())\n>> ['blue', 'green', 'orange']\n"

The general syntax for these calculations is:

In [4]:
# df.column_name.command()

The following table summarizes some common commands:

    Command	- Description
    mean	- Average of all values in column
    std	    - Standard deviation
    median	- Median
    max	    - Maximum value in column
    min	    - Minimum value in column
    count	- Number of values in column
    nunique	- Number of unique values in column
    unique	- List of unique values in column
    
Once more, we'll revisit our orders from ShoeFly.com. Our new batch of orders is in the DataFrame orders. Examine the first 10 rows.

In [5]:
import pandas as pd
orders = pd.read_csv('orders.csv')
print(orders.head(10))

      id first_name    last_name                         email     shoe_type  \
0  41874       Kyle         Peck          KylePeck71@gmail.com  ballet flats   
1  31349  Elizabeth    Velazquez      EVelazquez1971@gmail.com         boots   
2  43416      Keith     Saunders              KS4047@gmail.com       sandles   
3  56054       Ryan      Sweeney     RyanSweeney14@outlook.com       sandles   
4  77402      Donna  Blankenship              DB3807@gmail.com     stilettos   
5  97148     Albert       Dillon       Albert.Dillon@gmail.com        wedges   
6  19998     Judith       Hewitt      JudithHewitt98@gmail.com     stilettos   
7  83290      Kayla       Hardin        Kayla.Hardin@gmail.com     stilettos   
8  77867     Steven  Blankenship  Steven.Blankenship@gmail.com        wedges   
9  54885      Carol   Mclaughlin              CM3415@gmail.com  ballet flats   

  shoe_material shoe_color  price  
0  faux-leather      black  385.0  
1        fabric      brown  388.0  
2       lea

Our finance department wants to know the price of the most expensive pair of shoes purchased. Save your answer to the variable most_expensive.

In [6]:
most_expensive = orders.price.max()
print(most_expensive)

493.0


Our fashion department wants to know how many different colors of shoes we are selling. Save your answer to the variable num_colors.

In [7]:
num_colors = orders.shoe_color.nunique()
print(num_colors)

5


### Calculating Aggregate Functions I
When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data.

In general, we use the following syntax to calculate aggregates:

In [8]:
# df.groupby('column1').column2.measurement()

where:

    column1 is the column that we want to group by ('student' in our example)
    column2 is the column that we want to perform a measurement on (grade in our example)
    measurement is the measurement function we want to apply (mean in our example)

Let's return to our orders data from ShoeFly.com.

In the previous exercise, our finance department wanted to know the most expensive shoe that we sold.

Now, they want to know the most expensive shoe for each shoe_type (i.e., the most expensive boot, the most expensive ballet flat, etc.).

Save your answer to the variable pricey_shoes.

In [9]:
pricey_shoes = orders.groupby('shoe_type').price.max()
print(pricey_shoes)

shoe_type
ballet flats    481.0
boots           478.0
clogs           493.0
sandles         456.0
stilettos       487.0
wedges          461.0
Name: price, dtype: float64


What type of object is pricey_shoes?

In [10]:
print(type(pricey_shoes))

<class 'pandas.core.series.Series'>


alculating Aggregate Functions II
After using groupby, we often need to clean our resulting data.

As we saw in the previous exercise, the groupby function creates a new Series, not a DataFrame. For our ShoeFly.com example, the indices of the Series were different values of shoe_type, and the name property was price.

Usually, we'd prefer that those indices were actually a column. In order to get that, we can use reset_index(). This will transform our Series into a DataFrame and move the indices into their own column.

Generally, you'll always see a groupby statement followed by reset_index:

In [11]:
# df.groupby('column1').column2.measurement().reset_index()

Modify your code from the previous exercise so that it ends with reset_index, which will change pricey_shoes into a DataFrame.

Now, what type of object is pricey_shoes?

In [12]:
pricey_shoes = orders.groupby('shoe_type').price.max().reset_index()
print(pricey_shoes)
print(type(pricey_shoes))

      shoe_type  price
0  ballet flats  481.0
1         boots  478.0
2         clogs  493.0
3       sandles  456.0
4     stilettos  487.0
5        wedges  461.0
<class 'pandas.core.frame.DataFrame'>


Calculating Aggregate Functions III
Sometimes, the operation that you want to perform is more complicated than mean or count. In those cases, you can use the apply method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called df that has the following columns:

    id: the employee's id number
    name: the employee's name
    wage: the employee's hourly wage
    category: the type of work that the employee does

Example: If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each category, we can use the following combination of apply and a lambda function:

In [13]:
'''
# np.percentile can calculate any percentile over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()
'''

"\n# np.percentile can calculate any percentile over an array of values\nhigh_earners = df.groupby('category').wage\n    .apply(lambda x: np.percentile(x, 75))\n    .reset_index()\n"

Once more, we'll return to the data from ShoeFly.com. Our Marketing team says that it's important to have some affordably priced shoes available for every color of shoe that we sell.

Let's calculate the 25th percentile for shoe price for each shoe_color to help Marketing decide if we have enough cheap shoes on sale. Save the data to the variable cheap_shoes.

Note: Be sure to use reset_index() at the end of your query so that cheap_shoes is a DataFrame.

In [14]:
import numpy as np

cheap_shoes = orders.groupby('shoe_color').price.apply(lambda x: np.percentile(x, 25)).reset_index()
print(cheap_shoes)

  shoe_color  price
0      black    NaN
1      brown  193.5
2       navy  205.5
3        red  250.0
4      white  196.0


  interpolation=interpolation)


### Calculating Aggregate Functions IV
Sometimes, we want to group by more than one column. We can easily do this by passing a list of column names into the groupby method.

Imagine that we run a chain of stores and have data about the number of sales at different locations on different days:
(table not showne here)

We suspect that sales are different at different locations on different days of the week. In order to test this hypothesis, we could calculate the average sales for each store on each day of the week across multiple months. The code would look like this:

In [15]:
# df.groupby(['Location', 'Day of Week'])['Total Sales'].mean().reset_index()

At ShoeFly.com, our Purchasing team thinks that certain shoe_type/shoe_color combinations are particularly popular this year (for example, blue ballet flats are all the rage in Paris).

Create a DataFrame with the total number of shoes of each shoe_type/shoe_color combination purchased. Save it to the variable shoe_counts.

You should be able to do this using groupby and count().

Note: When we're using count(), it doesn't really matter which column we perform the calculation on. You should use id in this example, but we would get the same answer if we used shoe_type or last_name.

Remember to use reset_index() at the end of your code!

In [16]:
shoe_counts = orders.groupby(['shoe_type', 'shoe_color'])['id'].count().reset_index()
print(shoe_counts)

       shoe_type shoe_color  id
0   ballet flats      black   2
1   ballet flats      brown   5
2   ballet flats        red   3
3   ballet flats      white   5
4          boots      black   3
5          boots      brown   5
6          boots       navy   6
7          boots        red   2
8          boots      white   3
9          clogs      black   4
10         clogs      brown   6
11         clogs       navy   1
12         clogs        red   4
13         clogs      white   1
14       sandles      black   1
15       sandles      brown   4
16       sandles       navy   5
17       sandles        red   3
18       sandles      white   4
19     stilettos      black   5
20     stilettos      brown   3
21     stilettos       navy   2
22     stilettos        red   2
23     stilettos      white   2
24        wedges      black   3
25        wedges      brown   4
26        wedges       navy   4
27        wedges        red   5
28        wedges      white   2


### Pivot Tables
When we perform a groupby across multiple columns, we often want to change how our data is stored. 

In Pandas, the command for pivot is:

In [17]:
'''
df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')
'''

"\ndf.pivot(columns='ColumnToPivot',\n         index='ColumnToBeRows',\n         values='ColumnToBeValues')\n"

Just like with groupby, the output of a pivot command is a new DataFrame, but the indexing tends to be "weird", so we usually follow up with .reset_index().

In the previous example, you created a DataFrame with the total number of shoes of each shoe_type/shoe_color combination purchased for ShoeFly.com.

The purchasing manager complains that this DataFrame is confusing.

Make it easier for her to compare purchases of different shoe colors of the same shoe type by creating a pivot table. Save your results to the variable shoe_counts_pivot.

In [18]:
orders = pd.read_csv('orders.csv')

shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()

print(shoe_counts)

shoe_counts_pivot = shoe_counts.pivot(
    columns='shoe_color',
    index='shoe_type',
    values='id').reset_index()

print(shoe_counts_pivot)

       shoe_type shoe_color  id
0   ballet flats      black   2
1   ballet flats      brown   5
2   ballet flats        red   3
3   ballet flats      white   5
4          boots      black   3
5          boots      brown   5
6          boots       navy   6
7          boots        red   2
8          boots      white   3
9          clogs      black   4
10         clogs      brown   6
11         clogs       navy   1
12         clogs        red   4
13         clogs      white   1
14       sandles      black   1
15       sandles      brown   4
16       sandles       navy   5
17       sandles        red   3
18       sandles      white   4
19     stilettos      black   5
20     stilettos      brown   3
21     stilettos       navy   2
22     stilettos        red   2
23     stilettos      white   2
24        wedges      black   3
25        wedges      brown   4
26        wedges       navy   4
27        wedges        red   5
28        wedges      white   2
shoe_color     shoe_type  black  brown  

### Review
This lesson introduced you to aggregates in Pandas. You learned:

    How to perform aggregate statistics over individual rows with the same value using groupby.
    How to rearrange a DataFrame into a pivot table, a great way to compare data across two dimensions.

Let's examine some more data from ShoeFly.com. This time, we'll be looking at data about user visits to the website (the same dataset that you saw in the introduction to this lesson).

The data is a DataFrame called user_visits. Use print and head() to examine the first few rows of the DataFrame.

In [19]:
user_visits = pd.read_csv('page_visits.csv')
print(user_visits.head(10))

      id first_name   last_name                       email         month  \
0  10043      Louis        Koch       LouisKoch43@gmail.com     3 - March   
1  10150      Bruce        Webb     BruceWebb44@outlook.com     3 - March   
2  10155   Nicholas     Hoffman  Nicholas.Hoffman@gmail.com  2 - February   
3  10178    William         Key     William.Key@outlook.com     3 - March   
4  10208      Karen        Bass            KB4971@gmail.com  2 - February   
5  10260   Benjamin       Ochoa  Benjamin.Ochoa@outlook.com   1 - January   
6  10271     Gerald     Aguilar    Gerald.Aguilar@gmail.com     3 - March   
7  10278    Melissa     Lambert   Melissa.Lambert@gmail.com  2 - February   
8  10320       Adam  Strickland   Adam.Strickland@gmail.com     3 - March   
9  10389      Ethan       Payne    EthanPayne26@outlook.com  2 - February   

  utm_source  
0      yahoo  
1    twitter  
2     google  
3      yahoo  
4     google  
5    twitter  
6     google  
7      email  
8      email  
9 

The column utm_source contains information about how users got to ShoeFly's homepage. For instance, if utm_source = Facebook, then the user came to ShoeFly by clicking on an ad on Facebook.com.

Use a groupby statement to calculate how many visits came from each of the different sources. Save your answer to the variable click_source.

Remember to use reset_index()!

In [20]:
click_source = user_visits.groupby('utm_source').id.count().reset_index()
print(click_source)

click_source_by_month = user_visits.groupby(['utm_source', 'month']).id.count().reset_index()

click_source_by_month_pivot = click_source_by_month.pivot(
    columns='month',
    index='utm_source',
    values='id').reset_index()

print(click_source_by_month_pivot)

  utm_source   id
0      email  462
1   facebook  823
2     google  543
3    twitter  415
4      yahoo  757
month utm_source  1 - January  2 - February  3 - March
0          email           43           147        272
1       facebook          404           263        156
2         google          127           196        220
3        twitter          164           154         97
4          yahoo          262           240        255
