<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Removing-Duplicates" data-toc-modified-id="Removing-Duplicates-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Removing Duplicates</a></span></li><li><span><a href="#Counting" data-toc-modified-id="Counting-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Counting</a></span></li><li><span><a href="#Proportion" data-toc-modified-id="Proportion-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Proportion</a></span></li><li><span><a href="#Group-By" data-toc-modified-id="Group-By-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Group By</a></span></li><li><span><a href="#Pivot-Table-in-Pandas" data-toc-modified-id="Pivot-Table-in-Pandas-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Pivot Table in Pandas</a></span></li></ul></div>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Data Frame from cars.csv
cars = pd.read_csv('../datasets/cars.csv', index_col=0)
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JPN,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


## Removing Duplicates

We can drop duplicates based on specific values in a column. Only the first non-duplicate values will be kept.

In [3]:
cars.drop_duplicates(subset='drives_right')

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False


If we want unique combinations, we can add multiple columns

In [4]:
cars.drop_duplicates(subset=['drives_right', 'country'])

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JPN,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


## Counting

Counting values can be done using the `.value_count()` method. `sort` is an optional argument to sort for largest to smallest

In [5]:
# How many countries are in this dataset?
cars['country'].value_counts(sort=True)

Egypt            1
Morocco          1
Russia           1
United States    1
Australia        1
Japan            1
India            1
Name: country, dtype: int64

## Proportion

Using `.value_counts(normalize=True)`, we can get a proportion of the counts instead

In [6]:
# Proportion of the countries in this dataset
cars['country'].value_counts(sort=True, normalize=True)

Egypt            0.142857
Morocco          0.142857
Russia           0.142857
United States    0.142857
Australia        0.142857
Japan            0.142857
India            0.142857
Name: country, dtype: float64

## Group By

- We can group by the categorical values within one column then get the summary statistics of each group
- We can use `.agg()` for multiple summary statistics
- **We can read these as: *For each group of X, select the columns, and calculate the functions***

In [7]:
# Group by `drives_right` categories and work on the numbers from `cars_per_cap`
cars.groupby('drives_right')[['cars_per_cap']].agg([sum, np.mean, np.median, min, max])

Unnamed: 0_level_0,cars_per_cap,cars_per_cap,cars_per_cap,cars_per_cap,cars_per_cap
Unnamed: 0_level_1,sum,mean,median,min,max
drives_right,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
False,1337,445.666667,588,18,731
True,1124,281.0,135,45,809


We can also group by multiple columns

In [8]:
# Group by `country`, then `drives_right` categories and work on the numbers from `cars_per_cap`
cars.groupby(['country', 'drives_right'])[['cars_per_cap']].agg([sum, np.mean, np.median, min, max])

Unnamed: 0_level_0,Unnamed: 1_level_0,cars_per_cap,cars_per_cap,cars_per_cap,cars_per_cap,cars_per_cap
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,median,min,max
country,drives_right,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Australia,False,731,731,731,731,731
Egypt,True,45,45,45,45,45
India,False,18,18,18,18,18
Japan,False,588,588,588,588,588
Morocco,True,70,70,70,70,70
Russia,True,200,200,200,200,200
United States,True,809,809,809,809,809


**We can sort a group-by table by the values of any columns. Use tuple for multi-level indexing.**

In [11]:
grouped_cars = cars.groupby(['country', 'drives_right'])[['cars_per_cap']].agg([sum, np.mean, np.median, min, max])
grouped_cars.sort_values(('cars_per_cap', 'mean'), ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,cars_per_cap,cars_per_cap,cars_per_cap,cars_per_cap,cars_per_cap
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,median,min,max
country,drives_right,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
United States,True,809,809,809,809,809
Australia,False,731,731,731,731,731
Japan,False,588,588,588,588,588
Russia,True,200,200,200,200,200
Morocco,True,70,70,70,70,70
Egypt,True,45,45,45,45,45
India,False,18,18,18,18,18


## Pivot Table in Pandas

- Pivot tables are very similar to using `groupby`
- By default, Pivot tables generate the mean for each group
  - But we can also specify the `aggfunc` using a different function(s)

In [None]:
# Using Group By
cars.groupby('drives_right')[['cars_per_cap']].mean()

In [None]:
# Using Pivot Table (mean as default)
cars.pivot_table(values='cars_per_cap', index='drives_right')

In [None]:
# Using Pivot Table
cars.pivot_table(values='cars_per_cap', index='drives_right', aggfunc=[np.mean, np.median])

We can also group by multiple columns using PivotTable. We can replace `NaN` with a `fill_value` argument. If we set `margins`, the last row and columns will contain the total (for the appropriate function) of all the rows and columns.

In [None]:
# Group by `country`, then `drives_right` categories and work on the numbers from `cars_per_cap`
cars.groupby(['country', 'drives_right'])[['cars_per_cap']].agg([sum, np.mean, np.median, min, max])

In [None]:
# Using Pivot Table
cars.pivot_table(
    values = 'cars_per_cap', # values to aggregate
    index = 'country', # column to group_by and display in pivot_table rows
    columns = ['drives_right'], # column to group_by and display in pivot_table columns
    aggfunc = [sum, np.mean, np.median, min, max], # The aggregate functions to calculate
    fill_value = '-', # If NaN, fill with this
    margins = True # Add total row and total column
)