In [1]:
import pandas as pd
import numpy as np

# From Questions to Code

### Understanding data granularity

## Announcements
* On-line Tutorials: Exercise Caution!
    - https://afraenkel.github.io/practical-data-science/introduction
* SettingWithCopy Example.

## Data Granularity: What is an Individual?

<div class="image-txt-container">

- Tables consist of:
    - rows (individuals or observations)
    - columns (measurements)


|Name|State|Color|Food|Date of Survey|
|---|---|---|---|---|
|Jane|NY|blue|Steak|2018|
|Aaron|CA|red|Mango|2017|
|Marina|IL|green|Apple|2015|
|...|...|...|...|...|

</div>    

What is an individual? A person? A survey response?


## Website Visits

<div class="image-txt-container">
    
<img src="imgs/netflix.png" width="50%">


* **Types of Individuals**
* Visits to the netflix (browsing; video streamed)
* Customer accounts (paid subscriptions)

</div>


## Website Visits

<div class="image-txt-container">
    
<img src="imgs/netflix.png" width="50%">


- 140M streaming hrs/day (-\$)
- ~60M page visits/day (-\$)
- ~140M paid subscribers (+\$)
- 15M devices/day 
    * shared accounts?

</div>




# Design phase

If you can control how your dataset is created then you should determine the granularity of your data *before* collection. 

**Advantages**

- We can change a fine granularity to a coarse one if needed (how?) (every page visit)
- No need to collect additional data (saves time)


**Disadvantages**

- Expensive to collect
- Takes space



# Manipulating Granularity

* From the example above we saw that data can be represented in different levels. 

* In order to work with data from different tables we can change the granularity of the data. 

* The examples below will show you a techniques that allow you to manipulate granularity.



### Discussion Question

Given the table below, for each color, what proportion identify as 'M'?
* Return your answer as a Series, indexed by color.
* Try it first for one color.
* How many passes through the data does your solution require?


In [2]:
people = pd.DataFrame(
    [["Joey", "blue", 42,"M"],
     ["Weiwei","blue", 50,"F"],
     ["Joey", "green", 8,"M"],
     ["Karina", "green",7, "F"],
     ["Fernando", "pink", -9,"M"],
     ["Nhi","blue",3,"F"],
     ["Sam","pink", -42,"X"]], 
    columns = ["Name", "Color", "Number", "Gender"])
people

Unnamed: 0,Name,Color,Number,Gender
0,Joey,blue,42,M
1,Weiwei,blue,50,F
2,Joey,green,8,M
3,Karina,green,7,F
4,Fernando,pink,-9,M
5,Nhi,blue,3,F
6,Sam,pink,-42,X


### Approach 1: 'looping through unique values'

* How many passes through the data?
* What are the space constraints?

In [3]:
colors = {}
for color in people['Color'].unique():
    filtered_for_color = people.loc[people['Color'] == color, : ]   # filter by color
    colors[color] = (filtered_for_color['Gender'] == 'M').mean()    # boolean array for "M" and take the mean

pd.Series(colors)

blue     0.333333
green    0.500000
pink     0.500000
dtype: float64

### Approach 2: 'single pass'

In [4]:
people

Unnamed: 0,Name,Color,Number,Gender
0,Joey,blue,42,M
1,Weiwei,blue,50,F
2,Joey,green,8,M
3,Karina,green,7,F
4,Fernando,pink,-9,M
5,Nhi,blue,3,F
6,Sam,pink,-42,X


In [5]:
colors = {}                                                  # dict
for idx, row in people.iterrows():                            
    print(row)
  

Name      Joey
Color     blue
Number      42
Gender       M
Name: 0, dtype: object
Name      Weiwei
Color       blue
Number        50
Gender         F
Name: 1, dtype: object
Name       Joey
Color     green
Number        8
Gender        M
Name: 2, dtype: object
Name      Karina
Color      green
Number         7
Gender         F
Name: 3, dtype: object
Name      Fernando
Color         pink
Number          -9
Gender           M
Name: 4, dtype: object
Name       Nhi
Color     blue
Number       3
Gender       F
Name: 5, dtype: object
Name       Sam
Color     pink
Number     -42
Gender       X
Name: 6, dtype: object


In [6]:
colors = {}                                                  # dict
for idx, row in people.iterrows():                            
    
    c, is_male = row['Color'], int(row['Gender'] == 'M')
    if c in colors:
        colors[c] += np.array([1, is_male])
    else:
        colors[c] = np.array([1, is_male])

colors

{'blue': array([3, 1]), 'green': array([2, 1]), 'pink': array([2, 1])}

In [7]:
# put in the dataframe and calculate the averages
df = pd.DataFrame(colors, index=['total', 'is_male'])
(df.loc['is_male'] / df.loc['total'])

blue     0.333333
green    0.500000
pink     0.500000
dtype: float64

### Issues with above solutions:

* Ad-hoc solution that depends on the specific problem.
* Loops in *python* are slow (though the *algorithmic reasoning* is still relevant).

What are the *common patterns* in processing 'groups of data'?

## Grouping and Aggregating Data

**split-apply-combine**

<img src="imgs/image_0.png"/>


**Aggregation** is the process of turning the values of a dataset (or a subset of it) into one single value.


### Pandas `groupby` objects

This makes clear what the `groupby` accomplishes:

* **split** breaks up and groups a `DataFrame` depending on the value of the specified key.
* **apply** computes a function (e.g. aggregate, transformation, or filtering) within the individual groups.
* **combine** merges the results of these operations into an output array.


### How `groupby` computes

* The `groupby` can (often) do this in a *single* pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. 

* `groupby` abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.

* Big data tools use the same design pattern to send the computations on each group across *many computers*.

## `groupby` and `aggregate/apply/transform`

* `groupby`: grouping collections of records over a set of fields for computing quantities over the remaining fields.
    - `groupby` is a dataframe method that returns a `groupby` object.


* `aggregate`: aggregating using one or more operations over the specified axis (also `apply/transform`)
    - `aggregate` is a `groupby` object method that returns a Series/DataFrame.



### `groupby` example

Given the table below, for each color, what proportion identify as 'M'?

In [8]:
people

Unnamed: 0,Name,Color,Number,Gender
0,Joey,blue,42,M
1,Weiwei,blue,50,F
2,Joey,green,8,M
3,Karina,green,7,F
4,Fernando,pink,-9,M
5,Nhi,blue,3,F
6,Sam,pink,-42,X


In [9]:
# add boolean column
people = people.assign(is_male=(people['Gender'] == 'M'))
people


Unnamed: 0,Name,Color,Number,Gender,is_male
0,Joey,blue,42,M,True
1,Weiwei,blue,50,F,False
2,Joey,green,8,M,True
3,Karina,green,7,F,False
4,Fernando,pink,-9,M,True
5,Nhi,blue,3,F,False
6,Sam,pink,-42,X,False


In [10]:
(    
    people
    .groupby('Color')['is_male'] #group by color, take male and then take mean
    .mean()
)

Color
blue     0.333333
green    0.500000
pink     0.500000
Name: is_male, dtype: float64

### `groupby` example

* `dataframe.groupby(key)` returns a `DataFrameGroupBy` object.
* `.group` is a dictionary of grouping keys and the corresponding dataframe
* `.get_group(key)` method returns a dataframe corresponding to the given key


In [11]:
# The `groupby` operator groups rows in the table that are the same in one or more columns.

grps = people.groupby("Color")
grps

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D977C57AC8>

In [12]:
# dictionary of keys and indecies 
grps.groups

{'blue': Int64Index([0, 1, 5], dtype='int64'),
 'green': Int64Index([2, 3], dtype='int64'),
 'pink': Int64Index([4, 6], dtype='int64')}

In [13]:
# size of each group
grps.size()

Color
blue     3
green    2
pink     2
dtype: int64

In [14]:
# To view the content:

for key, item in grps:
    print('***** %s *****\n' % key,
          grps.get_group(key), 
          "\n\n")


***** blue *****
      Name Color  Number Gender  is_male
0    Joey  blue      42      M     True
1  Weiwei  blue      50      F    False
5     Nhi  blue       3      F    False 


***** green *****
      Name  Color  Number Gender  is_male
2    Joey  green       8      M     True
3  Karina  green       7      F    False 


***** pink *****
        Name Color  Number Gender  is_male
4  Fernando  pink      -9      M     True
6       Sam  pink     -42      X    False 




### `groupby` and column selection

* We will commonly combine `groupby` with column selection:
    - e.g., `df.groupby("Region")["Sales"]` 
* Then add an aggregate calculation on that column:

In [15]:
people

Unnamed: 0,Name,Color,Number,Gender,is_male
0,Joey,blue,42,M,True
1,Weiwei,blue,50,F,False
2,Joey,green,8,M,True
3,Karina,green,7,F,False
4,Fernando,pink,-9,M,True
5,Nhi,blue,3,F,False
6,Sam,pink,-42,X,False


In [16]:
# median of a number for each color
# slice of the number column
people.groupby("Color")["Number"].median()

Color
blue     42.0
green     7.5
pink    -25.5
Name: Number, dtype: float64

In [17]:
people.groupby("Color")["Number"].mean()

Color
blue     31.666667
green     7.500000
pink    -25.500000
Name: Number, dtype: float64

In [18]:
people

Unnamed: 0,Name,Color,Number,Gender,is_male
0,Joey,blue,42,M,True
1,Weiwei,blue,50,F,False
2,Joey,green,8,M,True
3,Karina,green,7,F,False
4,Fernando,pink,-9,M,True
5,Nhi,blue,3,F,False
6,Sam,pink,-42,X,False


### Grouping over multiple columns

In [19]:
# can you predict the output of the groupby?

two_fields = people.groupby(["Color", "Gender"])


In [20]:
# To view the content:

for key, item in two_fields:
    print('***** %s *****\n' % str(key),
          two_fields.get_group(key), 
          "\n\n")

***** ('blue', 'F') *****
      Name Color  Number Gender  is_male
1  Weiwei  blue      50      F    False
5     Nhi  blue       3      F    False 


***** ('blue', 'M') *****
    Name Color  Number Gender  is_male
0  Joey  blue      42      M     True 


***** ('green', 'F') *****
      Name  Color  Number Gender  is_male
3  Karina  green       7      F    False 


***** ('green', 'M') *****
    Name  Color  Number Gender  is_male
2  Joey  green       8      M     True 


***** ('pink', 'M') *****
        Name Color  Number Gender  is_male
4  Fernando  pink      -9      M     True 


***** ('pink', 'X') *****
   Name Color  Number Gender  is_male
6  Sam  pink     -42      X    False 




In [21]:
people

Unnamed: 0,Name,Color,Number,Gender,is_male
0,Joey,blue,42,M,True
1,Weiwei,blue,50,F,False
2,Joey,green,8,M,True
3,Karina,green,7,F,False
4,Fernando,pink,-9,M,True
5,Nhi,blue,3,F,False
6,Sam,pink,-42,X,False


In [22]:
# for every color-gender pair get a max for Name and Number
people.groupby(["Color", "Gender"])[['Name','Number']].max()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Number
Color,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,F,Weiwei,50
blue,M,Joey,42
green,F,Karina,7
green,M,Joey,8
pink,M,Fernando,-9
pink,X,Sam,-42


In [23]:
# ^^ multi-Index

### `groupby` methods: `aggregate`, `apply`
* Aggregates using one or more operations over the specified axis.
* Takes in a dictionary of:
    - keys: names of columns to a apply a function to,
    - values: the function to apply.
* There are more [sophisticated ways](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html#pandas.DataFrame.aggregate) to use it.

In [24]:
# aggregate

def avg_str_len(series):
    return series.str.len().mean()  # purpose?

res = (
    people
        .groupby(["Color", "Gender"])
        .aggregate({"Name": avg_str_len, "Number": np.mean})
)

res

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Number
Color,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,F,4.5,26.5
blue,M,4.0,42.0
green,F,6.0,7.0
green,M,4.0,8.0
pink,M,8.0,-9.0
pink,X,3.0,-42.0


In [25]:
people

Unnamed: 0,Name,Color,Number,Gender,is_male
0,Joey,blue,42,M,True
1,Weiwei,blue,50,F,False
2,Joey,green,8,M,True
3,Karina,green,7,F,False
4,Fernando,pink,-9,M,True
5,Nhi,blue,3,F,False
6,Sam,pink,-42,X,False


In [26]:
# aggregate with list
people.groupby(['Color', 'Gender']).aggregate([np.min, np.max, 'size'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Name,Name,Number,Number,Number,is_male,is_male,is_male
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,amax,size,amin,amax,size,amin,amax,size
Color,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
blue,F,Nhi,Weiwei,2,3,50,2,False,False,2
blue,M,Joey,Joey,1,42,42,1,True,True,1
green,F,Karina,Karina,1,7,7,1,False,False,1
green,M,Joey,Joey,1,8,8,1,True,True,1
pink,M,Fernando,Fernando,1,-9,-9,1,True,True,1
pink,X,Sam,Sam,1,-42,-42,1,False,False,1


`size`: the name of only certain functions as strings (originally). `Groupby` Implemented for a set number of functions. 
Needed now: backwards compatibility. 

# Stop Monday Lecture

### Animals at the zoo

In [27]:
zoo = pd.read_csv("data/zoo.csv")
zoo.head(10)

Unnamed: 0,animal,age,water_need
0,elephant,23,500
1,elephant,12,600
2,elephant,2,550
3,tiger,14,300
4,tiger,10,320
5,tiger,7,330
6,tiger,12,290
7,tiger,4,310
8,zebra,2,200
9,zebra,1,220


In [28]:
# 1. How much water is needed for all animals in a zoo?

zoo.water_need.sum()

7650

### Discussion Question

What happens if I execute zoo.sum()?

|Option|Answer|
|---|---|
|A:| Error|
|B:| sum will be calculated for the first column|
|C:| sum will be calculated for all columns|
|D:| sum wll be calcalculated for the last column|

In [29]:
zoo

Unnamed: 0,animal,age,water_need
0,elephant,23,500
1,elephant,12,600
2,elephant,2,550
3,tiger,14,300
4,tiger,10,320
5,tiger,7,330
6,tiger,12,290
7,tiger,4,310
8,zebra,2,200
9,zebra,1,220


In [30]:
zoo.sum()

animal        elephantelephantelephanttigertigertigertigerti...
age                                                         164
water_need                                                 7650
dtype: object

In [31]:
# Find the average consumption of water for each type of animal

zoo.groupby("animal")['water_need'].mean()

animal
elephant    550.000000
kangaroo    416.666667
lion        477.500000
tiger       310.000000
zebra       184.285714
Name: water_need, dtype: float64

In [32]:
# 4. Find the median consumption of water and the oldest animal within each animal category 
    

z = zoo.groupby(['animal']).aggregate({"age": np.max, "water_need": np.median})

z

Unnamed: 0_level_0,age,water_need
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
elephant,23,550
kangaroo,7,410
lion,12,460
tiger,14,310
zebra,11,220


### Grouping and Indexes

* the `groupby` operation creates an index based on the grouping columns. 
* If a grouping was one multiple columns, it results in a `MultiIndex`.
    - Advice: given a `MultiIndex`? Use `.reset_index`!

In [33]:
# reminder

import numpy as np

def avg_str_len(series):
    return series.str.len().mean()  # purpose?

res = (
    people
        .groupby(["Color", "Gender"])
        .aggregate({"Name": avg_str_len, "Number": np.mean})
)

res

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Number
Color,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,F,4.5,26.5
blue,M,4.0,42.0
green,F,6.0,7.0
green,M,4.0,8.0
pink,M,8.0,-9.0
pink,X,3.0,-42.0


In [34]:
# In some cases we might want to leave the grouping fields as columns:

(
    people
        .groupby(["Color", "Gender"], as_index=False)
        .aggregate({"Name": "first", "Number": np.mean})
)

Unnamed: 0,Color,Gender,Name,Number
0,blue,F,Weiwei,26.5
1,blue,M,Joey,42.0
2,green,F,Karina,7.0
3,green,M,Joey,8.0
4,pink,M,Fernando,-9.0
5,pink,X,Sam,-42.0


In [35]:
# Or using .reset_index instead

(
people
    .groupby(["Color", "Gender"])
    .aggregate({"Name": "first", "Number": np.mean})
    .reset_index()
)

Unnamed: 0,Color,Gender,Name,Number
0,blue,F,Weiwei,26.5
1,blue,M,Joey,42.0
2,green,F,Karina,7.0
3,green,M,Joey,8.0
4,pink,M,Fernando,-9.0
5,pink,X,Sam,-42.0


# GroupBy

In [36]:
depts = pd.read_csv("data/depts.csv")
depts


Unnamed: 0,dept,class,grade
0,one,1,500
1,one,2,500
2,one,3,500
3,one,4,500
4,one,5,500
...,...,...,...
75,four,16,500
76,four,17,500
77,four,18,500
78,four,19,500


In [37]:
#what is the length of the table?
out = depts.groupby(["dept", "class"]).mean()
out

Unnamed: 0_level_0,Unnamed: 1_level_0,grade
dept,class,Unnamed: 2_level_1
four,1,500
four,2,500
four,3,500
four,4,500
four,5,500
...,...,...
two,16,500
two,17,500
two,18,500
two,19,500


In [38]:
out.to_frame()

AttributeError: 'DataFrame' object has no attribute 'to_frame'

In [None]:
# You can reshape it:
# Departaments as indecies
# Courses as columns
# How?

## `pivot` / `pivot_table` methods

* Pivot is used to examine aggregates with respect to two characteristics.
    - e.g. pivot sales data to look at average sales broken down by year and market.
* reshapes the rows *and the columns*  of a table


<img src="imgs/image_1.png" width="75%">

### `pivot`/ `pivot_table` methods reshape dataframes from 'long' to 'wide'
* `.pivot`/`.pivot_table` transforms:
    - a long table of rows 'indexed' by two characteristics,
    - into a wide table with one characteristic per axis.

* `.pivot` is a reshape that often follows a `groupby`.

In [39]:
# .pivot` is a reshape that often follows a `groupby`.

In [40]:
out.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,grade
dept,class,Unnamed: 2_level_1
four,1,500
four,2,500
four,3,500
four,4,500
four,5,500


In [41]:
out.pivot(index = "dept", columns = "class")

KeyError: "None of ['dept', 'class'] are in the columns"

In [42]:
out = depts.groupby(["dept", "class"],as_index=False).mean()
out.head()

Unnamed: 0,dept,class,grade
0,four,1,500
1,four,2,500
2,four,3,500
3,four,4,500
4,four,5,500


In [43]:
out.pivot(index = "dept", columns = "class")

Unnamed: 0_level_0,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade,grade
class,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
dept,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
four,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500
one,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500
three,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500
two,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500,500


In [44]:
#Another example

people

Unnamed: 0,Name,Color,Number,Gender,is_male
0,Joey,blue,42,M,True
1,Weiwei,blue,50,F,False
2,Joey,green,8,M,True
3,Karina,green,7,F,False
4,Fernando,pink,-9,M,True
5,Nhi,blue,3,F,False
6,Sam,pink,-42,X,False


In [45]:
# Counts of Color/Sex
counts = people.groupby(["Color", "Gender"], as_index=False)['Number'].count()
counts

Unnamed: 0,Color,Gender,Number
0,blue,F,2
1,blue,M,1
2,green,F,1
3,green,M,1
4,pink,M,1
5,pink,X,1


In [46]:
# pivot method merely reshapes the data
counts.pivot(index='Color', columns='Gender', values='Number')

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,2.0,1.0,
green,1.0,1.0,
pink,,1.0,1.0


In [47]:
# pivot method merely reshapes the data
counts.pivot(index='Gender', columns='Color', values='Number')

Color,blue,green,pink
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,2.0,1.0,
M,1.0,1.0,1.0
X,,,1.0


### The `pivot_table` method can combine `groupby` and `pivot`

* Doing a pivot after a groupby is so common, `pivot_table` can do it!
* `aggfunc='count'` specifies to aggregate by count before pivoting.
* The equivalent of:
```
people.groupby(["Color", "Gender"], as_index=False)['Number'].count().pivot('Color', 'Gender', 'Number')
```

In [48]:
# For each color and sex count the number of people

people.groupby(["Color", "Gender"], as_index=False)['Number'].count().pivot('Color', 'Gender', 'Number')

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,2.0,1.0,
green,1.0,1.0,
pink,,1.0,1.0


In [49]:
people.pivot_table(
    values  = "Number", # the entry to aggregate over
    index   = "Color",  # the row grouping attributes
    columns = "Gender",    # the column grouping attributes
    aggfunc = "count"   # the aggregation function
)

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,2.0,1.0,
green,1.0,1.0,
pink,,1.0,1.0


### `pivot_table` observations

1. The second "grouping" column (`Gender`) has been **"pivoted" from the rows to column location**. 
2. There is a missing value for `pink` and `F` since none of the women chose `pink` as their favorite color.
    - specify how missing values are filled in with `fill_value` keyword argument

In [50]:
people.pivot_table(
    values  = "Number",
    index   = "Color",
    columns = "Gender",
    aggfunc = "count",
    fill_value = 0.0
)

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,2,1,0
green,1,1,0
pink,0,1,1


### `pivot_table` observations

* Rows/columns do *not* represent individuals/observations.
* The statistical summaries are related to joint/conditional distributions:
    - Joint: 'The distribution of (Color, Gender) pairs'
    - Conditional: 'The distribution of Colors, given Gender=...'

In [51]:
cnts = people.pivot_table(
    values  = "Number",
    index   = "Color",
    columns = "Gender",
    aggfunc = "count",
    fill_value = 0.0
)

cnts

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,2,1,0
green,1,1,0
pink,0,1,1


In [52]:
# to get a distribution we need to normalize the values
# we will divide by the total number of people

joint = cnts / cnts.sum().sum()
joint

# joint empirical distribution of color and gender

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,0.285714,0.142857,0.0
green,0.142857,0.142857,0.0
pink,0.0,0.142857,0.142857


In [53]:
cnts.sum().sum()

7

You can also get marginal distributions back from the joint distribution.
We can get back the empirical distribution of `color` or `gender` separately by summing the rows and the columns. 

In [54]:
# original distribution of colors

joint.sum(axis=1).to_frame()

Unnamed: 0_level_0,0
Color,Unnamed: 1_level_1
blue,0.428571
green,0.285714
pink,0.285714


In [55]:
# original distribution of genders

joint.sum(axis=0).to_frame()

Unnamed: 0_level_0,0
Gender,Unnamed: 1_level_1
F,0.428571
M,0.428571
X,0.142857


In [56]:
# conditional distributions of color given Gender  (P (A|B) = P(A and B)/P(B))
# Joint probability divided by probab. given F

joint

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,0.285714,0.142857,0.0
green,0.142857,0.142857,0.0
pink,0.0,0.142857,0.142857


In [57]:
0.285714/0.428571

0.6666666666666667

In [58]:
cnts

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,2,1,0
green,1,1,0
pink,0,1,1


In [59]:
# conditional distributions of color given Gender 
# (each column sums to 1)
# divide each column by a total in the column

cnts.apply(lambda x:x / x.sum())

Gender,F,M,X
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blue,0.666667,0.333333,0.0
green,0.333333,0.333333,0.0
pink,0.0,0.333333,1.0


In [60]:
# conditional distributions of Gender given Color
# (each column sums to 1)
# Either normalize by rows or transpose it
cnts.T.apply(lambda x:x / x.sum())

Color,blue,green,pink
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,0.666667,0.5,0.0
M,0.333333,0.5,0.5
X,0.0,0.0,0.5


### `Pivot` conclusion

* Pivots reshape your data from long to wide.
* Other reshaping dataframe methods:
    - `melt`: un-pivots your data
    - `stack`: pivoting multi-level columns to multi-indices
    - `unstack`: pivoting multi-indices to columns

# Simpson's 'Paradox'

# Simpson's 'Paradox'
* Datasets look different at different granularities

<img src="imgs/image_2.png">

# Example 1. How Berkeley was sued for gender discrimination

<img src="imgs/image_3.png">
**Do you agree?**

## Researchers looked more closely within specific departments

<img src="imgs/image_4.png">

**and what did they see?**

(from here: https://medium.com/@dexter.shawn/how-uc-berkeley-almost-got-sued-because-of-lying-data-aaa5d641f571)

### What happened with admission?


<div class="image-txt-container">
    
<img src="imgs/simpsons_berkeley.png" width="50%">


* Most depts admitted MORE women!
* Dept A: few women applicants
* Dept F: many women applicants
* Women apply to harder depts.

    
</div>




### What happened? (by the numbers)

* Overall acceptance rate: 35% (women) to 44% (men).

* Dept A has an acceptance rate 82% for women vs 62% for men! 
    - **2%** of all women applied to Dept A.
    - **10%** of all men applied to Dept A.
    
* Dept F has an acceptance rate 6% for women vs 7% for men! 
    - **8%** of all women applied to Dept A.
    - **4%** of all men applied to Dept A.

**Conclusion:** Women tend to apply to depts with a low-acceptance rate.

## Simpson's Paradox

* When grouped data tells the opposite story of the ungrouped data. 

* This *often* happens because there is a hidden factor (*a confounder*) within the data that influences results.

* What is the "correct" way to summarize your data? What if you had to act on these results?

# Example 2. Hospital Example

* Should I send my elderly relative to Hospital A or B?

<img src="imgs/hospitals.png">
    
[[from here]](https://www.youtube.com/watch?v=sxYrzzy3cq8&feature=youtu.be) 

### Additional observation:

Not all patients arrive with the same health:

<br/>

<div class="image-txt-container">
    
    
    
<img src="imgs/A_poor.png" width="42%">


<img src="imgs/B_poor.png" width="45%">

    
</div>


Calculate the survival rate for those in poor health.

In [61]:
# for A:
print(30*100/100)
# for B:
print(210*100/400)

30.0
52.5


### Question

* What if your relative's health is good? 
* What hospital should you choose, A or B?

Remember, that hospital 
* A had 900/1000 survivors (30/100 poor health) 
* B has 800/1000 survivors (210/400 poor health). 

Talk to each other and vote:

|Option|Answer|
|---|---|
|A| Hospital A|
|B| Hospital B|
|C| Impossible to decide, not enough data|


### Simpson's paradox explanation: hospital example

* The data show opposite trends, depending on how it is grouped. 
* The hidden factor is the relative proportion of patients who arrived on good/poor health.
* *In this case*, how you act depends on which group you are in.

## Caution: Simpson's paradox is merely arithmetic

* Sometimes there are no *revelant* confounders.
* Simpson's paradox is present in ~2% of randomly chosen "grouping distributions".
* The best way to interpret the data depends on what you want from it!


### Restaurant reviews and phone types

* You are deciding between two restaurants with a friend.
* In a new feature, yelp aggregates attributes of reviewers for their reviews.
* Should you choose restaurant A or B? 

|Phone Type|Stars for A|Stars for B|
|---|---|---|
|Android|4.24|4.0|
|iPhone|2.99|2.79|
|___|___|___|
|All|3.32|3.37|



### Restaurant reviews and phone types
* It's doubtful that your phone-type will *cause* you to prefer one restaurant over another (?)
* If you aggregate again, the inequalities may flip *again* (e.g. phone-type ownership by zip-code)
* Simpson's paradox is merely a property of weighted averages!

* Maybe Android users give better reviews? But you care about relative rank!

### Verifying simpson's paradox
* Suppose we have a dataset of individual ratings
* Can you verify simpson's paradox?

In [62]:
ratings = pd.read_csv('data/ratings.csv')
ratings.head()

Unnamed: 0,phone,restaurant,rating
0,Android,A,4
1,Android,A,4
2,Android,A,4
3,Android,A,4
4,Android,A,4
