In [None]:
from IPython.core.display import HTML
import pandas as pd 

In [2]:
spending_df = pd.read_csv('https://www.dropbox.com/s/ce9b47nzt3sx7y5/spending_10k.csv?dl=1', index_col="unique_id", dtype={"doctor_id":"object"})
spending_df.head(10)

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NX531425,1255626040,FAMILY PRACTICE,METFORMIN HCL,30,135.24
QG879256,1699761833,FAMILY PRACTICE,ALLOPURINOL,30,715.76
FW363228,1538148804,INTERNAL MEDICINE,LOSARTAN POTASSIUM,146,1056.47
WD733417,1730200619,PSYCHIATRY,OLANZAPINE,13,28226.97
XW149832,1023116894,FAMILY PRACTICE,PRAVASTATIN SODIUM,348,8199.48
QT485324,1952359671,FAMILY PRACTICE,HYDROCHLOROTHIAZIDE,57,247.01
NA293426,1841235223,FAMILY PRACTICE,SEVELAMER CARBONATE,11,4869.32
IF945618,1326095662,INTERNAL MEDICINE,FLUTICASONE/SALMETEROL,20,7832.46
PH384257,1821126830,HEMATOLOGY/ONCOLOGY,ZOLPIDEM TARTRATE,14,65.21
JY407340,1710986088,INTERNAL MEDICINE,MECLIZINE HCL,47,861.67


### Overview


* In this section, we will tackle the handy `groupby` method.

* We also cover the split-apply-combine scheme to:

  * Aggregate data in each group
  * Transform data in each group
  * Filter the data in each group
  * Thin the data in each group

### `group_by` and `DataFrame` groups

* The `groupby()` method is used to group the data using values from one or more columns.

   * `groupby` takes as input one or more column labels, which it uses to group the data.

```python
df_1.groupby("X")
```

![](https://www.dropbox.com/s/86bi697t59zmkdn/groupby.png?dl=1)



### Identifying Groups from a GroupBy Object


```python
spending_df.groupby('specialty')
```

![](https://www.dropbox.com/s/bs8o34e4s7bdqa8/group_by_specialty.png?dl=1)

* The `groupby` method returns an object of type `DataFrameGroupBy.`
  * This is not a `DataFrame`, and does not, therefore, have the `DataFrame` methods discussed previously 




In [4]:
x = spending_df.groupby('specialty')


In [5]:
type(x)

pandas.core.groupby.generic.DataFrameGroupBy

In [3]:
spending_by_specialty = spending_df.groupby('specialty')

addiction_med_group = spending_by_specialty.get_group("ADDICTION MEDICINE")
addiction_med_group

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56


In [13]:
x = pd.Series([1,2,3,4,5])

x / x.sum()

0    0.066667
1    0.133333
2    0.200000
3    0.266667
4    0.333333
dtype: float64

### `groupby` and Group-Specific Processing

* An ideal use-case for `groupby` consists of applying operations to each group independently.

* For instance, to compute the total spending by `specialty`, we need to:
  * Split the data by `specialty`.
  * Sum the total `spending` for each group.
  * Combine the sums for each group into a new `DataFrame`.




### Split-Apply-Combine Paradigm

* `groupby()` is often applied in the context of the data processing paradigm called "split-apply-combine".

  * **Split**: you need to split the data into chunks defined using one or more columns.
    * This is typically done using `groupby`.
  * **Apply**: apply some operation to the chunks generated.
    * Ex. Count the number of rows in each chunk, average the values for a specific column, etc.
  * **Combine**: combine the results of the applied operation into a new `DataFrame`.




### Split-Apply-Combine Example

![](https://www.dropbox.com/s/aecufw3mfu2mlah/split_apply_combine_example.png?dl=1)

* The type of Split-Apply-Combine applied here is referred to as aggregation.
  * Aggregations refer to any operation that aggregates (reduces) group data to a single value.

### The 3 ( or 3  $\frac{1}{2}$) Classes of Opearations on Groups


* Three are 3 formal classes of split-apply-combine operations that can be applied to group data.

  * I include a variant ($\frac{1}{2}$ a class) which I think is useful to better classify split-apply-combine operations.


1\.$~~$__Aggregations__ generate a single value for each group
   * Ex. Sum the spending by specialty
  
2\.$~~$ __Transformations__ convert the data and generate a group of the same size as the original group.
   * Ex. Convert the currency by country for some datasets that contains medication cost by country.

3\.$~~$ __Filters__ retain or discard a group based on group-specific boolean computations.
   * Ex. drop specialty if the sum of spending is below some threshold

3$\frac{1}{2}$\.$~$"__Thinning__" drops entries in a group based on some defined logic.
  * Filter out values in a group that are 3 standard deviations above or below the mean.
  


### Aggregations

- __Aggregations__ aggregate the data in each group, i.e., they reduce the data in each group to a single value. 

  * This includes, for instance, computing group sums, means, maximums, minimums, _etc_.



![](https://www.dropbox.com/s/9q54na9szs5syi5/aggregate.png?dl=1)



### Transforming Group Data

* Transform the data in a group-specific way.

  *  Ex. for specialty, we want to transform the column `nb_beneficiaries` into the values small, large or medium, depending on whether the `nb_beneficiaries` value is, respectively, `-2 * std` below the mean, `+2 * std` larger than the mean or withing `+/-2 * std` of the mean.


   *  The number of entries per group resulting from a transformation is the same as the number of entries in the group before the transformation.



- The diagram below shows an example where the data in column "Y" in transformed by dividing it by the group mean.

![](https://www.dropbox.com/s/nf8lg0lqk3yxf7k/transform_2.png?dl=1)


### Filtering Group Data

* Consist of dropping or retaining that group in a way that depends on a group-specific computation that returns `True` or `False`. 

* For instance, we can filter specialties that don't have enough entries or for which the mean `spending` if below a certain threshold.
  * Groups are either retained or discarded. Groups that are retained are unmodified.


- The diagram below shows an example where groups are filtered if their sum for column `Y` is less than 10.

![](https://www.dropbox.com/s/ncmv2xsupjok7va/filter.png?dl=1)

### Thinning Group Data

* Consist of reducing the number of entries using a group-specific operation.

* Thinning can be useful to sub-sample the data at the group level or returning the top `n` entries in each group, etc. 

  * As opposed to aggregating functions, thinning does not have to reduce the group into a single entry; although it could

    
![](https://www.dropbox.com/s/m4p4f5nk55w2ni2/thin.png?dl=1) 


### Aggregating the Data Using `groupby`

* Aggregation is commonly used to compute summary statistics on each of the groups.

* Some of the interesting/important summary aggregation methods `DataFrameGroupBy` objects are:

|Methods           |        Decription                              |
|:-----------------|:-----------------------------------------------|
| `mean`, `median` | Computes the mean and the median in each group | 
| `min` , `max`    | computes the min and max in each group         | 
| `size`           | computes the number of values in each group    | 




In [6]:
spending_df.head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NX531425,1255626040,FAMILY PRACTICE,METFORMIN HCL,30,135.24
QG879256,1699761833,FAMILY PRACTICE,ALLOPURINOL,30,715.76
FW363228,1538148804,INTERNAL MEDICINE,LOSARTAN POTASSIUM,146,1056.47
WD733417,1730200619,PSYCHIATRY,OLANZAPINE,13,28226.97
XW149832,1023116894,FAMILY PRACTICE,PRAVASTATIN SODIUM,348,8199.48


##### Aggregating the Data Using `groupby` Cont'd 


- The functions above all use the same syntax:
 
```python
spending_df.groupby('specialty').sum()
# or
spending_df.groupby('specialty').min()
```


In [7]:
spending_df.groupby('specialty').sum().head(10)


Unnamed: 0_level_0,nb_beneficiaries,spending
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1
ADDICTION MEDICINE,74,920.06
ALLERGY/IMMUNOLOGY,1063,189174.06
ANESTHESIOLOGY,1673,142804.73
CARDIAC ELECTROPHYSIOLOGY,1041,225543.62
CARDIAC SURGERY,33,12432.92
CARDIOLOGY,29638,1915787.9
CERTIFIED CLINICAL NURSE SPECIALIST,1146,114057.4
CERTIFIED NURSE MIDWIFE,58,14763.47
CLINIC/CENTER,23,3110.16
CLINICAL PSYCHOLOGIST,83,495.95



### Applying Functions to Group Columns

- The method called `agg` can be used where complex or custom aggregation logic is required.
 The method `agg` takes a function (or a list of functions) and uses it (them) to aggregate the group's colum(s)

- Example, we can use `sum_spending_CAD` to return the sum of the spending in Canadian Dollars.



```python
def sum_spending_CAD(x):
    return x.sum() * 1.32

spending_by_specialty['spending'].agg(sum_spending_CAD)
```


* `agg` can either:
  * take a dictionary of functions to aggregate on.
    * Required for aggregating more than one column 

    ```python 
    spending_by_specialty.agg({'nb_beneficiaries' :sum,
                               'spending' : sum_spending_CAD)
    ```

  * `agg` can take a list of function to apply to each column functions to aggregate on.
  
    `
    spending_by_specialty.agg([min,max,sum])
    `

In [12]:
def sum_spending_CAD(x):
    return x.sum() * 1.32

# format
(
    spending_by_specialty.agg({ 'nb_beneficiaries': sum, 'spending': sum_spending_CAD })
                         .head()
)



Unnamed: 0_level_0,nb_beneficiaries,spending
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1
ADDICTION MEDICINE,74,1214.4792
ALLERGY/IMMUNOLOGY,1063,249709.7592
ANESTHESIOLOGY,1673,188502.2436
CARDIAC ELECTROPHYSIOLOGY,1041,297717.5784
CARDIAC SURGERY,33,16411.4544


In [14]:
spending_by_specialty.get_group("ADDICTION MEDICINE")

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56


In [6]:
# note that sum, min and max here are functions
spending_by_specialty['spending'].agg([sum, min, max]).head()

Unnamed: 0_level_0,sum,min,max
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ADDICTION MEDICINE,920.06,19.56,817.88
ALLERGY/IMMUNOLOGY,189174.06,109.8,52389.61
ANESTHESIOLOGY,142804.73,35.33,34073.91
CARDIAC ELECTROPHYSIOLOGY,225543.62,69.85,89101.54
CARDIAC SURGERY,12432.92,442.91,11990.01


In [10]:
spending_by_specialty.agg({'nb_beneficiaries' :min,
                           'spending' : max}).head()


Unnamed: 0_level_0,nb_beneficiaries,spending
specialty,Unnamed: 1_level_1,Unnamed: 2_level_1
ADDICTION MEDICINE,11,817.88
ALLERGY/IMMUNOLOGY,11,52389.61
ANESTHESIOLOGY,12,34073.91
CARDIAC ELECTROPHYSIOLOGY,12,89101.54
CARDIAC SURGERY,15,11990.01


In [11]:
spending_by_specialty.agg({'nb_beneficiaries' :[min, sum],
                           'spending' : max}).head()


Unnamed: 0_level_0,nb_beneficiaries,nb_beneficiaries,spending
Unnamed: 0_level_1,min,sum,max
specialty,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
ADDICTION MEDICINE,11,74,817.88
ALLERGY/IMMUNOLOGY,11,1063,52389.61
ANESTHESIOLOGY,12,1673,34073.91
CARDIAC ELECTROPHYSIOLOGY,12,1041,89101.54
CARDIAC SURGERY,15,33,11990.01


### Transforming the Data in `groupby`

- As opposed to aggregations, which reduce the data into a single value, transformations modify the data but don't change the `shape` (dimension) of the groups

- Transformations are useful for applying operations that are group specific



### Transforming the Data in `groupby` Cont'd


- The example below computes the percent contribution of each entry to each specialty by applying a transformation that normalizes the entry's spending over the total spending in that specialty. 

![](https://www.dropbox.com/s/xwomvq1cs90jpg1/transform_spending.png?dl=1)


In [16]:
spending_by_specialty["spending"].get_group("ADDICTION MEDICINE")

unique_id
VG585760    82.620
GJ278932   817.880
TX420809    19.560
Name: spending, dtype: float64

### Applying a Transformation

- Applying a transformation is done using the method called `transform`.


- The method `transform` takes as input a function name, which it calls on each group of the `DataFrameGroupBy` object

In [15]:
# i=0
# global i
# i+=1 
# print(type(x))


def my_function(x):

    return (x   / x.sum() ) * 100
    


spending_df["spending_pct"] = spending_by_specialty['spending'].transform(my_function)


In [16]:
spending_df[spending_df['specialty'] == "ADDICTION MEDICINE"]


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62,8.979849
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88,88.894203
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56,2.125948


In [28]:
spending_df.sort_values(['specialty', 'spending_pct'], ascending=[True, False]).head(10)

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GJ278932,1134139991,ADDICTION MEDICINE,BUSPIRONE HCL,49,817.88,88.894203
VG585760,1801032297,ADDICTION MEDICINE,LAMOTRIGINE,11,82.62,8.979849
TX420809,1801032297,ADDICTION MEDICINE,LORAZEPAM,14,19.56,2.125948
XY715196,1376691626,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,102,52389.61,27.693866
DL492570,1962588053,ALLERGY/IMMUNOLOGY,OMALIZUMAB,12,29153.71,15.411051
UJ888112,1003897851,ALLERGY/IMMUNOLOGY,MOMETASONE FUROATE,55,20759.04,10.973513
GO641321,1255301404,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,35,14277.61,7.54734
JU235992,1003812595,ALLERGY/IMMUNOLOGY,MOMETASONE FUROATE,50,13559.5,7.167737
WE196352,1720080062,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,37,12594.63,6.657694
EW891894,1104888403,ALLERGY/IMMUNOLOGY,DEXLANSOPRAZOLE,32,12411.92,6.561111


### More complex Transformations

* As noted above, drugs are still duplicated across `doctor_ids` within the same `specialty.`

  *  ex. FLUTICASONE/SALMETEROL is prescribed by at least 3 doctors

- To see the percent spending by `drug` column, we need to group on both the `specialty` and the `medication` and then sum the `spending_pct` computed previously

```python
medication_spendng_pct =  spending_df.groupby(["specialty", "medication"])["spending_pct"].sum()
```



In [36]:
medication_spendng_pct.head()

specialty           medication                
ADDICTION MEDICINE  BUSPIRONE HCL                 88.894203
                    LAMOTRIGINE                    8.979849
                    LORAZEPAM                      2.125948
ALLERGY/IMMUNOLOGY  ALBUTEROL SULFATE              3.553199
                    AMOXICILLIN/POTASSIUM CLAV     0.196089
Name: spending_pct, dtype: float64

In [35]:
medication_spendng_pct =  spending_df.groupby(["specialty", "medication"])["spending_pct"].sum()
print(type(medication_spendng_pct))
print("\n" + "*" * 35 + "\n")
print(medication_spendng_pct.index)



<class 'pandas.core.series.Series'>

***********************************

MultiIndex([('ADDICTION MEDICINE',                 'BUSPIRONE HCL'),
            ('ADDICTION MEDICINE',                   'LAMOTRIGINE'),
            ('ADDICTION MEDICINE',                     'LORAZEPAM'),
            ('ALLERGY/IMMUNOLOGY',             'ALBUTEROL SULFATE'),
            ('ALLERGY/IMMUNOLOGY',    'AMOXICILLIN/POTASSIUM CLAV'),
            ('ALLERGY/IMMUNOLOGY',                'AZELASTINE HCL'),
            ('ALLERGY/IMMUNOLOGY',                  'AZITHROMYCIN'),
            ('ALLERGY/IMMUNOLOGY',               'DEXLANSOPRAZOLE'),
            ('ALLERGY/IMMUNOLOGY',                 'DILTIAZEM HCL'),
            ('ALLERGY/IMMUNOLOGY',            'DOXAZOSIN MESYLATE'),
            ...
            (           'UROLOGY',            'SILDENAFIL CITRATE'),
            (           'UROLOGY', 'SULFAMETHOXAZOLE/TRIMETHOPRIM'),
            (           'UROLOGY',                     'TADALAFIL'),
            (

In [53]:
import string
import random

print(string.ascii_letters)
print("\n" + "*" * 52 + "\n")

lc_letters = list(string.ascii_letters[:26])
print(lc_letters)

print("\n" + "*" * 52 + "\n")

print(random.sample(lc_letters, 6))


abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

****************************************************

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

****************************************************

['r', 'z', 'v', 'k', 'i', 'f']


In [48]:
x = pd.Series(random.sample(lc_letters, 6), index=[1,2,3,4,5,6])
x.head()


1    t
2    b
3    d
4    w
5    p
dtype: object

In [56]:
print(x.index)

print("\n" + "*" * 45 + "\n")

print(x[1])

Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

*********************************************

t


In [57]:
medication_spendng_pct.index

MultiIndex([('ADDICTION MEDICINE',                 'BUSPIRONE HCL'),
            ('ADDICTION MEDICINE',                   'LAMOTRIGINE'),
            ('ADDICTION MEDICINE',                     'LORAZEPAM'),
            ('ALLERGY/IMMUNOLOGY',             'ALBUTEROL SULFATE'),
            ('ALLERGY/IMMUNOLOGY',    'AMOXICILLIN/POTASSIUM CLAV'),
            ('ALLERGY/IMMUNOLOGY',                'AZELASTINE HCL'),
            ('ALLERGY/IMMUNOLOGY',                  'AZITHROMYCIN'),
            ('ALLERGY/IMMUNOLOGY',               'DEXLANSOPRAZOLE'),
            ('ALLERGY/IMMUNOLOGY',                 'DILTIAZEM HCL'),
            ('ALLERGY/IMMUNOLOGY',            'DOXAZOSIN MESYLATE'),
            ...
            (           'UROLOGY',            'SILDENAFIL CITRATE'),
            (           'UROLOGY', 'SULFAMETHOXAZOLE/TRIMETHOPRIM'),
            (           'UROLOGY',                     'TADALAFIL'),
            (           'UROLOGY',                'TAMSULOSIN HCL'),
            (     

In [58]:
medication_spendng_pct[('ADDICTION MEDICINE', 'BUSPIRONE HCL')]

88.89420255200748

In [63]:
medication_spendng_pct[('ADDICTION MEDICINE', )]

medication
BUSPIRONE HCL    88.894203
LAMOTRIGINE       8.979849
LORAZEPAM         2.125948
Name: spending_pct, dtype: float64

In [64]:
medication_spendng_pct[('ALLERGY/IMMUNOLOGY', )]

medication
ALBUTEROL SULFATE                  3.553199
AMOXICILLIN/POTASSIUM CLAV         0.196089
AZELASTINE HCL                     3.646451
AZITHROMYCIN                       0.100500
DEXLANSOPRAZOLE                    6.561111
DILTIAZEM HCL                      0.293666
DOXAZOSIN MESYLATE                 0.104872
ENALAPRIL MALEATE                  0.318553
FLUTICASONE PROPIONATE             2.842509
FLUTICASONE/SALMETEROL            41.898900
HYDROXYZINE HCL                    0.135473
IRBESARTAN                         0.058042
LEVOCETIRIZINE DIHYDROCHLORIDE     1.200698
MOMETASONE FUROATE                18.141250
MOMETASONE/FORMOTEROL              2.566123
OMALIZUMAB                        15.411051
PREGABALIN                         2.679754
RANITIDINE HCL                     0.091477
TRIAMCINOLONE ACETONIDE            0.200281
Name: spending_pct, dtype: float64

### More complex Transformations- cont'd

* The multiindex is sometime inconvenient to work with
    * Makes it hard to sort on `speciality` and `spending_pct` as we did earlier

* We can reset (drop) the index using the method `reset_index`
  * allows us to sort on `specialty` and `spending_pct` as we did earlier





In [65]:
spending_df.groupby(["specialty", "medication"])["spending_pct"].sum().head()

specialty           medication                
ADDICTION MEDICINE  BUSPIRONE HCL                 88.894203
                    LAMOTRIGINE                    8.979849
                    LORAZEPAM                      2.125948
ALLERGY/IMMUNOLOGY  ALBUTEROL SULFATE              3.553199
                    AMOXICILLIN/POTASSIUM CLAV     0.196089
Name: spending_pct, dtype: float64

In [31]:
medication_spendng_pct = spending_df.groupby(["specialty", "medication"])["spending_pct"].sum().reset_index()
medication_spendng_pct.head()


Unnamed: 0,specialty,medication,spending_pct
0,ADDICTION MEDICINE,BUSPIRONE HCL,88.894203
1,ADDICTION MEDICINE,LAMOTRIGINE,8.979849
2,ADDICTION MEDICINE,LORAZEPAM,2.125948
3,ALLERGY/IMMUNOLOGY,ALBUTEROL SULFATE,3.553199
4,ALLERGY/IMMUNOLOGY,AMOXICILLIN/POTASSIUM CLAV,0.196089


In [34]:
medication_spendng_pct.sort_values(["specialty", "spending_pct"], ascending=[True, False]).head(10)

Unnamed: 0,specialty,medication,spending_pct
0,ADDICTION MEDICINE,BUSPIRONE HCL,88.894203
1,ADDICTION MEDICINE,LAMOTRIGINE,8.979849
2,ADDICTION MEDICINE,LORAZEPAM,2.125948
12,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,41.8989
16,ALLERGY/IMMUNOLOGY,MOMETASONE FUROATE,18.14125
18,ALLERGY/IMMUNOLOGY,OMALIZUMAB,15.411051
7,ALLERGY/IMMUNOLOGY,DEXLANSOPRAZOLE,6.561111
5,ALLERGY/IMMUNOLOGY,AZELASTINE HCL,3.646451
3,ALLERGY/IMMUNOLOGY,ALBUTEROL SULFATE,3.553199
11,ALLERGY/IMMUNOLOGY,FLUTICASONE PROPIONATE,2.842509


### Filtering Groups

- Filtering a group is done using the method called `filter`


- The method `filter` takes as input a function name, which it calls on each group of the `DataFrameGroupBy` object
  - The function must return either `True` or `False`.
  - Groups for which the function returns `False` are dropped.


- The resulting` DataFrame` has its entries in the same order as the original `DataFrame`.
 


In [43]:
spending_df['specialty'].unique()

array(['FAMILY PRACTICE', 'INTERNAL MEDICINE', 'PSYCHIATRY',
       'HEMATOLOGY/ONCOLOGY', 'OPHTHALMOLOGY', 'NEUROLOGY',
       'NURSE PRACTITIONER', 'NEPHROLOGY', 'DENTIST', 'SPECIALIST',
       'GENERAL PRACTICE', 'INTERVENTIONAL CARDIOLOGY',
       'OBSTETRICS/GYNECOLOGY', 'PHYSICIAN ASSISTANT', 'CARDIOLOGY',
       'ENDOCRINOLOGY', 'RHEUMATOLOGY', 'OPTOMETRY',
       'STUDENT IN AN ORGANIZED HEALTH CARE EDUCATION/TRAINING PROGRAM',
       'PULMONARY DISEASE', 'DERMATOLOGY',
       'INTERVENTIONAL PAIN MANAGEMENT', 'PSYCHIATRY & NEUROLOGY',
       'GASTROENTEROLOGY', 'GERIATRIC MEDICINE', 'UROLOGY',
       'MEDICAL ONCOLOGY', 'PHYSICAL MEDICINE AND REHABILITATION',
       'EMERGENCY MEDICINE', 'ORTHOPEDIC SURGERY',
       'CARDIAC ELECTROPHYSIOLOGY', 'OTOLARYNGOLOGY', 'ALLERGY/IMMUNOLOGY',
       'PODIATRY', 'CERTIFIED CLINICAL NURSE SPECIALIST',
       'INFECTIOUS DISEASE', 'UNKNOWN PHYSICIAN SPECIALTY CODE',
       'ANESTHESIOLOGY', 'PEDIATRIC MEDICINE', 'PAIN MANAGEMENT',
       

In [68]:

def filter_on_spending(x):
    return x['spending'].sum() > 50_000

high_spending_df = spending_df[["specialty", 'spending']].groupby('specialty').filter(filter_on_spending)



In [38]:
high_spending_df['specialty'].unique() 


array(['FAMILY PRACTICE', 'INTERNAL MEDICINE', 'PSYCHIATRY',
       'HEMATOLOGY/ONCOLOGY', 'OPHTHALMOLOGY', 'NEUROLOGY',
       'NURSE PRACTITIONER', 'NEPHROLOGY', 'GENERAL PRACTICE',
       'INTERVENTIONAL CARDIOLOGY', 'OBSTETRICS/GYNECOLOGY',
       'PHYSICIAN ASSISTANT', 'CARDIOLOGY', 'ENDOCRINOLOGY',
       'RHEUMATOLOGY', 'OPTOMETRY', 'PULMONARY DISEASE', 'DERMATOLOGY',
       'INTERVENTIONAL PAIN MANAGEMENT', 'PSYCHIATRY & NEUROLOGY',
       'GASTROENTEROLOGY', 'GERIATRIC MEDICINE', 'UROLOGY',
       'MEDICAL ONCOLOGY', 'PHYSICAL MEDICINE AND REHABILITATION',
       'EMERGENCY MEDICINE', 'ORTHOPEDIC SURGERY',
       'CARDIAC ELECTROPHYSIOLOGY', 'ALLERGY/IMMUNOLOGY', 'PODIATRY',
       'CERTIFIED CLINICAL NURSE SPECIALIST', 'INFECTIOUS DISEASE',
       'ANESTHESIOLOGY', 'PEDIATRIC MEDICINE', 'PAIN MANAGEMENT',
       'HEMATOLOGY', 'GENERAL SURGERY', 'DIAGNOSTIC RADIOLOGY'],
      dtype=object)

### Thinning Groups

* Thinning the data consist in reducing the number of entries in a group

* As opposed to aggregating functions, thinning does not have to reduce the group into a single entry
  * Although it could reduce it to a single entry


* Thinning can be used, for instance, to return only the top 3 entries in each category, or to randomly sample a small subset of entries from each category

### Thinning Methods and `apply`

- `pandas` offers a few methods for thinning the data.
  - Ex. `nlargest`, `nsmallest`, etc.
    
    
- However, thinning  is most often carried out using a method  called `apply.` 



- The  method `apply` takes as input a function name, which it calls on each group of the `DataFrameGroupBy` object.


In [54]:
spending_by_specialty['spending'].nlargest(2)

specialty                  unique_id
ADDICTION MEDICINE         GJ278932      817.880
                           VG585760       82.620
ALLERGY/IMMUNOLOGY         XY715196    52389.610
                           DL492570    29153.710
ANESTHESIOLOGY             WD732008    34073.910
                           ZJ839161    33127.750
CARDIAC ELECTROPHYSIOLOGY  XZ523373    89101.540
                           RR251593    59935.970
CARDIAC SURGERY            YC312951    11990.010
                           FK638917      442.910
Name: spending, dtype: float64

In [24]:
spending_by_specialty['spending'].nsmallest(3)

specialty                  unique_id
ADDICTION MEDICINE         TX420809     19.560
                           VG585760     82.620
                           GJ278932    817.880
ALLERGY/IMMUNOLOGY         HQ120242    109.800
                           HN843226    173.050
                           LE617956    190.120
ANESTHESIOLOGY             IS925171     35.330
                           XZ351859     38.960
                           HY359879     56.860
CARDIAC ELECTROPHYSIOLOGY  XR445715     69.850
Name: spending, dtype: float64

### Sub-sampling a DataFrame


- This is necessary to maintain group composions.

- This can be achived using the DataFrame mthod called `sample.` 

  - Two parameters are relevant in this scenario,`n` the number of samples to randomly select or `frac` a portion of the data to retun
  - We are interested the latter

```python
 spending_df.sample(frac=0.001)
```


In [25]:
# return 0.01% of the data, i.e 10 entries
spending_df.sample(frac=0.01).head()


Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
UC393942,1942272653,FAMILY PRACTICE,CLONAZEPAM,25,129.94,0.002
IA801487,1962438861,INTERNAL MEDICINE,LIDOCAINE,16,4473.19,0.046
PR105009,1891765079,PHYSICAL MEDICINE AND REHABILITATION,GABAPENTIN,101,2071.38,1.332
AK980177,1245284090,INTERNAL MEDICINE,AZITHROMYCIN,58,367.42,0.004
OM839383,1285694505,INTERNAL MEDICINE,"INSULIN GLARGINE,HUM.REC.ANLOG",54,17829.88,0.183


In [39]:
# return 0.01% of the data, i.e 10 entries
spending_df.sample(n=10) 

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DB389404,1194751297,NEPHROLOGY,ROPINIROLE HCL,18,597.62,0.052547
DA812252,1124015888,FAMILY PRACTICE,OMEPRAZOLE,71,815.68,0.012518
LR238505,1639198229,NURSE PRACTITIONER,TOPIRAMATE,11,64.71,0.001938
FH905481,1518923382,EMERGENCY MEDICINE,CEFADROXIL,21,1675.71,0.842024
LH581620,1356340145,INTERNAL MEDICINE,LORAZEPAM,28,262.3,0.002695
HP735529,1659375699,ORTHOPEDIC SURGERY,TRAMADOL HCL,68,234.28,0.360997
TT280891,1275847071,INTERNAL MEDICINE,LEVETIRACETAM,13,1114.71,0.011453
IE414935,1265487839,FAMILY PRACTICE,SULFAMETHOXAZOLE/TRIMETHOPRIM,52,191.68,0.002942
XS778681,1831453786,FAMILY PRACTICE,LISINOPRIL,12,52.37,0.000804
DO782181,1215102074,INTERNAL MEDICINE,AMANTADINE HCL,19,976.97,0.010038


In [69]:
# We sample only 10% of the Data in each category

def sample_10p(x):
    return x.sample(frac=0.1)
    
    
# spending_by_specialty.apply(sample_10p).head()
spending_df.groupby('specialty').apply(sample_10p).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
specialty,unique_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ALLERGY/IMMUNOLOGY,HX341365,1790855732,ALLERGY/IMMUNOLOGY,ENALAPRIL MALEATE,19,602.62,0.318553
ALLERGY/IMMUNOLOGY,HQ120242,1811919988,ALLERGY/IMMUNOLOGY,IRBESARTAN,12,109.8,0.058042
ANESTHESIOLOGY,RF491526,1356658389,ANESTHESIOLOGY,MORPHINE SULFATE,78,3670.87,2.570552
ANESTHESIOLOGY,ZX773797,1750481461,ANESTHESIOLOGY,VENLAFAXINE HCL,21,744.11,0.521068
ANESTHESIOLOGY,EL864120,1881669554,ANESTHESIOLOGY,GABAPENTIN,419,12842.83,8.993281


In [72]:

print(spending_by_specialty.get_group("CARDIAC ELECTROPHYSIOLOGY").shape)
print(spending_by_specialty.get_group("ANESTHESIOLOGY").shape)
print(spending_by_specialty.get_group("CARDIOLOGY").shape)


(20, 6)
(30, 6)
(445, 6)


In [73]:
subsampled_spending_df = spending_by_specialty.apply(sample_10p)

print(subsampled_spending_df.loc["CARDIAC ELECTROPHYSIOLOGY"].shape)

print(subsampled_spending_df.loc["ANESTHESIOLOGY"].shape)

print(subsampled_spending_df.loc["CARDIOLOGY"].shape)




(2, 6)
(3, 6)
(44, 6)


In [47]:
subsampled_spending_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending,spending_pct
specialty,unique_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ALLERGY/IMMUNOLOGY,KL206491,1124085436,ALLERGY/IMMUNOLOGY,AZELASTINE HCL,15,1379.06,0.72899
ALLERGY/IMMUNOLOGY,GO641321,1255301404,ALLERGY/IMMUNOLOGY,FLUTICASONE/SALMETEROL,35,14277.61,7.54734
ANESTHESIOLOGY,HF933304,1528061603,ANESTHESIOLOGY,CITALOPRAM HYDROBROMIDE,57,255.26,0.178748
ANESTHESIOLOGY,ZJ839161,1700893575,ANESTHESIOLOGY,PREGABALIN,90,33127.75,23.197936
ANESTHESIOLOGY,XZ351859,1811096688,ANESTHESIOLOGY,CITALOPRAM HYDROBROMIDE,15,38.96,0.027282
