# <span style='color:green'> Grouping for Aggregation,Filtration, and Transformation </span>

```referred to as split-apply-combine.```

```All basic groupby operations have grouping columns, and each unique combination of values
in these columns represents an independent grouping of the data.```

```Structure of a groupby()```

> ``` df.groupby(['list', 'of', 'grouping', 'columns'])```


> ```df.groupby('single_column') # when grouping by a single column```

In [2]:
url = "https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv"
flights = pd.read_csv(url)
flights.head()

Unnamed: 0,MONTH,DAY,WEEKDAY,AIRLINE,ORG_AIR,DEST_AIR,SCHED_DEP,DEP_DELAY,AIR_TIME,DIST,SCHED_ARR,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,WN,LAX,SLC,1625,58.0,94.0,590,1905,65.0,0,0
1,1,1,4,UA,DEN,IAD,823,7.0,154.0,1452,1333,-13.0,0,0
2,1,1,4,MQ,DFW,VPS,1305,36.0,85.0,641,1453,35.0,0,0
3,1,1,4,AA,DFW,DCA,1555,7.0,126.0,1192,1935,-7.0,0,0
4,1,1,4,WN,LAX,MCI,1720,48.0,166.0,1363,2225,39.0,0,0


> ```The most common use of the groupby method is to perform an aggregation.```

An aggregation takes place when a sequence of many inputs get summarized or combined into a single value output.

For example, summing up all the values of a column or finding its maximum are common aggregations
applied on a single sequence of data. 

> ```An aggregation simply takes many values and converts them down to a single value.```

In [3]:
flights.groupby('AIRLINE').agg({'ARR_DELAY':'mean'}).head() #agg([columna a operar: operacion])

Unnamed: 0_level_0,ARR_DELAY
AIRLINE,Unnamed: 1_level_1
AA,5.542661
AS,-0.833333
B6,8.692593
DL,0.339691
EV,7.03458


In [4]:
#or like this. this puts the aggregate column (ARR_DELAY) in the index operator and pass the agg function as A STRING
flights.groupby('AIRLINE')['ARR_DELAY'].agg('mean').head()
#o tmabien podemos pasar cualquier numpy function 
flights.groupby('AIRLINE')['ARR_DELAY'].agg(np.mean).head()

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

In [5]:
#o mucho mejor (bueno, mas corto)
flights.groupby('AIRLINE')['ARR_DELAY'].mean().head()

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

```The following is a list of several aggregating functions that may be passed as a
string to agg or chained directly as a method to```

>min max mean median sum count std var

>size describe nunique idxmin idxmax

## Grouping and aggregating with multiple columns and functions

```As usual with any kind of grouping operation, it helps to identify the three components: the
grouping columns, aggregating columns, and aggregating functions.```

In [6]:
#Finding the number(the aggregate function) of cancelled flights (the aggregate column) 
#for every airline per weekday (the grouping columns)
flights.groupby(['AIRLINE', 'WEEKDAY'])['CANCELLED'].agg('sum').head()

AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
Name: CANCELLED, dtype: int64

In [7]:
#Finding the number and percentage of cancelled and diverted flights for every airline per weekday
flights.groupby(['AIRLINE', 'WEEKDAY'])['CANCELLED', 'DIVERTED'].agg(['sum', 'mean']).head(7)

  flights.groupby(['AIRLINE', 'WEEKDAY'])['CANCELLED', 'DIVERTED'].agg(['sum', 'mean']).head(7)


Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,DIVERTED,DIVERTED
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
AIRLINE,WEEKDAY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AA,1,41,0.032106,6,0.004699
AA,2,9,0.007341,2,0.001631
AA,3,16,0.011949,2,0.001494
AA,4,20,0.015004,5,0.003751
AA,5,18,0.014151,1,0.000786
AA,6,21,0.018667,9,0.008
AA,7,29,0.021837,1,0.000753


In [8]:
#For each origin and destination, finding the total number, the number
#and percentage of cancelled flights. Also the average and variance of the airtime
agg_dict = {'CANCELLED':['sum', 'mean', 'size'], 'AIR_TIME': ['mean', 'var']} #define the agg functions in a list per agg colum
flights.groupby(['ORG_AIR', 'DEST_AIR']).agg(agg_dict).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,CANCELLED,AIR_TIME,AIR_TIME
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,size,mean,var
ORG_AIR,DEST_AIR,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ATL,ABE,0,0.0,31,96.387097,45.778495
ATL,ABQ,0,0.0,16,170.5,87.866667
ATL,ABY,0,0.0,19,28.578947,6.590643
ATL,ACY,0,0.0,6,91.333333,11.466667
ATL,AEX,0,0.0,40,78.725,47.332692


#### Basicamente, este es el esqueleto de lo anterior

```df.groupby(['grouping', 'columns']).agg({'agg_cols1':['list', 'of', 'functions'], 'agg_cols2':['other', 'functions']})```

## Removing the MultiIndex after grouping

In [9]:
#DataFrames with MultiIndexes are more difficult to navigate and occasionally have confusing column names as well.
dictio = {'DIST': ['sum', 'mean'], 'ARR_DELAY':['min', 'max']}

airline = flights.groupby(['AIRLINE', 'WEEKDAY']).agg(dictio).head()
airline

Unnamed: 0_level_0,Unnamed: 1_level_0,DIST,DIST,ARR_DELAY,ARR_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,max
AIRLINE,WEEKDAY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AA,1,1455386,1139.691464,-60.0,551.0
AA,2,1358256,1107.87602,-52.0,725.0
AA,3,1496665,1117.74832,-45.0,473.0
AA,4,1452394,1089.567892,-46.0,349.0
AA,5,1427749,1122.444182,-41.0,732.0


In [10]:
#como lidiamos con los multiindex? con --------columns. + get_level_values-------
level0 = airline.columns.get_level_values(0) #0 es la columna mas arriba
level0

Index(['DIST', 'DIST', 'ARR_DELAY', 'ARR_DELAY'], dtype='object')

In [11]:
level1 = airline.columns.get_level_values(1)
level1

Index(['sum', 'mean', 'min', 'max'], dtype='object')

In [12]:
#Now, we unite them
airline.columns = level0 + '_' + level1

In [13]:
airline #mas facil para procesar. PERO! todavia tenemos la columna de airline

Unnamed: 0_level_0,Unnamed: 1_level_0,DIST_sum,DIST_mean,ARR_DELAY_min,ARR_DELAY_max
AIRLINE,WEEKDAY,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA,1,1455386,1139.691464,-60.0,551.0
AA,2,1358256,1107.87602,-52.0,725.0
AA,3,1496665,1117.74832,-45.0,473.0
AA,4,1452394,1089.567892,-46.0,349.0
AA,5,1427749,1122.444182,-41.0,732.0


In [14]:
#reset index
airline.reset_index(inplace=True)

In [15]:
airline #ahora ya se puede procesar chiler0

Unnamed: 0,AIRLINE,WEEKDAY,DIST_sum,DIST_mean,ARR_DELAY_min,ARR_DELAY_max
0,AA,1,1455386,1139.691464,-60.0,551.0
1,AA,2,1358256,1107.87602,-52.0,725.0
2,AA,3,1496665,1117.74832,-45.0,473.0
3,AA,4,1452394,1089.567892,-46.0,349.0
4,AA,5,1427749,1122.444182,-41.0,732.0


```By default, at the end of a groupby operation, pandas puts all of the grouping columns in the index.``` 

```The as_index parameter in the groupby method can be set to False to avoid this behavior.```

In [16]:
dictio = {'DIST': ['sum', 'mean'], 'ARR_DELAY':['min', 'max']}

airline1 = flights.groupby(['AIRLINE', 'WEEKDAY'], as_index=False).agg(dictio).head()

In [17]:
airline1 #BUT we, still need to deal with the sum, mean and the aggregate funtions. the level1 and level 0 

Unnamed: 0_level_0,AIRLINE,WEEKDAY,DIST,DIST,ARR_DELAY,ARR_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,mean,min,max
0,AA,1,1455386,1139.691464,-60.0,551.0
1,AA,2,1358256,1107.87602,-52.0,725.0
2,AA,3,1496665,1117.74832,-45.0,473.0
3,AA,4,1452394,1089.567892,-46.0,349.0
4,AA,5,1427749,1122.444182,-41.0,732.0


## Customizing an aggregation function

In [6]:
url = "https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/college.csv"
college = pd.read_csv(url)

In [7]:
#creating a normal function:
#the maximum number of standard deviations away from the mean for any one institution.
def max_deviation(s):
    std_score = (s - s.mean())/s.std()
    return std_score.abs().max()

In [8]:
college.groupby('STABBR')['UGDS'].agg(max_deviation).round(1).head()

STABBR
AK    2.6
AL    5.8
AR    6.3
AS    NaN
AZ    9.9
Name: UGDS, dtype: float64

In [20]:
#is possible to apply our customized function to multiple aggregating columns.
college.groupby('STABBR', as_index= False)['UGDS', 'SATVRMID', 'SATMTMID'].agg(max_deviation).round(1).head()

  


Unnamed: 0,STABBR,UGDS,SATVRMID,SATMTMID
0,AK,2.6,,
1,AL,5.8,1.6,1.8
2,AR,6.3,2.2,2.3
3,AS,,,
4,AZ,9.9,1.9,1.4


In [21]:
#We can pass many other functions with our made function
college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATVRMID', 'SATMTMID'].agg([max_deviation, 'mean', 'std']).round(1).head()

  


Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATVRMID,SATVRMID,SATVRMID,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,max_deviation,mean,std,max_deviation,mean,std,max_deviation,mean,std
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
AK,0,2.1,3508.9,4539.5,,,,,,
AK,1,1.1,123.3,132.9,,555.0,,,503.0,
AL,0,5.2,3248.8,5102.4,1.6,514.9,56.5,1.7,515.8,56.7
AL,1,2.4,979.7,870.8,1.5,498.0,53.0,1.4,485.6,61.4
AR,0,5.8,1793.7,3401.6,1.9,481.1,37.9,2.0,503.6,39.0


```Notice that pandas uses the name of the function as the name for the returned column. You
can change the column name directly with the rename method or you can modify the
special function attribute __name__:```

In [22]:
max_deviation.__name__

'max_deviation'

In [23]:
max_deviation.__name__ = 'Max Deviation'
college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATVRMID', 'SATMTMID'].agg([max_deviation, 'mean', 'std']).round(1).head()

  


Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATVRMID,SATVRMID,SATVRMID,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,Max Deviation,mean,std,Max Deviation,mean,std,Max Deviation,mean,std
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
AK,0,2.1,3508.9,4539.5,,,,,,
AK,1,1.1,123.3,132.9,,555.0,,,503.0,
AL,0,5.2,3248.8,5102.4,1.6,514.9,56.5,1.7,515.8,56.7
AL,1,2.4,979.7,870.8,1.5,498.0,53.0,1.4,485.6,61.4
AR,0,5.8,1793.7,3401.6,1.9,481.1,37.9,2.0,503.6,39.0


## Customizing aggregating functions with *args and **kwargs

In [10]:
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [11]:
grouped = college.groupby(['STABBR', 'RELAFFIL'])
grouped.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0000,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.0100,0.2607,1,0.3460,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0000,0.0000,0.2715,0.4536,1,0.6801,0.7795,0.8540,40100,23370
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.0350,0.2146,1,0.3072,0.4596,0.2640,45500,24097
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.1270,26600,33118.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7366,Montpelier Center - Closed July 2013,Montpelier,VT,,,,1,,,,...,,,,,0,,,,39600,18750
7367,New England Center,Brattleboro,VT,,,,1,,,,...,,,,,1,,,,39600,18750
7404,University of the Virgin Islands-Albert A. Sheen,St. Croix,VI,,,,1,,,,...,,,,,1,,,,31800,15150
7419,Computer Career Center-Las Cruces,Las Cruces,NM,,,,1,,,,...,,,,,1,,,,21300,14250


In [12]:
import inspect
inspect.signature(grouped.agg)

<Signature (func=None, *args, engine=None, engine_kwargs=None, **kwargs)>

The argument *args allow you to pass an arbitrary number of non-keyword arguments to
your customized aggregation function. 

Similarly, **kwargs allows you to pass an arbitrary
number of keyword arguments.

In [13]:
def pct_between_1_3k(s):
    return s.between(1000, 3000).mean()

In [15]:
college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between_1_3k).head()

STABBR  RELAFFIL
AK      0           0.142857
        1           0.000000
AL      0           0.236111
        1           0.333333
AR      0           0.279412
Name: UGDS, dtype: float64

In [16]:
#This function works fine but it doesn't give the user any flexibility to choose the lower and upper bound.
#create a function that let's the user define the bounds

def pct_between(s, low, high):
    return s.between(low, high).mean()

In [19]:
college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between,1000, 10000).head()

#tambien se puede colocar un poco mas explicito
#college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(pct_between, high=10000, low=1000).head(9)

STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
AR      0           0.397059
Name: UGDS, dtype: float64

```Unfortunately, pandas does not have a direct way to use these additional arguments when using multiple aggregation functions together.```

college.groupby(['STABBR', 'RELAFFIL'])['UGDS'].agg(['mean', pct_between], low=100, high=1000)

TypeError: pct_between() missing 2 required positional arguments: 'low' and 'high'


```para eso, hay que colocar la funcion de wrapper```

def make_agg_func(func, name, *args, **kwargs):
def wrapper(x):
return func(x, *args, **kwargs)
wrapper.__name__ = name
return wrapper


```Y luego pasamos la funcion de antes sobre una varaible```
> my_agg1 = make_agg_func(pct_between, 'pct_1_3k', low=1000, high=3000)

> The make_agg_func function acts as a factory to create customized aggregation functions.


## Examining the group project

In [20]:
#the number of groups 
grouped.ngroups

112

In [21]:
#To find the uniquely identifying labels for each group, look in the groups attribute
groups = list(grouped.groups.keys())
groups

[('AK', 0),
 ('AK', 1),
 ('AL', 0),
 ('AL', 1),
 ('AR', 0),
 ('AR', 1),
 ('AS', 0),
 ('AZ', 0),
 ('AZ', 1),
 ('CA', 0),
 ('CA', 1),
 ('CO', 0),
 ('CO', 1),
 ('CT', 0),
 ('CT', 1),
 ('DC', 0),
 ('DC', 1),
 ('DE', 0),
 ('DE', 1),
 ('FL', 0),
 ('FL', 1),
 ('FM', 0),
 ('GA', 0),
 ('GA', 1),
 ('GU', 0),
 ('GU', 1),
 ('HI', 0),
 ('HI', 1),
 ('IA', 0),
 ('IA', 1),
 ('ID', 0),
 ('ID', 1),
 ('IL', 0),
 ('IL', 1),
 ('IN', 0),
 ('IN', 1),
 ('KS', 0),
 ('KS', 1),
 ('KY', 0),
 ('KY', 1),
 ('LA', 0),
 ('LA', 1),
 ('MA', 0),
 ('MA', 1),
 ('MD', 0),
 ('MD', 1),
 ('ME', 0),
 ('ME', 1),
 ('MH', 0),
 ('MI', 0),
 ('MI', 1),
 ('MN', 0),
 ('MN', 1),
 ('MO', 0),
 ('MO', 1),
 ('MP', 0),
 ('MS', 0),
 ('MS', 1),
 ('MT', 0),
 ('MT', 1),
 ('NC', 0),
 ('NC', 1),
 ('ND', 0),
 ('ND', 1),
 ('NE', 0),
 ('NE', 1),
 ('NH', 0),
 ('NH', 1),
 ('NJ', 0),
 ('NJ', 1),
 ('NM', 0),
 ('NM', 1),
 ('NV', 0),
 ('NV', 1),
 ('NY', 0),
 ('NY', 1),
 ('OH', 0),
 ('OH', 1),
 ('OK', 0),
 ('OK', 1),
 ('OR', 0),
 ('OR', 1),
 ('PA', 0),
 ('P

In [23]:
# Retrieve a single group with the get_group method by passing it a tuple of an exact group label
grouped.get_group(('FL',1)).head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
712,The Baptist College of Florida,Graceville,FL,0.0,0.0,0.0,1,545.0,465.0,0.0,...,0.0308,0.0,0.0507,0.2291,1,0.5878,0.5602,0.3531,30800.0,20052
713,Barry University,Miami,FL,0.0,0.0,0.0,1,470.0,462.0,0.0,...,0.0164,0.0741,0.0841,0.1518,1,0.5045,0.6733,0.4361,44100.0,28250
714,Gooding Institute of Nurse Anesthesia,Panama City,FL,0.0,0.0,0.0,1,,,0.0,...,,,,,0,,,,,PrivacySuppressed
715,Bethune-Cookman University,Daytona Beach,FL,1.0,0.0,0.0,1,405.0,395.0,0.0,...,0.0198,0.0205,0.019,0.0523,1,0.7758,0.8867,0.0647,29400.0,36250
724,Johnson University Florida,Kissimmee,FL,0.0,0.0,0.0,1,480.0,470.0,0.0,...,0.0045,0.0045,0.0136,0.1636,1,0.6689,0.7384,0.2185,26300.0,20199


## Filtering for states with a minority majority

In [24]:
college.set_index('INSTNM', inplace= True)

In [25]:
grouped = college.groupby('STABBR')
grouped.ngroups
#college['STABBR'].nunique() #for verifying the same number

59

```The grouped variable has a filter method, which accepts a custom function
that determines whether a group is kept or not.```

In [26]:
def check_minority(df, threshold):
    minority_pct = 1 - df['UGDS_WHITE']
    total_minority = (df['UGDS'] * minority_pct).sum()
    total_ugds = df['UGDS'].sum()
    total_minority_pct = total_minority / total_ugds
    return total_minority_pct > threshold

In [27]:
#threshold of 50% to find all states that have a minority majority
college_filtered = grouped.filter(check_minority, threshold =.5)
college_filtered.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Everest College-Phoenix,Phoenix,AZ,0.0,0.0,0.0,1,,,0.0,4102.0,...,0.0373,0.0,0.1026,0.4749,0,0.8291,0.7151,0.67,28600,9500
Collins College,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,83.0,...,0.0241,0.0,0.3855,0.3373,0,0.7205,0.8228,0.4764,25700,47000
Empire Beauty School-Paradise Valley,Phoenix,AZ,0.0,0.0,0.0,1,,,0.0,25.0,...,0.04,0.0,0.0,0.16,0,0.6349,0.5873,0.4651,17800,9588
Empire Beauty School-Tucson,Tucson,AZ,0.0,0.0,0.0,0,,,0.0,126.0,...,0.0,0.0,0.0079,0.2222,1,0.7962,0.6615,0.4229,18200,9833
Thunderbird School of Global Management,Glendale,AZ,0.0,0.0,0.0,0,,,0.0,1.0,...,0.0,0.0,0.0,1.0,0,0.0,0.0,0.0,118900,PrivacySuppressed


## Transforming through a weight loss bet

In [28]:
url = "https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/weight_loss.csv"
weight_loss = pd.read_csv(url)
weight_loss.head()

Unnamed: 0,Name,Month,Week,Weight
0,Bob,Jan,Week 1,291
1,Amy,Jan,Week 1,197
2,Bob,Jan,Week 2,288
3,Amy,Jan,Week 2,189
4,Bob,Jan,Week 3,283


In [29]:
weight_loss[weight_loss['Month'] == 'Jan']
#or weight_loss.query('Month'=='Jan')

Unnamed: 0,Name,Month,Week,Weight
0,Bob,Jan,Week 1,291
1,Amy,Jan,Week 1,197
2,Bob,Jan,Week 2,288
3,Amy,Jan,Week 2,189
4,Bob,Jan,Week 3,283
5,Amy,Jan,Week 3,189
6,Bob,Jan,Week 4,283
7,Amy,Jan,Week 4,190


In [30]:
# To determine the winner for each month, we only need to compare weight loss from the first week 
# to the last week of each month.
def find_perc_loss(s):
    return (s - s.iloc[0]) / s.iloc[0]

In [31]:
bob_jan = weight_loss.query('Name=="Bob" and Month=="Jan"')
find_perc_loss(bob_jan['Weight'])

0    0.000000
2   -0.010309
4   -0.027491
6   -0.027491
Name: Weight, dtype: float64

In [32]:
#We can apply this function to every single combination of person and week to get the
#weight loss per week in relation to the first week of the month

pcnt_loss = weight_loss.groupby(['Name', 'Month'])['Weight'].transform(find_perc_loss)
pcnt_loss.head()

0    0.000000
1    0.000000
2   -0.010309
3   -0.040609
4   -0.027491
Name: Weight, dtype: float64

In [33]:
#add it to the column
weight_loss['Perc weight loss'] = pcnt_loss.round(3)
weight_loss.query('Name == "Bob" and Month in ["Jan", "Feb"]')

Unnamed: 0,Name,Month,Week,Weight,Perc weight loss
0,Bob,Jan,Week 1,291,0.0
2,Bob,Jan,Week 2,288,-0.01
4,Bob,Jan,Week 3,283,-0.027
6,Bob,Jan,Week 4,283,-0.027
8,Bob,Feb,Week 1,283,0.0
10,Bob,Feb,Week 2,275,-0.028
12,Bob,Feb,Week 3,268,-0.053
14,Bob,Feb,Week 4,268,-0.053


In [34]:
#But the % of weight loss resets every beginning of month. We need an accumulative result for just the Week 4
week4 = weight_loss.query('Week == "Week 4"')
week4

Unnamed: 0,Name,Month,Week,Weight,Perc weight loss
6,Bob,Jan,Week 4,283,-0.027
7,Amy,Jan,Week 4,190,-0.036
14,Bob,Feb,Week 4,268,-0.053
15,Amy,Feb,Week 4,173,-0.089
22,Bob,Mar,Week 4,261,-0.026
23,Amy,Mar,Week 4,170,-0.017
30,Bob,Apr,Week 4,250,-0.042
31,Amy,Apr,Week 4,161,-0.053


In [35]:
#But who won? We can make a pivot table
winner = week4.pivot(index = 'Month', columns= 'Name', values='Perc weight loss')
winner

Name,Amy,Bob
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Apr,-0.053,-0.042
Feb,-0.089,-0.053
Jan,-0.036,-0.027
Mar,-0.017,-0.026


In [36]:
#Make it easier to see who won
winner['Winner'] = np.where(winner['Amy'] < winner['Bob'], 'Amy', 'Bob') #condition and the columns
winner.style.highlight_min(axis=1, color= 'green')

Name,Amy,Bob,Winner
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,-0.053,-0.042,Amy
Feb,-0.089,-0.053,Amy
Jan,-0.036,-0.027,Amy
Mar,-0.017,-0.026,Bob


## Calculating weighted mean SAT scores per state with apply

The groupby object has four methods that accept a function (or functions) to perform a
calculation on each group. These four methods are agg, filter, transform, and apply.


> ```Apply: operates on one column and only shows as one column```

>```Aggregate: operates in every column of the dataframe and shows the results in every column```

In [31]:
#For this, we need to drop all the NAN
subset = ['UGDS', 'SATMTMID', 'SATVRMID']
college2 = college.dropna(subset = subset)

In [32]:
def weighted_math_average(df):
    weighted_math = df['UGDS'] * df['SATMTMID']
    return int(weighted_math.sum() / df['UGDS'].sum())

In [33]:
college2.groupby('STABBR').apply(weighted_math_average).head() #

STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
dtype: int64

In [40]:
#if we had used agg it gives error
#college2.groupby('STABBR').agg(weighted_math_average).head()
#college2.groupby('STABBR')['SATMTMID'].agg(weighted_math_average)

```A nice feature of apply is that you can create multiple new columns by returning
a Series.```

## Grouping by continuous variables

```However, if we can transform columns with continuous
values into a discrete column by placing each value into a bin, rounding them, or using
some other mapping, then grouping with them makes sense.```

> To accomplish this, we use the pandas ***cut*** function to discretize the distance of each flight flown.

In [42]:
flights.head()

Unnamed: 0,MONTH,DAY,WEEKDAY,AIRLINE,ORG_AIR,DEST_AIR,SCHED_DEP,DEP_DELAY,AIR_TIME,DIST,SCHED_ARR,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,WN,LAX,SLC,1625,58.0,94.0,590,1905,65.0,0,0
1,1,1,4,UA,DEN,IAD,823,7.0,154.0,1452,1333,-13.0,0,0
2,1,1,4,MQ,DFW,VPS,1305,36.0,85.0,641,1453,35.0,0,0
3,1,1,4,AA,DFW,DCA,1555,7.0,126.0,1192,1935,-7.0,0,0
4,1,1,4,WN,LAX,MCI,1720,48.0,166.0,1363,2225,39.0,0,0


In [43]:
#define the bins for distance
bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
cuts = pd.cut(flights['DIST'], bins = bins) #"cuts" the data into those bins (or also called: groups)
cuts.head()

0     (500.0, 1000.0]
1    (1000.0, 2000.0]
2     (500.0, 1000.0]
3    (1000.0, 2000.0]
4    (1000.0, 2000.0]
Name: DIST, dtype: category
Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] < (1000.0, 2000.0] < (2000.0, inf]]

In [44]:
#The cuts Series can now be used to form groups.
flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True).round(3).head(15)

DIST            AIRLINE
(-inf, 200.0]   OO         0.326
                EV         0.289
                MQ         0.211
                DL         0.086
                AA         0.052
                UA         0.027
                WN         0.009
(200.0, 500.0]  WN         0.194
                DL         0.189
                OO         0.159
                EV         0.156
                MQ         0.100
                AA         0.071
                UA         0.062
                VX         0.028
Name: AIRLINE, dtype: float64

```Some interesting insights can be drawn from this result. Looking at the full result, SkyWest
is the leading airline for under 200 miles but has no flights over 2,000 miles.```

```In contrast, American Airlines has the fifth highest total for flights under 200 miles but has by far the
most flights between 1,000 and 2,000 miles.```

In [46]:
#We can use the cuts for grouping anything
#For instance, we can find the 25th, 50th, and 75th percentile airtime for each distance grouping. (in minutes)
flights.groupby(cuts)['AIR_TIME'].quantile(q = [.25, .5,.75]).round(2).head() #add .div(60) for hours

DIST                
(-inf, 200.0]   0.25    26.0
                0.50    30.0
                0.75    34.0
(200.0, 500.0]  0.25    46.0
                0.50    55.0
Name: AIR_TIME, dtype: float64

> We can use this information to create informative string labels when using the cut function.

> These labels replace the interval notation. 

> We can also chain the unstack method which transposes the inner index level to column names:

In [51]:
labels=['Under an Hour', '1 Hour', '1-2 Hours', '2-4 Hours', '4+ Hours']

cuts2 = pd.cut(flights['DIST'], bins= bins, labels= labels)

flights.groupby(cuts2)['AIRLINE'].value_counts(normalize = True).round(3).unstack().style.highlight_max(axis = 1, color = 'green')

AIRLINE,AA,AS,B6,DL,EV,F9,HA,MQ,NK,OO,UA,US,VX,WN
DIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Under an Hour,0.052,,,0.086,0.289,,,0.211,,0.326,0.027,,,0.009
1 Hour,0.071,0.001,0.007,0.189,0.156,0.005,,0.1,0.012,0.159,0.062,0.016,0.028,0.194
1-2 Hours,0.144,0.023,0.003,0.206,0.101,0.038,,0.051,0.03,0.106,0.131,0.025,0.004,0.138
2-4 Hours,0.264,0.016,0.003,0.165,0.016,0.031,,0.003,0.045,0.046,0.199,0.04,0.012,0.16
4+ Hours,0.212,0.012,0.08,0.171,,0.004,0.028,,0.019,,0.289,0.065,0.074,0.046


## Counting the total number of flights between cities

In [84]:
#Count the total number of flights between two cities regardless of which one is the origin or destination.

#find the total number (pues con size seria la cantidad) of flights between each origin and destination airport

flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
flights_ct

ORG_AIR  DEST_AIR
ATL      ABE          31
         ABQ          16
         ABY          19
         ACY           6
         AEX          40
                    ... 
SFO      SNA         122
         STL          20
         SUN          10
         TUS          20
         XNA           2
Length: 1130, dtype: int64

In [85]:
#Select the total number of flights between Houston (IAH) and Atlanta (ATL) (de ida y regreso)
flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]].sum()

269

In [None]:
#but this can be very slow for when there is too much data
#Try with np.sort for faster performance
data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
flights_sort2 = pd.DataFrame(data_sorted, columns=['AIR1', 'AIR2'])

In [74]:
%%timeit
data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
flights_sort2 = pd.DataFrame(data_sorted, columns=['ORG_AIR', 'DEST_AIR'])

30.3 ms ± 3.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
