In this video we will learn about using the groupby method to split and aggregate data into groups. We will explore how the groupby method works by breaking it into parts. We will demonstrate groupby with statistical and other methods. We will learn how to do interesting things with the groupby method's ability to iterate over the group data.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [3]:
import pandas as pd
data = pd.read_table('data-zillow1.csv', sep=',')
data.head()

Unnamed: 0,Date,RegionID,RegionName,State,Metro,County,SizeRank,Price
0,2017-05-31,6181,New York,NY,New York,Queens,0,672400
1,2017-05-31,12447,Los Angeles,CA,Los Angeles-Long Beach-Anaheim,Los Angeles,1,629900
2,2017-05-31,17426,Chicago,IL,Chicago,Cook,2,222700
3,2017-05-31,13271,Philadelphia,PA,Philadelphia,Philadelphia,3,137300
4,2017-05-31,40326,Phoenix,AZ,Phoenix,Maricopa,4,211300


Let's start by asking a question and see if pandas' groupby method can help us get the answer. 


The question is We want to get the mean Priace value of every State.

In [4]:
grouped_data = data[['State', 'Price']].groupby('State').mean()
grouped_data.head()

Unnamed: 0_level_0,Price
State,Unnamed: 1_level_1
AK,237783.333333
AL,137645.637584
AR,136331.707317
AZ,232353.921569
CA,617425.392297


Here we did use the groupby method for aggregating data by states, and got the mean Price per State. In the background, the groupby method split the data into groups and we then applied the function on the split data and the result was put together and displayed.

Let's break this code into its individual pieces to see how it happened. First, splitting into groups is done as below: 
grouped_data = data[['State', 'Price']].groupby('State')

We did select a subset of data that has only state and Price columns. We then call the groupby method on this data and pass it in the State column as that is the column we want the data to be groupby.


Now we have the data groups based on the State. Next we apply a function on teh displayed data and display the combine result.


We are using the mean method to get the mean of the price. After the data is split into groups.


We can also use the groupby on multiple columns. For example, we are grouping by the State and RegionName columns.

In [6]:
grouped_data = data[['State', 'RegionName', 'Price']].groupby(['State', 'RegionName']).mean()

In [7]:
grouped_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
State,RegionName,Unnamed: 2_level_1
AK,Anchor Point,175800.0
AK,Anchorage,293900.0
AK,Fairbanks,221000.0
AK,Juneau,323100.0
AK,Kenai,206500.0


In [9]:
# We cna also get the number of records per State through the groupby and size methods
grouped_data = data.groupby(['State']).size()

In [10]:
list(grouped_data)

[12,
 149,
 82,
 102,
 701,
 166,
 183,
 1,
 26,
 528,
 266,
 43,
 63,
 33,
 496,
 418,
 52,
 118,
 148,
 334,
 318,
 1,
 461,
 155,
 202,
 135,
 20,
 293,
 12,
 20,
 190,
 577,
 26,
 26,
 717,
 507,
 70,
 164,
 1208,
 41,
 138,
 350,
 397,
 81,
 228,
 275,
 270,
 17,
 10]

In [11]:
grouped_data.head()

State
AK     12
AL    149
AR     82
AZ    102
CA    701
dtype: int64

In all the code we have demonstrated in this video so far we grouped by rows. However, we can also group by columns.

In [12]:
grouped_data = data.groupby(data.dtypes, axis=1)
list(grouped_data)

[(dtype('int64'),        RegionID  SizeRank   Price
  0          6181         0  672400
  1         12447         1  629900
  2         17426         2  222700
  3         13271         3  137300
  4         40326         4  211300
  ...         ...       ...     ...
  10825     26077     10825  392200
  10826     24105     10826  191900
  10827    737788     10827  231100
  10828    182023     10828  230800
  10829     51793     10829  296400
  
  [10830 rows x 3 columns]), (dtype('O'),              Date  ...        County
  0      2017-05-31  ...        Queens
  1      2017-05-31  ...   Los Angeles
  2      2017-05-31  ...          Cook
  3      2017-05-31  ...  Philadelphia
  4      2017-05-31  ...      Maricopa
  ...           ...  ...           ...
  10825  2017-05-31  ...     Tillamook
  10826  2017-05-31  ...     Galveston
  10827  2017-05-31  ...     Hunterdon
  10828  2017-05-31  ...       Henrico
  10829  2017-05-31  ...    Rockingham
  
  [10830 rows x 5 columns])]

In [13]:
# We can also iterate over the split groups and do interesting thisng with them

for state, grouped_data in data.groupby('State'):
  print(state, '\n', grouped_data)

AK 
              Date  RegionID  ... SizeRank   Price
57     2017-05-31     23482  ...       57  293900
842    2017-05-31     38465  ...      842  221000
1793   2017-05-31     36906  ...     1793  247800
1830   2017-05-31     29910  ...     1830  213100
1974   2017-05-31      5365  ...     1974  323100
3756   2017-05-31     52742  ...     3756  206500
3869   2017-05-31     39281  ...     3869  270700
4450   2017-05-31    102611  ...     4450  224700
5229   2017-05-31     32296  ...     5229  207500
5996   2017-05-31    395445  ...     5996  249700
9622   2017-05-31     54367  ...     9622  219600
10162  2017-05-31     28124  ...    10162  175800

[12 rows x 8 columns]
AL 
              Date  RegionID         RegionName  ...      County SizeRank   Price
71     2017-05-31     32900             Mobile  ...      Mobile       71  112100
121    2017-05-31     10417         Birmingham  ...   Jefferson      121   61900
154    2017-05-31     12014         Huntsville  ...     Madison      154  

Here we iterate over the gorup by State and publish the result with State as the heading followed by a table of all the records from that State.


In this video, we did learn about using the groupby method to split and aggregate data into gorups. We did explore how the gorupby method works by breaking it into its pieces. We demonstrated groupby with the statistical and other methods, and we also learned how to do interesting thing through groupby by iterating over the group data.