## Apa itu Grouping ?

__Grouping__ adalah proses mengelompokkan data pada dataframe berdasarkan kolom atau fitur tertentu dengan agregasi tertentu, misalnya __mean__, __sum__, __std__, dan sebagainya. 

Proses yang terjadi dalam <i>grouping</i> secara berurutan adalah <i>splitting</i>, <i>applying</i>, dan <i>combining</i>.
* <i>Splitting</i> : Memisahkan data ke dalam suatu group berdasarkan kolom tertentu
* <i>Applying</i> : Melakukan suatu operasi terhadap sekumpulan data di group-group tersebut
* <i>Combining</i> : Menggabungkan data menjadi suatu struktur baru

Di Python menggunakan <i>function</i> <code>groupby()</code> untuk melakukan <i>grouping</i> data. Sebelumnya <i>load</i> datasetnya terlebih dahulu.

## Load dataset

Dataset <a href='https://www.kaggle.com/gregorut/videogamesales'>Video Games Sales</a>.

<b>Attribute Information</b>

* __Rank :__  Ranking of overall sales
* __Name :__  The games name
* __Platform :__  Platform of the games release (i.e. PC,PS4, etc.)
* __Year :__  Year of the game's release
* __Genre :__  Genre of the game
* __Publisher :__  Publisher of the game
* __NA_Sales :__  Sales in North America (in millions)
* __EU_Sales :__  Sales in Europe (in millions)
* __JP_Sales :__  Sales in Japan (in millions)
* __Other_Sales :__  Sales in the rest of the world (in millions)
* __Global_Sales :__  Total worldwide sales.


Pertama impor Pandas, kemudian <i>load</i> data ke dataframe menggunakan <code>read_csv()</code>.

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
%cd /content/gdrive/My Drive/Colab Notebooks/DATASET/datasets_py/

/content/gdrive/My Drive/Colab Notebooks/DATASET/datasets_py


In [4]:
df = pd.read_csv("video-game-sales.csv")
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


cek <code>info()</code> dari dataframe tersebut.

In [5]:
# Cek info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


## Grouping berdasarkan satu kolom tertentu

<i>grouping</i> data berdasarkan satu kolom. Tampilkan seluruh kolom atau beberapa kolom tertentu saja.

### Grouping semua kolom berdasarkan satu kolom tertentu

<i>grouping</i> untuk semua kolom berdasarkan kolom <code>Genre</code> dengan agregasi __mean__.

Perhatikan kode berikut.

In [6]:
# Grouping semua kolom berdasarkan kolom Genre dengan agregasi mean

df.groupby('Genre').mean()

Unnamed: 0_level_0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Action,7973.879071,2007.909929,0.264726,0.158323,0.048236,0.056508,0.5281
Adventure,11532.787714,2008.130878,0.082271,0.049868,0.04049,0.013072,0.185879
Fighting,7646.511792,2004.630383,0.263667,0.119481,0.103007,0.043255,0.529375
Misc,8561.847039,2007.25848,0.235906,0.124198,0.061967,0.043312,0.465762
Platform,6927.251693,2003.820776,0.504571,0.227573,0.147596,0.058228,0.938341
Puzzle,9627.381443,2005.243433,0.21268,0.087251,0.098471,0.021564,0.420876
Racing,7961.515612,2004.840131,0.287766,0.190865,0.045388,0.061865,0.586101
Role-Playing,8086.174731,2007.055744,0.219946,0.126384,0.236767,0.04006,0.623233
Shooter,7369.367939,2005.918877,0.444733,0.239137,0.029221,0.078389,0.791885
Simulation,8626.085352,2006.567568,0.21143,0.130773,0.073472,0.036355,0.452364


Kode di atas menghasilkan output berupa dataframe yang sudah dikelompokkan berdasarkan kolom <code>Genre</code> dengan nilai pada masing-masing <i>cell</i> adalah __mean__ dari tiap genre untuk tiap kolom. Perhatikan bahwa hanya kolom numerik yang ditampilkan karena agregasi __mean__ hanya bisa diaplikasikan untuk kolom numerik.

### Grouping beberapa kolom berdasarkan satu kolom tertentu

Memilih kolom-kolom tertentu yang ingin ditampilkan, misalnya dalam contoh ini hanya ingin menampilkan kolom <code>NA_Sales</code>, <code>EU_Sales</code>, <code>JP_Sales</code>, <code>Other_Sales</code>, <code>Global_Sales</code>.

Perhatikan penulisan kodenya di bawah ini.

In [7]:
# Grouping beberapa kolom berdasarkan Genre dengan agregasi mean

df.groupby('Genre')[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].mean()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,0.264726,0.158323,0.048236,0.056508,0.5281
Adventure,0.082271,0.049868,0.04049,0.013072,0.185879
Fighting,0.263667,0.119481,0.103007,0.043255,0.529375
Misc,0.235906,0.124198,0.061967,0.043312,0.465762
Platform,0.504571,0.227573,0.147596,0.058228,0.938341
Puzzle,0.21268,0.087251,0.098471,0.021564,0.420876
Racing,0.287766,0.190865,0.045388,0.061865,0.586101
Role-Playing,0.219946,0.126384,0.236767,0.04006,0.623233
Shooter,0.444733,0.239137,0.029221,0.078389,0.791885
Simulation,0.21143,0.130773,0.073472,0.036355,0.452364


<i>grouping</i> berdasarkan kolom <code>Year</code> dengan kriteria <code>max()</code> atau nilai maksimum.

In [8]:
# Grouping beberapa kolom berdasarkan Year dengan agregasi max

df.groupby('Year')[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].max()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1980.0,4.0,0.26,0.0,0.05,4.31
1981.0,4.21,0.24,0.0,0.05,4.5
1982.0,7.28,0.45,0.0,0.08,7.81
1983.0,1.22,0.12,2.35,0.02,3.2
1984.0,26.93,0.63,2.46,0.47,28.31
1985.0,29.08,3.58,6.81,0.77,40.24
1986.0,3.74,0.93,2.65,1.51,6.51
1987.0,2.19,0.5,2.41,0.08,4.38
1988.0,9.54,3.44,3.84,0.46,17.28
1989.0,23.2,2.71,4.22,0.58,30.26


## Grouping berdasarkan dua kolom tertentu

<i>grouping</i> berdasarkan dua kolom atau lebih seperti kode di bawah ini.

In [9]:
# Grouping berdasarkan kolom Year dan Genre

df.groupby(['Year', 'Genre'])[['NA_Sales', 'EU_Sales', 'JP_Sales']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,NA_Sales,EU_Sales,JP_Sales
Year,Genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980.0,Action,0.320000,0.020000,0.000000
1980.0,Fighting,0.720000,0.040000,0.000000
1980.0,Misc,0.632500,0.037500,0.000000
1980.0,Shooter,3.280000,0.215000,0.000000
1980.0,Sports,0.460000,0.030000,0.000000
...,...,...,...,...
2016.0,Sports,0.120263,0.193684,0.020526
2016.0,Strategy,0.011000,0.032000,0.005000
2017.0,Action,0.000000,0.000000,0.010000
2017.0,Role-Playing,0.000000,0.000000,0.020000


## Grouping dengan beberapa agregasi

Jika ingin mengelompokkan data dengan beberapa agregasi, misalnya mean, median, max, dan sebagainya, gunakan <i>function</i> <code>agg()</code> yang diisi dengan list agregasinya sebagai parameter.

### Grouping semua kolom dengan beberapa agregasi

<i>grouping</i> pada semua kolom berdasarkan kolom <code>Genre</code> dengan agregasi __mean__ dan __median__.

In [10]:
# Grouping dengan agregasi mean dan median

df.groupby('Genre').agg(['mean', 'median'])

Unnamed: 0_level_0,Rank,Rank,Year,Year,NA_Sales,NA_Sales,EU_Sales,EU_Sales,JP_Sales,JP_Sales,Other_Sales,Other_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,mean,median,mean,median,mean,median,mean,median,mean,median,mean,median,mean,median
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
Action,7973.879071,7759.0,2007.909929,2009.0,0.264726,0.1,0.158323,0.03,0.048236,0.0,0.056508,0.01,0.5281,0.19
Adventure,11532.787714,12766.0,2008.130878,2009.0,0.082271,0.0,0.049868,0.0,0.04049,0.01,0.013072,0.0,0.185879,0.06
Fighting,7646.511792,7483.0,2004.630383,2005.0,0.263667,0.08,0.119481,0.03,0.103007,0.01,0.043255,0.01,0.529375,0.21
Misc,8561.847039,8624.0,2007.25848,2008.0,0.235906,0.08,0.124198,0.01,0.061967,0.0,0.043312,0.01,0.465762,0.16
Platform,6927.251693,6233.0,2003.820776,2004.0,0.504571,0.14,0.227573,0.05,0.147596,0.0,0.058228,0.01,0.938341,0.28
Puzzle,9627.381443,10342.0,2005.243433,2007.0,0.21268,0.05,0.087251,0.01,0.098471,0.0,0.021564,0.01,0.420876,0.11
Racing,7961.515612,7811.0,2004.840131,2005.0,0.287766,0.1,0.190865,0.04,0.045388,0.0,0.061865,0.01,0.586101,0.19
Role-Playing,8086.174731,8002.5,2007.055744,2008.0,0.219946,0.04,0.126384,0.01,0.236767,0.05,0.04006,0.01,0.623233,0.185
Shooter,7369.367939,7069.5,2005.918877,2007.0,0.444733,0.12,0.239137,0.05,0.029221,0.0,0.078389,0.02,0.791885,0.23
Simulation,8626.085352,8681.0,2006.567568,2008.0,0.21143,0.07,0.130773,0.01,0.073472,0.0,0.036355,0.01,0.452364,0.16


### Grouping beberapa kolom dengan beberapa agregasi

Jika hanya ingin menampilkan beberapa kolom saja dengan beberapa agregasi

In [11]:
# Grouping beberapa kolom dengan kriteria max dan min

df.groupby('Genre')[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Global_Sales']].agg(['max', 'min'])

Unnamed: 0_level_0,NA_Sales,NA_Sales,EU_Sales,EU_Sales,JP_Sales,JP_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,max,min,max,min,max,min,max,min
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Action,9.63,0.0,9.27,0.0,3.96,0.0,21.4,0.01
Adventure,6.16,0.0,2.79,0.0,2.69,0.0,11.18,0.01
Fighting,6.75,0.0,2.61,0.0,2.87,0.0,13.04,0.01
Misc,14.97,0.0,9.26,0.0,4.16,0.0,29.02,0.01
Platform,29.08,0.0,9.23,0.0,6.81,0.0,40.24,0.01
Puzzle,23.2,0.0,5.36,0.0,5.32,0.0,30.26,0.01
Racing,15.85,0.0,12.88,0.0,4.13,0.0,35.82,0.01
Role-Playing,11.27,0.0,8.89,0.0,10.22,0.0,31.37,0.01
Shooter,26.93,0.0,5.88,0.0,1.44,0.0,28.31,0.01
Simulation,9.07,0.0,11.0,0.0,5.33,0.0,24.76,0.01


### Grouping dengan agregasi yang berbeda untuk kolom tertentu

Menentukan agregasi yang berbeda untuk tiap kolom dalam proses <i>grouping</i>. Contoh di bawah ini misalnya ingin melakukan <i>grouping</i> berdasarkan <code>Genre</code> dan agregasi untuk kolom <code>NA_Sales</code> adalah nilai maksimum atau __max__, <code>Global_Sales</code> adalah nilai rata-ratanya atau __mean__, dan <code>Year</code> nilai minimumnya atau __min__.

In [12]:
# Grouping dengan agregasi yang berbeda untuk tiap kolom

df.groupby('Genre').agg({'NA_Sales':'max', 'Global_Sales':'mean', 'Year':'min'})

Unnamed: 0_level_0,NA_Sales,Global_Sales,Year
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,9.63,0.5281,1980.0
Adventure,6.16,0.185879,1983.0
Fighting,6.75,0.529375,1980.0
Misc,14.97,0.465762,1980.0
Platform,29.08,0.938341,1981.0
Puzzle,23.2,0.420876,1981.0
Racing,15.85,0.586101,1981.0
Role-Playing,11.27,0.623233,1986.0
Shooter,26.93,0.791885,1980.0
Simulation,9.07,0.452364,1981.0


## Grouping dengan custom function

Selain menggunakan <i>function</i> yang sudah tersedia di Python, dapat juga membuat <i>function</i> sendiri untuk menentukan kriteria pengelompokkan data menggunakan <code>def</code>.

Perhatikan kode berikut.

In [13]:
# Grouping dengan custom function

def selisih_max_min(x):
    return x.max() - x.min()

df.groupby('Genre').agg({'Global_Sales': selisih_max_min, 'NA_Sales': 'std'})

Unnamed: 0_level_0,Global_Sales,NA_Sales
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Action,21.39,0.56689
Adventure,11.17,0.274674
Fighting,13.03,0.516148
Misc,29.01,0.690878
Platform,40.23,1.502039
Puzzle,30.25,1.057669
Racing,35.81,0.742523
Role-Playing,31.36,0.672721
Shooter,28.3,1.201147
Simulation,24.75,0.466698




---


Semoga Bermanfaat dan jangan lupa main-main kesini: <a href="https://nurpurwanto.github.io/">**nurpurwanto**</a> Terimakasih.

---


