# Session 19 GroupBy Object in Pandas

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv("imdb-top-1000.csv")
movies.head(2)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
1,The Godfather,1972,175,Crime,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0


## **GroupBy Object in Pandas**

### **Introduction**

The `groupby()` function in pandas is one of the most powerful tools for **data aggregation and analysis**. It allows us to **split the data into groups** based on certain column values, apply operations independently to each group, and then combine the results.

---

### **Concept**

* The **GroupBy** operation involves **splitting**, **applying**, and **combining**:

  1. **Split** – Divide the dataset into groups based on one or more **categorical columns**.
  2. **Apply** – Perform operations (like `sum()`, `mean()`, `count()`, etc.) on each group.
  3. **Combine** – Merge the results back into a new structure (often a DataFrame).

---

### **How Grouping Works**

When we call:

```python
df.groupby('column_name')
```

* Pandas creates a **`DataFrameGroupBy`** object.
* This object **does not return a subset of the DataFrame** immediately.
* Instead, it acts as an **intermediate object** that stores information about how the data is split internally.
* We can then perform **aggregate**, **filter**, or **transform** operations on this object.

---

### **Grouping Basis**

Grouping is generally done based on **categorical columns**, since:

* **Categorical columns** represent groups, categories, or labels (like "Gender", "Region", "Department").
* **Numerical columns** contain data on which we can perform aggregations (like sum, mean, max, etc.).

For example:

```python
import pandas as pd

data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT'],
    'Employee': ['A', 'B', 'C', 'D', 'E'],
    'Salary': [50000, 55000, 40000, 42000, 60000]
}

df = pd.DataFrame(data)

grouped = df.groupby('Department')
print(grouped)
```

**Output:**

```
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
```

This output confirms that `grouped` is a **GroupBy object**, not a DataFrame.

---

### **Accessing Aggregated Results**

We can perform operations on the grouped data:

```python
# Get mean salary of each department
df.groupby('Department')['Salary'].mean()
```

**Output:**

```
Department
HR        41000.0
IT        60000.0
Sales     52500.0
Name: Salary, dtype: float64
```

---

### **Summary**

| Aspect                | Description                                                                         |
| --------------------- | ----------------------------------------------------------------------------------- |
| **Purpose**           | To group data based on categorical values and perform operations on numerical data. |
| **Returned Object**   | `DataFrameGroupBy` (not a DataFrame itself)                                         |
| **Common Operations** | `sum()`, `mean()`, `count()`, `max()`, `min()`, `aggregate()`                       |
| **Grouping Basis**    | Usually on categorical columns.                                                     |

## Some Tasks using GroupBy

### **Task 1**: Find the top 3 genres by total earning, i.e. `Gross`

In [None]:
# strategy 1
movies.groupby('Genre').sum().sort_values(by='Gross', ascending=False).head(3)['Gross']

Genre
Drama     3.540997e+10
Action    3.263226e+10
Comedy    1.566387e+10
Name: Gross, dtype: float64

In [32]:
# strategy 2 (faster)
movies.groupby('Genre')['Gross'].sum().sort_values(ascending=False).head(3)

Genre
Drama     3.540997e+10
Action    3.263226e+10
Comedy    1.566387e+10
Name: Gross, dtype: float64

The second strategy is faster because it filters out the required column first and then performs aggregations on the filtered or selected columns.

### **Task 2**: Find the genre with highest avg IMDB Rating
Following the efficient strategy in this one as discussed above by Selecting the **IMDB_Rating** column beforehand.

In [34]:
movies.groupby('Genre')['IMDB_Rating'].sum().sort_values(ascending=False).head(1)

Genre
Drama    2299.7
Name: IMDB_Rating, dtype: float64

### **Task 3**: Find the director with the most popularity.
Popularity described by **No_of_Votes** attribute.

In [42]:
movies.groupby('Director')['No_of_Votes'].sum().sort_values(ascending=False)

Director
Christopher Nolan    11578345
Quentin Tarantino     8123208
Steven Spielberg      7817166
David Fincher         6607859
Martin Scorsese       6513530
                       ...   
J. Lee Thompson         26457
Peter Mullan            25938
René Laloux             25229
Francis Lee             25198
Kaige Chen              25088
Name: No_of_Votes, Length: 548, dtype: int64

### **Task 4**: Find the highest rated movies of each genre.

In [58]:
movies.groupby('Genre')[['Genre', 'Series_Title', 'IMDB_Rating']].head(1)

Unnamed: 0,Genre,Series_Title,IMDB_Rating
0,Drama,The Shawshank Redemption,9.3
1,Crime,The Godfather,9.2
2,Action,The Dark Knight,9.0
7,Biography,Schindler's List,8.9
12,Western,"Il buono, il brutto, il cattivo",8.8
19,Comedy,Gisaengchung,8.6
21,Adventure,Interstellar,8.6
23,Animation,Sen to Chihiro no kamikakushi,8.6
49,Horror,Psycho,8.5
69,Mystery,Memento,8.4


In [75]:
# or using this strategy, which is also the better one
movies.groupby('Genre')[['Series_Title', 'IMDB_Rating']].first()

Unnamed: 0_level_0,Series_Title,IMDB_Rating
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Action,The Dark Knight,9.0
Adventure,Interstellar,8.6
Animation,Sen to Chihiro no kamikakushi,8.6
Biography,Schindler's List,8.9
Comedy,Gisaengchung,8.6
Crime,The Godfather,9.2
Drama,The Shawshank Redemption,9.3
Family,E.T. the Extra-Terrestrial,7.8
Fantasy,Das Cabinet des Dr. Caligari,8.1
Film-Noir,The Third Man,8.1


### **Task 5**: Find the number of movies done by each actor.

In [None]:
# simple strategy without groupby
movies['Star1'].value_counts()

Star1
Tom Hanks              12
Robert De Niro         11
Al Pacino              10
Clint Eastwood         10
Humphrey Bogart         9
                       ..
Richard Farnsworth      1
Junko Iwao              1
Fernanda Montenegro     1
Eli Marienthal          1
Preity Zinta            1
Name: count, Length: 660, dtype: int64

In [65]:
movies.groupby('Star1')['Series_Title'].count().sort_values(ascending=False)

Star1
Tom Hanks             12
Robert De Niro        11
Al Pacino             10
Clint Eastwood        10
Humphrey Bogart        9
                      ..
Günes Sensoy           1
Haluk Bilginer         1
Harriet Andersson      1
Harry Dean Stanton     1
Griffin Dunne          1
Name: Series_Title, Length: 660, dtype: int64

#### Assigning variable for the group for simplification of code ahead.

In [4]:
genres = movies.groupby('Genre')

In [6]:
# applying builtin aggregation functions on groupby objects
genres.sum()

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,The Dark KnightThe Lord of the Rings: The Retu...,2008200320102001200219991980197719621954200019...,22196,1367.3,Christopher NolanPeter JacksonChristopher Nola...,Christian BaleElijah WoodLeonardo DiCaprioElij...,72282412,32632260000.0,10499.0
Adventure,InterstellarBack to the FutureInglourious Bast...,2014198520091981196819621959201319751963194819...,9656,571.5,Christopher NolanRobert ZemeckisQuentin Tarant...,Matthew McConaugheyMichael J. FoxBrad PittJürg...,22576163,9496922000.0,5020.0
Animation,Sen to Chihiro no kamikakushiThe Lion KingHota...,2001199419882016201820172008199719952019200920...,8166,650.3,Hayao MiyazakiRoger AllersIsao TakahataMakoto ...,Daveigh ChaseRob MinkoffTsutomu TatsumiRyûnosu...,21978630,14631470000.0,6082.0
Biography,Schindler's ListGoodfellasHamiltonThe Intoucha...,1993199020202011200220171995198420182013201320...,11970,698.6,Steven SpielbergMartin ScorseseThomas KailOliv...,Liam NeesonRobert De NiroLin-Manuel MirandaÉri...,24006844,8276358000.0,6023.0
Comedy,GisaengchungLa vita è bellaModern TimesCity Li...,2019199719361931200919641940200120001973196019...,17380,1224.7,Bong Joon HoRoberto BenigniCharles ChaplinChar...,Kang-ho SongRoberto BenigniCharles ChaplinChar...,27620327,15663870000.0,9840.0
Crime,The GodfatherThe Godfather: Part II12 Angry Me...,1972197419571994200219991995199120192006199519...,13524,857.8,Francis Ford CoppolaFrancis Ford CoppolaSidney...,Marlon BrandoAl PacinoHenry FondaJohn Travolta...,33533615,8452632000.0,6706.0
Drama,The Shawshank RedemptionFight ClubForrest Gump...,1994199919941975202019981946201420061998198819...,36049,2299.7,Frank DarabontDavid FincherRobert ZemeckisMilo...,Tim RobbinsBrad PittTom HanksJack NicholsonSur...,61367304,35409970000.0,19208.0
Family,E.T. the Extra-TerrestrialWilly Wonka & the Ch...,19821971,215,15.6,Steven SpielbergMel Stuart,Henry ThomasGene Wilder,551221,439110600.0,158.0
Fantasy,Das Cabinet des Dr. CaligariNosferatu,19201922,170,16.0,Robert WieneF.W. Murnau,Werner KraussMax Schreck,146222,782726700.0,0.0
Film-Noir,The Third ManThe Maltese FalconShadow of a Doubt,194919411943,312,23.9,Carol ReedJohn HustonAlfred Hitchcock,Orson WellesHumphrey BogartTeresa Wright,367215,125910500.0,287.0


In [7]:
genres.min()

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,300,1924,45,7.6,Abhishek Chaubey,Aamir Khan,25312,3296.0,33.0
Adventure,2001: A Space Odyssey,1925,88,7.6,Akira Kurosawa,Aamir Khan,29999,61001.0,41.0
Animation,Akira,1940,71,7.6,Adam Elliot,Adrian Molina,25229,128985.0,61.0
Biography,12 Years a Slave,1928,93,7.6,Adam McKay,Adrien Brody,27254,21877.0,48.0
Comedy,(500) Days of Summer,1921,68,7.6,Alejandro G. Iñárritu,Aamir Khan,26337,1305.0,45.0
Crime,12 Angry Men,1931,80,7.6,Akira Kurosawa,Ajay Devgn,27712,6013.0,47.0
Drama,1917,1925,64,7.6,Aamir Khan,Abhay Deol,25088,3600.0,28.0
Family,E.T. the Extra-Terrestrial,1971,100,7.8,Mel Stuart,Gene Wilder,178731,4000000.0,67.0
Fantasy,Das Cabinet des Dr. Caligari,1920,76,7.9,F.W. Murnau,Max Schreck,57428,337574718.0,
Film-Noir,Shadow of a Doubt,1941,100,7.8,Alfred Hitchcock,Humphrey Bogart,59556,449191.0,94.0


### What does the above aggregation method do?
The above aggregation method when applied on the **GroupBy Object**, it sums up the value of each genre with the corresponding columns. *eg*: the sum of **IMDB_Rating** for all the movies of the action genre is **1367.3**.

Now here is another example, where we're using the `min()` aggregate function to identify the earliest release of a movie in specific genres, as well as also giving the min of other columns.

### **Task 6**: Count the number of groups formed by using `groupby()` on **Genre** column.

In [66]:
len(movies.groupby('Genre'))

14

In [68]:
# or this way
movies['Genre'].nunique()

14

### **Task 7**: Find the number of rows within each group.

In [69]:
movies.groupby('Genre').size()

Genre
Action       172
Adventure     72
Animation     82
Biography     88
Comedy       155
Crime        107
Drama        289
Family         2
Fantasy        2
Film-Noir      3
Horror        11
Mystery       12
Thriller       1
Western        4
dtype: int64

In [70]:
# or this way
movies.Genre.value_counts()

Genre
Drama        289
Action       172
Comedy       155
Crime        107
Biography     88
Animation     82
Adventure     72
Mystery       12
Horror        11
Western        4
Film-Noir      3
Fantasy        2
Family         2
Thriller       1
Name: count, dtype: int64

In [None]:
# storing the groupby object in var for further code ahead
genres = movies.groupby('Genre')

### **Task 8**: Find the first, last, and nth item of the Group.

In [79]:
genres.first()
genres.last()
genres.nth(6)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
16,Star Wars: Episode V - The Empire Strikes Back,1980,124,Action,8.7,Irvin Kershner,Mark Hamill,1159315,290475067.0,82.0
27,Se7en,1995,127,Crime,8.6,David Fincher,Morgan Freeman,1445096,100125643.0,65.0
32,It's a Wonderful Life,1946,130,Drama,8.6,Frank Capra,James Stewart,405801,82385199.0,89.0
66,WALL·E,2008,98,Animation,8.4,Andrew Stanton,Ben Burtt,999790,223808164.0,95.0
83,The Great Dictator,1940,125,Comedy,8.4,Charles Chaplin,Charles Chaplin,203150,288475.0,
102,Braveheart,1995,178,Biography,8.3,Mel Gibson,Mel Gibson,959181,75600000.0,68.0
118,North by Northwest,1959,136,Adventure,8.3,Alfred Hitchcock,Cary Grant,299198,13275000.0,98.0
420,Sleuth,1972,138,Mystery,8.0,Joseph L. Mankiewicz,Laurence Olivier,44748,4081254.0,
724,Get Out,2017,104,Horror,7.7,Jordan Peele,Daniel Kaluuya,492851,176040665.0,85.0


## `get_group()`
Returns a dataframe containing all the elements within a Group.

In [80]:
genres.get_group('Horror')

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
49,Psycho,1960,109,Horror,8.5,Alfred Hitchcock,Anthony Perkins,604211,32000000.0,97.0
75,Alien,1979,117,Horror,8.4,Ridley Scott,Sigourney Weaver,787806,78900000.0,89.0
271,The Thing,1982,109,Horror,8.1,John Carpenter,Kurt Russell,371271,13782838.0,57.0
419,The Exorcist,1973,122,Horror,8.0,William Friedkin,Ellen Burstyn,362393,232906145.0,81.0
544,Night of the Living Dead,1968,96,Horror,7.9,George A. Romero,Duane Jones,116557,89029.0,89.0
707,The Innocents,1961,100,Horror,7.8,Jack Clayton,Deborah Kerr,27007,2616000.0,88.0
724,Get Out,2017,104,Horror,7.7,Jordan Peele,Daniel Kaluuya,492851,176040665.0,85.0
844,Halloween,1978,91,Horror,7.7,John Carpenter,Donald Pleasence,233106,47000000.0,87.0
876,The Invisible Man,1933,71,Horror,7.7,James Whale,Claude Rains,30683,298791505.0,87.0
932,Saw,2004,103,Horror,7.6,James Wan,Cary Elwes,379020,56000369.0,46.0


Although we can also do the above mentioned thing by using the "filtering" strategy of native pandas in the following way:

In [81]:
movies[movies['Genre'] == 'Horror']

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
49,Psycho,1960,109,Horror,8.5,Alfred Hitchcock,Anthony Perkins,604211,32000000.0,97.0
75,Alien,1979,117,Horror,8.4,Ridley Scott,Sigourney Weaver,787806,78900000.0,89.0
271,The Thing,1982,109,Horror,8.1,John Carpenter,Kurt Russell,371271,13782838.0,57.0
419,The Exorcist,1973,122,Horror,8.0,William Friedkin,Ellen Burstyn,362393,232906145.0,81.0
544,Night of the Living Dead,1968,96,Horror,7.9,George A. Romero,Duane Jones,116557,89029.0,89.0
707,The Innocents,1961,100,Horror,7.8,Jack Clayton,Deborah Kerr,27007,2616000.0,88.0
724,Get Out,2017,104,Horror,7.7,Jordan Peele,Daniel Kaluuya,492851,176040665.0,85.0
844,Halloween,1978,91,Horror,7.7,John Carpenter,Donald Pleasence,233106,47000000.0,87.0
876,The Invisible Man,1933,71,Horror,7.7,James Whale,Claude Rains,30683,298791505.0,87.0
932,Saw,2004,103,Horror,7.6,James Wan,Cary Elwes,379020,56000369.0,46.0


## `groups` attribute of a **GroupBy Object**
This attribute or method returns a dictionary containing the **Groups** as **keys** and a list of indices as their corresponding **values**.

## `describe`, `sample`, `nunique`
These are some functions that we had discussed in Series as well as in DataFrames. These functions are also applicable on GroupBy Objects and the DataFrame formed out of them.

In [84]:
genres.describe()

Unnamed: 0_level_0,Runtime,Runtime,Runtime,Runtime,Runtime,Runtime,Runtime,Runtime,IMDB_Rating,IMDB_Rating,...,Gross,Gross,Metascore,Metascore,Metascore,Metascore,Metascore,Metascore,Metascore,Metascore
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Action,172.0,129.046512,28.500706,45.0,110.75,127.5,143.25,321.0,172.0,7.949419,...,267443700.0,936662225.0,143.0,73.41958,12.421252,33.0,65.0,74.0,82.0,98.0
Adventure,72.0,134.111111,33.31732,88.0,109.0,127.0,149.0,228.0,72.0,7.9375,...,199807000.0,874211619.0,64.0,78.4375,12.345393,41.0,69.75,80.5,87.25,100.0
Animation,82.0,99.585366,14.530471,71.0,90.0,99.5,106.75,137.0,82.0,7.930488,...,252061200.0,873839108.0,75.0,81.093333,8.813646,61.0,75.0,82.0,87.5,96.0
Biography,88.0,136.022727,25.514466,93.0,120.0,129.0,146.25,209.0,88.0,7.938636,...,98299240.0,753585104.0,79.0,76.240506,11.028187,48.0,70.5,76.0,84.5,97.0
Comedy,155.0,112.129032,22.946213,68.0,96.0,106.0,124.5,188.0,155.0,7.90129,...,81078090.0,886752933.0,125.0,78.72,11.82916,45.0,72.0,79.0,88.0,99.0
Crime,107.0,126.392523,27.689231,80.0,106.5,122.0,141.5,229.0,107.0,8.016822,...,71021630.0,790482117.0,87.0,77.08046,13.099102,47.0,69.5,77.0,87.0,100.0
Drama,289.0,124.737024,27.74049,64.0,105.0,121.0,137.0,242.0,289.0,7.957439,...,116446100.0,924558264.0,241.0,79.701245,12.744687,28.0,72.0,82.0,89.0,100.0
Family,2.0,107.5,10.606602,100.0,103.75,107.5,111.25,115.0,2.0,7.8,...,327332900.0,435110554.0,2.0,79.0,16.970563,67.0,73.0,79.0,85.0,91.0
Fantasy,2.0,85.0,12.727922,76.0,80.5,85.0,89.5,94.0,2.0,8.0,...,418257700.0,445151978.0,0.0,,,,,,,
Film-Noir,3.0,104.0,4.0,100.0,102.0,104.0,106.0,108.0,3.0,7.966667,...,62730680.0,123353292.0,3.0,95.666667,1.527525,94.0,95.0,96.0,96.5,97.0


In [None]:
# 1 random values from each group
genres.sample()

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
152,V for Vendetta,2005,132,Action,8.2,James McTeigue,Hugo Weaving,1032749,70511035.0,62.0
675,Back to the Future Part II,1989,108,Adventure,7.8,Robert Zemeckis,Michael J. Fox,481918,118500000.0,57.0
395,The Nightmare Before Christmas,1993,76,Animation,8.0,Henry Selick,Danny Elfman,300208,75082668.0,82.0
650,October Sky,1999,108,Biography,7.8,Joe Johnston,Jake Gyllenhaal,82855,32481825.0,71.0
780,Lost in Translation,2003,102,Comedy,7.7,Sofia Coppola,Bill Murray,415074,44585453.0,89.0
556,Strangers on a Train,1951,101,Crime,7.9,Alfred Hitchcock,Farley Granger,123341,7630000.0,88.0
297,Jungfrukällan,1960,89,Drama,8.1,Ingmar Bergman,Max von Sydow,26697,1526000.0,
688,E.T. the Extra-Terrestrial,1982,115,Family,7.8,Steven Spielberg,Henry Thomas,372490,435110554.0,91.0
321,Das Cabinet des Dr. Caligari,1920,76,Fantasy,8.1,Robert Wiene,Werner Krauss,57428,337574718.0,
712,Shadow of a Doubt,1943,108,Film-Noir,7.8,Alfred Hitchcock,Teresa Wright,59556,123353292.0,94.0


In [86]:
genres.nunique()

Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,172,61,78,15,123,121,172,172,50
Adventure,72,49,58,10,59,59,72,72,33
Animation,82,35,41,11,51,77,82,82,29
Biography,88,44,56,13,76,72,88,88,40
Comedy,155,72,70,11,113,133,155,155,44
Crime,106,56,65,14,86,85,107,107,39
Drama,289,83,95,14,211,250,288,287,52
Family,2,2,2,1,2,2,2,2,2
Fantasy,2,2,2,2,2,2,2,2,0
Film-Noir,3,3,3,3,3,3,3,3,3


In [92]:
genres['Released_Year'].max()

Genre
Action       2019
Adventure      PG
Animation    2020
Biography    2020
Comedy       2020
Crime        2019
Drama        2020
Family       1982
Fantasy      1922
Film-Noir    1949
Horror       2017
Mystery      2012
Thriller     1967
Western      1976
Name: Released_Year, dtype: object

## `agg()`
This method is used in cases where we'd want to apply different aggregate methods on different columns of the Group. Here, we pass a dictionary as a parameter, consisting of the column names as **Keys** and the function to be applied on them as their corresponding **Values**.

In [95]:
genres.agg(
    {
        'Runtime': 'mean',
        'IMDB_Rating': 'mean',
        'No_of_Votes': 'sum',
        'Gross': 'sum',
        'Metascore': 'min'
    }
)

Unnamed: 0_level_0,Runtime,IMDB_Rating,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,129.046512,7.949419,72282412,32632260000.0,33.0
Adventure,134.111111,7.9375,22576163,9496922000.0,41.0
Animation,99.585366,7.930488,21978630,14631470000.0,61.0
Biography,136.022727,7.938636,24006844,8276358000.0,48.0
Comedy,112.129032,7.90129,27620327,15663870000.0,45.0
Crime,126.392523,8.016822,33533615,8452632000.0,47.0
Drama,124.737024,7.957439,61367304,35409970000.0,28.0
Family,107.5,7.8,551221,439110600.0,67.0
Fantasy,85.0,8.0,146222,782726700.0,
Film-Noir,104.0,7.966667,367215,125910500.0,94.0


On the other hand, using the same method, we can also generate a DF consisiting of the resultant of aggregate function for each distinct columns. This is as follows:

In [97]:
genres.agg(['min', 'max'])

Unnamed: 0_level_0,Series_Title,Series_Title,Released_Year,Released_Year,Runtime,Runtime,IMDB_Rating,IMDB_Rating,Director,Director,Star1,Star1,No_of_Votes,No_of_Votes,Gross,Gross,Metascore,Metascore
Unnamed: 0_level_1,min,max,min,max,min,max,min,max,min,max,min,max,min,max,min,max,min,max
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
Action,300,Yôjinbô,1924,2019,45,321,7.6,9.0,Abhishek Chaubey,Zack Snyder,Aamir Khan,Yun-Fat Chow,25312,2303232,3296.0,936662225.0,33.0,98.0
Adventure,2001: A Space Odyssey,Zombieland,1925,PG,88,228,7.6,8.6,Akira Kurosawa,Ömer Faruk Sorak,Aamir Khan,Yves Montand,29999,1512360,61001.0,874211619.0,41.0,100.0
Animation,Akira,Ôkami kodomo no Ame to Yuki,1940,2020,71,137,7.6,8.6,Adam Elliot,Yoshifumi Kondô,Adrian Molina,Yôji Matsuda,25229,999790,128985.0,873839108.0,61.0,96.0
Biography,12 Years a Slave,Zerkalo,1928,2020,93,209,7.6,8.9,Adam McKay,Tom McCarthy,Adrien Brody,Éric Toledano,27254,1213505,21877.0,753585104.0,48.0,97.0
Comedy,(500) Days of Summer,Zindagi Na Milegi Dobara,1921,2020,68,188,7.6,8.6,Alejandro G. Iñárritu,Zoya Akhtar,Aamir Khan,Ömer Faruk Sorak,26337,939631,1305.0,886752933.0,45.0,99.0
Crime,12 Angry Men,À bout de souffle,1931,2019,80,229,7.6,9.2,Akira Kurosawa,Yavuz Turgul,Ajay Devgn,Vincent Cassel,27712,1826188,6013.0,790482117.0,47.0,100.0
Drama,1917,Zwartboek,1925,2020,64,242,7.6,9.3,Aamir Khan,Çagan Irmak,Abhay Deol,Çetin Tekindor,25088,2343110,3600.0,924558264.0,28.0,100.0
Family,E.T. the Extra-Terrestrial,Willy Wonka & the Chocolate Factory,1971,1982,100,115,7.8,7.8,Mel Stuart,Steven Spielberg,Gene Wilder,Henry Thomas,178731,372490,4000000.0,435110554.0,67.0,91.0
Fantasy,Das Cabinet des Dr. Caligari,Nosferatu,1920,1922,76,94,7.9,8.1,F.W. Murnau,Robert Wiene,Max Schreck,Werner Krauss,57428,88794,337574718.0,445151978.0,,
Film-Noir,Shadow of a Doubt,The Third Man,1941,1949,100,108,7.8,8.1,Alfred Hitchcock,John Huston,Humphrey Bogart,Teresa Wright,59556,158731,449191.0,123353292.0,94.0,97.0


Furthermore, we will merge both the syntaxes.

In [98]:
genres.agg(
    {
        'Runtime': ['min', 'mean'],
        'IMDB_Rating': 'mean',
        'No_of_Votes': 'sum',
        'Gross': 'sum',
        'Metascore': 'min'
    }
)

Unnamed: 0_level_0,Runtime,Runtime,IMDB_Rating,No_of_Votes,Gross,Metascore
Unnamed: 0_level_1,min,mean,mean,sum,sum,min
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Action,45,129.046512,7.949419,72282412,32632260000.0,33.0
Adventure,88,134.111111,7.9375,22576163,9496922000.0,41.0
Animation,71,99.585366,7.930488,21978630,14631470000.0,61.0
Biography,93,136.022727,7.938636,24006844,8276358000.0,48.0
Comedy,68,112.129032,7.90129,27620327,15663870000.0,45.0
Crime,80,126.392523,8.016822,33533615,8452632000.0,47.0
Drama,64,124.737024,7.957439,61367304,35409970000.0,28.0
Family,100,107.5,7.8,551221,439110600.0,67.0
Fantasy,76,85.0,8.0,146222,782726700.0,
Film-Noir,100,104.0,7.966667,367215,125910500.0,94.0


## Looping on Groups

In [107]:
df = pd.DataFrame(columns=movies.columns)

for group, data in genres:
    df = pd.concat((df, data[data['IMDB_Rating'] == data['IMDB_Rating'].max()]))

df

  df = pd.concat((df, data[data['IMDB_Rating'] == data['IMDB_Rating'].max()]))


Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
2,The Dark Knight,2008,152,Action,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0
21,Interstellar,2014,169,Adventure,8.6,Christopher Nolan,Matthew McConaughey,1512360,188020017.0,74.0
23,Sen to Chihiro no kamikakushi,2001,125,Animation,8.6,Hayao Miyazaki,Daveigh Chase,651376,10055859.0,96.0
7,Schindler's List,1993,195,Biography,8.9,Steven Spielberg,Liam Neeson,1213505,96898818.0,94.0
19,Gisaengchung,2019,132,Comedy,8.6,Bong Joon Ho,Kang-ho Song,552778,53367844.0,96.0
26,La vita è bella,1997,116,Comedy,8.6,Roberto Benigni,Roberto Benigni,623629,57598247.0,59.0
1,The Godfather,1972,175,Crime,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
688,E.T. the Extra-Terrestrial,1982,115,Family,7.8,Steven Spielberg,Henry Thomas,372490,435110554.0,91.0
698,Willy Wonka & the Chocolate Factory,1971,100,Family,7.8,Mel Stuart,Gene Wilder,178731,4000000.0,67.0


## split -> apply -> combine
This simply refers to the viewpoint of the GroupBy method. This is diagramatically elaborated below.

          +--------------------+
          |     Original DF    |
          +--------------------+
                     │
                     ▼
             [Split by column]
                     │
         ┌───────────┼────────────┐
         ▼           ▼            ▼
      Group 1     Group 2      Group 3
         │           │            │
         ▼           ▼            ▼
     apply(sum)   apply(sum)   apply(sum)
         │           │            │
         └───────────┼────────────┘
                     ▼
             [Combine Results]
                     │
                     ▼
          +----------------------+
          |  Aggregated DataFrame |
          +----------------------+


In [142]:
genres.apply(min)

  genres.apply(min)
  genres.apply(min)


Unnamed: 0_level_0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Action,300,1924,45,Action,7.6,Abhishek Chaubey,Aamir Khan,25312,3296.0,
Adventure,2001: A Space Odyssey,1925,88,Adventure,7.6,Akira Kurosawa,Aamir Khan,29999,61001.0,
Animation,Akira,1940,71,Animation,7.6,Adam Elliot,Adrian Molina,25229,128985.0,
Biography,12 Years a Slave,1928,93,Biography,7.6,Adam McKay,Adrien Brody,27254,21877.0,
Comedy,(500) Days of Summer,1921,68,Comedy,7.6,Alejandro G. Iñárritu,Aamir Khan,26337,1305.0,
Crime,12 Angry Men,1931,80,Crime,7.6,Akira Kurosawa,Ajay Devgn,27712,6013.0,
Drama,1917,1925,64,Drama,7.6,Aamir Khan,Abhay Deol,25088,3600.0,
Family,E.T. the Extra-Terrestrial,1971,100,Family,7.8,Mel Stuart,Gene Wilder,178731,4000000.0,67.0
Fantasy,Das Cabinet des Dr. Caligari,1920,76,Fantasy,7.9,F.W. Murnau,Max Schreck,57428,337574718.0,
Film-Noir,Shadow of a Doubt,1941,100,Film-Noir,7.8,Alfred Hitchcock,Humphrey Bogart,59556,449191.0,94.0


### **Task 9**: Find the number of movies starting with the letter 'A' in each group.

In [143]:
def moviesWithA(group):
    return group['Series_Title'].str.startswith('A').sum()

In [144]:
genres.apply(moviesWithA)

  genres.apply(moviesWithA)


Genre
Action       10
Adventure     2
Animation     2
Biography     9
Comedy       14
Crime         4
Drama        21
Family        0
Fantasy       0
Film-Noir     0
Horror        1
Mystery       0
Thriller      0
Western       0
dtype: int64

### **Task 10**: Find the ranking of each movie in the group of "Genres" according to IMDB Score.

In [145]:
def rank_movie(group):
    group['genre_rank'] = group['IMDB_Rating'].rank(ascending=False)
    return group

In [146]:
genres.apply(rank_movie)

  genres.apply(rank_movie)


Unnamed: 0_level_0,Unnamed: 1_level_0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore,genre_rank
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Action,2,The Dark Knight,2008,152,Action,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0,1.0
Action,5,The Lord of the Rings: The Return of the King,2003,201,Action,8.9,Peter Jackson,Elijah Wood,1642758,377845905.0,94.0,2.0
Action,8,Inception,2010,148,Action,8.8,Christopher Nolan,Leonardo DiCaprio,2067042,292576195.0,74.0,3.5
Action,10,The Lord of the Rings: The Fellowship of the Ring,2001,178,Action,8.8,Peter Jackson,Elijah Wood,1661481,315544750.0,92.0,3.5
Action,13,The Lord of the Rings: The Two Towers,2002,179,Action,8.7,Peter Jackson,Elijah Wood,1485555,342551365.0,87.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...
Thriller,700,Wait Until Dark,1967,108,Thriller,7.8,Terence Young,Audrey Hepburn,27733,17550741.0,81.0,1.0
Western,12,"Il buono, il brutto, il cattivo",1966,161,Western,8.8,Sergio Leone,Clint Eastwood,688390,6100000.0,90.0,1.0
Western,48,Once Upon a Time in the West,1968,165,Western,8.5,Sergio Leone,Henry Fonda,302844,5321508.0,80.0,2.0
Western,115,Per qualche dollaro in più,1965,132,Western,8.3,Sergio Leone,Clint Eastwood,232772,15000000.0,74.0,3.0


In [147]:
genres['IMDB_Rating'].rank()

0      289.0
1      107.0
2      172.0
3      105.5
4      105.5
       ...  
995      9.0
996     17.5
997     17.5
998     17.5
999      7.0
Name: IMDB_Rating, Length: 1000, dtype: float64

### **Task 11**: Find normalized IMDB Rating group wise.

In [154]:
def normal(group):
    group['norm_rating'] = (group['IMDB_Rating'] - group['IMDB_Rating'].min()) / (group['IMDB_Rating'].max() - group['IMDB_Rating'].min())
    return group


In [155]:
genres.apply(normal)

  genres.apply(normal)


Unnamed: 0_level_0,Unnamed: 1_level_0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore,norm_rating
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Action,2,The Dark Knight,2008,152,Action,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0,1.000000
Action,5,The Lord of the Rings: The Return of the King,2003,201,Action,8.9,Peter Jackson,Elijah Wood,1642758,377845905.0,94.0,0.928571
Action,8,Inception,2010,148,Action,8.8,Christopher Nolan,Leonardo DiCaprio,2067042,292576195.0,74.0,0.857143
Action,10,The Lord of the Rings: The Fellowship of the Ring,2001,178,Action,8.8,Peter Jackson,Elijah Wood,1661481,315544750.0,92.0,0.857143
Action,13,The Lord of the Rings: The Two Towers,2002,179,Action,8.7,Peter Jackson,Elijah Wood,1485555,342551365.0,87.0,0.785714
...,...,...,...,...,...,...,...,...,...,...,...,...
Thriller,700,Wait Until Dark,1967,108,Thriller,7.8,Terence Young,Audrey Hepburn,27733,17550741.0,81.0,
Western,12,"Il buono, il brutto, il cattivo",1966,161,Western,8.8,Sergio Leone,Clint Eastwood,688390,6100000.0,90.0,1.000000
Western,48,Once Upon a Time in the West,1968,165,Western,8.5,Sergio Leone,Henry Fonda,302844,5321508.0,80.0,0.700000
Western,115,Per qualche dollaro in più,1965,132,Western,8.3,Sergio Leone,Clint Eastwood,232772,15000000.0,74.0,0.500000


## GroupBy on multiple columns
`Case`: Suppose we take a case where we want to bind the values from two columns and group them together such that the grouping is done in a way that formations of groups of **Directors** and **Star1** are formed.

In [15]:
# groupby on multiple cols
duo = movies.groupby(['Director', 'Star1'])
# size
duo.size()
# get_group
duo.get_group(('Aamir Khan', 'Amole Gupte'))

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
65,Taare Zameen Par,2007,165,Drama,8.4,Aamir Khan,Amole Gupte,168895,1223869.0,


### **Task 12**: Find the most earning actor-director combo.

In [20]:
duo['Gross'].sum().sort_values(ascending=False).head(1)

Director        Star1         
Akira Kurosawa  Toshirô Mifune    2.999877e+09
Name: Gross, dtype: float64

### **Task 13**: FInd the best actor-genre combo in terms of average of metascore.

In [31]:
actGen_grp = movies.groupby(['Genre', 'Star1'])
actGen_grp['Metascore'].mean().reset_index().sort_values(by='Metascore', ascending=False).head(1)

Unnamed: 0,Genre,Star1,Metascore
163,Adventure,Peter O'Toole,100.0


In [52]:
duo.agg(
    {
        'Runtime': ['min', 'max', 'mean']
    }
)
# temp[temp.reset_index().Runtime['min'] != temp.reset_index().Runtime['max']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Runtime,Runtime,Runtime
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean
Director,Star1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Aamir Khan,Amole Gupte,165,165,165.0
Aaron Sorkin,Eddie Redmayne,129,129,129.0
Abdellatif Kechiche,Léa Seydoux,180,180,180.0
Abhishek Chaubey,Shahid Kapoor,148,148,148.0
Abhishek Kapoor,Amit Sadh,130,130,130.0
...,...,...,...,...
Zaza Urushadze,Lembit Ulfsak,87,87,87.0
Zoya Akhtar,Hrithik Roshan,155,155,155.0
Zoya Akhtar,Vijay Varma,154,154,154.0
Çagan Irmak,Çetin Tekindor,112,112,112.0
