<a href="https://colab.research.google.com/github/kavithachitriki/SCALER/blob/main/PostRead_Apply_Pandas_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Creating our required dataframe (This portion is similar to the lecture)

In [None]:
# Importing the necessary libraries
import pandas as pd
import numpy as np

In [None]:
# Downloading the required datasets
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 64.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 56.7MB/s]


Similar to the lecture, we have two datasets, movies, and directors


In [None]:
movies = pd.read_csv('movies.csv', index_col=0)
movies.head()

Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday
2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday
5,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday


In [None]:
directors = pd.read_csv('directors.csv',index_col=0)
directors.head()

Unnamed: 0,director_name,id,gender
0,James Cameron,4762,Male
1,Gore Verbinski,4763,Male
2,Sam Mendes,4764,Male
3,Christopher Nolan,4765,Male
4,Andrew Stanton,4766,Male


Now let's merge our datasets and create a copy to build our final dataset

In [None]:
data = movies.merge(directors, how='left', left_on='director_id',right_on='id')
data.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day,director_name,id_y,gender
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday,James Cameron,4762,Male
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday,Gore Verbinski,4763,Male
2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday,Sam Mendes,4764,Male
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday,Christopher Nolan,4765,Male
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday,Sam Raimi,4767,Male


In [None]:
data.drop(['director_id','id_y'],axis=1,inplace=True)
data.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male


In [None]:
df = data.copy(deep=True)
df.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male


## Working on finding risky movie, using *apply*

Now, our goal is to find out movies which are risky. This can be found out using the logic that, if average revenue of the director is more than the movie's budget, then the movie can be said as risky. To do so, we can groupby the dataframe by the director's name, and using apply function, find out the required data.

### Using a function

Let's first define the function to calculate the difference

In [None]:
def calc_risk(x):
  return x['budget'] - x['revenue'].mean() > 0

Now on using the groupby

In [None]:
df.groupby('director_name').apply(calc_risk)

director_name      
Adam McKay     176     False
               323     False
               366     False
               505     False
               839     False
                       ...  
Zhang Yimou    590     False
               604     False
               1217    False
               1223    False
               1389    False
Name: budget, Length: 1465, dtype: bool

This is fine, but as we see, the output is multiindex. Now what issue can this create? Say if we try to set this to a new column, "risky"

In [None]:
df['risky'] = df.groupby('director_name').apply(calc_risk)

TypeError: ignored

The error basically says, that since our output is multi-index, we can't assign it to a single column.

We need our output as a single series.
To solve this issue, there are two methods.

#### Method 1: Creating the column and assigning it in the function itself

In [None]:
def calc_risk(x):
  x['risky'] = x['budget'] - x['revenue'].mean() > 0
  return x

df = df.groupby('director_name').apply(calc_risk)
df.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender,risky
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,False
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,False
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,False
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,False
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,False


As we see, this works and gives us the required column.

#### Method 2: Using the parameter group_keys = False, while using groupby.

We know that groupby groups the data based on lexicographical order. This could create a problem while assigning the values. Hence, we can use ```group_keys = False``` while using groupby, to return single index values. Let's look at it's implementation

In [None]:
def calc_risk_new(x):
  return x['budget'] - x['revenue'].mean() > 0

df['risky_new'] = df.groupby('director_name', group_keys = False).apply(calc_risk_new)
df.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender,risky,risky_new
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,False,False
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,False,False
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,False,False
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,False,False
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,False,False


This worked just fine. What does actually happen though? 
IF we print the values of calc_risk, we can understand what it is returning

In [None]:
df.groupby('director_name', group_keys = False).apply(calc_risk_new)

176     False
323     False
366     False
505     False
839     False
        ...  
590     False
604     False
1217    False
1223    False
1389    False
Name: budget, Length: 1465, dtype: bool

On using group_keys = True, we get the multi-index result, whici is the default value.

In [None]:
df.groupby('director_name', group_keys = True).apply(calc_risk_new)

director_name      
Adam McKay     176     False
               323     False
               366     False
               505     False
               839     False
                       ...  
Zhang Yimou    590     False
               604     False
               1217    False
               1223    False
               1389    False
Name: budget, Length: 1465, dtype: bool

Hence, to create and assign these values to a column, we need to either assign the column in the function itself, or use the parameter ```group_keys = True```

In [None]:
# Check if both the columns values are same
np.all(df['risky'] == df['risky_new'])

True

### Using lambda

The same can be achieved using lambda function too, though in this case we will need to use ```group_keys = False```, since we cannot assign a column to a dataframe in lambda single line function definition

Let's try it out

In [None]:
df.groupby('director_name', group_keys = True).apply(lambda x: x.budget - x['revenue'].mean()) > 0

director_name      
Adam McKay     176     False
               323     False
               366     False
               505     False
               839     False
                       ...  
Zhang Yimou    590     False
               604     False
               1217    False
               1223    False
               1389    False
Name: budget, Length: 1465, dtype: bool

This is exactly in line with the result we were getting earlier while using the functions.
If we try to print the indexes

In [None]:
df.groupby('director_name').apply((lambda x: (x.budget - x['revenue'].mean() > 0).index))

director_name
Adam McKay                     Int64Index([176, 323, 366, 505, 839, 916], dty...
Adam Shankman                  Int64Index([265, 300, 350, 404, 458, 843, 999,...
Alejandro González Iñárritu    Int64Index([106, 749, 1015, 1034, 1077, 1405],...
Alex Proyas                    Int64Index([95, 159, 514, 671, 873], dtype='in...
Alexander Payne                Int64Index([793, 1006, 1101, 1211, 1281], dtyp...
                                                     ...                        
Wes Craven                     Int64Index([620, 651, 714, 734, 887, 932, 952,...
Wolfgang Petersen              Int64Index([65, 87, 132, 235, 515, 872, 1216],...
Woody Allen                    Int64Index([ 799,  895,  985, 1038, 1044, 1046...
Zack Snyder                    Int64Index([5, 10, 97, 187, 317, 396, 842], dt...
Zhang Yimou                    Int64Index([192, 590, 604, 1217, 1223, 1389], ...
Length: 199, dtype: object

This confirms our confusion about the multi indexes. Now, we can simply use group_keys = False to avoid this.

In [None]:
df.groupby('director_name', group_keys = False).apply(lambda x: x.budget - x['revenue'].mean()) > 0

176     False
323     False
366     False
505     False
839     False
        ...  
590     False
604     False
1217    False
1223    False
1389    False
Name: budget, Length: 1465, dtype: bool

In [None]:
df['lambda_risky'] = df.groupby('director_name', group_keys = False).apply(lambda x: x['budget'] - x['revenue'].mean()) > 0
df.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender,risky,risky_new,lambda_risky
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,False,False,False
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,False,False,False
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,False,False,False
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,False,False,False
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,False,False,False


In [None]:
np.all(df['risky'] == df['lambda_risky'])

True

Thus, we can achieve our required result in any of the ways above.

### Key Takeaways

- Using a function, or lambda function, with apply, after using groupby on a dataframe, gives multiindex output.
- We need to either "return the whole dataframe" (after creating a new column) OR ```group_keys = False``` 