### Pandas Apply Lambda

Always remember the Zen of Python!!!

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../data/input/IMDB-Movie-Data.csv')

In [4]:
df.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


# üêº Challenge 1. Using a single argument 

We want to create bins of movies according to the number of votes they've received. For that matter, we will create a new column named 'bin' which will tag every movie as follow:

From 0 to 999 ==> 'cat_1' 

From 1000 to 9999 ==> 'cat_2'

From 10000 to 99999 ==> 'cat_3'

From 100000 to 999999 ==> 'cat_4'

More than 1000000 ==> 'cat_5'

In [6]:
# Creas la categoria indicando que pille como argumento v. Por cada valor de v cumpliendo con las condicionales 
# devuelve la categor√≠a indicada 

def categoria(v):
    if v <= 999:
        return "cat_1"
    elif (v <= 9999) & (v >= 1000):
        return "cat_2"
    elif (v <= 99999) & (v >= 10000):
        return "cat_3"
    elif (v <= 999999) & (v >= 100000):
        return "cat_4"
    elif v >= 1000000:
        return "cat_5"

In [7]:
# Prueba 

x = categoria(999998)
print(x)

cat_4


In [8]:
# Aplicas la funci√≥n marcada, creando una nueva columna llamada "Category". A esta le dices que aplique 
# la lamba por row llamando a la funci√≥n categor√≠a en funci√≥n de la row de Votes
df["Category"] = df.apply(lambda row: categoria(row["Votes"]), axis=1)

In [9]:
df.head()    # Compruebas

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Category
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,cat_4
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,cat_4
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,cat_4
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,cat_3
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,cat_4


## üêº üêº Challenge 2. Using two arguments

We want to know how much is the revenue per minute for every movie.

In [10]:
# Debe devolver el resultado de revenue / minutos

In [11]:
def division(min, rev):
    return rev / min

In [12]:
x = division(121, 333.13)
print(x)

2.7531404958677688


In [13]:
df["Revenue per minute (Millions)"] = df.apply(lambda row: division(row["Runtime (Minutes)"], row["Revenue (Millions)"]), axis=1)

In [14]:
df.head()    # Antes apliqu√© la lamba a una columna llamada "Revenue per minute", al hacerlo de nuevo 
             # para que se muestre el nombre de la columna "Revenue per minute (Millions)", dej√© creada la columna anterior. 

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Category,Revenue per minute (Millions)
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,cat_4,2.75314
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,cat_4,1.019839
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,cat_4,1.180513
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,cat_3,2.502963
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,cat_4,2.642439


In [15]:
df = df.drop(['Revenue per minute'], axis=1)    # Elimino la columna que cree antes por error

KeyError: "['Revenue per minute'] not found in axis"

In [None]:
df.head()

## üêº üêº üêº Challenge 3. A bit more complicated

We want to create a new rating where we add 1 point if the genre is thriller but subtract 1 point if the genre is comedy.

In [None]:
# Nueva columna que se llame 'New rating'
# Recorrer cada row de "Genre" y si aparece "Thriller" entonces es +1
# Si en el recorrido aparece "Comedy" entonces es -1

In [16]:
def ranqueo(texto, value):
    if "Thriller" in texto:
        value += 1
    elif "Comedy" in texto:
        value -= 1
    return value

In [17]:
x = "Animation, Comedy, Family"
y = ranqueo(x, 0)
print(y)

-1


In [18]:
df["New Rating"] = df.apply(lambda row: ranqueo(row["Genre"], row["Rank"]), axis=1)

In [19]:
df.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Category,Revenue per minute (Millions),New Rating
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,cat_4,2.75314,1
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,cat_4,1.019839,2
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,cat_4,1.180513,4
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,cat_3,2.502963,3
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,cat_4,2.642439,5


## üêº üêº üêº üêº Challenge 4. A bit too weird...

We want to know whether the integer part of the number resulting from the sum of the ASCII value of every character of the movie title divided by the number of votes, is a prime number (remember that prime numbers are integers).

In [None]:
# La suma del valor ASCII de cada caracter del t√≠tulo es un entero
# Comprobar si la suma dividida por el n√∫mero de votos, obtiene un resultado n√∫mero primo 

In [43]:
# Prueba - Obtener la suma total del valor ASCII 

string = "Guardians of the Galaxy"                       
ascii_value = sum(ord(ch) for ch in string)   # Suma el valor ascii del caracter por cada caracter en string
print(ascii_value)

2170


In [50]:
votes = 757074
diff = ascii_value / votes  # == n√∫mero primo
print(diff)

0.002866298406760766


In [51]:
# Funci√≥n para comprobar si un n√∫mero es primo
def esprimo(n):
    # Divisible entre √©l mismo y 1 √∫nicamente
    if n<= 1:
        return False
    elif n == 2:
        return True
    else: 
        for i in range(2, n): 
            if n % i == 0:
                return False
        return True 
    
print(esprimo(diff))    # Prueba con el resultado anterior, que el primero ser√≠a False

False


In [52]:
# Funci√≥n para comprobar si el resultado de la divisi√≥n entre la suma ascii y el voto es primo
def ascii_primo(texto, voto): 
    ascii_value = sum(ord(ch) for ch in texto)    # Obtener el ascii del texto que le d√©
    diff = ascii_value / voto                     # Obtener la diferencia entre los dos 
    return esprimo(int(diff))             # A la hora de devolver llamo a la funci√≥n anterior as√≠ me devuelve True o False


In [53]:
df["Prime number?"] = df.apply(lambda row: ascii_primo(row["Title"], row["Votes"]), axis=1)
# Aplico al df creando una nueva columna Prime number? 
# La lambda ejecuta la funci√≥n ascii_primo que, a su vez, lleva dentro otra funci√≥n aunque no se vea. 

In [57]:
df.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Category,Revenue per minute (Millions),New Rating,Prime number?
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,cat_4,2.75314,1,False
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,cat_4,1.019839,2,False
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,cat_4,1.180513,4,False
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,cat_3,2.502963,3,False
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,cat_4,2.642439,5,False


## üêº üêº üêº üêº üêº Challenge 5. And finally some fantasy

Feel free to propose your own ranking based in aggregations of at least 3 columns of the dataset.

In [74]:
# ¬øQu√© generos son los que m√°s votos y m√°s revenue producen? 
# Agrupo el df por la columna Genre y le indico que quiero la col Votes y Revenue(Millions)
# Tambi√©n, para los que sean del mismo g√©nero, le aplico la .sum y lo ordeno descendente por Revenue
df1 = df.groupby(['Genre'])[['Votes', 'Revenue (Millions)']].sum().sort_values(by='Revenue (Millions)', ascending=False)

In [75]:
df1.head()

Unnamed: 0_level_0,Votes,Revenue (Millions)
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
"Action,Adventure,Sci-Fi",18582076,10461.51
"Animation,Adventure,Comedy",5913065,5754.75
"Action,Adventure,Fantasy",7816851,5248.29
"Adventure,Family,Fantasy",2640649,2201.47
Comedy,3685529,1941.81


In [77]:
# ¬øQu√© media de votes y revenue genera cada director? 
# Aqu√≠ agrupo el df por Director y mantengo Votes y Revenue (Millions)
df2 = df.groupby(['Director'])[['Votes', 'Revenue (Millions)']].mean().sort_values(by='Revenue (Millions)', ascending=False)

In [78]:
df2.head()

Unnamed: 0_level_0,Votes,Revenue (Millions)
Director,Unnamed: 1_level_1,Unnamed: 2_level_1
James Cameron,935408.0,760.51
Colin Trevorrow,455169.0,652.18
Joss Whedon,781241.5,541.135
Lee Unkrich,586669.0,414.98
Gary Ross,382749.5,408.0


In [79]:
# Spoiler no utilizo tres columnas pero quer√≠a trastear con .idxmax
# ¬øQu√© director genera mayor revenue?
df3 = df.groupby(['Director'])[['Revenue (Millions)']].sum()

In [80]:
print("El director que genera mayor revenue es: ",df3['Revenue (Millions)'].idxmax())
# Como el df3 ya est√° agrupado por director, aqu√≠ llamo a la col Revenue con idx max

El director que genera mayor revenue es:  J.J. Abrams


In [82]:
print("El director que genera menor revenue es: ",df3['Revenue (Millions)'].idxmin())

El director que genera menor revenue es:  Adam Leon


## üêº üêº üêº üêº üêº üêº Bonus challenge. Freaky bonus

We want to know which movies might have hidden paterns in their description. A way to know that is finding those movies which the sum of all numeric values of the string description hash (SHA256) are between their revenue and their number of votes.