In [1]:
import pandas as pd
import numpy as np

In [2]:
cars = pd.read_csv("https://raw.githubusercontent.com/juliandnl/redi_ss20/master/cars.csv")

In [3]:
cars.head()

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba


# Part 1 - Transformations

The following exercises are about doing row-wise transformations of string columns. [Docs](https://pandas.pydata.org/docs/user_guide/text.html#) on working with string data in pandas. [Built-in functions](https://pandas.pydata.org/docs/reference/series.html#string-handling) on string columns. Note that the exercises cover different implementation variations.

As a rule of thumb it is better to use dataframe built-in functionality if available for the logic you want to implement. Lower-level abstractions like `apply` can help if the functionality is not available.

In [16]:
#Exercise 1. Create a new column `has_power` that indicates (with booleans) if the value in the `Variant` column contains the word "Power"
# 1. Use `apply` with a `def` function and anonymous `lambda` function
#do it at home
'''
def power(cars):
    cars['has_power'] = cars['Variant'].str.contains("Power")
    return cars
cars = power(cars)
or with lambda
cars['has_power'] = cars['Variant'].apply(lambda x: "Power" in x)
'''
# 2. Use only dataframe built-in functionality
# cars['has_power'] = cars['Variant'].str.contains("Power")
display(cars.head())

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,price_euros,Doors,Kind,Location,has_power,abbrev_brand
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,972.0,4.0,Sedán,Córdoba,False,VOL
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,864.0,2.0,Pick-Up,Entre Ríos,False,FOR
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,566.946,5.0,Hatchback,Bs.as. G.b.a. Sur,False,VOL
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,2154.6,4.0,Pick-Up,Neuquén,False,FOR
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,394.2,3.0,Hatchback,Córdoba,True,VOL


In [5]:
## Exercise 2
# Create an abbreviation of the `Make` of a car. Create a new column `abbrev_brand` that contains the first 3 letters in uppercase of the value from the `Make` columns. For example, if `Make` is `Chrysler`, `abbrev_brand` should be `CHR`.
# 1. Use `apply` with a regular `def` function and anonymous `lambda` function
'''
def abbrev(cars):
    cars['abbrev_brand'] = cars['Make'].str[:3].str.upper()
    return cars
cars = abbrev(cars)
or with lambda
cars['abbrev_brand'] = cars['Make'].apply(lambda x: x[:3].upper())
'''
# 2. Use only dataframe built-in functionality
cars['abbrev_brand'] = cars['Make'].str[:3].str.upper()
display(cars.head())

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,Doors,Kind,Location,has_power,abbrev_brand
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,4.0,Sedán,Córdoba,False,VOL
1,Ford,Ranger,2012,2.3 Cd Xl Plus 4x2,140000,320000,2.0,Pick-Up,Entre Ríos,False,FOR
2,Volkswagen,Fox,2011,1.6 Trendline,132000,209980,5.0,Hatchback,Bs.as. G.b.a. Sur,False,VOL
3,Ford,Ranger,2017,3.2 Cd Xls Tdci 200cv Automática,13000,798000,4.0,Pick-Up,Neuquén,False,FOR
4,Volkswagen,Gol,2013,1.4 Power 83cv 3 p,107000,146000,3.0,Hatchback,Córdoba,True,VOL


# Part 2 - Joins

The following exercises cover different join types. Check the [merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) docs for help on the `how` parameter.

In [6]:
# We create the same dataframe as seen in the class to use for the joins. It is computed based on 'cars' and has following properties
# - Average price **in euros €** of car models per brand (brand = Make). Columns: `Make, Model, avg_price`
# - 3 additional "invented" rows
# - All `Mercedes Benz` cars are removed
Price_Euro = cars['Price'] * 0.0027
cars.insert(6, 'price_euros', Price_Euro)

avg_model_prices = cars.groupby(['Make', 'Model'])['price_euros'].mean().reset_index()
avg_model_prices = avg_model_prices.rename(columns={'price_euros': 'avg_price'})
avg_model_prices = avg_model_prices.loc[avg_model_prices.loc[:, 'Make'] != 'Mercedes Benz', :]

invented_rows = pd.DataFrame(
    data = [('Ford', 'Lo', 158703.340), ('Ford', 'Hi', 324235.670), ('Ford', 'Cheap', 6533.700)],
    columns=['Make', 'Model', 'avg_price']
)
avg_model_prices = pd.concat([invented_rows, avg_model_prices], axis=0, ignore_index=True)

avg_model_prices.shape

(50, 3)

In [7]:
## Exercise 1 Do a RIGHT OUTER JOIN on the dataset and compare the result to the LEFT OUTER JOIN we've done during the class.
right_join = pd.merge(cars, avg_model_prices, on=['Make', 'Model'], how='right')
# 1. How do the `Mercedes Benz` rows compare?
'''display (right_join[right_join['Make'] == 'Mercedes Benz'])
They are not in the right_joindataset, because they were not in a avg_model_prices dataset.'''
# 2. How do the "invented rows" from `avg_model_prices` compare? 
'''They are there, because it was a main dataset'''
# 3. What's the shape of the result? Explain what you think it should be and then check it.
'''So all  columns, but rows without Mercedes Benz
display(cars.head(1),right_join.head(1))
display(cars.shape, avg_model_prices.shape, right_join.shape).  So its 10k-256 rows of MB but +3 new unique values that we invented'''
# 4. How can you achieve the behaviour of the LEFT OUTER JOIN from the class, but doing the RIGHT OUTER JOIN (`merge` with `how='right' parameter`)?
'''right_join = pd.merge(avg_model_prices, cars, on=['Make', 'Model'], how='right')'''

"right_join = pd.merge(avg_model_prices, cars, on=['Make', 'Model'], how='right')"

## Exercise 2
Do a FULL OUTER JOIN and INNER JOIN on the datasets and answer following questions
  1. Check the shape of the result and compare it to the shape of cars/avg_model_prices.
    * What do you observe and how can you explain it? In full outer its 10003*13 because he add 3 unique values that we add and all columns.
  2. Check how `Mercedes Benz` rows look in the result and explain why
  He is in full_outer but not in inner, because into inner_join goes only values that are on both of databases and he is not.
  3. Check how the "invented rows" from `avg_model_prices` look in the result and explain why. Those are only in full_outer_join, because they were eliminated from inner_join (not in both)

In [8]:
inner_join = pd.merge(cars, avg_model_prices,on=['Make', 'Model'])
full_outer_join = pd.merge(cars, avg_model_prices,on=['Make', 'Model'],how='outer')

In [9]:
full_outer_join.tail()
#display (full_outer_join[full_outer_join['Make'] == 'Mercedes Benz'])256 rows

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,price_euros,Doors,Kind,Location,has_power,abbrev_brand,avg_price
9998,Mercedes Benz,ML,2013.0,3.5 Ml350 4matic Sport B.efficiency,119000.0,49000.0,132.3,5.0,SUV,Bs.as. G.b.a. Norte,False,MER,
9999,Volkswagen,New Beetle,2011.0,1.8 Turbo Sport,82000.0,299999.0,809.9973,2.0,Cabriolet,Capital Federal,False,VOL,809.9973
10000,Ford,Lo,,,,,,,,,,,158703.34
10001,Ford,Hi,,,,,,,,,,,324235.67
10002,Ford,Cheap,,,,,,,,,,,6533.7


In [10]:
display(cars.shape, avg_model_prices.shape, right_join.shape)

(10000, 12)

(50, 3)

(9747, 13)

In [11]:
display(inner_join)
#display (inner_join[inner_join['Make'] == 'Mercedes Benz']) not there because inner means that it should be in both

Unnamed: 0,Make,Model,Year,Variant,Kms,Price,price_euros,Doors,Kind,Location,has_power,abbrev_brand,avg_price
0,Volkswagen,Vento,2012,2.5 Luxury 170cv,99950,360000,972.0000,4.0,Sedán,Córdoba,False,VOL,1042.82909
1,Volkswagen,Vento,2013,2.0 Sportline Tsi 200cv,68200,530000,1431.0000,4.0,Sedán,Bs.as. G.b.a. Sur,False,VOL,1042.82909
2,Volkswagen,Vento,2016,2.5 Advance Plus 170cv Tiptronic,59000,500000,1350.0000,4.0,Sedán,Santa Fe,False,VOL,1042.82909
3,Volkswagen,Vento,2014,2.5 Luxury 170cv Tiptronic,65000,470000,1269.0000,4.0,Sedán,Córdoba,False,VOL,1042.82909
4,Volkswagen,Vento,2013,2.5 Luxury 170cv Tiptronic,95000,387500,1046.2500,4.0,Sedán,Bs.as. G.b.a. Norte,False,VOL,1042.82909
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9739,Volkswagen,Virtus,2018,1.6 Msi Trendline,9900,499000,1347.3000,4.0,Sedán,Capital Federal,False,VOL,1412.10000
9740,Volkswagen,Virtus,2018,1.6 Msi Trendline,15000,520000,1404.0000,4.0,Sedán,Bs.as. G.b.a. Norte,False,VOL,1412.10000
9741,Volkswagen,Tiguan Allspace,2018,2.0 Tsi Highline Dsg,16000,45000,121.5000,5.0,SUV,Buenos Aires Interior,False,VOL,1829.25000
9742,Volkswagen,Tiguan Allspace,2018,1.4 Tsi Trendline 150cv Dsg,10000,1310000,3537.0000,5.0,SUV,Bs.as. G.b.a. Oeste,False,VOL,1829.25000


## Exercise 3
Compute the price difference (based on the euros price) of each car, compared to the average price of each brand (`Make`). Do a join as part of the solution. Which type of join do you use? Explain your choice.

# Bonus Dataset: IMDB movie dataset
We'll look at the movie datasets provided by [IMDb](https://www.imdb.com/). The datasets are described [here](https://developer.imdb.com/non-commercial-datasets/). We'll work with a sample of the `title.basics` dataset (which includes only the top 10,000 most-voted movies) and `title.ratings` datasets.


In [30]:
movies = pd.read_csv("https://raw.githubusercontent.com/obreit/redi/master/imdb_data/top_voted.tsv", sep='\t', header=0)

## Exercise 1
The goal is to get long-running movies (i.e. with `runtimeMinutes` more than 2 hours). Notice that the obvious expression `movies['runtimeMinutes'] > 120` crashes. Try it yourself and try to explain why this is not working (for a hint, see the [details](https://developer.imdb.com/non-commercial-datasets/#imdb-dataset-details) section of the imdb page). Before going to the next hints, try to implement a solution to this problem by yourself.

In [82]:
movies.info() 
display(movies.head())


<class 'pandas.core.frame.DataFrame'>
Index: 10001 entries, 0 to runtimeMinutes
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          10001 non-null  object 
 1   titleType       10001 non-null  object 
 2   primaryTitle    10001 non-null  object 
 3   originalTitle   10001 non-null  object 
 4   isAdult         10001 non-null  object 
 5   startYear       10001 non-null  object 
 6   endYear         10001 non-null  object 
 7   runtimeMinutes  9998 non-null   float64
 8   genres          10001 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.0+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,\N,142.0,Drama
1,tt0468569,movie,The Dark Knight,The Dark Knight,0,2008,\N,152.0,"Action,Crime,Drama"
2,tt1375666,movie,Inception,Inception,0,2010,\N,148.0,"Action,Adventure,Sci-Fi"
3,tt0137523,movie,Fight Club,Fight Club,0,1999,\N,139.0,Drama
4,tt0109830,movie,Forrest Gump,Forrest Gump,0,1994,\N,142.0,"Drama,Romance"


In [79]:
movies['runtimeMinutes'] = movies['runtimeMinutes'].replace('\\N', pd.NA)

In [80]:
movies['runtimeMinutes'] = pd.to_numeric(movies['runtimeMinutes'], errors='coerce')

In [83]:
display(movies[movies['runtimeMinutes'] > 120])

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,\N,142.0,Drama
1,tt0468569,movie,The Dark Knight,The Dark Knight,0,2008,\N,152.0,"Action,Crime,Drama"
2,tt1375666,movie,Inception,Inception,0,2010,\N,148.0,"Action,Adventure,Sci-Fi"
3,tt0137523,movie,Fight Club,Fight Club,0,1999,\N,139.0,Drama
4,tt0109830,movie,Forrest Gump,Forrest Gump,0,1994,\N,142.0,"Drama,Romance"
...,...,...,...,...,...,...,...,...,...
9989,tt15680228,movie,Bheed,Bheed,0,2023,\N,124.0,"Drama,History"
9990,tt0099426,movie,Bullet in the Head,Dip huet gai tau,0,1990,\N,136.0,"Action,Crime,Drama"
9994,tt5456546,movie,Judwaa 2,Judwaa 2,0,2017,\N,145.0,"Action,Comedy"
9996,tt6793580,movie,Champions,Campeones,0,2018,\N,124.0,"Comedy,Drama,Family"


The `runtimeMinutes` column isn't of numeric type, because it contains `\N` values that can't be interpreted as numbers. So we need to somehow update those values and make a numeric column out of it.
1. Replace all rows where `runtimeMinutes` contains the imdb encoding for a missing value (`\N`) with `pd.NA` (the pandas type which represents a missing/null value). Note that you might run into some syntax issues. Try to find out how to overcome those.
2. Try to cast the column to a numeric type like this: `.astype(int)`. Why doesn't it work?
3. Try to find alternative ways to make the casting work.
4. Filter the movies that are longer than 2 hours

## Exercise 2
Count how many movies there are per genre. Notice that the `genres` column contains the information about the genre of a movie. But a movie can be assigned to multiple genres. We want to count the movie for every genre.

For example, if we have two movies `sad movie with genres: drama, action` and `dramedy with genres: drama, comedy` then the result counts should be `action: 1, comedy: 1, drama: 2`.

However, the `genres` column is not in a format that makes this counting easy (it's simply a string column). You'll need to transform the `genres` column in a way to simplify the counting. Part of this transformation will be to use the [explode](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) function.   

## Exercise 3
Find out some information about the movie ratings. Specifically, answer following questions
1. What's the movie with the highest average rating?
2. What's the movie with the lowest average rating among the movies with at least 1 million votes (if there are multiple with the same rating, return all of them)
3. Look at the last 10 years and count the number of movies per year as well as the average rating per year

In order to answer the questions, you need to combine the `ratings` dataset with the movies dataset.

In [13]:
ratings = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", sep='\t', header=0)