# TMDB Movie Data Sampling Script

## Author: Leonardo Ferreira

## Objective
Process TMDB movie data to create a balanced dataset of movie profitability from 2015 onwards.

## Key Steps
1. Calculate movie profits (revenue - budget)
2. Filter movies released after 2015
3. Sample 1000 movies across three profit categories:
   - Top 333 most profitable movies
   - 334 movies near average profitability
   - Bottom 333 least profitable movies

## Output
- File: `tmdb_1000_sample.csv`
- Contains diverse representation of movie financial performance

## Data Processing
- Source: `tmdb_enriched_movies.csv`

## Considerations
- We only focus on recent movies (2015+) since we want to analyze social media hype before a movie release

In [42]:
import pandas as pd

In [43]:
# read imdb data
df = pd.read_csv('tmdb_enriched_movies.csv')

print(df.shape)
df.head()

(13831, 17)


Unnamed: 0,movie_ID,IMDB_ID,title,vote_average,vote_count,status,Release Date,Budget,Revenue,Popularity,Runtime,Language,Genres,Production Companies,Director,Top Cast,Keywords
0,27205,tt1375666,Inception,8.369,37309,Released,2010-07-15,160000000,839030630,26.3533,148,en,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ken W...","rescue, mission, dreams, airplane, paris, fran..."
1,157336,tt0816692,Interstellar,8.453,36903,Released,2014-11-05,165000000,746606706,44.7344,169,en,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...",Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Michael Caine","rescue, future, spacecraft, race against time,..."
2,155,tt0468569,The Dark Knight,8.519,33688,Released,2008-07-16,185000000,1004558444,34.4859,152,en,"Drama, Action, Crime, Thriller","Warner Bros. Pictures, Legendary Pictures, Syn...",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart","joker, sadism, chaos, secret identity, crime f..."
3,19995,tt0499549,Avatar,7.588,32126,Released,2009-12-15,237000000,2923706026,32.3911,162,en,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...",James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver","paraplegic, attachment to nature, culture clas..."
4,24428,tt0848228,The Avengers,7.735,31521,Released,2012-04-25,220000000,1518815515,35.1429,143,en,"Science Fiction, Action, Adventure",Marvel Studios,Joss Whedon,"Robert Downey Jr., Chris Evans, Mark Ruffalo","new york city, superhero, shield, based on com..."


In [44]:
# calculate profit for each movie
df['Profit'] = df['Revenue'] - df['Budget']

# sort by profit
df_by_profit = df.sort_values('Profit', ascending=False).reset_index(drop=True)

profit_stats = df['Profit'].describe()
print(profit_stats)

count    1.383100e+04
mean     3.180047e+07
std      1.140805e+08
min     -4.470000e+08
25%     -3.478500e+03
50%      1.000000e+04
75%      1.560251e+07
max      2.686706e+09
Name: Profit, dtype: float64


In [45]:
# some movies are weird, can't even find them online. These movies have fields NaN. I am removing them.
df = df.dropna()

# filtering it by date (we will keep only recent movies since we want to use social media hype)
# release date to datetime 
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')

# year column to filter
df['Year'] = df['Release Date'].dt.year

# filter by year
df = df[df['Year'] >= 2015].copy()

# get mean
profit_mean = df['Profit'].mean()

# calculate distance from mean
df['Distance From Mean'] = abs(df['Profit'] - profit_mean)
df.head()

Unnamed: 0,movie_ID,IMDB_ID,title,vote_average,vote_count,status,Release Date,Budget,Revenue,Popularity,Runtime,Language,Genres,Production Companies,Director,Top Cast,Keywords,Profit,Year,Distance From Mean
5,293660,tt1431045,Deadpool,7.622,31501,Released,2016-02-09,58000000,782837347,17.4256,108,en,"Action, Adventure, Comedy","20th Century Fox, The Donners' Company, Genre ...",Tim Miller,"Ryan Reynolds, Morena Baccarin, Ed Skrein","superhero, anti hero, mercenary, based on comi...",724837347,2016,657956000.0
6,299536,tt4154756,Avengers: Infinity War,8.235,30394,Released,2018-04-25,300000000,2052415039,34.0939,149,en,"Adventure, Action, Science Fiction",Marvel Studios,Joe Russo,"Robert Downey Jr., Chris Evans, Chris Hemsworth","sacrifice, magic, superhero, based on comic, s...",1752415039,2018,1685534000.0
15,299534,tt4154796,Avengers: Endgame,8.238,26214,Released,2019-04-24,356000000,2799439100,26.9856,181,en,"Adventure, Science Fiction, Action",Marvel Studios,Anthony Russo,"Robert Downey Jr., Chris Evans, Mark Ruffalo","superhero, time travel, space travel, time mac...",2443439100,2019,2376558000.0
18,475557,tt7286456,Joker,8.139,26318,Released,2019-10-01,55000000,1078958629,19.7364,122,en,"Crime, Thriller, Drama","Warner Bros. Pictures, Joint Effort, Village R...",Todd Phillips,"Joaquin Phoenix, Robert De Niro, Zazie Beetz","dreams, street gang, society, psychopath, clow...",1023958629,2019,957077300.0
23,99861,tt2395427,Avengers: Age of Ultron,7.271,23350,Released,2015-04-22,365000000,1405403694,17.9154,141,en,"Action, Adventure, Science Fiction",Marvel Studios,Joss Whedon,"Robert Downey Jr., Chris Hemsworth, Mark Ruffalo","artificial intelligence (a.i.), saving the wor...",1040403694,2015,973522400.0


In [46]:
# sorting by profit
df = df.sort_values('Profit', ascending=False)

# get top 333 by profit
top_movies = df.head(333)

# get bottom 333 by profit
bottom_movies = df.tail(333)

In [47]:
# get movies with profit close to the mean - we consider these movies that were neither a success nor a failure
middle_movies = df.sort_values('Distance From Mean')
middle_movies = middle_movies.head(334)

middle_movies

Unnamed: 0,movie_ID,IMDB_ID,title,vote_average,vote_count,status,Release Date,Budget,Revenue,Popularity,Runtime,Language,Genres,Production Companies,Director,Top Cast,Keywords,Profit,Year,Distance From Mean
1564,303858,tt2241351,Money Monster,6.433,2996,Released,2016-05-12,27000000,93282604,2.2130,98,en,"Thriller, Drama","TriStar Pictures, LStar Capital, Smoke House P...",Jodie Foster,"George Clooney, Julia Roberts, Jack O'Connell","sniper, new york city, hostage, wall street, h...",66282604,2016,5.987163e+05
549,328387,tt3531824,Nerve,7.016,6822,Released,2016-06-27,19000000,85241496,4.4937,96,en,"Mystery, Adventure, Crime","Lionsgate, Allison Shearmur Productions, Keep ...",Henry Joost,"Emma Roberts, Dave Franco, Emily Meade","new york city, technology, internet, hacking, ...",66241496,2016,6.398243e+05
6955,441909,tt5956100,Tiger Zinda Hai,6.100,151,Released,2017-12-18,20000000,87530000,1.3889,160,hi,"Action, Adventure, Thriller","Yash Raj Films, Cine Tirol, NSA Picture",Ali Abbas Zafar,"Salman Khan, Katrina Kaif, Girish Karnad","hostage, hostage situation, iran hostage crisi...",67530000,2017,6.486797e+05
4016,256040,tt2631186,Bāhubali: The Beginning,7.511,835,Released,2015-07-10,25130430,90747520,5.2068,159,te,"Action, Adventure, Fantasy, Drama",Arka Media Works,S. S. Rajamouli,"Prabhas, Rana Daggubati, Tamannaah Bhatia","kingdom, medieval india, ancient india, bilingual",65617090,2015,1.264230e+06
352,515001,tt2584384,Jojo Rabbit,8.010,9744,Released,2019-10-18,14000000,82468705,8.7722,108,en,"Comedy, War, Drama","Fox Searchlight Pictures, Defender Films, Piki...",Taika Waititi,"Roman Griffin Davis, Thomasin McKenzie, Scarle...","based on novel or book, world war ii, jew pers...",68468705,2019,1.587385e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1565,458897,tt5033998,Charlie's Angels,6.413,3004,Released,2019-11-14,48000000,73279888,5.3700,118,en,"Action, Adventure, Comedy","Columbia Pictures, Brownstone Productions, 2.0...",Elizabeth Banks,"Kristen Stewart, Naomi Scott, Ella Balinska","training, high technology, betrayal, female pr...",25279888,2019,4.160143e+07
7810,353501,tt5297256,Pulimurugan,6.336,55,Released,2016-10-07,3800000,29000000,0.4701,161,ml,"Action, Adventure",Mulakuppadam Films,Vysakh,"Mohanlal, Jagapati Babu, Lal","sibling relationship, hunter, tiger, fighter, ...",25200000,2016,4.168132e+07
5963,436391,tt5969696,The Battleship Island,7.100,265,Released,2017-07-26,21000000,46183383,1.8717,132,ko,"Action, History, Drama, War","Filmmaker R&K, CJ Entertainment, filmK",Ryoo Seung-wan,"Hwang Jung-min, So Ji-sub, Song Joong-ki","forced labour, korean resistance, japanese occ...",25183383,2017,4.169794e+07
841,316727,tt4094724,The Purge: Election Year,6.382,5295,Released,2016-06-29,10000000,118587880,5.6287,109,en,"Action, Horror, Thriller","Why Not Productions, Universal Pictures, Blumh...",James DeMonaco,"Elizabeth Mitchell, Frank Grillo, Mykelti Will...","dystopia, sequel, calvary, legalized murder, a...",108587880,2016,4.170656e+07


In [48]:
# combine the datasets
final_sample = pd.concat([top_movies, middle_movies, bottom_movies])

final_sample.head()

Unnamed: 0,movie_ID,IMDB_ID,title,vote_average,vote_count,status,Release Date,Budget,Revenue,Popularity,Runtime,Language,Genres,Production Companies,Director,Top Cast,Keywords,Profit,Year,Distance From Mean
15,299534,tt4154796,Avengers: Endgame,8.238,26214,Released,2019-04-24,356000000,2799439100,26.9856,181,en,"Adventure, Science Fiction, Action",Marvel Studios,Anthony Russo,"Robert Downey Jr., Chris Evans, Mark Ruffalo","superhero, time travel, space travel, time mac...",2443439100,2019,2376558000.0
281,76600,tt1630029,Avatar: The Way of Water,7.609,12445,Released,2022-12-14,460000000,2320250281,24.4638,192,en,"Science Fiction, Adventure, Action","20th Century Studios, Lightstorm Entertainment",James Cameron,"Sam Worthington, Zoe Saldaña, Sigourney Weaver","dying and death, loss of loved one, alien life...",1860250281,2022,1793369000.0
56,140607,tt2488496,Star Wars: The Force Awakens,7.3,19667,Released,2015-12-15,245000000,2068223624,13.0107,136,en,"Adventure, Action, Science Fiction","Lucasfilm Ltd., Bad Robot",J.J. Abrams,"Harrison Ford, Mark Hamill, Carrie Fisher","android, spacecraft, space opera, requel",1823223624,2015,1756342000.0
6,299536,tt4154756,Avengers: Infinity War,8.235,30394,Released,2018-04-25,300000000,2052415039,34.0939,149,en,"Adventure, Action, Science Fiction",Marvel Studios,Joe Russo,"Robert Downey Jr., Chris Evans, Chris Hemsworth","sacrifice, magic, superhero, based on comic, s...",1752415039,2018,1685534000.0
57,634649,tt10872600,Spider-Man: No Way Home,7.946,20666,Released,2021-12-15,200000000,1921847111,30.3377,148,en,"Action, Adventure, Science Fiction","Marvel Studios, Pascal Pictures, Columbia Pict...",Jon Watts,"Tom Holland, Zendaya, Benedict Cumberbatch","new york city, hero, showdown, magic, loss of ...",1721847111,2021,1654966000.0


In [49]:
boolean = not df["title"].is_unique
boolean

True

In [50]:
# save the data
final_sample.to_csv('tmdb_1000_sample.csv', index=False)