# INFO 2950 Final Project - Phase II

# Research Questions:
How do different genres correlate with IMDb ratings over time? Do movies that belong to multiple genres have higher or lower ratings than those classified under a single genre?

In this assignment, we aim to explore the correlation between movie genres and IMDb ratings, focusing on films released after the year 2000. We will examine whether the genre type influences the ratings a movie receives and investigate if the number of genres a movie is associated with has a positive or negative impact on its rating. Our analysis will involve training a linear regression model to predict a movie's popularity based on its genre and the number of genres it belongs to (with the weight).

# Data Description:

Data Description:
For this project, we are using two datasets from IMDb: title.basics.tsv.gz and title.ratings.tsv.gz, sourced from IMDb's dataset repository. These datasets provide information about movie titles, their ratings, and other attributes, allowing us to explore the relationship between genres and IMDb ratings.

title.basics.tsv.gz This file contains basic information about titles, the relevant columns are listed below:

- tconst: A unique identifier for each title.
- titleType: The format of the title (e.g., movie, short, tvseries).
- primaryTitle: The title commonly used for promotional purposes.
- startYear: The year the title was released.
- runtimeMinutes: The duration of the title in minutes.
- genres: Up to three genres associated with the title.

title.ratings.tsv.gz This file contains user ratings information for each title, including:
- tconst: A unique identifier matching that in title.basics.tsv.gz.
- averageRating: The average IMDb rating for the title.
- numVotes: The number of votes used to calculate the average rating.

Link to the datasets: https://datasets.imdbws.com/

## Data Cleaning:

The original income data sheet came in tsv or Tab-separated values (TSV) which is a simple, text-based file format for storing tabular data. We didnt have to convert it to a csv file because its an acceptable data format 

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A '\N' is used to denote that a particular field is missing or null for that title/name. The available datasets included 7 tsv files which include information about actors, movies, tvseries and other related information. 


## Our python imports

In [32]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import duckdb

# Importing our dataset:


In [33]:
#loading income data
movies = pd.read_csv('data/title.basics.tsv', delimiter="\t")


  movies = pd.read_csv('data/title.basics.tsv', delimiter="\t")


Here we wanted to see how the first 10 rows of our dataset look like so we can choose how we are going to clean it.

In [34]:
print(movies.shape)
movies.head(10)

(11169285, 9)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In the following cell we filter out many of the things that are irrilevant to our research question and clean our data by removing empty values marked as "\N" in the original dataset. Since our focus is on movies we also filter out data related to short films or tv series or other types they had in the original dataset. We also limit our analysis to movies that were released in the 21st century so that it is more relevant to our generation. 

Since our dataset is huge we had to do multiple filtering steps and cleaning and get a clean, nice and relatively smaller dataset that we can easily run our analysis on for later parts of this project. We save the filtered out dataset in a csv file "cleaned_data" and we will refer to it later when we do more analysis and when we do combined analysis with the movie ratings dataset.


In [35]:
# Filter out rows where startYear is '\N'
df = movies[movies['startYear'] != '\\N']

# Convert startYear from string to int
df['startYear'] = df['startYear'].astype(int)

# filter out shows
df = df[df['titleType'] == "movie"]

# Filter out years before 2000
df = df[df['startYear'] >= 2000]

# Save the cleaned data (optional)
df.to_csv('cleaned_data.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['startYear'] = df['startYear'].astype(int)


Here we import another file where the ratings are stored to do combined analysis of movies using their genres and ratings.

In [36]:
cleaned_movies_df = pd.read_csv('cleaned_data.csv')

Here, we show the first 10 entries of our new cleaned dataset to visualize it and make sure we did our cleaning right:

In [37]:
cleaned_movies_df.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,0,2019,\N,\N,"Action,Crime"
1,tt0015414,movie,La tierra de los toros,La tierra de los toros,0,2000,\N,60,\N
2,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,\N,118,"Comedy,Fantasy,Romance"
3,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,\N,70,Drama
4,tt0067758,movie,"Simón, contamos contigo","Simón, contamos contigo",0,2015,\N,81,"Comedy,Drama"
5,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,\N,122,Drama
6,tt0070596,movie,Socialist Realism,El realismo socialista,0,2023,\N,78,Drama
7,tt0077684,movie,Histórias de Combóios em Portugal,Histórias de Combóios em Portugal,0,2022,\N,46,Documentary
8,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008,\N,94,Horror
9,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,\N,100,"Comedy,Horror,Sci-Fi"


Here we wanted to check the size of our newly cleaned data set and we noticed that it is much smaller than the original one having 338244 entries.

In [38]:
cleaned_movies_df.shape

(338244, 9)

Here we import another file which have the ratings and the number of votes for each movie

In [39]:
ratings_df = pd.read_csv("data/title.ratings.tsv", delimiter = "\t")

Here we print the first 10 rows of ratings_df to visualize it better:


In [40]:
print(ratings_df.shape)
ratings_df.head(10)

(1488213, 3)


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2096
1,tt0000002,5.6,283
2,tt0000003,6.5,2103
3,tt0000004,5.4,183
4,tt0000005,6.2,2839
5,tt0000006,5.0,197
6,tt0000007,5.4,889
7,tt0000008,5.4,2243
8,tt0000009,5.4,215
9,tt0000010,6.8,7728


Here we combined both tables using SQL to do further analysis that depends on the ratings of each movie as well as the details in previous cleaned_movies_df

In [41]:

joined_df = duckdb.sql("SELECT * FROM cleaned_movies_df LEFT JOIN ratings_df ON cleaned_movies_df.tconst = ratings_df.tconst").to_df()
joined_df.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst_1,averageRating,numVotes
0,tt0757139,movie,49 Days,Sai chiu,0,2006,\N,93,"Drama,Mystery,Thriller",tt0757139,5.1,251.0
1,tt0757157,movie,A Millionaire's First Love,Baekmanjangja-ui cheot-sarang,0,2006,\N,116,"Drama,Romance",tt0757157,7.2,5439.0
2,tt0757165,movie,Between the Lines - Indiens drittes Geschlecht,Between the Lines - Indiens drittes Geschlecht,0,2005,\N,95,Documentary,tt0757165,7.4,48.0
3,tt0757166,movie,Big Time,Big Time,0,2005,\N,105,"Action,Comedy",tt0757166,7.9,42.0
4,tt0757171,movie,Bye Bye Life,Bye Bye Life,0,2008,\N,90,Documentary,tt0757171,7.3,18.0
5,tt0757193,movie,Esperanza,Esperanza,0,2006,\N,90,"Comedy,Drama",tt0757193,6.0,34.0
6,tt0757194,movie,Forbidden Quest,Eum-lan-seo-seng,0,2006,\N,142,"Comedy,Drama,Romance",tt0757194,6.2,419.0
7,tt0757201,movie,Fracassés,Fracassés,0,2008,\N,88,Comedy,tt0757201,4.6,25.0
8,tt0757210,movie,Oh! My God,Guseju,0,2006,\N,104,"Action,Comedy,Romance",tt0757210,5.2,239.0
9,tt0757214,movie,All the Invisible Things,Heile Welt,0,2007,\N,89,"Crime,Drama",tt0757214,6.1,109.0
