
# INFO 2950 Final Project - Phase IV



## Research Question:

Are movies from different genres in the IMDb dataset more influenced by longer runtimes or longer movie titles?

Movies are a fusion of artistic creativity and audience appeal, with various factors contributing to their identity. Among these, the runtime and title length can significantly shape a film's perception and reception. This investigation delves into the IMDb dataset to explore whether movies from different genres are more influenced by their runtime or the length of their titles. By analyzing these patterns, the study seeks to uncover how genres balance storytelling duration with the impact of a compelling title, offering insights into the subtle ways film characteristics align with genre expectations and audience engagement. We also hope to discover patterns related to how the public audience interacts with these movies, do they vote more for movies with longer titles or longer runtimes? How does this interplay change when we change the genre of the movies? How does belonging to multiple genres affect the patterns observed in movies categorized under a single genre?


# Data Description and Cleaning

Data description and cleaning. Refer to the Phase 2 rubric for instructions on
“Datasheets for Datasets” questions. Additionally, describe the data cleaning
steps taken for your project, and provide a link to the ipynb on GitHub that
performs these data cleaning steps. If it would be helpful for understanding your
research question, feel free to include some figures from your Phase 2 EDA to
better explain your data.

For this project, we are using **two** datasets from IMDb: title.basics.tsv and title.ratings.tsv, sourced from IMDb's dataset repository. These datasets provide information about movie titles, their ratings, and other attributes, allowing us to explore the relationship between genres and IMDb ratings in relation to our question.

Arrtibutes of file 1:
- tconst: A unique identifier for each title.
- primaryTitle: The title commonly used for promotional purposes.
- startYear: The year the title was released.
- runtimeMinutes: The duration of the title in minutes.
- genres: Up to three genres associated with the title.
- Data origin: The two datasets we used were funded and created by IMDb officials for access to customers for personal and non-commercial use.

Attributes of file 2:
- averageRating: The average IMDb rating for the title.
- numVotes: The number of votes used to calculate the average rating.

Data cleaning: These datasets created by IMDb are well formatted as TSV. Therefore, there is no more steps needed to be done to use the dataset on our end. For processing, we restricted the data to movies released after the year 2000 to ensure consistency in rating standards, because the audience watching these movies likely belongs to the same or adjacent generations. We also kept only the essential fields—movie title, release year, genre, ratings, and number of votes—to ensure the dataset is clean and focused solely on the information relevant to our analysis. We kept the necessary information including the movie title, genre, runtime minutes, release year, average rating, and number of votes for the average rating. We also kept the title id to use it for combining data from different datasets.

Source: These datasets can be found on the official website of IMDb through the following link: https://developer.imdb.com/non-commercial-datasets/


Link to data cleaning and cleaned dataset to reproduce our results here:
https://github.com/mohdLabadi/info2950-finalproject




Preregistration statement. Provide at least 2 preregistration statements. Each
statement contains a hypothesis and a description of analysis to study the
hypothesis. It should also include context for why you believe in your hypothesis by referencing domain knowledge or other literature. Refer to Phase 3 rubric for more details

The hypothesis is substantiated by a description of domain knowledge and likely includes some citations. The analyses are described in a way
that persuades the reader that their results will be interesting, whether or not they turn out to be.

# Pre Registration Statement 1.
- Hypothesis: There is a bigger influence on movies from different genres in the IMDb dataset from longer runtimes than longer titles.
- Why we chose this hypothesis: We think that longer movies sometimes are rated better because they have more time to tell a full story whether it was a romance or a drama movie. However, it might also mean that longer movies might get boring at some point. So based on this observation and the capacities of our dataset and the tools we learnt we
think this is a valid hypothesis to test.
- Proposed analysis:
     - Dual Factor Analysis: Include both movie runtime and title length as independent variables in regression models for each genre. Assess the relative influence of runtime and title length by comparing the size and significance of their coefficients.

     - Genre-Specific Comparisons: Run separate models for drama, romance, action, and comedy to evaluate whether these factors have a more pronounced effect in certain genres. Test interaction effects between genre and runtime/title length to identify genre-specific patterns.

     - Non-Linear Relationships: Add quadratic terms for runtime and title length to test if their effects vary at extreme values (e.g., very long movies or overly long titles). Compare linear and non-linear models to determine the best fit for each genre.

     - Segmented Ratings Analysis: Conduct separate analyses for critic and audience ratings to determine whether these groups are differently influenced by runtime and title length.

# Pre Registration Statement 2.
- Hypothesis: The average movie rating has increased over time at a similar rate across all
genres.
- Why we chose this question: In our EDA, we noticed a strange trend where the average
movie ratings increased suddenly at some point in time for four genres we explored and it
seemed to do so at similar rates for each one of them. So, we wrote this hypothesis to find
out whether that is actually the case for different genres of interest.
- Proposed Analysis:
Conduct a linear regression where movie start year is the
independent variable and the movie rating is the dependent variable. Separate models will be run for each genre. We will test whether the coefficient for movie start year (β) is positive for all the genres or there are different cases.


# Data analysis

Carry out the analysis for each preregistration statement.
Interpret analyses appropriately (e.g., regressions should be interpreted using the
summarizing, predicting, and outliers/oddities framework from class)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm

cleaned_movies_df = pd.read_csv('cleaned_data.csv')
joined_df = pd.read_csv('joined_data.csv')

1. Correlation Analysis

In [None]:
correlation_matrix = cleaned_df[['runtimeMinutes', 'titleLength', 'averageRating']].corr()
print(correlation_matrix)

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

2. Genre Specific

In [None]:
# Split genres into individual rows
genres_df = cleaned_df.assign(genre=cleaned_df['genres'].str.split(',')).explode('genre')

# Check unique genres
print(genres_df['genre'].value_counts())

In [None]:
sns.boxplot(x='genre', y='runtimeMinutes', data=genres_df)
plt.title("Runtime Distribution by Genre")
plt.xlabel("Genre")
plt.ylabel("Runtime (Minutes)")
plt.xticks(rotation=45)
plt.show()

sns.boxplot(x='genre', y='averageRating', data=genres_df)
plt.title("Rating Distribution by Genre")
plt.xlabel("Genre")
plt.ylabel("Average Rating")
plt.xticks(rotation=45)
plt.show()


3. Regression Analysis

In [None]:
# Define a function to run linear regression for a specific genre
def run_linear_regression(genre):
    subset = genres_df[genres_df['genre'] == genre]
    X = subset[['runtimeMinutes']]
    y = subset['averageRating']

    model = LinearRegression()
    model.fit(X, y)

    coef = model.coef_[0]
    intercept = model.intercept_
    print(f"Linear Regression for {genre}: Coefficient={coef:.4f}, Intercept={intercept:.4f}")

    sns.regplot(x='runtimeMinutes', y='averageRating', data=subset, line_kws={'color': 'red'})
    plt.title(f"Linear Relationship Between Runtime and Rating in {genre}")
    plt.xlabel("Runtime (Minutes)")
    plt.ylabel("Average Rating")
    plt.show()

# Test for drama
run_linear_regression('Drama')


In [None]:
# Polynomial regression for a genre
def run_polynomial_regression(genre, degree=2):
    subset = genres_df[genres_df['genre'] == genre]
    X = subset[['runtimeMinutes']]
    y = subset['averageRating']

    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X)

    model = LinearRegression()
    model.fit(X_poly, y)

    print(f"Polynomial Regression for {genre} (Degree {degree}): Coefficients={model.coef_}, Intercept={model.intercept_}")

    sns.scatterplot(x='runtimeMinutes', y='averageRating', data=subset, alpha=0.6)
    x_range = np.linspace(X['runtimeMinutes'].min(), X['runtimeMinutes'].max(), 100).reshape(-1, 1)
    y_poly_pred = model.predict(poly.transform(x_range))
    plt.plot(x_range, y_poly_pred, color='red')
    plt.title(f"Polynomial Relationship Between Runtime and Rating in {genre}")
    plt.xlabel("Runtime (Minutes)")
    plt.ylabel("Average Rating")
    plt.show()

# Test for drama
run_polynomial_regression('Drama', degree=2)


4. Combined Analysis of Runtime and Title Length:

In [None]:
# Multi-variable regression for drama
drama_subset = genres_df[genres_df['genre'] == 'Drama']
X = drama_subset[['runtimeMinutes', 'titleLength']]
y = drama_subset['averageRating']

model = LinearRegression()
model.fit(X, y)

print(f"Coefficients: {model.coef_}, Intercept: {model.intercept_}")

# Visualize the effects of both variables
sns.scatterplot(x='runtimeMinutes', y='averageRating', data=drama_subset, label='Runtime')
sns.scatterplot(x='titleLength', y='averageRating', data=drama_subset, label='Title Length')
plt.legend()
plt.title("Effects of Runtime and Title Length on Rating (Drama)")
plt.show()


# Evaluation of significance

# Conclusions

# Limitations


Data Sources:
The data observed and recorded might have been influenced by how the data is collected. For IMDb, they collect their data through various sources, including direct input from filmmakers, studios, and users, as well as IMDb's internal data collection processes.

Rating Bias:
In addition, the data or ratings submitted by users can cause biases, because the information may be incomplete or not reflect the accurate titles. Beside, the demographics of IMDb users are biased among genders, geographics, ages, cultures, and linguistics, so the rating might not reflect the accurate feedback of the audience.

Policy Restrictions:
Lastly, the data could be influenced by IMDb's editorial policies and the submission guidelines they provide to contributors. Therefore, certain titles may lack sufficient data if contributors fail to adhere to submission guidelines or the policies may prioritize professionally verified information over crowd-sourced data, limiting smaller or independent films.



# Acknowledgements and Bibliography

IMDB, "IMDb Non-Commercial Datasets. title.basics.tsv", Accessed November 21, 2024. https://developer.imdb.com/non-commercial-datasets/

IMDB, "IMDb Non-Commercial Datasets. title.ratings.tsv", Accessed November 21, 2024. https://developer.imdb.com/non-commercial-datasets/