## Phase I Project Proposal
### Can movie genre, release date, and audience score be used to predict Rotten Tomatoes critics scores?

#### Name: Ian Menachery, DS 3000


### Introduction

### Introduction

Can movie genre, length, and audience scores really help predict Rotten Tomatoes critic scores? In this project, I want to dig into how these factors might impact the way critics rate movies. By looking at genre, runtime, and audience feedback, we can uncover valuable insights that filmmakers and studios can use to make smarter choices when creating films. If we can build a model that predicts critic scores based on these characteristics, it could give creators a leg up in understanding what resonates with both audiences and critics alike. Exploring these relationships might also reveal other interesting factors that play a role in shaping critical opinions, helping us get a better idea of what makes a movie successful in the eyes of reviewers.

### Data Collection

I plan to use web scraping to collect movie data from Rotten Tomatoes, focusing on key details like genre, release dates, critic scores, and audience scores. This approach allows me to gather important information that can be useful for analysis. I'll run calculations to assess the relationship between these variables and critics/audience scores. To gain further insights into audience reception, I may also explore additional movie metadata, such as box office performance or awards. By combining these data points, I aim to identify trends or patterns that could help predict rotten tomatoes scores, offering valuable insights reviewers and potentially finding bias in reviews.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Copied all genres from website so they are accesible
genres_array = ["Action", "Adventure", "Animation", "Anime", "Biography", "Comedy", "Crime", "Documentary", "Drama", "Entertainment", 
                "Faith & Spirituality", "Fantasy", "Game Show", "LGBTQ+", "Health & Wellness", "History", "Holiday", "Horror", "House & Garden", 
                "Kids & Family", "Music", "Musical", "Mystery & Thriller", "Nature", "News", "Reality", "Romance", "Sci-Fi", "Short", "Soap", 
                "Special Interest", "Sports", "Stand-Up", "Talk Show", "Travel", "Variety", "War", "Western"]

genre = genres_array[0] # chooses action movies to look at
url = f'https://www.rottentomatoes.com/browse/movies_at_home/critics:certified_fresh~genres:{genre}'
response = requests.get(url)


soup = BeautifulSoup(response.text)
    
# Find all movie elements
movie_elements = soup.find_all('div', class_='js-tile-link')

# Lists to hold movie details
movie_names = []
release_dates = []
critic_scores = []
audience_scores = []

# Loop through each movie element to extract relevant details
for movie in movie_elements:
    # Get movie name
    name = movie.find('span', {'data-qa': 'discovery-media-list-item-title'}).get_text(strip=True)
    movie_names.append(name)

    # Get release date
    release = movie.find('span', class_='smaller').get_text(strip=True) if movie.find('span', class_='smaller') else 'N/A'
    release_dates.append(release)

    # Get critic score
    critic_score = movie.find('rt-text', {'slot': 'criticsScore'})
    critic_scores.append(critic_score.get_text(strip=True) if critic_score else 'N/A')

    # Get audience score
    audience_score = movie.find('rt-text', {'slot': 'audienceScore'})
    audience_scores.append(audience_score.get_text(strip=True) if audience_score else 'N/A')

# Creates DataFrame to store the movie details
df = pd.DataFrame({
    'Movie Name': movie_names,
    'Release': release_dates,
    'Critic Score': critic_scores,
    'Audience Score': audience_scores,
    'Genre' : genre
})

print(df)
print(f"Movies found: {len(df)}")




                           Movie Name                 Release Critic Score  \
0                    Transformers One  Streaming Oct 22, 2024          89%   
1                Deadpool & Wolverine   Streaming Oct 1, 2024          78%   
2                         Rebel Ridge   Streaming Sep 6, 2024          95%   
3                             Hit Man   Streaming Jun 7, 2024          95%   
4                            Twisters  Streaming Aug 13, 2024          75%   
5                           Civil War  Streaming May 24, 2024          81%   
6                  Godzilla Minus One   Streaming Jun 1, 2024          99%   
7                          Monkey Man  Streaming Apr 23, 2024          89%   
8                           John Wick   Streaming Jun 7, 2016          86%   
9                        The Fall Guy  Streaming May 21, 2024          81%   
10                         The Batman  Streaming Apr 19, 2022          85%   
11                          Gladiator  Streaming Jun 15, 2011   

 ### Data Usage and Remaining Issues

 the DataFrame still requires some cleaning and organization. For instance, while the genre is straightforward in this dataset, the scores are presented as strings with percentage signs, which will need conversion to ints for effective analysis. Additionally, I will also convert the release column to a date time object for simplicity and to cut down on. There is also a problem with movies having multiple genres so I may need to clean for repeats If I combine different genres. As I progress, I plan to investigate the relationships between the various attributes, particularly focusing on how the genre and scores correlate with each other. Although I haven’t yet covered machine learning models in my studies, I see potential for using regression to predict scores based on movie features or classification techniques to analyze patterns within the genres. Further exploration into these methods will help deepen my analysis and potentially lead to uncovering bias in film review world.
 
