## Goal  
The goal in this project is to explore the connection between movie ratings and movie attributes such as genre, release year, and runtime by combining a ratings dataset with a metadata dataset. By consolidating user ratings with descriptive movie information, we can analyze trends in viewer preferences and construct a simple recommendation system.  

## Research Questions  
1. Do certain genres consistently receive higher average ratings?  
2. How does the release year of a movie influence its ratings?  


## Team  

### Sheryl John  
- Responsible for data cleaning and quality assessment.  
- Handles tasks such as removing missing values and standardizing data types.  
- Leads documentation of cleaning steps, ethical considerations, metadata, and reproducibility notes.  
- Prepares visualizations that illustrate patterns in genres, release years, and rating distributions.  

### Xing  
- Responsible for data integration.  
- Implements the dataset joining process, ensuring correct mapping between MovieLens `movieId` and Kaggle `id` (via `links.csv`).  
- Builds the reproducible pipeline (Jupyter Notebook) that automates loading, cleaning, joining, and analyzing the data.  

### Collaboration  
Both of us will work together on exploratory analysis, refining research questions, and preparing the final presentation.  


## Datasets  

This project will use two publicly available, complementary, reliable, and popular datasets that are frequently used in research and instruction.  

The first dataset is the MovieLens Ratings Dataset from GroupLens Research at the University of Minnesota. This dataset is widely accepted as one of the most consistent benchmark datasets for recommender system projects and has been referenced in a large number of academic studies. The version we are using contains approximately 100,000 ratings from about 600 users on roughly 9,000 distinct movies. Each rating is recorded on a scale from 0.5 to 5.0. The dataset includes four columns: `userId`, an anonymized identifier for the user; `movieId`, the identifier for the movie being rated; `rating`, the score given; and `timestamp`, which records when the rating was made. Because the dataset is fully anonymized, it contains no personally identifiable information, making it ethically safe to use. Its relatively small size and well-organized structure also make it ideal for classroom analysis, while still being rich enough to support meaningful insights.  

The second dataset is the Movies Metadata Dataset, hosted on Kaggle and originally constructed from The Movie Database (TMDB), a highly dependable and regularly updated source of movie information. This dataset contains information on approximately 45,000 films and provides descriptive features that can be used to enhance the MovieLens ratings. It includes columns such as `id` (a unique identifier for every film), `title`, `genres`, `release_date`, `runtime`, `budget`, and `revenue`. Although some values are incomplete, the breadth of the dataset makes it excellent for enrichment. For example, the release year and genre columns allow us to group ratings by film type, while runtime and budget add useful context for understanding viewer preferences.  

The two datasets will be combined by merging the `movieId` column in MovieLens with the `id` column in the Kaggle dataset. Because the Kaggle `id` column is not always stored neatly as an integer, it will need to be standardized and cleaned before performing the join. Once integrated, the merged dataset will link behavioral data (user ratings) with descriptive attributes (genres, years, runtimes), producing a more complete picture of film trends and enabling us to address the research questions effectively.  


## Timeline  

In week one of the project, we will download both datasets from their original locations and carefully review their terms of use and ethical guidelines to ensure that they can be used within an academic environment. During week two, we will set up the GitHub repository directory structure and move the datasets into a directory. With this explicit and well-structured directory, we will be prepared to begin our project.  

Sheryl will begin pre-cleaning the datasets in week three by handling missing values, normalizing identifiers, and standardizing date formats. Xing will merge the two datasets in week four by joining the MovieLens `movieId` column with the Kaggle `id` column, resolving mismatches and documenting the integration process.  

After the data has been consolidated, the fifth week will focus on exploratory data analysis, when both team members will examine trends in ratings across genres, years of release, and runtimes. In the sixth week, Xing will develop a basic recommender system that displays the top-rated movies by genre and year, demonstrating how consolidated data can be utilized to generate insights. The seventh week will be spent on workflow automation, where Xing will develop a reproducible Jupyter Notebook pipeline that executes the entire process from data loading to visualization. Finally, during the eighth week, Sheryl will complete the project documentation, including ethical implications, metadata descriptions, and a final report. 

## Constraints  

This project only uses publicly available datasets, so our scope of analysis is entirely dependent on what has been made public by MovieLens and Kaggle. While this makes the data accessible and straightforward to use, it also prevents us from incorporating proprietary or more detailed datasets that may be available elsewhere. In addition, the MovieLens dataset is fully anonymized, which means that we cannot analyze any demographic information about users such as gender, age, or location. This limitation restricts the types of inferences we can make, as our study will focus only on rating activity without the possibility of connecting it to user backgrounds. Finally, the movies metadata file may contain missing or inconsistent values for features like release dates, runtimes, budgets, or revenues, which could reduce the completeness and reliability of certain aspects of our analysis.  


## Gaps  

There are gaps in the project scope that are from the nature of the data. For example, the datasets do not include external information such as advertising budgets, promotional campaigns, or critic reviews, all of which are typically influential drivers of a movieâ€™s success and its audience ratings. Without these types of factors, our analysis will necessarily take a more limited focus on viewer ratings and basic movie attributes. Another gap lies in the merging process itself: some movie IDs will not align perfectly between the MovieLens dataset and the Kaggle metadata, even when the linking file is used. As a result, a subset of movies will be excluded from the final merged dataset, which may reduce the inclusiveness and comprehensiveness of our findings.  
