# 1 Business Understanding

## 1.0 Overview
With the rapid expansion of internet streaming platforms, users are overwhelmed by the sheer volume of available movies. Providing personalized recommendations is critical for increasing user engagement, satisfaction, and retention. The MovieLens (ml-latest-small) dataset, rom [MovieLens](http://movielens.org) (a movie recommendation service),  includes user-generated 5-star ratings and free-text tags that can be used to create a powerful recommendation engine.


## 1.1 Problem Statement
With thousands of movies accessible on streaming platforms, customers struggle to discover ones they'll like. This choice overload frequently results in dissatisfaction, decision fatigue, and low user engagement.
Many systems rely on generic rankings or trending lists that do not consider individual preferences. This leads to irrelevant movie suggestions that do not match user preferences, longer search times lower user happiness, or low retention rates due to users potentially switching to competitors with better recommendations.

The goal is to create a movie recommendation system that can predict user preferences based on previous interactions. Specifically, we want to:

## 1.2 Objectives
*   &#x2611; Analyze user ratings and tags to find patterns and trends.
*   &#x2611; Create a recommendation model that incorporates collaborative filtering, content-based filtering, or hybrid approaches.
*   &#x2611; Address critical issues such as data scarcity, cold start issues, and bias in user ratings.
*   &#x2611; Assess model performance using relevant measures such as RMSE, MAE, and Precision@K.

## 1.3 Proposed solution
The objective is to examine and use the dataset to boost user engagement by developing a movie recommendation engine. Potential applications include:

&#x2611; Personalized **Movie Recommendations** - Predict user preferences based on previous ratings.

&#x2611; Customize content by **segmenting and clustering users** based on similar preferences.

&#x2611; **Trend Analysis and Insights**: Discover popular genres, top-rated movies, and viewing behaviors.

&#9745; Tag-based **sentiment analysis** provides insights into user perception of movies.


&#x2713; **Next Steps**

&#x2A39; Exploratory Data Analysis (EDA) – Understand rating distributions, popular tags, and trends.

&#x2A3B; Feature Engineering – Transform text tags into meaningful numerical features.

&#x2A39; Model Selection – Choose between collaborative filtering, content-based filtering, or hybrid models.

&#x2A3B; Evaluation Metrics – Use RMSE, MAE, or Precision@K to assess recommendation performance.

📌 Note 

### Challenges & Considerations
* Data Sparsity: Not all users have rated all movies, leading to gaps in the dataset.
* Cold Start Problem: New users/movies lack enough data for accurate recommendations.
* Bias in Ratings: Some users may consistently rate higher or lower than others.
* Scalability: The model should be efficient enough to handle large datasets in real-world applications.

# 2 Data Understanding

The MovieLens dataset (ml-latest-small) comprises user-generated movie ratings as well as free-text tags. This dataset is commonly used in recommendation systems, where businesses try to improve the customer experience by proposing movies based on user interests.Companies like Netflix, Hulu, and Amazon Prime Video use similar databases to boost user engagement, retention, and satisfaction through personalized suggestions.

The dataset consists of 5-star ratings and free-text tags from MovieLens, an online movie recommendation service.

Users rate movies on a 1-5 star scale (higher ratings indicate better user satisfaction).
Tags are free-text descriptions provided by users to describe movies (e.g., "thriller," "comedy," "Oscar-winning").
The dataset is anonymized (users are represented by IDs).

|File Name | Description|
|----------|------------|
|ratings.csv |	Contains user ratings for movies (1-5 scale).|
|movies.csv |	Metadata including movie titles and genres.|
|tags.csv |	Free-text tags assigned by users to movies.|
|links.csv |	Provides mappings to external movie databases (IMDB, TMDb).|


**Identifiers**
- *userId*: random and anonymous IDs given to identify users. MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files)

- *movieId*: IDs given to identify movies. Since only those movies which have 1 or more rating or tag are selected, it is not a complete sequence. Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files)

**Movies Data columns:**
- *movieId*:Unique movie identifier.
- *title:* Movie title, and include the year of release in parentheses.
- *genre:* a pipe-separated list, and are selected from the following: Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, 
Sci-Fi, Thriller, War, Western and "no genres listed"

**Ratings Data columns:**
- *userId* – Unique identifier for each user.
- *movieId* – Unique identifier for each movie.
- *rating:* User rating (1-5 stars), with half-star increments (0.5 stars - 5.0 stars).
- *timestamp:* Time when the rating was given.

**Links data features:**
- *movieId:* Unique identifier for each movie.
- *imdbId:* an identifier for movies used by <http://www.imdb.com>
- *tmdbId:* an identifier for movies used by <https://www.themoviedb.org>

**Tags data features:**
- *userId* – Unique user identifier.
- *movieId* – Movie being tagged.
- *tag:* Free-text tag (e.g., "thrilling," "mind-blowing," "classic").
- *timestamp:* Time when the tag was added


In [None]:
# Data understanding

# Import data manipulation libraries
import pandas as pd
import numpy as np

# Import data visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import modeling libraries
import scipy.sparse as sp
from sklearn.model_selection import train_test_split

# import Merics 
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [None]:
# Load the datasets
movie_data = pd.read_csv("Data/movies.csv")

rating_data = pd.read_csv("Data/ratings.csv")

link_data = pd.read_csv("Data/links.csv")

tag_data = pd.read_csv("Data/tags.csv")

In [15]:
# Explore movie data
movie_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [17]:
movie_data.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


Understand ratings data

In [18]:

rating_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [19]:

rating_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [20]:

rating_data.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


understand links data:

In [21]:
link_data.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [22]:

link_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [23]:

link_data.describe()

Unnamed: 0,movieId,imdbId,tmdbId
count,9742.0,9742.0,9734.0
mean,42200.353623,677183.9,55162.123793
std,52160.494854,1107228.0,93653.481487
min,1.0,417.0,2.0
25%,3248.25,95180.75,9665.5
50%,7300.0,167260.5,16529.0
75%,76232.0,805568.5,44205.75
max,193609.0,8391976.0,525662.0


Understand tags data:

In [24]:
tag_data.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [25]:

tag_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [26]:

tag_data.describe()

Unnamed: 0,userId,movieId,timestamp
count,3683.0,3683.0,3683.0
mean,431.149335,27252.013576,1320032000.0
std,158.472553,43490.558803,172102500.0
min,2.0,1.0,1137179000.0
25%,424.0,1262.5,1137521000.0
50%,474.0,4454.0,1269833000.0
75%,477.0,39263.0,1498457000.0
max,610.0,193565.0,1537099000.0


# 3 Data Preparation



## 3.1 Data Cleaning


### Handling Missing Values


### Duplicate Values


### Column Editing


### ddddd

# 4. Exploring the Dataset

## Univariate

## Bivariate

## Multivariate

# 5. Modeling

# 6. Model Evaluation


# 7. Conclusion

# 8. Recommendations