# ðŸ“˜ Data Dictionary 

This data dictionary documents the datasets used and produced in our project, including the raw MovieLens files, metadata files, and the merged dataset used for analysis and building recommendation models.

---

# 1. MovieLens Ratings Dataset (`ratings.csv`)

**Source:** MovieLens 100k â€” GroupLens Research  
**Description:** Contains user ratings for movies on a 0.5â€“5.0 scale.

| Column       | Type                | Description |
|--------------|---------------------|-------------|
| `userId`     | integer             | Anonymous user ID. |
| `movieId`    | integer             | Unique movie identifier; join key across datasets. |
| `rating`     | float               | Rating from 0.5 to 5.0. |
| `timestamp`  | integer â†’ datetime  | UNIX timestamp; converted to datetime during cleaning. |

---

# 2. MovieLens Movies Dataset (`movies.csv`)

**Source:** MovieLens 100k â€” GroupLens Research  
**Description:** Basic movie metadata including title and genres.

| Column      | Type        | Description |
|-------------|-------------|-------------|
| `movieId`   | integer     | Unique movie identifier. |
| `title`     | string      | Movie title, often includes release year (e.g., _"Toy Story (1995)"_). |
| `genres`    | string      | Pipe-separated genre list (e.g., `"Comedy|Romance"`). |

### Derived Columns (Created During Cleaning)

| Column          | Type             | Description |
|-----------------|------------------|-------------|
| `genres_list`   | list of strings  | Cleaned list of genres (e.g., `["Comedy", "Romance"]`). |
| `release_year`  | integer/float    | Extracted from the title using regex. |

---

# 3. Kaggle Movie Metadata (`movies_metadata.csv`, `credits.csv`, etc.)

**Source:** The Movies Dataset â€” Kaggle (TMDB Metadata)  
**Note:** Large files stored on Box due to GitHub file size limits.

| Column          | Type                | Description |
|-----------------|---------------------|-------------|
| `id`            | string/integer      | TMDB movie ID; cleaned and cast to numeric. |
| `title`          | string             | Movie title from TMDB. |
| `genres`         | JSON-like string   | TMDB genres; available but not always directly used. |
| `budget`         | float              | Production budget in USD. |
| `revenue`        | float              | Worldwide revenue in USD. |
| `runtime`        | float              | Movie duration in minutes. |
| `release_date`   | string/date        | Theatrical release date. |

(*Remove columns you did not use.*)

---

# 4. Cleaned Datasets

These are standardized versions of the raw datasets.

### `ratings_clean.csv`
- Converted timestamps to datetime  
- Removed invalid rows  
- Ensured numeric consistency  

### `movies_clean.csv`
- Extracted release years  
- Standardized titles  
- Converted genres into lists  

### `metadata_clean.csv`
- Selected metadata used for integration  
- Cleaned numeric and date fields  

---

# 5. Integrated Dataset (`merged_pipeline_output.csv`)

Contains merged MovieLens ratings with metadata.

| Column           | Type              | Description |
|------------------|-------------------|-------------|
| `userId`         | integer           | User who rated the movie. |
| `movieId`        | integer           | Movie identifier. |
| `rating`         | float             | User rating. |
| `timestamp`      | datetime          | Rating date. |
| `title`          | string            | Movie title. |
| `genres_list`    | list of strings   | Cleaned list of genres. |
| `release_year`   | integer/float     | Extracted release year. |
| `runtime`        | float             | Runtime (if integrated). |
| `budget`         | float             | Movie budget (if integrated). |
| `revenue`        | float             | Revenue (if integrated). |

---

# 6. Recommendation Outputs

### 6.1 Genre-Based Recommendations (`top_drama_recommendations.csv`)

| Column         | Type  | Description |
|----------------|-------|-------------|
| `movieId`      | int   | Movie identifier. |
| `title`        | str   | Movie title. |
| `genres_list`  | str   | Selected genre ("Drama"). |
| `rating`       | float | Average rating. |

### 6.2 Year-Based Recommendations (`top_1995_recommendations.csv`)

| Column           | Type  | Description |
|------------------|-------|-------------|
| `movieId`        | int   | Movie ID. |
| `title`          | str   | Title. |
| `release_year`   | int   | Selected year (e.g., 1995). |
| `rating`         | float | Average rating. |

---

