# Random Split (80/20) - Method 1

## Overview

This notebook implements **Method 1: Random Split** for train/test data preparation.

### Methodology

- **Approach**: Random sampling without considering temporal order
- **Split Ratio**: 80% training, 20% testing
- **Random State**: 42 (for reproducibility)

### Use Case

Random split is a baseline approach that:
- ✅ Ensures even distribution across users/movies
- ✅ Simple to implement and understand
- ✅ Standard approach for initial model development
- ⚠️ Does not reflect real-world temporal deployment

### Output

- `train_ratings.csv`: Training set (~80% of data)
- `test_ratings.csv`: Test set (~20% of data)

## Step 1: Load Cleaned Ratings Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load cleaned ratings
print("Loading cleaned ratings dataset...")
ratings = pd.read_csv('../../datasets/output/cleaned_datasets/cleaned_ratings.csv', low_memory=False)

print(f"✓ Dataset loaded successfully")
print(f"\nDataset Shape: {ratings.shape}")
print(f"Columns: {list(ratings.columns)}")
print(f"\nData Types:")
print(ratings.dtypes)
print(f"\nSample Data:")
print(ratings.head(10))

Loading cleaned ratings dataset...
✓ Dataset loaded successfully

Dataset Shape: (26024289, 4)
Columns: ['userId', 'movieId', 'rating', 'timestamp']

Data Types:
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

Sample Data:
   userId  movieId  rating   timestamp
0       1      110     1.0  1425941529
1       1      147     4.5  1425942435
2       1      858     5.0  1425941523
3       1     1221     5.0  1425941546
4       1     1246     5.0  1425941556
5       1     1968     4.0  1425942148
6       1     2762     4.5  1425941300
7       1     2918     5.0  1425941593
8       1     2959     4.0  1425941601
9       1     4226     4.0  1425942228


## Step 2: Explore Dataset Statistics

In [2]:
# Basic statistics
print("Dataset Statistics:\n")
print(f"Total Ratings: {len(ratings):,}")
print(f"Unique Users: {ratings['userId'].nunique():,}")
print(f"Unique Movies: {ratings['movieId'].nunique():,}")
print(f"\nRating Distribution:")
print(ratings['rating'].value_counts().sort_index())
print(f"\nRating Statistics:")
print(ratings['rating'].describe())

Dataset Statistics:

Total Ratings: 26,024,289
Unique Users: 270,896
Unique Movies: 45,115

Rating Distribution:
rating
0.5     404897
1.0     843310
1.5     403607
2.0    1762440
2.5    1255358
3.0    5256722
3.5    3116213
4.0    6998802
4.5    2170441
5.0    3812499
Name: count, dtype: int64

Rating Statistics:
count    2.602429e+07
mean     3.528090e+00
std      1.065443e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64


## Step 3: Perform Random 80/20 Split

In [3]:
# Perform random split
print("Performing random 80/20 train/test split...\n")

train, test = train_test_split(
    ratings,
    test_size=0.2,      # 20% for test
    random_state=42,    # For reproducibility
    shuffle=True        # Ensure random sampling
)

print("✓ Split completed successfully\n")
print(f"Training Set:")
print(f"  - Rows: {len(train):,} ({len(train)/len(ratings)*100:.2f}%)")
print(f"  - Users: {train['userId'].nunique():,}")
print(f"  - Movies: {train['movieId'].nunique():,}")

print(f"\nTest Set:")
print(f"  - Rows: {len(test):,} ({len(test)/len(ratings)*100:.2f}%)")
print(f"  - Users: {test['userId'].nunique():,}")
print(f"  - Movies: {test['movieId'].nunique():,}")

Performing random 80/20 train/test split...

✓ Split completed successfully

Training Set:
  - Rows: 20,819,431 (80.00%)
  - Users: 269,710
  - Movies: 43,326

Test Set:
  - Rows: 5,204,858 (20.00%)
  - Users: 253,107
  - Movies: 31,629


## Step 4: Validate Split Integrity

In [4]:
print("Validating split integrity...\n")

# Check 1: No overlap between train and test indices
train_indices = set(train.index)
test_indices = set(test.index)
overlap = train_indices.intersection(test_indices)

print(f"Check 1: No row overlap")
print(f"  - Train indices: {len(train_indices):,}")
print(f"  - Test indices: {len(test_indices):,}")
print(f"  - Overlap: {len(overlap)} {'✓ PASS' if len(overlap) == 0 else '✗ FAIL'}")

# Check 2: Verify proportions
train_pct = len(train) / len(ratings) * 100
test_pct = len(test) / len(ratings) * 100
total_pct = train_pct + test_pct

print(f"\nCheck 2: Verify 80/20 proportions")
print(f"  - Train: {train_pct:.2f}% (expected ~80%)")
print(f"  - Test: {test_pct:.2f}% (expected ~20%)")
print(f"  - Total: {total_pct:.2f}% (expected 100%)")
print(f"  - {'✓ PASS' if 79.5 <= train_pct <= 80.5 and 19.5 <= test_pct <= 20.5 else '✗ FAIL'}")

# Check 3: All columns preserved
print(f"\nCheck 3: All columns preserved")
print(f"  - Original columns: {list(ratings.columns)}")
print(f"  - Train columns: {list(train.columns)}")
print(f"  - Test columns: {list(test.columns)}")
print(f"  - {'✓ PASS' if list(train.columns) == list(test.columns) == list(ratings.columns) else '✗ FAIL'}")

# Check 4: Rating distribution similarity
print(f"\nCheck 4: Rating distribution comparison")
print(f"\nOriginal distribution:")
original_dist = ratings['rating'].value_counts(normalize=True).sort_index() * 100
print(original_dist.to_string())

print(f"\nTrain distribution:")
train_dist = train['rating'].value_counts(normalize=True).sort_index() * 100
print(train_dist.to_string())

print(f"\nTest distribution:")
test_dist = test['rating'].value_counts(normalize=True).sort_index() * 100
print(test_dist.to_string())

print(f"\n✓ All validation checks completed")

Validating split integrity...

Check 1: No row overlap
  - Train indices: 20,819,431
  - Test indices: 5,204,858
  - Overlap: 0 ✓ PASS

Check 2: Verify 80/20 proportions
  - Train: 80.00% (expected ~80%)
  - Test: 20.00% (expected ~20%)
  - Total: 100.00% (expected 100%)
  - ✓ PASS

Check 3: All columns preserved
  - Original columns: ['userId', 'movieId', 'rating', 'timestamp']
  - Train columns: ['userId', 'movieId', 'rating', 'timestamp']
  - Test columns: ['userId', 'movieId', 'rating', 'timestamp']
  - ✓ PASS

Check 4: Rating distribution comparison

Original distribution:
rating
0.5     1.555843
1.0     3.240473
1.5     1.550886
2.0     6.772289
2.5     4.823794
3.0    20.199292
3.5    11.974248
4.0    26.893346
4.5     8.340059
5.0    14.649772

Train distribution:
rating
0.5     1.555153
1.0     3.240651
1.5     1.549586
2.0     6.775910
2.5     4.820569
3.0    20.196426
3.5    11.978132
4.0    26.891700
4.5     8.339594
5.0    14.652278

Test distribution:
rating
0.5     1.558

## Step 5: Save Train and Test Datasets

In [5]:
# Define output paths
output_dir = '../../datasets/output/split_and_train_datasets/80-20/'
train_path = output_dir + 'train_ratings.csv'
test_path = output_dir + 'test_ratings.csv'

# Save datasets
print("Saving train and test datasets...\n")

print(f"Saving training set to: {train_path}")
train.to_csv(train_path, index=False)
print(f"✓ Training set saved ({len(train):,} rows)")

print(f"\nSaving test set to: {test_path}")
test.to_csv(test_path, index=False)
print(f"✓ Test set saved ({len(test):,} rows)")

# Verify files were created
import os
print(f"\nVerifying saved files:")
train_size_mb = os.path.getsize(train_path) / (1024 * 1024)
test_size_mb = os.path.getsize(test_path) / (1024 * 1024)
print(f"  - {train_path}")
print(f"    Size: {train_size_mb:.2f} MB")
print(f"  - {test_path}")
print(f"    Size: {test_size_mb:.2f} MB")

print(f"\n✓ All files saved successfully")

Saving train and test datasets...

Saving training set to: ../../datasets/output/split_and_train_datasets/80-20/train_ratings.csv
✓ Training set saved (20,819,431 rows)

Saving test set to: ../../datasets/output/split_and_train_datasets/80-20/test_ratings.csv
✓ Test set saved (5,204,858 rows)

Verifying saved files:
  - ../../datasets/output/split_and_train_datasets/80-20/train_ratings.csv
    Size: 521.49 MB
  - ../../datasets/output/split_and_train_datasets/80-20/test_ratings.csv
    Size: 130.37 MB

✓ All files saved successfully


## Summary

### Random Split (80/20) Completed

**Method**: Random sampling without temporal considerations

**Output Files**:
- `train_ratings.csv`: Training set (~80% of data)
- `test_ratings.csv`: Test set (~20% of data)

**Key Characteristics**:
- Random state: 42 (reproducible)
- No data leakage (verified no overlap)
- Preserved all columns
- Rating distributions similar across train/test

**Next Steps**:
1. Use `train_ratings.csv` for model training (collaborative filtering)
2. Use `test_ratings.csv` for evaluation (RMSE, MAE, Precision@K)
3. Compare with temporal split results (Method 2)

In [6]:
# Final summary statistics
print("="*60)
print("RANDOM SPLIT (80/20) - FINAL SUMMARY")
print("="*60)
print(f"\nOriginal Dataset: {len(ratings):,} ratings")
print(f"\nTraining Set: {len(train):,} ratings ({len(train)/len(ratings)*100:.2f}%)")
print(f"  - Users: {train['userId'].nunique():,}")
print(f"  - Movies: {train['movieId'].nunique():,}")
print(f"  - Avg rating: {train['rating'].mean():.4f}")

print(f"\nTest Set: {len(test):,} ratings ({len(test)/len(ratings)*100:.2f}%)")
print(f"  - Users: {test['userId'].nunique():,}")
print(f"  - Movies: {test['movieId'].nunique():,}")
print(f"  - Avg rating: {test['rating'].mean():.4f}")

print(f"\nOutput Location: {output_dir}")
print(f"  - train_ratings.csv ({train_size_mb:.2f} MB)")
print(f"  - test_ratings.csv ({test_size_mb:.2f} MB)")
print("\n" + "="*60)
print("✓ Random split completed successfully")
print("="*60)

RANDOM SPLIT (80/20) - FINAL SUMMARY

Original Dataset: 26,024,289 ratings

Training Set: 20,819,431 ratings (80.00%)
  - Users: 269,710
  - Movies: 43,326
  - Avg rating: 3.5281

Test Set: 5,204,858 ratings (20.00%)
  - Users: 253,107
  - Movies: 31,629
  - Avg rating: 3.5279

Output Location: ../../datasets/output/split_and_train_datasets/80-20/
  - train_ratings.csv (521.49 MB)
  - test_ratings.csv (130.37 MB)

✓ Random split completed successfully
