# Temporal Split (80/20) - Method 2

## Overview

This notebook implements **Method 2: Temporal Split** for train/test data preparation.

### Methodology

- **Approach**: Time-based split - train on past, test on future
- **Split Ratio**: 80% training (older ratings), 20% testing (newer ratings)
- **Ordering**: Sort by timestamp before splitting

### Why Temporal Split is Superior

Temporal split better reflects real-world recommendation systems:
- ✅ **Realistic evaluation**: Predicts future user behavior from past patterns
- ✅ **Prevents temporal leakage**: Never trains on future to predict past
- ✅ **Industry standard**: Used by Netflix, Spotify, Amazon
- ✅ **Captures dynamics**: Measures adaptation to changing user tastes
- ✅ **Reveals cold-start**: Identifies new users/items in test period

### Real-World Simulation

```
Training Period: All ratings from 1996 to ~2015
  → Learn user preferences and item patterns
  
Test Period: All ratings from ~2015 to 2018  
  → Predict recent/future ratings based on historical patterns
  
Question: Can the model recommend movies in 2015-2018
         based on patterns learned from 1996-2015?
```

### Output

- `train_ratings.csv`: Training set (older 80% of ratings)
- `test_ratings.csv`: Test set (newer 20% of ratings)

## Step 1: Load Cleaned Ratings Dataset

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load cleaned ratings
print("Loading cleaned ratings dataset...")
ratings = pd.read_csv('../../datasets/output/cleaned_datasets/cleaned_ratings.csv', low_memory=False)

print(f"✓ Dataset loaded successfully")
print(f"\nDataset Shape: {ratings.shape}")
print(f"Columns: {list(ratings.columns)}")
print(f"\nData Types:")
print(ratings.dtypes)
print(f"\nSample Data:")
print(ratings.head(10))

# Verify timestamp column exists
if 'timestamp' not in ratings.columns:
    raise ValueError("ERROR: timestamp column not found in ratings dataset!")
else:
    print(f"\n✓ Timestamp column verified")

Loading cleaned ratings dataset...
✓ Dataset loaded successfully

Dataset Shape: (26024289, 4)
Columns: ['userId', 'movieId', 'rating', 'timestamp']

Data Types:
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

Sample Data:
   userId  movieId  rating   timestamp
0       1      110     1.0  1425941529
1       1      147     4.5  1425942435
2       1      858     5.0  1425941523
3       1     1221     5.0  1425941546
4       1     1246     5.0  1425941556
5       1     1968     4.0  1425942148
6       1     2762     4.5  1425941300
7       1     2918     5.0  1425941593
8       1     2959     4.0  1425941601
9       1     4226     4.0  1425942228

✓ Timestamp column verified


## Step 2: Analyze Temporal Distribution

In [2]:
# Convert Unix timestamp to datetime
print("Analyzing temporal distribution...\n")

ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['year'] = ratings['datetime'].dt.year
ratings['month'] = ratings['datetime'].dt.to_period('M')

# Display temporal range
print(f"Temporal Range:")
print(f"  - Earliest rating: {ratings['datetime'].min()}")
print(f"  - Latest rating: {ratings['datetime'].max()}")
print(f"  - Time span: {(ratings['datetime'].max() - ratings['datetime'].min()).days} days")

# Ratings by year
print(f"\nRatings by Year:")
year_counts = ratings['year'].value_counts().sort_index()
print(year_counts)

# Calculate 80% split point
split_idx = int(len(ratings) * 0.8)
ratings_sorted = ratings.sort_values('timestamp')
split_timestamp = ratings_sorted.iloc[split_idx]['timestamp']
split_datetime = pd.to_datetime(split_timestamp, unit='s')

print(f"\n80/20 Split Point:")
print(f"  - Split at index: {split_idx:,}")
print(f"  - Split timestamp: {split_timestamp}")
print(f"  - Split date: {split_datetime}")
print(f"\nTraining period: {ratings_sorted['datetime'].min()} to {split_datetime}")
print(f"Test period: {split_datetime} to {ratings_sorted['datetime'].max()}")

Analyzing temporal distribution...

Temporal Range:
  - Earliest rating: 1995-01-09 11:46:44
  - Latest rating: 2017-08-04 06:57:50
  - Time span: 8242 days

Ratings by Year:
year
1995          4
1996    1733263
1997     763932
1998     329720
1999    1231263
2000    2034030
2001    1239718
2002     910659
2003    1079511
2004    1202081
2005    1850589
2006    1211577
2007    1096103
2008    1211067
2009     993934
2010     983023
2011     835260
2012     793710
2013     634778
2014     586094
2015    1913720
2016    2094961
2017    1295292
Name: count, dtype: int64

80/20 Split Point:
  - Split at index: 20,819,431
  - Split timestamp: 1422580677
  - Split date: 2015-01-30 01:17:57

Training period: 1995-01-09 11:46:44 to 2015-01-30 01:17:57
Test period: 2015-01-30 01:17:57 to 2017-08-04 06:57:50


## Step 3: Visualize Temporal Patterns (Optional)

In [3]:
# Display ratings distribution over time
print("Ratings Distribution by Year:\n")

year_stats = ratings.groupby('year').agg({
    'rating': ['count', 'mean'],
    'userId': 'nunique',
    'movieId': 'nunique'
})
year_stats.columns = ['Total Ratings', 'Avg Rating', 'Unique Users', 'Unique Movies']
print(year_stats)

# Identify split year
split_year = split_datetime.year
print(f"\n✓ Split occurs in year: {split_year}")
print(f"Training data: Up to {split_datetime.strftime('%Y-%m-%d')}")
print(f"Test data: From {split_datetime.strftime('%Y-%m-%d')} onwards")

Ratings Distribution by Year:

      Total Ratings  Avg Rating  Unique Users  Unique Movies
year                                                        
1995              4    3.750000             2              4
1996        1733263    3.554189         34448           1392
1997         763932    3.592388         16768           1679
1998         329720    3.522659          6880           2309
1999        1231263    3.617325         14223           3054
2000        2034030    3.577626         25358           3934
2001        1239718    3.535594         16430           4848
2002         910659    3.484433         11661           5805
2003        1079511    3.471338         11899           6873
2004        1202081    3.426694         10829           8032
2005        1850589    3.431955         15616           8512
2006        1211577    3.459998         12975           8844
2007        1096103    3.463976         13044           9383
2008        1211067    3.527408         15182         

## Step 4: Perform Temporal Split

In [4]:
# Sort by timestamp (critical for temporal split!)
print("Performing temporal 80/20 train/test split...\n")
print("Step 1: Sorting by timestamp...")

ratings_sorted = ratings.sort_values('timestamp').reset_index(drop=True)

print(f"✓ Sorted {len(ratings_sorted):,} ratings by timestamp")

# Perform split
print("\nStep 2: Splitting at 80% mark...")
split_idx = int(len(ratings_sorted) * 0.8)

train = ratings_sorted.iloc[:split_idx].copy()
test = ratings_sorted.iloc[split_idx:].copy()

print(f"\n✓ Split completed successfully\n")

# Display split statistics
print(f"Training Set:")
print(f"  - Rows: {len(train):,} ({len(train)/len(ratings)*100:.2f}%)")
print(f"  - Temporal range: {train['datetime'].min()} to {train['datetime'].max()}")
print(f"  - Users: {train['userId'].nunique():,}")
print(f"  - Movies: {train['movieId'].nunique():,}")
print(f"  - Avg rating: {train['rating'].mean():.4f}")

print(f"\nTest Set:")
print(f"  - Rows: {len(test):,} ({len(test)/len(ratings)*100:.2f}%)")
print(f"  - Temporal range: {test['datetime'].min()} to {test['datetime'].max()}")
print(f"  - Users: {test['userId'].nunique():,}")
print(f"  - Movies: {test['movieId'].nunique():,}")
print(f"  - Avg rating: {test['rating'].mean():.4f}")

Performing temporal 80/20 train/test split...

Step 1: Sorting by timestamp...
✓ Sorted 26,024,289 ratings by timestamp

Step 2: Splitting at 80% mark...

✓ Split completed successfully

Training Set:
  - Rows: 20,819,431 (80.00%)
  - Temporal range: 1995-01-09 11:46:44 to 2015-01-30 01:17:56
  - Users: 227,222
  - Movies: 25,636
  - Avg rating: 3.5210

Test Set:
  - Rows: 5,204,858 (20.00%)
  - Temporal range: 2015-01-30 01:17:57 to 2017-08-04 06:57:50
  - Users: 48,464
  - Movies: 42,721
  - Avg rating: 3.5566


## Step 5: Validate Temporal Ordering

In [5]:
print("Validating temporal split integrity...\n")

# Check 1: Strict temporal ordering (all train timestamps < all test timestamps)
max_train_timestamp = train['timestamp'].max()
min_test_timestamp = test['timestamp'].min()

print(f"Check 1: Temporal ordering")
print(f"  - Latest train timestamp: {pd.to_datetime(max_train_timestamp, unit='s')}")
print(f"  - Earliest test timestamp: {pd.to_datetime(min_test_timestamp, unit='s')}")
print(f"  - Max train < Min test: {max_train_timestamp < min_test_timestamp}")

# Note: Temporal split may have overlap at boundary (ratings with same timestamp)
# This is acceptable and reflects real-world scenarios
if max_train_timestamp >= min_test_timestamp:
    overlap_count = len(train[train['timestamp'] == max_train_timestamp])
    print(f"  ⚠️  Boundary overlap: {overlap_count} ratings share the split timestamp")
    print(f"      This is normal and acceptable for temporal splits")
else:
    print(f"  ✓ PASS - Perfect temporal separation")

# Check 2: No index overlap
train_indices = set(train.index)
test_indices = set(test.index)
overlap = train_indices.intersection(test_indices)

print(f"\nCheck 2: No row overlap")
print(f"  - Train indices: {len(train_indices):,}")
print(f"  - Test indices: {len(test_indices):,}")
print(f"  - Overlap: {len(overlap)} {'✓ PASS' if len(overlap) == 0 else '✗ FAIL'}")

# Check 3: Verify proportions
train_pct = len(train) / len(ratings) * 100
test_pct = len(test) / len(ratings) * 100
total_pct = train_pct + test_pct

print(f"\nCheck 3: Verify 80/20 proportions")
print(f"  - Train: {train_pct:.2f}% (expected ~80%)")
print(f"  - Test: {test_pct:.2f}% (expected ~20%)")
print(f"  - Total: {total_pct:.2f}% (expected 100%)")
print(f"  - {'✓ PASS' if 79.5 <= train_pct <= 80.5 and 19.5 <= test_pct <= 20.5 else '✗ FAIL'}")

# Check 4: All columns preserved
print(f"\nCheck 4: All columns preserved")
print(f"  - Original columns: {len(ratings.columns)}")
print(f"  - Train columns: {len(train.columns)}")
print(f"  - Test columns: {len(test.columns)}")
print(f"  - {'✓ PASS' if len(train.columns) == len(test.columns) == len(ratings.columns) else '✗ FAIL'}")

# Check 5: Rating distribution comparison
print(f"\nCheck 5: Rating distribution comparison")
print(f"\nTrain distribution:")
train_dist = train['rating'].value_counts(normalize=True).sort_index() * 100
print(train_dist.to_string())

print(f"\nTest distribution:")
test_dist = test['rating'].value_counts(normalize=True).sort_index() * 100
print(test_dist.to_string())

print(f"\n✓ All validation checks completed")

Validating temporal split integrity...

Check 1: Temporal ordering
  - Latest train timestamp: 2015-01-30 01:17:56
  - Earliest test timestamp: 2015-01-30 01:17:57
  - Max train < Min test: True
  ✓ PASS - Perfect temporal separation

Check 2: No row overlap
  - Train indices: 20,819,431
  - Test indices: 5,204,858
  - Overlap: 0 ✓ PASS

Check 3: Verify 80/20 proportions
  - Train: 80.00% (expected ~80%)
  - Test: 20.00% (expected ~20%)
  - Total: 100.00% (expected 100%)
  - ✓ PASS

Check 4: All columns preserved
  - Original columns: 7
  - Train columns: 7
  - Test columns: 7
  - ✓ PASS

Check 5: Rating distribution comparison

Train distribution:
rating
0.5     1.255193
1.0     3.471853
1.5     1.424693
2.0     7.204083
2.5     4.395034
3.0    21.515521
3.5    10.831309
4.0    27.729495
4.5     7.549313
5.0    14.623507

Test distribution:
rating
0.5     2.758442
1.0     2.314953
1.5     2.055656
2.0     5.045114
2.5     6.538834
3.0    14.934375
3.5    16.546004
4.0    23.548750
4.5

## Step 6: Cold-Start Analysis

In [6]:
print("Analyzing cold-start scenarios...\n")

# User overlap analysis
train_users = set(train['userId'].unique())
test_users = set(test['userId'].unique())

common_users = train_users.intersection(test_users)
new_users_in_test = test_users - train_users
users_only_in_train = train_users - test_users

print(f"User Analysis:")
print(f"  - Train users: {len(train_users):,}")
print(f"  - Test users: {len(test_users):,}")
print(f"  - Common users (in both): {len(common_users):,} ({len(common_users)/len(test_users)*100:.2f}% of test)")
print(f"  - NEW users in test (cold-start): {len(new_users_in_test):,} ({len(new_users_in_test)/len(test_users)*100:.2f}% of test)")
print(f"  - Users only in train: {len(users_only_in_train):,}")

# Movie overlap analysis  
train_movies = set(train['movieId'].unique())
test_movies = set(test['movieId'].unique())

common_movies = train_movies.intersection(test_movies)
new_movies_in_test = test_movies - train_movies
movies_only_in_train = train_movies - test_movies

print(f"\nMovie Analysis:")
print(f"  - Train movies: {len(train_movies):,}")
print(f"  - Test movies: {len(test_movies):,}")
print(f"  - Common movies (in both): {len(common_movies):,} ({len(common_movies)/len(test_movies)*100:.2f}% of test)")
print(f"  - NEW movies in test (cold-start): {len(new_movies_in_test):,} ({len(new_movies_in_test)/len(test_movies)*100:.2f}% of test)")
print(f"  - Movies only in train: {len(movies_only_in_train):,}")

# Cold-start impact on test ratings
test_coldstart_users = test[test['userId'].isin(new_users_in_test)]
test_coldstart_movies = test[test['movieId'].isin(new_movies_in_test)]

print(f"\nCold-Start Impact on Test Set:")
print(f"  - Ratings from NEW users: {len(test_coldstart_users):,} ({len(test_coldstart_users)/len(test)*100:.2f}% of test)")
print(f"  - Ratings for NEW movies: {len(test_coldstart_movies):,} ({len(test_coldstart_movies)/len(test)*100:.2f}% of test)")

print(f"\n✓ Cold-start analysis completed")
print(f"\nNote: Cold-start items/users are realistic in temporal splits.")
print(f"They represent new users/movies appearing in the test period.")
print(f"This reflects production scenarios where the model must handle new entities.")

Analyzing cold-start scenarios...

User Analysis:
  - Train users: 227,222
  - Test users: 48,464
  - Common users (in both): 4,790 (9.88% of test)
  - NEW users in test (cold-start): 43,674 (90.12% of test)
  - Users only in train: 222,432

Movie Analysis:
  - Train movies: 25,636
  - Test movies: 42,721
  - Common movies (in both): 23,242 (54.40% of test)
  - NEW movies in test (cold-start): 19,479 (45.60% of test)
  - Movies only in train: 2,394

Cold-Start Impact on Test Set:
  - Ratings from NEW users: 4,645,394 (89.25% of test)
  - Ratings for NEW movies: 401,578 (7.72% of test)

✓ Cold-start analysis completed

Note: Cold-start items/users are realistic in temporal splits.
They represent new users/movies appearing in the test period.
This reflects production scenarios where the model must handle new entities.


## Step 7: Save Train and Test Datasets

In [7]:
# Remove temporary datetime columns before saving
columns_to_save = ['userId', 'movieId', 'rating', 'timestamp']

train_final = train[columns_to_save].copy()
test_final = test[columns_to_save].copy()

# Define output paths
output_dir = '../../datasets/output/split_and_train_datasets/temporal_split/'
train_path = output_dir + 'train_ratings.csv'
test_path = output_dir + 'test_ratings.csv'

# Save datasets
print("Saving train and test datasets...\n")

print(f"Saving training set to: {train_path}")
train_final.to_csv(train_path, index=False)
print(f"✓ Training set saved ({len(train_final):,} rows)")

print(f"\nSaving test set to: {test_path}")
test_final.to_csv(test_path, index=False)
print(f"✓ Test set saved ({len(test_final):,} rows)")

# Verify files were created
import os
print(f"\nVerifying saved files:")
train_size_mb = os.path.getsize(train_path) / (1024 * 1024)
test_size_mb = os.path.getsize(test_path) / (1024 * 1024)
print(f"  - {train_path}")
print(f"    Size: {train_size_mb:.2f} MB")
print(f"    Temporal range: {train['datetime'].min()} to {train['datetime'].max()}")
print(f"  - {test_path}")
print(f"    Size: {test_size_mb:.2f} MB")
print(f"    Temporal range: {test['datetime'].min()} to {test['datetime'].max()}")

print(f"\n✓ All files saved successfully")

Saving train and test datasets...

Saving training set to: ../../datasets/output/split_and_train_datasets/temporal_split/train_ratings.csv
✓ Training set saved (20,819,431 rows)

Saving test set to: ../../datasets/output/split_and_train_datasets/temporal_split/test_ratings.csv
✓ Test set saved (5,204,858 rows)

Verifying saved files:
  - ../../datasets/output/split_and_train_datasets/temporal_split/train_ratings.csv
    Size: 517.09 MB
    Temporal range: 1995-01-09 11:46:44 to 2015-01-30 01:17:56
  - ../../datasets/output/split_and_train_datasets/temporal_split/test_ratings.csv
    Size: 134.77 MB
    Temporal range: 2015-01-30 01:17:57 to 2017-08-04 06:57:50

✓ All files saved successfully


## Summary

### Temporal Split (80/20) Completed

**Method**: Time-based split - training on past ratings, testing on future ratings

**Output Files**:
- `train_ratings.csv`: Training set (older 80% of ratings)
- `test_ratings.csv`: Test set (newer 20% of ratings)

**Key Characteristics**:
- Strict temporal ordering (train timestamps < test timestamps)
- No data leakage (no future information in training)
- Preserved all columns
- Realistic cold-start scenarios included

**Advantages Over Random Split**:
- ✅ Simulates real-world deployment (predict future from past)
- ✅ Prevents temporal leakage (never trains on future to predict past)
- ✅ Reveals model's ability to handle new users/items
- ✅ Industry standard evaluation method
- ✅ More honest/conservative performance estimates

**Next Steps**:
1. Use `train_ratings.csv` for model training (collaborative filtering)
2. Use `test_ratings.csv` for evaluation (RMSE, MAE, Precision@K)
3. Compare with random split to demonstrate temporal split benefits
4. Address cold-start challenges in "Improvement Opportunities" section

In [8]:
# Final summary statistics
print("="*70)
print("TEMPORAL SPLIT (80/20) - FINAL SUMMARY")
print("="*70)
print(f"\nOriginal Dataset: {len(ratings):,} ratings")
print(f"Temporal Range: {ratings['datetime'].min()} to {ratings['datetime'].max()}")

print(f"\nTraining Set: {len(train_final):,} ratings ({len(train_final)/len(ratings)*100:.2f}%)")
print(f"  - Period: {train['datetime'].min()} to {train['datetime'].max()}")
print(f"  - Users: {train['userId'].nunique():,}")
print(f"  - Movies: {train['movieId'].nunique():,}")
print(f"  - Avg rating: {train['rating'].mean():.4f}")

print(f"\nTest Set: {len(test_final):,} ratings ({len(test_final)/len(ratings)*100:.2f}%)")
print(f"  - Period: {test['datetime'].min()} to {test['datetime'].max()}")
print(f"  - Users: {test['userId'].nunique():,}")
print(f"  - Movies: {test['movieId'].nunique():,}")
print(f"  - Avg rating: {test['rating'].mean():.4f}")

print(f"\nCold-Start Statistics:")
print(f"  - New users in test: {len(new_users_in_test):,} ({len(new_users_in_test)/len(test_users)*100:.2f}% of test users)")
print(f"  - New movies in test: {len(new_movies_in_test):,} ({len(new_movies_in_test)/len(test_movies)*100:.2f}% of test movies)")

print(f"\nOutput Location: {output_dir}")
print(f"  - train_ratings.csv ({train_size_mb:.2f} MB)")
print(f"  - test_ratings.csv ({test_size_mb:.2f} MB)")

print("\n" + "="*70)
print("✓ Temporal split completed successfully")
print("✓ Data ready for time-aware model evaluation")
print("="*70)

TEMPORAL SPLIT (80/20) - FINAL SUMMARY

Original Dataset: 26,024,289 ratings
Temporal Range: 1995-01-09 11:46:44 to 2017-08-04 06:57:50

Training Set: 20,819,431 ratings (80.00%)
  - Period: 1995-01-09 11:46:44 to 2015-01-30 01:17:56
  - Users: 227,222
  - Movies: 25,636
  - Avg rating: 3.5210

Test Set: 5,204,858 ratings (20.00%)
  - Period: 2015-01-30 01:17:57 to 2017-08-04 06:57:50
  - Users: 48,464
  - Movies: 42,721
  - Avg rating: 3.5566

Cold-Start Statistics:
  - New users in test: 43,674 (90.12% of test users)
  - New movies in test: 19,479 (45.60% of test movies)

Output Location: ../../datasets/output/split_and_train_datasets/temporal_split/
  - train_ratings.csv (517.09 MB)
  - test_ratings.csv (134.77 MB)

✓ Temporal split completed successfully
✓ Data ready for time-aware model evaluation
