# üèÜ DecryptX Round 3 - Data Cleaning Contest

Welcome to the final round of DecryptX! Your challenge is to **clean the FIFA21 dataset** and build a model to predict player Overall Ratings (OVA).

## Rules
- **5 submission attempts** total (lifetime limit)
- **1 minute cooldown** between submissions
- **Fixed train/test split**: `random_state=42`, `test_size=0.2`
- **Metric**: RMSE (lower is better)

## Workflow
1. Install the library and login
2. Load and explore the raw data
3. **Clean the data** (your main task!)
4. Train a model
5. Evaluate and submit

---

## Step 1: Install the DecryptX Helper Library

In [None]:
# Install the DecryptX helper library from GitHub
!pip install -q git+https://github.com/keanesc/decryptx-helper.git

## Step 2: Login with Your Team Credentials

Enter your team name and password below. This authenticates you with the DecryptX server.

In [None]:
from decryptx import login

# Enter your team credentials
TEAM_NAME = ""  # <-- Enter your team name
PASSWORD = ""   # <-- Enter your password

# Login to the server
session = login(TEAM_NAME, PASSWORD)

## Step 3: Load the Raw Dataset

The FIFA21 dataset contains various data quality issues. Your job is to clean it!

In [None]:
from decryptx import load_data
import pandas as pd

# Load the raw dataset
df_raw = load_data()

# Take a look at the data
print(f"Dataset shape: {df_raw.shape}")
df_raw.head()

In [None]:
# Explore the columns
print("Columns:")
print(df_raw.columns.tolist())

In [None]:
# Check data types and missing values
df_raw.info()

## Step 4: Clean Your Data üßπ

This is where your work goes! The dataset has several quality issues:

### Common Issues to Address:
- **Multi-line cell values**: Club names may have embedded newlines
- **Currency values**: "‚Ç¨103.5M", "‚Ç¨43K" need to be converted to numbers
- **Height/Weight**: "170cm", "72kg" need to be parsed
- **Missing values**: Various columns have NaN values
- **Special characters**: Star ratings like "4 ‚òÖ"
- **Non-numeric columns**: Need encoding or removal

### Tips:
- Don't just drop all rows with missing values!
- Consider which features are actually useful for predicting OVA
- A simple model on clean data beats a complex model on dirty data

In [None]:
# YOUR DATA CLEANING CODE GOES HERE
# ===================================

# Start with a copy of the raw data
df_clean = df_raw.copy()

# Example: Remove columns that aren't useful for prediction
# df_clean = df_clean.drop(columns=['photoUrl', 'playerUrl', ...])

# Example: Parse currency values
# def parse_currency(val):
#     if pd.isna(val):
#         return None
#     val = str(val).replace('‚Ç¨', '').strip()
#     if val.endswith('M'):
#         return float(val[:-1]) * 1_000_000
#     elif val.endswith('K'):
#         return float(val[:-1]) * 1_000
#     return float(val)
#
# df_clean['Value'] = df_clean['Value'].apply(parse_currency)

# Example: Parse height
# def parse_height(val):
#     if pd.isna(val):
#         return None
#     val = str(val).replace('cm', '').strip()
#     return float(val)
#
# df_clean['Height'] = df_clean['Height'].apply(parse_height)

# TODO: Add your cleaning code here!
# ...

print(f"Cleaned dataset shape: {df_clean.shape}")

In [None]:
# Validate your cleaned data
from decryptx.data import validate_data

report = validate_data(df_clean)
print("Validation Report:")
print(f"  Valid: {report['valid']}")
print(f"  Stats: {report['stats']}")
if report['warnings']:
    print(f"  Warnings: {report['warnings']}")
if report['errors']:
    print(f"  Errors: {report['errors']}")

## Step 5: Split the Data

**Important**: The split uses fixed parameters (`random_state=42`, `test_size=0.2`) to ensure fairness. Do not modify these!

In [None]:
from decryptx import get_train_test_split

# Split the data (fixed parameters: random_state=42, test_size=0.2)
X_train, X_test, y_train, y_test = get_train_test_split(df_clean)

## Step 6: Train Your Model

Train any regression model you like! The focus is on **data cleaning**, not model complexity.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Example: Simple Random Forest
model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

# Train the model
print("Training model...")
model.fit(X_train, y_train)
print("Training complete!")

## Step 7: Evaluate Your Model

This computes your RMSE score on the test set.

In [None]:
from decryptx import evaluate

# Evaluate the model
score, run_id, metadata = evaluate(model, X_test, y_test)

print(f"\nüéØ Your RMSE Score: {score:.4f}")
print("   (Lower is better!)")

## Step 8: Submit Your Score üöÄ

**Remember:**
- You have **5 attempts total**
- **1 minute cooldown** between submissions
- Make sure you're happy with your score before submitting!

In [None]:
from decryptx import submit

# Uncomment the line below when you're ready to submit!
# result = submit(session, score, run_id, metadata)

print("‚ö†Ô∏è  Submission is commented out. Uncomment the line above when ready!")
print(f"   Your current score: {score:.4f}")
print(f"   Remaining attempts: {session.get('remainingAttempts', 5)}/5")

In [None]:
# ACTUALLY SUBMIT (uncomment when ready!)
# ========================================

# result = submit(session, score, run_id, metadata)
# print(f"\n‚úÖ Submitted! Remaining attempts: {result['remainingAttempts']}/5")

---

## üí° Need Help?

### Data Cleaning Ideas:
1. **Currency Parsing**: Convert "‚Ç¨103.5M" ‚Üí 103500000
2. **Height/Weight**: Convert "170cm" ‚Üí 170, "72kg" ‚Üí 72
3. **Star Ratings**: Parse "4 ‚òÖ" ‚Üí 4
4. **Work Rate**: Split "High/Low" into two columns
5. **Missing Values**: Consider imputation vs. dropping
6. **Feature Selection**: Remove URLs, names, IDs that don't help prediction

### Model Ideas:
- `RandomForestRegressor`
- `GradientBoostingRegressor`
- `XGBRegressor` (if installed)
- `Ridge` / `Lasso` for simpler models

### Remember:
**Clean data > Complex model**

Good luck! üçÄ