# PSG Diaspora Dataset - Data Exploration

Analysis of professional footballers born in Île-de-France (1980-2006).

**Key Questions:**
1. What is the diaspora composition of IDF footballers?
2. Which départements produce the most players?
3. How has production evolved over time?

In [None]:
import json
import pandas as pd
from pathlib import Path
from collections import Counter

# Load data
DATA_DIR = Path("..") / "data"

with open(DATA_DIR / "raw" / "wikidata" / "idf_footballers.json") as f:
    raw_data = json.load(f)

with open(DATA_DIR / "processed" / "analysis_results.json") as f:
    analysis = json.load(f)

players = raw_data["players"]
print(f"Total players: {len(players)}")

## 1. Overview Statistics

In [None]:
# Key metrics
print("="*50)
print("KEY FINDINGS")
print("="*50)
print(f"Total players: {analysis['demographics']['total_players']}")
print(f"Dual nationals: {analysis['demographics']['dual_nationals']} ({analysis['demographics']['dual_national_pct']}%)")
print(f"African diaspora: {analysis['diaspora']['total_diaspora']} ({analysis['diaspora']['diaspora_pct']}%)")
print("="*50)

## 2. Diaspora Breakdown

In [None]:
# Diaspora regions
print("\nDiaspora by Region:")
print("-"*30)
for region, count in sorted(analysis['diaspora']['by_region'].items(), key=lambda x: -x[1]):
    pct = 100 * count / analysis['demographics']['total_players']
    print(f"{region:25} {count:4} ({pct:.1f}%)")

In [None]:
# Top countries
print("\nTop Origin Countries (besides France):")
print("-"*40)
for country, count in list(analysis['diaspora']['by_country'].items())[:10]:
    print(f"{country:35} {count:4}")

## 3. Geographic Distribution

In [None]:
# By département
print("\nPlayers by Département:")
print("-"*40)
for dept, info in analysis['geographic']['by_department'].items():
    print(f"{info['name']:20} ({dept}) {info['count']:4}")

print("\n⚠️ WARNING: 93 and 95 data missing (rate limited)")

## 4. Temporal Trends

In [None]:
# Birth year distribution
print("\nBirth Year Distribution:")
print("-"*40)
for period, count in sorted(analysis['temporal']['by_5year'].items(), key=lambda x: int(x[0])):
    bar = "█" * (count // 10)
    print(f"{period}-{int(period)+4}: {count:3} {bar}")

## 5. Sample Players

In [None]:
# Show some example players
df = pd.DataFrame(players)

# Extract birthplace name
df['birthplace_name'] = df['birthplace'].apply(lambda x: x.get('name', '') if isinstance(x, dict) else '')

# Show dual nationals
dual_nationals = df[df['nationalities'].apply(lambda x: len(x) > 1)]
print(f"\nSample Dual National Players ({len(dual_nationals)} total):")
print("-"*60)
for _, row in dual_nationals.head(10).iterrows():
    nats = ", ".join(row['nationalities'])
    print(f"{row['name']:30} | {row['birthplace_name'][:20]:20} | {nats}")

## 6. Visualizations

Charts are pre-generated in `docs/figures/`. View them there or run:

```bash
./venv/bin/python src/visualization/charts.py
```

In [None]:
from IPython.display import Image, display

figures_dir = Path("..") / "docs" / "figures"

print("Summary Infographic:")
display(Image(filename=figures_dir / "summary_infographic.png", width=800))

## 7. Data Export

Data is available in multiple formats in `data/huggingface/`:
- `idf_footballers.csv` - CSV format
- `idf_footballers.parquet` - Parquet format (recommended)
- `idf_footballers.jsonl` - JSON Lines format

In [None]:
# Load the processed dataset
df_hf = pd.read_csv(DATA_DIR / "huggingface" / "idf_footballers.csv")
print(f"Dataset shape: {df_hf.shape}")
df_hf.head()

---

## Limitations

1. **Missing 93/95**: Seine-Saint-Denis and Val-d'Oise data not collected (rate limited)
2. **Wikidata only**: Only players with Wikidata entries included
3. **Birthplace ≠ Training**: Where someone was born may differ from where they trained
4. **Diaspora estimation**: Based on nationality data, may miss some