# Arsenal FC Match Data & AI Simulation - Complete Notebook

This notebook provides complete access to all Arsenal FC match data collection, AI simulation, and analysis capabilities.

## Table of Contents
1. [Setup & Installation](#setup)
2. [Real Data Collection](#real-data)
3. [AI Match Simulation](#simulation)
4. [Data Analysis](#analysis)
5. [Visualization](#visualization)
6. [Export & Save](#export)

---

## 1. Setup & Installation <a id='setup'></a>

First, let's install required dependencies and set up the environment.

In [None]:
# Install required packages (run this once)
!pip install -q requests pandas numpy pydantic python-dotenv scikit-learn matplotlib seaborn

In [None]:
# Import all required modules
import sys
import os
import json
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add src directory to path
sys.path.insert(0, os.path.join(os.getcwd(), 'src'))

# Import custom modules
from data_schema import MatchData, TeamStats, Dataset
from data_collector import FootballDataCollector, create_dataset_from_api, save_dataset
from simulator import FootballMatchSimulator, create_simulated_dataset, PREMIER_LEAGUE_TEAMS

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✅ Setup complete! All modules loaded successfully.")

---
## 2. Real Data Collection <a id='real-data'></a>

Collect real Arsenal match data from football-data.org API or use mock data.

In [None]:
# Optional: Set your API key here (get from https://www.football-data.org/client/register)
API_KEY = None  # or "your_api_key_here"

# Create dataset from API
print("Fetching Arsenal match data...")
real_dataset = create_dataset_from_api(api_key=API_KEY, season="2023")

print(f"\n✅ Collected {len(real_dataset.matches)} matches")
print(f"Source: {real_dataset.source}")
print(f"Dataset: {real_dataset.dataset_name}")

In [None]:
# Display sample matches
print("\n📋 Sample Matches:")
for i, match in enumerate(real_dataset.matches[:3], 1):
    print(f"{i}. {match.date}: {match.home_team} {match.home_score}-{match.away_score} {match.away_team}")

---
## 3. AI Match Simulation <a id='simulation'></a>

Generate synthetic match data using AI-powered simulation.

### 3.1 Simulate Single Match

In [None]:
# Create simulator
simulator = FootballMatchSimulator(seed=42)

# Simulate Arsenal vs Manchester City
match = simulator.simulate_match(
    home_team="Arsenal",
    away_team="Manchester City",
    date="2024-03-31"
)

print("🎮 AI Simulated Match:")
print(f"\n{match.home_team} {match.home_score} - {match.away_score} {match.away_team}")
print(f"\nVenue: {match.venue}")
print(f"Attendance: {match.attendance:,}")
print(f"\nStats:")
print(f"  Possession: {match.home_stats.possession}% - {match.away_stats.possession}%")
print(f"  Shots: {match.home_stats.shots} - {match.away_stats.shots}")
print(f"  Shots on Target: {match.home_stats.shots_on_target} - {match.away_stats.shots_on_target}")
print(f"  xG: {match.home_stats.xg} - {match.away_stats.xg}")
print(f"  Corners: {match.home_stats.corners} - {match.away_stats.corners}")

### 3.2 Simulate Arsenal Season

In [None]:
# Simulate a full season (38 matches)
print("🎮 Simulating Arsenal's season...")
sim_dataset = create_simulated_dataset(
    num_matches=38,
    team="Arsenal",
    season="2023-24",
    seed=42
)

print(f"\n✅ Simulated {len(sim_dataset.matches)} matches")

# Calculate season statistics
wins = sum(1 for m in sim_dataset.matches if (
    (m.is_arsenal_home and m.home_score > m.away_score) or
    (not m.is_arsenal_home and m.away_score > m.home_score)
))
draws = sum(1 for m in sim_dataset.matches if m.home_score == m.away_score)
losses = len(sim_dataset.matches) - wins - draws

goals_for = sum(
    m.home_score if m.is_arsenal_home else m.away_score 
    for m in sim_dataset.matches
)
goals_against = sum(
    m.away_score if m.is_arsenal_home else m.home_score 
    for m in sim_dataset.matches
)

points = wins * 3 + draws

print(f"\n📊 Season Statistics:")
print(f"  Record: {wins}W {draws}D {losses}L")
print(f"  Points: {points}")
print(f"  Goals Scored: {goals_for}")
print(f"  Goals Conceded: {goals_against}")
print(f"  Goal Difference: {goals_for - goals_against:+d}")

### 3.3 Simulate Full League Round

In [None]:
# Simulate all 10 matches in a Premier League round
print("🎮 Simulating full Premier League matchday...")
league_matches = simulator.simulate_league_round(
    date="2024-02-10",
    season="2023-24"
)

print(f"\n📋 Match Results:")
for match in league_matches:
    print(f"  {match.home_team:20s} {match.home_score}-{match.away_score}  {match.away_team}")

### 3.4 Create Large Training Dataset

In [None]:
# Generate large dataset for machine learning
print("🎮 Creating large training dataset...")
training_dataset = create_simulated_dataset(
    num_matches=200,
    team="Arsenal",
    season="2023-24",
    seed=123
)

print(f"\n✅ Created training dataset with {len(training_dataset.matches)} matches")
print("This can be used for ML model training!")

---
## 4. Data Analysis <a id='analysis'></a>

Analyze match data using pandas.

### 4.1 Convert to DataFrame

In [None]:
# Convert simulated dataset to pandas DataFrame
def dataset_to_dataframe(dataset):
    """Convert Dataset to pandas DataFrame."""
    records = []
    for match in dataset.matches:
        record = {
            'match_id': match.match_id,
            'date': match.date,
            'competition': match.competition,
            'home_team': match.home_team,
            'away_team': match.away_team,
            'home_score': match.home_score,
            'away_score': match.away_score,
            'is_arsenal_home': match.is_arsenal_home,
            'venue': match.venue,
            'attendance': match.attendance,
        }
        
        if match.home_stats:
            record.update({
                'home_possession': match.home_stats.possession,
                'home_shots': match.home_stats.shots,
                'home_shots_on_target': match.home_stats.shots_on_target,
                'home_xg': match.home_stats.xg,
                'home_corners': match.home_stats.corners,
            })
        
        if match.away_stats:
            record.update({
                'away_possession': match.away_stats.possession,
                'away_shots': match.away_stats.shots,
                'away_shots_on_target': match.away_stats.shots_on_target,
                'away_xg': match.away_stats.xg,
                'away_corners': match.away_stats.corners,
            })
        
        records.append(record)
    
    return pd.DataFrame(records)

# Create DataFrame
df = dataset_to_dataframe(sim_dataset)
print("✅ Converted to DataFrame")
print(f"\nShape: {df.shape}")
df.head()

### 4.2 Calculate Arsenal Results

In [None]:
# Add Arsenal-specific columns
df['arsenal_score'] = df.apply(
    lambda row: row['home_score'] if row['is_arsenal_home'] else row['away_score'],
    axis=1
)

df['opponent_score'] = df.apply(
    lambda row: row['away_score'] if row['is_arsenal_home'] else row['home_score'],
    axis=1
)

df['opponent'] = df.apply(
    lambda row: row['away_team'] if row['is_arsenal_home'] else row['home_team'],
    axis=1
)

df['result'] = df.apply(
    lambda row: 'Win' if row['arsenal_score'] > row['opponent_score'] 
                else ('Draw' if row['arsenal_score'] == row['opponent_score'] else 'Loss'),
    axis=1
)

df['points'] = df['result'].map({'Win': 3, 'Draw': 1, 'Loss': 0})

print("✅ Added Arsenal-specific columns")
df[['date', 'opponent', 'arsenal_score', 'opponent_score', 'result']].head(10)

### 4.3 Summary Statistics

In [None]:
print("📊 Arsenal Season Summary Statistics:\n")

# Results distribution
print("Results:")
print(df['result'].value_counts())
print(f"\nWin Rate: {(df['result'] == 'Win').sum() / len(df) * 100:.1f}%")

# Goals
print(f"\nGoals:")
print(f"  Scored: {df['arsenal_score'].sum()}")
print(f"  Conceded: {df['opponent_score'].sum()}")
print(f"  Difference: {df['arsenal_score'].sum() - df['opponent_score'].sum():+d}")
print(f"  Average per match: {df['arsenal_score'].mean():.2f}")

# Points
print(f"\nPoints: {df['points'].sum()}")

# Home vs Away
print(f"\nHome Record:")
home_df = df[df['is_arsenal_home']]
print(f"  Matches: {len(home_df)}")
print(f"  Wins: {(home_df['result'] == 'Win').sum()}")
print(f"  Goals/match: {home_df['arsenal_score'].mean():.2f}")

print(f"\nAway Record:")
away_df = df[~df['is_arsenal_home']]
print(f"  Matches: {len(away_df)}")
print(f"  Wins: {(away_df['result'] == 'Win').sum()}")
print(f"  Goals/match: {away_df['arsenal_score'].mean():.2f}")

### 4.4 Advanced Statistics

In [None]:
# Add Arsenal possession and shots columns
df['arsenal_possession'] = df.apply(
    lambda row: row['home_possession'] if row['is_arsenal_home'] else row['away_possession'],
    axis=1
)

df['arsenal_shots'] = df.apply(
    lambda row: row['home_shots'] if row['is_arsenal_home'] else row['away_shots'],
    axis=1
)

df['arsenal_xg'] = df.apply(
    lambda row: row['home_xg'] if row['is_arsenal_home'] else row['away_xg'],
    axis=1
)

print("📊 Advanced Statistics:\n")

print(f"Average Possession: {df['arsenal_possession'].mean():.1f}%")
print(f"Average Shots: {df['arsenal_shots'].mean():.1f}")
print(f"Average xG: {df['arsenal_xg'].mean():.2f}")
print(f"\nxG vs Actual Goals:")
print(f"  Expected: {df['arsenal_xg'].sum():.1f}")
print(f"  Actual: {df['arsenal_score'].sum()}")
print(f"  Difference: {df['arsenal_score'].sum() - df['arsenal_xg'].sum():+.1f}")

---
## 5. Visualization <a id='visualization'></a>

Create insightful visualizations of the match data.

### 5.1 Results Distribution

In [None]:
# Results pie chart
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
result_counts = df['result'].value_counts()
colors = ['#00ff87', '#FFD700', '#ff4444']
ax1.pie(result_counts.values, labels=result_counts.index, autopct='%1.1f%%',
        startangle=90, colors=colors)
ax1.set_title('Arsenal Results Distribution', fontsize=14, fontweight='bold')

# Bar chart
result_counts.plot(kind='bar', ax=ax2, color=colors)
ax2.set_title('Arsenal Results Count', fontsize=14, fontweight='bold')
ax2.set_ylabel('Number of Matches')
ax2.set_xlabel('Result')
ax2.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

### 5.2 Goals Scored Over Season

In [None]:
# Goals timeline
fig, ax = plt.subplots(figsize=(14, 6))

df['match_number'] = range(1, len(df) + 1)
df['cumulative_goals'] = df['arsenal_score'].cumsum()
df['cumulative_points'] = df['points'].cumsum()

ax.plot(df['match_number'], df['cumulative_goals'], marker='o', linewidth=2, 
        label='Cumulative Goals', color='#00ff87')
ax.fill_between(df['match_number'], df['cumulative_goals'], alpha=0.3, color='#00ff87')

ax.set_xlabel('Match Number', fontsize=12)
ax.set_ylabel('Cumulative Goals', fontsize=12)
ax.set_title('Arsenal Goals Progression Throughout Season', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

### 5.3 Home vs Away Performance

In [None]:
# Home vs Away comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Goals scored
home_away_goals = df.groupby('is_arsenal_home')['arsenal_score'].mean()
axes[0, 0].bar(['Away', 'Home'], home_away_goals.values, color=['#ff9999', '#00ff87'])
axes[0, 0].set_title('Average Goals Scored', fontweight='bold')
axes[0, 0].set_ylabel('Goals per Match')

# Possession
home_away_poss = df.groupby('is_arsenal_home')['arsenal_possession'].mean()
axes[0, 1].bar(['Away', 'Home'], home_away_poss.values, color=['#ff9999', '#00ff87'])
axes[0, 1].set_title('Average Possession', fontweight='bold')
axes[0, 1].set_ylabel('Possession %')

# Win rate
home_wins = (df[df['is_arsenal_home']]['result'] == 'Win').sum() / len(df[df['is_arsenal_home']]) * 100
away_wins = (df[~df['is_arsenal_home']]['result'] == 'Win').sum() / len(df[~df['is_arsenal_home']]) * 100
axes[1, 0].bar(['Away', 'Home'], [away_wins, home_wins], color=['#ff9999', '#00ff87'])
axes[1, 0].set_title('Win Rate', fontweight='bold')
axes[1, 0].set_ylabel('Win %')

# Shots
home_away_shots = df.groupby('is_arsenal_home')['arsenal_shots'].mean()
axes[1, 1].bar(['Away', 'Home'], home_away_shots.values, color=['#ff9999', '#00ff87'])
axes[1, 1].set_title('Average Shots', fontweight='bold')
axes[1, 1].set_ylabel('Shots per Match')

plt.tight_layout()
plt.show()

### 5.4 xG vs Actual Goals

In [None]:
# xG analysis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot: xG vs Goals
ax1.scatter(df['arsenal_xg'], df['arsenal_score'], alpha=0.6, s=100, color='#00ff87')
ax1.plot([0, df['arsenal_xg'].max()], [0, df['arsenal_xg'].max()], 'r--', 
         label='Perfect prediction line')
ax1.set_xlabel('Expected Goals (xG)', fontsize=12)
ax1.set_ylabel('Actual Goals', fontsize=12)
ax1.set_title('xG vs Actual Goals', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Bar chart: Total comparison
totals = pd.DataFrame({
    'Metric': ['Expected Goals (xG)', 'Actual Goals'],
    'Value': [df['arsenal_xg'].sum(), df['arsenal_score'].sum()]
})
ax2.bar(totals['Metric'], totals['Value'], color=['#FFD700', '#00ff87'])
ax2.set_title('Season Total: xG vs Goals', fontsize=14, fontweight='bold')
ax2.set_ylabel('Total')

# Add value labels on bars
for i, v in enumerate(totals['Value']):
    ax2.text(i, v + 0.5, f'{v:.1f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### 5.5 Form Guide (Last 10 Matches)

In [None]:
# Form guide visualization
last_10 = df.tail(10).copy()

fig, ax = plt.subplots(figsize=(14, 3))

colors_map = {'Win': '#00ff87', 'Draw': '#FFD700', 'Loss': '#ff4444'}
colors = [colors_map[r] for r in last_10['result']]

ax.bar(range(len(last_10)), [1]*len(last_10), color=colors, width=0.8)

# Add result labels
for i, (idx, row) in enumerate(last_10.iterrows()):
    result_letter = row['result'][0]  # W, D, or L
    ax.text(i, 0.5, result_letter, ha='center', va='center', 
            fontsize=16, fontweight='bold', color='white')
    ax.text(i, -0.3, f"{row['arsenal_score']}-{row['opponent_score']}", 
            ha='center', va='top', fontsize=10)

ax.set_xlim(-0.5, len(last_10) - 0.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xticks(range(len(last_10)))
ax.set_xticklabels([f"Match {i+1}" for i in range(len(last_10))], rotation=0)
ax.set_yticks([])
ax.set_title('Arsenal Form Guide - Last 10 Matches', fontsize=14, fontweight='bold')
ax.spines['top'].set_visible(False)
ax.spines('right').set_visible(False)
ax.spines['left'].set_visible(False)

plt.tight_layout()
plt.show()

---
## 6. Advanced Tactical Simulation <a id='tactical'></a>

Use the advanced tactical simulator for event-level, minute-by-minute match simulation with formations and playing styles.

In [None]:
# Import tactical simulator
from tactical_simulator import TacticalMatchSimulator, ADVANCED_PL_TEAMS

# Create tactical simulator
tactical_sim = TacticalMatchSimulator(seed=42)

print("✅ Tactical simulator loaded!")

### 6.1 Simulate Tactical Match with Events

In [None]:
# Simulate Arsenal vs Liverpool with minute-by-minute events
print("🎮 Simulating tactical match: Arsenal vs Liverpool\n")

match_data, match_state = tactical_sim.simulate_tactical_match(
    home_team="Arsenal",
    away_team="Liverpool",
    date="2024-04-03",
    detailed_events=True  # Enable event tracking
)

print(tactical_sim.get_match_summary())

---
## 7. Export & Save <a id='export'></a>

Save datasets in various formats for further use.

### 6.1 Save as JSON

In [None]:
# Save simulated dataset as JSON
os.makedirs('data/processed', exist_ok=True)

save_dataset(sim_dataset, format="json", output_dir="data/processed")
print("✅ Dataset saved as JSON")

### 6.2 Save as CSV

In [None]:
# Save simulated dataset as CSV
save_dataset(sim_dataset, format="csv", output_dir="data/processed")
print("✅ Dataset saved as CSV")

### 6.3 Save DataFrame with Analysis

In [None]:
# Save enriched DataFrame
output_file = 'data/processed/arsenal_analysis.csv'
df.to_csv(output_file, index=False)
print(f"✅ Analysis DataFrame saved to {output_file}")
print(f"\nColumns saved: {list(df.columns)}")

### 6.4 Export Summary Report

In [None]:
# Create summary report
summary_report = f"""
ARSENAL FC SEASON ANALYSIS REPORT
{'=' * 50}

Dataset: {sim_dataset.dataset_name}
Source: {sim_dataset.source}
Generated: {sim_dataset.last_updated}

OVERALL STATISTICS
{'-' * 50}
Matches Played: {len(df)}
Wins: {(df['result'] == 'Win').sum()}
Draws: {(df['result'] == 'Draw').sum()}
Losses: {(df['result'] == 'Loss').sum()}
Win Rate: {(df['result'] == 'Win').sum() / len(df) * 100:.1f}%

Points: {df['points'].sum()}
Goals Scored: {df['arsenal_score'].sum()}
Goals Conceded: {df['opponent_score'].sum()}
Goal Difference: {df['arsenal_score'].sum() - df['opponent_score'].sum():+d}

HOME RECORD
{'-' * 50}
Matches: {len(df[df['is_arsenal_home']])}
Wins: {(df[df['is_arsenal_home']]['result'] == 'Win').sum()}
Goals/Match: {df[df['is_arsenal_home']]['arsenal_score'].mean():.2f}
Possession: {df[df['is_arsenal_home']]['arsenal_possession'].mean():.1f}%

AWAY RECORD
{'-' * 50}
Matches: {len(df[~df['is_arsenal_home']])}
Wins: {(df[~df['is_arsenal_home']]['result'] == 'Win').sum()}
Goals/Match: {df[~df['is_arsenal_home']]['arsenal_score'].mean():.2f}
Possession: {df[~df['is_arsenal_home']]['arsenal_possession'].mean():.1f}%

ADVANCED METRICS
{'-' * 50}
Average Possession: {df['arsenal_possession'].mean():.1f}%
Average Shots: {df['arsenal_shots'].mean():.1f}
Average xG: {df['arsenal_xg'].mean():.2f}
Total xG: {df['arsenal_xg'].sum():.1f}
xG Difference: {df['arsenal_score'].sum() - df['arsenal_xg'].sum():+.1f}
"""

# Save report
report_file = 'data/processed/arsenal_season_report.txt'
with open(report_file, 'w') as f:
    f.write(summary_report)

print(summary_report)
print(f"\n✅ Report saved to {report_file}")

---
## 🎉 Complete!

You now have:
- ✅ Real match data collection capability
- ✅ AI-powered match simulation
- ✅ Comprehensive data analysis
- ✅ Beautiful visualizations
- ✅ Multiple export formats

### Next Steps:
1. Modify parameters to generate different datasets
2. Use the data for machine learning models
3. Create custom analyses and visualizations
4. Compare simulated vs real data

### Quick Commands:
```python
# Generate 1000 matches for ML training
large_dataset = create_simulated_dataset(num_matches=1000, seed=42)
large_dataset.to_csv('data/training/ml_training_data.csv')

# Simulate specific matchup
match = simulator.simulate_match("Arsenal", "Tottenham", "2024-04-28")

# Get team profiles
print(PREMIER_LEAGUE_TEAMS["Arsenal"])
```