# Group Project part 01

#### Deadline for the code submission: October 10th at 08:59 am CET

#### Reminder
- your group is the one assigned to you by the University.
- one goal of this project is to learn how to work as a group, which is the standard in the tech industry. Therefore you need to resolve group issues on your own, as a group.
- if you did not manage to resolve the group issues on your own, you need to escalate to the teacher early, not last minute.
- if the group splits, it would result in a 0 for the whole group.

**Penalty for unexcused absence or lateness**:
- If you are absent or late on presentation day without an official excuse, you will receive 0 for the presentation part of the group project.
- If you are late without an official excuse and can still make it to the presentation of your team, you will still receive 0 for the presentation part of the group project.

## Objective
In this project, you utilise your skills to :
- collect data through multiple APIs and open source datasets, for both quantitative and qualitative data
- merge data from different sources
- describe and analyse datasets
- uncover patterns, insights
- calculate aggregated measures, statistics
- create compelling data visualisations
- write clean code
- tell a story and convince your audience

Each group can pick one and one only scenario among the following ones.

Be mindful to pick a topic that enables enough data collection and analysis in order to showcase all the skills gathered during the course, listed above.

### Scenario 01: Become a Business Manager

Your task is to design a local business that leverages data from various APIs to make informed, strategic decisions. Whether you're launching a street food stand, a drink shop, or another local venture, your team will gather and analyze relevant data —such as foot traffic, weather patterns, customer trends, or competitor insights— to shape your business plan. Your final deliverable will be a data-supported report and/or presentation to a management board, demonstrating how your findings guide key decisions in operations, marketing, or product offerings. The ultimate goal: to optimize performance and increase the chances of business success. Will your business thrive in today’s data-driven world?
Examples:
- lemonade stands business
- food truck business
- delivery service

### Scenario 02: Fact Check Popular beliefs

You are part of a fact-checking research team investigating common beliefs, trending opinions, or viral social media claims (e.g. “drinking lemon water boosts metabolism” or “blue light ruins your sleep”). Your goal is to dig into reliable sources, data, and expert opinions to determine whether these beliefs hold up under scrutiny. Use data to challenge or prove real-world claims with clear, persuasive insights. Drawing on research, statistics, and visual evidence, your team will present a well-supported explanation to help your audience separate fact from fiction.

You may also choose to divide the group into two sides—one defending the belief and the other challenging it—before presenting your findings in a debate or side-by-side analysis.

Examples:
- Electric cars are always better for the environment
- Areas with more green space have better physical and mental health outcomes.
- Does public sentiment on social media predict stock market trends?

## 01 - Getting Ready: first questions

Depending on the scenario you picked, please consider the following questions to help you get started.

### Scenario 01: Become a Business Manager

   - What kind of business do we run? What do we sell ? The choice of the business must be original and unique to your group.
   - How do we name our business?
   - When do we operate? Is it an all-year-round business or a seasonal one? If so, which seasons? Which months / weeks / days / hours of the day do we operate?
   - Where do we operate? In which countries / cities are we currently active ? Where do we want to develop in the future ? Determine where to set up your business stand based on weather conditions, local attractions, or events.
   - Which datasets will assist us in making our business the most successful?

### Scenario 02: Fact-Check a popular belief

- What specific belief or claim do you want to investigate ?
- Why is this belief important or worth fact-checking ?
- What evidence or data supports or contradicts the belief ?
- Will you split the team into two group (in favor / against) ?
- What real-world impact does this belief have on people ?
- What are the consequences if people continue believing or acting on this (true or false) idea ?

## 02 - Collect data from multiple APIs, the more the merrier

Integrate with as many APIs as you can e.g.:
- OpenWeatherMap API
- Google Maps,
- TripAdvisor,
- News API,
- Yelp,
- Wikipedia,
- Booking,
- Amadeus Travel API,
- Foursquare,
- etc. (make your own research and be original!)

Each API can provide different types of information. Pick the ones that best suit your scenario.

After collecting all the data you need, save them.

## 03 - Collect data from open source data sources

The dataset must align with your end-goal and serve its purpose.

- governement websites
- statistics institutes,
- etc.

At the end of this step, you should have collected both quantitative and qualitative data.

# Understat API
https://github.com/collinb9/understatAPI
https://pypi.org/project/understat/
This is a python API for scraping data from understat.com. Understat is a website with football data for 6 european leagues for every season since 2014/15 season. The leagues available are the Premier League, La Liga, Ligue 1, Serie A, Bundesliga and the Russian Premier League.

**How this library works (the mental model)**



*   It mirrors pages on understat.com. There are exactly four entry points and they map 1–1 to the website:
**league(<name>), team(<name>), player(<id>), match(<id>).** Each exposes methods that correspond to the tables you see on those pages. 
*   Names vs IDs.
league() and team() take human-readable (string) names (e.g., "EPL", "Manchester_United"). **player() and match() require numeric IDs** (you discover these via a league or team call first). 


**What you can fetch (typical methods):**


*   LeagueEndpoint: get_team_data(season), get_player_data(season), get_match_data(season)
*   TeamEndpoint: get_player_data(season), get_match_data(season), get_context_data(season)


*   PlayerEndpoint: get_season_data(), get_match_data(), get_shot_data()
*   MatchEndpoint: get_roster_data(), get_shot_data(), get_match_info() 


**Sessions + context manager.**
Use UnderstatClient() as a context manager to reuse a session and get nicer errors. 



In [None]:
pip install understatapi # install once 


In [None]:
from understatapi import UnderstatClient
import pandas as pd
import json, os
from time import sleep
from pathlib import Path
print("Imports OK. pandas:", pd.__version__)

In [None]:
# testing it out first

LEAGUE = "EPL"              # one of: EPL, La_Liga, Bundesliga, Serie_A, Ligue_1, RFPL
SEASON = "2023"             # Understat uses season start year as string

with UnderstatClient() as us:
    # Discover IDs
    players = us.league(LEAGUE).get_player_data(season=SEASON)
    # Example: pick by name (case/spacing must match Understat)
    pid = next(p["id"] for p in players if p["player_name"] == "Kevin De Bruyne") # pid = player id

    # Player detail
    season_stats = us.player(pid).get_season_data()   # per-season summary
    match_stats  = us.player(pid).get_match_data()    # per-match stats
    shots        = us.player(pid).get_shot_data()     # shot-by-shot with xG and locations

    # Team or match routes (optional)
    united_matches = us.team("Manchester_United").get_match_data(season=SEASON)
    first_match_id = united_matches[0]["id"]
    roster = us.match(first_match_id).get_roster_data()
    match_shots = us.match(first_match_id).get_shot_data()

In [None]:
print(players)
print(pid)
print(season_stats)
print(match_stats)
print(shots)

Now that we know it works, lets pull some data, store it and structure it.

Quick brief (so that I understand the data):

> League: “Bundesliga” (Germany’s top division).

> Seasons on Understat use the start year. So season "2024" means 2024/25.

> Useful tables to start with:

> Players (season aggregates): one row per player per season (xG, xA, shots, minutes, etc.). Great for scouting/overview.

> Matches (metadata): match id, date, home/away, goals, xG. Good for timelines or joining later.

> Teams (season aggregates): team-level KPIs you might want for context.

 Unlike CSV, which stores data in rows, Parquet organizes data by columns. 
 This columnar storage format allows for more efficient querying and
 compression, making it ideal for large-scale datasets. 
 Parquet is also more suitable for distributed systems, 
 while CSV is better suited for simpler, smaller datasets.

 Good to know since xG is a metric in the data pulled: 
 In football, xG (Expected Goals) is a performance metric that measures the 
 probability of a shot resulting in a goal. Each shot is assigned a score 
 between 0 and 1, representing its likelihood of scoring, based on factors 
 like shot distance, angle, player position, and shot type. By summing the 
 xG values for all shots in a match, it shows how many goals a team should 
 have scored. 

## Background
Meaning of the columns:
npg	npxG	xGChain	xGBuildup

*   **npg - Non-Penalty Goals**
Why exclude penalties? Because they’re relatively “easy” and don’t reflect open-play performance.

Use case: Compare attackers fairly — a player with 20 goals including 10 penalties might be less impressive than one with 15 all from open play.

*   **npgX  - Non-Penalty Expected Goals**
Sum of expected goals excluding penalties

xG = the probability of a shot becoming a goal based on distance, angle, body part, defensive pressure, etc.

npxG removes penalties (which are almost always ~0.76 xG) to show open-play shot quality.

Why it’s important: If a striker has 12 npg but npxG of 9.0 → he’s overperforming (finishing very well).
If npxG is 11.5 but only 7 npg → underperforming (poor finishing or bad luck).

*  **xGChain - Expected goals chain** (mid field and attack)
It measures how much a player contributes to dangerous attacks through passing, movement, buildup, not just shooting.

Example: Player A doesn’t shoot much but is involved in every build-up leading to high-quality chances.
His xGChain might be very high even if his personal xG is low → he’s crucial in creating those chances.

*   **xGBuildup - Expected Goals Buildup** (defensive mid fielders and defenders)
Similar to xGChain, but excludes shots, key passes, and assists.

It’s purely about involvement in the earlier parts of the attack — passing, ball progression, linking play.

Example: A deep-lying midfielder or centre-back might have a high xGBuildup even if they never shoot or assist — showing their importance in starting attacking moves.



**How to use them in analysis:**

> Finishing skill: Compare npg vs. npxG

> Chance involvement: Sort players by xGChain

> Deep playmaking / buildup role: Sort by xGBuildup

For example:

> A striker’s job: high npg, slightly > npxG

> A creative midfielder: modest npg, high xGChain

> A deep midfielder: low npg, high xGBuildup

# Now doing the same for the other leagues: Premier League, La Liga, Ligue 1, Serie A


In [None]:
# Multi-league, multi-season pull 
# Leagues: Premier League, La Liga, Ligue 1, Serie A, Bundesliga
# Keeps it simple, JSON-encodes nested fields, and stores per season in clean folders.

# Map your human names -> Understat accepted keys
LEAGUES = {
    "Premier_League": "EPL",
    "La_Liga": "La_Liga",
    "Ligue_1": "Ligue_1",
    "Serie_A": "Serie_A",
    "Bundesliga": "Bundesliga",
}

# Choose seasons (Understat uses start year strings; "2020" -> 2020/21)
SEASONS = [str(y) for y in range(2019, 2025)]  # adjust as you like (5–10 years is fine)

BASE_DIR = Path("data")  # all output goes here

def to_df(records):
    return pd.DataFrame.from_records(records) if records else pd.DataFrame()

def jsonify_objects(df: pd.DataFrame) -> pd.DataFrame:
    """Convert dict/list/tuple cells to JSON strings so CSV/Parquet save cleanly."""
    if df.empty:
        return df
    df = df.copy()
    for c in df.select_dtypes(include=["object"]).columns:
        df[c] = df[c].apply(
            lambda v: json.dumps(v, ensure_ascii=False) if isinstance(v, (dict, list, tuple)) else v
        )
    return df

def save(df: pd.DataFrame, base_path: Path, csv=True, parquet=True):
    """Save a DataFrame to CSV/Parquet at base_path (without extension)."""
    base_path.parent.mkdir(parents=True, exist_ok=True)
    if df.empty:
        print(f"[warn] {base_path}: empty, nothing saved.")
        return
    out = jsonify_objects(df)
    if csv:
        out.to_csv(base_path.with_suffix(".csv"), index=False)
    if parquet:
        try:
            out.to_parquet(base_path.with_suffix(".parquet"), index=False)
        except Exception as e:
            print(f"[warn] Parquet save failed for {base_path.name}: {e}")
    print(f"[ok] saved {base_path.name} (in {base_path.parent})")

with UnderstatClient() as us:
    for nice_name, understat_key in LEAGUES.items():
        print(f"\n================ {nice_name} ================")

        # Accumulators for combined files (per league)
        all_players, all_matches, all_teams = [], [], []

        for season in SEASONS:
            print(f"\n--- Pulling {nice_name} {season} ---")
            # Pull league data
            players_raw = us.league(understat_key).get_player_data(season=season)
            matches_raw = us.league(understat_key).get_match_data(season=season)
            teams_raw   = us.league(understat_key).get_team_data(season=season)

            # To DataFrames
            df_players = to_df(players_raw)
            df_matches = to_df(matches_raw)
            df_teams   = to_df(teams_raw)

            # Tag for clarity in downstream analysis
            for df in (df_players, df_matches, df_teams):
                if not df.empty:
                    df["league"] = nice_name
                    df["season"] = season  # start-year format

            # Save per-season into data/<league>/<season>/
            season_dir = BASE_DIR / nice_name / season
            save(df_players, season_dir / "players")
            save(df_matches, season_dir / "matches")
            save(df_teams,   season_dir / "teams")

            # Collect for combined
            if not df_players.empty: all_players.append(df_players)
            if not df_matches.empty: all_matches.append(df_matches)
            if not df_teams.empty:   all_teams.append(df_teams)

            # Gentle pause (keeps things polite if you later extend to heavier endpoints)
            sleep(0.6)

        # Save combined per-league across seasons into data/<league>/combined/
        combined_dir = BASE_DIR / nice_name / "combined"
        if all_players:
            save(pd.concat(all_players, ignore_index=True), combined_dir / "players_allseasons")
        if all_matches:
            save(pd.concat(all_matches, ignore_index=True), combined_dir / "matches_allseasons")
        if all_teams:
            save(pd.concat(all_teams, ignore_index=True), combined_dir / "teams_allseasons")

print("\nDone.")
