**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names
- Vishnu Babu Guturu
- Karthik Sankaran
- Shaan Bhakta
- Orlev Kuknariev
- Akshat Alurkar

# Research Question

How do player statistics (club, position, nationality, age, wins, losses, goals, assists, shooting accuracy, tackle success percentage, duels won, successful 50-50s, assists, passes per match, big chances created) affect soccer player market value among Premier League Midfielders from 2018-2020?
- Which statistics are the best measures of value?
- Can we generalize those statistics to predict player value based on their statistics?

## Background and Prior Work

Soccer, like most other sports, is incredibly data driven. However, unlike other sports, many of the characteristics of a good player are hard to explain by simple statistics. Unlike basketball where dropping 0 points as a superstar player most certainly means you had a bad game, a soccer superstar scoring 0 goals doesn’t necessarily mean they had a bad game. For this reason our team is trying to figure out if there is some commonly used and available statistic (or combination of stats.) that can help determine a player’s market value or quality. 

Some members of our group watch soccer heavily. The article "Decoding TransfrMarket: Analyzing Player Values Versus Player Performance"<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) reaffirms many of our opinions of how market value is influenced. This article analyzes how well soccer players’ Transfermarkt values reflect their on-field performance, focusing on the 2023–2024 Premier League season. By comparing player ratings from WhoScored and key performance metric like goals, passes, and interception with their market value, the author finds a generally positive correlation between better performance and higher valuations. However, positional differences are significant. Attackers are valued for scoring and key passes, midfielders for dribbling and long balls, and defenders for consistency and accurate passing rather than tackles. Younger players tend to be more expensive. These findings from a sports analytics club at Berkeley reflect what we also believe as a team. One more idea I would like to set forth is that the club and nation a player is from tends to overinflate their value. For example, there is a phenomenon in soccer known informally as the “Brazil/inho Tax” where Brazilian players are held to higher standards than other nationalities because of their nation's pedigree in the sport and are thus, valued higher. People often joke that Manchester City right winger Savinho’s nickname change from Savio to Savinho (a more Brazilian sounding name) had the effect of increasing his transfer fee by $10 million. 

One github repo that demonstrates a similar concept(but highly complicated) is: Football-Player-Market-Value-Prediction<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

This project focuses on predicting the market value of professional football players using machine learning techniques. Data was collected through web scraping with BeautifulSoup in Python from two websites, covering over 20 tables and resulting in a dataset of 350+ players from top leagues such as the Premier League, La Liga, Bundesliga, Serie A, and Ligue 1. The statistics that the author used were a superset of the ones we are trying to work with with a few exceptions. After cleaning and transforming the data, multiple models were tested, with the Random Forest Regressor achieving the best performance at around 90% accuracy and a 5% error margin. The project used preprocessing methods in outlier detection and null value handling while demonstrating results using incredibly detailed visualizations.

These prior works are helpful to our project because they establish a proof of concept and show feasibility of predicting player market value using available performance metrics and qualitative factors. The Transfermarkt analysis reinforces the idea that while statistics like goals, assists, and passes are important, market value is also heavily influenced by other factors like a player's position, age, and even their nationality or club affiliation. This information encourages us to embrace a more multifactorial approach to analysis.  Meanwhile, the GitHub project provides a blueprint for how such a predictive model can be constructed, from data collection and preprocessing to model selection and evaluation. Although the model used in that project is more complex, it offers us a useful benchmark for what can be achieved and motivates us to pursue a model that balances interpretability and intuitiveness with predictive power.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Wang, Ethan. (16 May 2024) Decoding TransfrMarket: Analyzing Player Values Versus Player Performance. *Sports Analytics Group Berkeley*. https://sportsanalytics.studentorg.berkeley.edu/articles/transfer-values.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Github Repository by akarshsinghh. https://github.com/akarshsinghh/Football-Player-Market-Value-Prediction


# Hypothesis


We predict a strong positive correlation between a player's market value and their age (specifically younger age), number of goals, assists, and big chances created. These metrics reflect offensive productivity and creativity, which are highly valued traits in modern soccer. We also anticipate that goals and big chances created will show the strongest correlations with market value, as these directly contribute to a team's success and are often used to evaluate a player's impact.

Our prediction is based on observed trends in the soccer transfer market, where young attacking players, particularly those who consistently contribute to goal, scoring opportunities, tend to command higher market values. In contrast, older players or those in primarily defensive roles typically have lower market values due to perceived lower resale value and shorter remaining career longevity.



# Data

For our research question, the ideal dataset would be a single player-season table spanning every Premier-League campaign, with one row per player per season and columns for demographics & context (age, nationality, position, club), attacking and defensive statistics (goals, assists, big-chances created, shooting accuracy, passes per match, tackle-success %, duels and 50-50s won, etc.), and three target variables: market value, FPL value, and FPL points. Because no public file contains all of that, we will merge two complementary kaggle sources:

1. Dataset Name: English Premier League Players Statistics
   
> Link to the dataset: https://www.kaggle.com/code/desalegngeb/english-premier-league-players-statistics/input

> Number of observations: 571 player-season rows (2019-20 EPL)

> Description: Contains the full box-score feature set we need—club, position, nationality, age, wins, losses, goals, assists, shooting-accuracy, tackle-success %, duels won, successful 50-50s, passes per match, and big chances created. This file supplies all explanatory variables and will be read into a pandas DataFrame.

2. Dataset Name: English Premier League Players Dataset
   
> Link to the dataset: https://www.kaggle.com/datasets/mauryashubham/english-premier-league-players-dataset

> Number of observations: 461 player-season rows (2017-18 EPL)

> Description: Provides the three valuation targets—market value (£ m), FPL value (£ m), and FPL points—along with basic demographics. We will merge these targets onto the statistics dataframe using player name. Since the number of observations is not the same between both the datasets, we will filter accordingly.

However, one problem these datasets pose is that the seasons do not perfectly align (2017-18 vs 2019-20), so we expect a small amount of temporal noise rather than systematic bias. But this difference shouldn’t be much since the market value for a few players must have increased and decreased for others from 2017-18 to 2020. This shift would compensate for any major discrepancies. These datasets will help us analyze the data efficiently.

## Dataset English Premier League Players Statistics

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 



## Dataset English Premier League Players Dataset

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Ethics & Privacy

All variables in our merged dataset are drawn from open, publicly posted sources (Kaggle CSVs that themselves scrape match statistics from the Premier-League website and crowdsourced market prices from Transfermarkt and the official FPL API). We will store raw files in a private Git-repo, publish only aggregated or model-ready tables, and respect each site’s non-commercial licence when downloading updates.

The larger ethical challenge is bias in the two target variables. Transfermarkt valuations are crowd-edited and have been shown to underrate defenders, goalkeepers, and players from lower-profile nationalities; FPL prices embed the game designers’ scoring heuristics, thereby privileging goal involvement over less visible defensive work. These systematic skews could make our models appear accurate while merely reproducing community bias. Moreover, we will be limiting our market value prediction based only on attacking statistics such as goals scored and shooting accuracy. This may be biased against players who perform in the defender position. However, we will only compare the market value for attackers and midfielders. To surface such issues we will:

* examine model residuals by position, nationality group, and age band; a pattern of under- or over-prediction in any subgroup will be explicitly reported.
* compute mean-absolute-error gaps across those subgroups and include them in the results table.
* state in the paper that outputs are descriptive of Transfermarkt/FPL perceptions, not “true” player worth, and should not be used for hiring, wage, or contract decisions.

By limiting ourselves to public performance data, honouring source licences, and auditing residuals for subgroup error, we address privacy obligations while making the model’s potential biases visible and interpretable.

# Team Expectations 

- Primary channels – We will use a dedicated Messenger group chat for quick updates and questions, with a weekly Zoom call (30 min, Fridays 5 PM) for progress checks and task-planning. Important files and deadlines will be mirrored in a shared Google Drive folder and posted to the Canvas discussion thread as a backup so nothing gets lost.

- Response time – Everyone agrees to acknowledge messages within 24 hours on weekdays; if someone anticipates being unavailable they will post a heads-up.

- Work allocation – Coding, writing, and literature search will be divided so every member contributes to each area, but weights can vary with comfort and skill. We will revisit the task list during the weekly call to rebalance workloads when needed.

- Feedback style – Direct, constructive critique is welcome, but comments must stay respectful and specific rather than personal.

- Conflict resolution – First, raise the issue privately with the teammate concerned; if unresolved, bring it to the next Zoom call for group discussion. Persistent problems will be taken forward to the instructional staff.

- Commitment – By adding our names to the project submission we confirm that we have read the COGS 108 Team Policies, accept the expectations above, and intend to contribute reliably throughout the quarter.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/30  |  4 PM | Read COGS 108 policies and rubric; Brainstormed topics  | Picked final topic (EPL player desirability via performance metrics); opened Google Drive & Messenger group; drafted research question and hypothesis | 
| 5/8  |  5 PM | Make sure everyone is up to speed with the datasets; brainstorm ideas for data anlysis | Discuss strategy and method to analyze data effectively; divide tasks for data wrangling | 
| 5/14  | 5 PM  | Finish up data wrangling; Stats dataset (Karthik, Orlev, Akshat); Market value dataset (Vishnu, Shaan) | Finalize strategy for analysis and discuss prediction methods; Divide tasks for EDA |
| 5/22  | 6 PM  | Finish up EDA; start working on prediction | Review/Edit wrangling/EDA; Discuss plan for predicting market value |
| 5/28  | 12 PM  | Finalize wrangling/EDA (Shaan, Orlev, Akshat); Begin prediction analysis (Vishnu, Karthik) | Complete project check-in |
| 6/3  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Everyone)| Discuss/edit full project |
| 6/11  | 11 AM  | NA | Turn in Final Project & Group Project Surveys |