# **UCL Winners**
Collaborators: Archit Shankar, Iris Guo, Justenn Wang

---




## **Introduction**

Soccer, known as football in some parts of the world, is more than just a sport; it is a global phenomenon that unites millions of fans through its thrilling matches and historic tournaments. Among these tournaments, the UEFA Champions League stands out as one of the most prestigious and fiercely contested competitions. Predicting the winner of such a high-stakes event is not only a fascinating challenge but also one that has significant implications for fans, clubs, and the sports betting industry.

### Why is Predicting UEFA Champions League Winners Important?

Sports predictions, particularly for high-profile events like the UEFA Champions League, hold immense importance. Accurate predictions can influence betting markets, enhance fan engagement, and provide strategic insights for clubs and analysts. Beyond the excitement of guessing the winner, predicting outcomes in sports involves analyzing vast amounts of data and identifying patterns, which is a perfect application of data science.

In recent years, the use of machine learning in sports analytics has grown tremendously. By leveraging historical data and advanced algorithms, we can uncover the factors that contribute to a team's success and build models that predict future outcomes with impressive accuracy. This project will take you through the entire data science pipeline to create a machine learning model, specifically a decision tree, to predict UEFA Champions League winners.

### Objective
The main goal of this project is to develop a decision tree model to predict the winners of the UEFA Champions League. We will guide you through each step of the data science lifecycle, from data collection and preprocessing to exploratory data analysis (EDA), model training, and evaluation. By the end of this tutorial, you will have a comprehensive understanding of the processes involved in building a predictive model and the factors that influence the outcomes of football matches.

Over this tutorial we will be going through the Data Science Lifecycle as following:

1. [Data Collection](https://colab.research.google.com/drive/1pL1pOrxOkhi0ABoYasop5EI1GKqDCJDE#scrollTo=YrJKn9PjajIX&line=14&uniqifier=1)
2. [Data Processing](https://colab.research.google.com/drive/1pL1pOrxOkhi0ABoYasop5EI1GKqDCJDE#scrollTo=XFavjgb4am21)
3. [Exploratory Analysis & Data Visualization](https://colab.research.google.com/drive/1pL1pOrxOkhi0ABoYasop5EI1GKqDCJDE#scrollTo=jn5X9tUOaoNm)
4. [Model: Analysis, Hypothesis Testing, & ML](https://colab.research.google.com/drive/1pL1pOrxOkhi0ABoYasop5EI1GKqDCJDE#scrollTo=7aUHEqbtbHl-)
5. [Interpretation: Insight Learned](https://colab.research.google.com/drive/1pL1pOrxOkhi0ABoYasop5EI1GKqDCJDE#scrollTo=RwWLqzXbbNVx)

## **Data Collection**

We will collect data to eventually gain the following information about teams that we are looking at:

- Average Domestic league position during past 3 years

- All time club rankings

- Seeding for that year

- Net transfer spend over the past 3 years

- Net market value of players on team



At this stage of the Data Science life cycle, we will be looking for a dataset that is related to our topic. Since we are thinking about      **UCL TEAM STUFF**     we started searching for such a dataset. Also, just like any other scientific paper, we have to pay attention to the legitimacy of the resource. Luckily, we were able to find such a dataset on     **LEGITIMATE SOURCE**        website.

Using the HTTP access, download     **What data do we download? Into What file? csv? **     . **Directions on where to store the data and how, for example: Make a folder called blah blah .**

However, that is not all the information that we need since we want to find the relation      **Any other data**.     ** If our current dataset doesnt have info on a specific thing, we keep looking **      . Keeping in mind that we need to get the data from a legitimate source, we were able to find the dataset we are looking for on        **Another Legitimate source?**.

*Directions for what to do with the data:*
Select data for      ** average temperatures (12-month scale) from 1950 - 2020 and click plot**.     After the data has been retrieved, click the excel icon to download the data in csv format. Make sure that the downloaded file (labeled "temperatures.csv") is in your project directory.

During this project, we will be using Python language, and we use tools such as Jupyter Notebook to develop this project. If you haven't heard about Jupyter notebooks before, make sure to learn more about them in [here](https://jupyter.org/)

Just like any other Python project, we need to import some libraries. Here are some of the libraries we will be using throughout this tutorial:

In [None]:
import os
import warnings
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup as bs

from sklearn.svm import SVC
from scipy.stats import norm
from sklearn import linear_model
from IPython.display import HTML
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score

One of the main libraries that we will be using throughout this project is Pandas. [Pandas](https://shahsean.github.io/#https://pandas.pydata.org/) is an open-source data analysis tool that was built on top of the Python programming language and it is going to help us manipulate the data in an easy and flexible way. With the vast library of tools available, you can transform data very easily as you will see below.

Another library that helps maximize efficiency is [NumPy](https://shahsean.github.io/#https://numpy.org/). This library allows for easy computation for large datasets and it is another way to store and manipulate information.

We will also need to download Selenium and use webdrivers to scrape data. Run
```
sudo pip3 install selenium
```

Then import the following libraries:

In [None]:
!pip3 install selenium #for google collab only
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys



In [None]:
#here, we are collecting data for possible teams that may reach the knockout stages. Our predictions will be. based on this data

#extract all time club rankings and storing them in a dataframe

rank = pd.read_csv("AllTimeRankingByClub.csv", encoding='UTF-16')

rank


Unnamed: 0,Position,Club,Country,Participated,Titles,Played,Win,Draw,Loss,Goals For,Goals Against,Pts,Goal Diff
0,1,Real Madrid CF,ESP,53,14,476,285,81,110,1047,521.0,651.0,526.0
1,2,FC Bayern München,GER,39,6,382,229,76,77,804,373.0,534.0,431.0
2,3,FC Barcelona,ESP,33,5,339,197,76,66,667,343.0,470.0,324.0
3,4,Manchester United,ENG,30,3,293,160,69,64,533,284.0,389.0,249.0
4,5,Juventus,ITA,37,2,301,153,70,78,479,301.0,376.0,178.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
531,532,CS Stade Dudelange,LUX,1,0,2,0,0,2,0,18.0,0.0,-18.0
532,533,Rabat Ajax FC,MLT,2,0,4,0,0,4,0,20.0,0.0,-20.0
533,534,Keflavík,ISL,4,0,8,0,0,8,5,35.0,0.0,-30.0
534,535,US Luxembourg,LUX,5,0,10,0,0,10,3,43.0,0.0,-40.0


In [None]:
#here we are looking at spi rankings

spi = pd.read_csv("spi_global_rankings.csv", encoding='UTF-8')

spi

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
0,1,1,Manchester City,Barclays Premier League,2.79,0.28,92.00
1,2,2,Bayern Munich,German Bundesliga,3.04,0.68,87.66
2,3,3,Barcelona,Spanish Primera Division,2.45,0.43,86.40
3,4,4,Real Madrid,Spanish Primera Division,2.56,0.60,84.41
4,5,5,Liverpool,Barclays Premier League,2.63,0.67,83.93
...,...,...,...,...,...,...,...
636,637,637,AFC Wimbledon,English League Two,0.24,2.30,6.96
637,638,638,Doncaster Rovers,English League Two,0.20,2.35,6.06
638,639,639,Forest Green Rovers,English League One,0.20,2.38,5.91
639,640,640,Crawley Town,English League Two,0.20,2.41,5.75


Collect data for net team values for years 23,22,21,20,19

In [None]:
#here we are collecting data for net team value

#This is the data for 2023-2024 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/uefa-champions-league/teilnehmer/pokalwettbewerb/CL/saison_id/2023'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)
twentythree = pd.concat(table)
twentythree

#(When you run this locally it works)

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /root/.cache/selenium/chrome/linux64/125.0.6422.60/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x575dd187eeca <unknown>
#1 0x575dd156845c <unknown>
#2 0x575dd159d6f8 <unknown>
#3 0x575dd159963b <unknown>
#4 0x575dd15e3b19 <unknown>
#5 0x575dd15d7253 <unknown>
#6 0x575dd15a71c7 <unknown>
#7 0x575dd15a7b3e <unknown>
#8 0x575dd184530b <unknown>
#9 0x575dd18493b7 <unknown>
#10 0x575dd1831e3e <unknown>
#11 0x575dd1849e82 <unknown>
#12 0x575dd18167df <unknown>
#13 0x575dd186e1b8 <unknown>
#14 0x575dd186e38b <unknown>
#15 0x575dd187dffc <unknown>
#16 0x7fa9eba0aac3 <unknown>


In [None]:
#This is the data for 2022-2023 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/uefa-champions-league/teilnehmer/pokalwettbewerb/CL/saison_id/2022'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)
twentytwo = pd.concat(table)
twentytwo

#(When you run this locally it works)

In [None]:
#This is the data for 2021-2022 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/uefa-champions-league/teilnehmer/pokalwettbewerb/CL/saison_id/2021'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)
twentyone = pd.concat(table)
twentyone

#(When you run this locally it works)

In [None]:
#This is the data for 2020-2021 Season
#!pip install selenium


# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/uefa-champions-league/teilnehmer/pokalwettbewerb/CL/saison_id/2020'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)
twenty = pd.concat(table)
twenty

#(When you run this locally it works)

In [None]:
#This is the data for 2019-2020 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/uefa-champions-league/teilnehmer/pokalwettbewerb/CL/saison_id/2019'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nineteen = pd.concat(table)
nineteen

#(When you run this locally it works)

NET TRANSFER SPEND BELOW THIS

In [None]:
#here we are colecting data for the average domestic league position

#here we are collecitng data for net transfer spend over past 3 years

#This is the data for 2023-2024 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/transfers/einnahmenausgaben/statistik/plus/0?ids=a&sa=&saison_id=2023&saison_id_bis=2023&land_id=&nat=&kontinent_id=&pos=&altersklasse=&w_s=&leihe=&intern=0&plus=0'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nts23 = table[1]
nts23

#(When you run this locally it works)

In [None]:
#This is the data for 2022-2023 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/transfers/einnahmenausgaben/statistik/plus/0?ids=a&sa=&saison_id=2022&saison_id_bis=2022&land_id=&nat=&kontinent_id=&pos=&altersklasse=&w_s=&leihe=&intern=0&plus=0'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nts22 = table[1]
nts22

#(When you run this locally it works)

In [None]:
#This is the data for 2021-2022 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/transfers/einnahmenausgaben/statistik/plus/0?ids=a&sa=&saison_id=2021&saison_id_bis=2021&land_id=&nat=&kontinent_id=&pos=&altersklasse=&w_s=&leihe=&intern=0&plus=0'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nts21 = table[1]
nts21

#(When you run this locally it works)

In [None]:
#This is the data for 2020-2021 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/transfers/einnahmenausgaben/statistik/plus/0?ids=a&sa=&saison_id=2020&saison_id_bis=2020&land_id=&nat=&kontinent_id=&pos=&altersklasse=&w_s=&leihe=&intern=0&plus=0'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nts20 = table[1]
nts20

#(When you run this locally it works)

In [None]:
#This is the data for 2019-2020 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/transfers/einnahmenausgaben/statistik/plus/0?ids=a&sa=&saison_id=2019&saison_id_bis=2019&land_id=&nat=&kontinent_id=&pos=&altersklasse=&w_s=&leihe=&intern=0&plus=0'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nts19 = table[1]
nts19

#(When you run this locally it works)

In [None]:
#This is the data for 2018-2019 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.transfermarkt.us/transfers/einnahmenausgaben/statistik/plus/0?ids=a&sa=&saison_id=2018&saison_id_bis=2018&land_id=&nat=&kontinent_id=&pos=&altersklasse=&w_s=&leihe=&intern=0&plus=0'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

nts18 = table[1]
nts18

#(When you run this locally it works)

Now, we are going to gather Uefa club coefficient rankings. These rankings come from UEFA themselves, and rank European teams based on their performance over the past years. This coefficient also determines seeding in the tournament's draws.

In [None]:
#This is the data for 2023-2024 Season
#!pip install selenium

# URL of the webpage to scrape
url = 'https://www.uefa.com/nationalassociations/uefarankings/club/?year=2024'

# Set up the WebDriver (ensure the path to the chromedriver is correct)
driver = webdriver.Chrome()

# Open the URL
driver.get(url)

# Get the page source after JavaScript has rendered
html = driver.page_source

# Parse the HTML content using BeautifulSoup
soup = bs(html, 'html.parser')
table = pd.read_html(html)

uefa23 = pd.concat(table)
uefa23

#(When you run this locally it works)

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_European_Cup_and_UEFA_Champions_League_finals'
response = requests.get(url)

soup = bs(response.content, "html.parser")

tables = soup.find_all('table')

selected_table = tables[2]

df = pd.read_html(str(selected_table))[0]
df

Unnamed: 0,Season,Country,Winners,Score,Runners-up,Country.1,Venue,Attend­ance[14]
0,1955–56,Spain,Real Madrid,4–3,Reims,France,"Parc des Princes, Paris, France",38239
1,1956–57,Spain,Real Madrid,2–0,Fiorentina,Italy,"Santiago Bernabéu, Madrid, Spain",124000
2,1957–58,Spain,Real Madrid,3–2†,Milan,Italy,"Heysel Stadium, Brussels, Belgium",67000
3,1958–59,Spain,Real Madrid,2–0,Reims,France,"Neckarstadion, Stuttgart, West Germany",72000
4,1959–60,Spain,Real Madrid,7–3,Eintracht Frankfurt,West Germany,"Hampden Park, Glasgow, Scotland",127621
...,...,...,...,...,...,...,...,...
68,2022–23,England,Manchester City,1–0,Inter Milan,Italy,"Atatürk Olympic Stadium, Istanbul, Turkey",71412
69,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals,Upcoming finals
70,Season,Country,Finalist,Match,Finalist,Country,Venue,Venue
71,2023–24,Germany,Borussia Dortmund,v,Real Madrid,Spain,"Wembley Stadium, London, England","Wembley Stadium, London, England"


Match data extraction below:

In [None]:
matches = pd.read_csv("matches.csv", encoding='UTF-8')

# Convert the 'datetime' column to datetime format
matches['datetime'] = pd.to_datetime(matches['datetime'])

# Define the cutoff date for filtering
cutoff_date = pd.Timestamp('2018-07-01 00:00:00+00:00')

# Filter the DataFrame to include only matches from the cutoff date onwards
filtered_matches = matches[matches['datetime'] >= cutoff_date]

filtered_matches

Unnamed: 0.1,Unnamed: 0,datetime,team1,team2,team1_code,team2_code,round,score1,score2,adj_score1,adj_score2,chances1,chances2,moves1,moves2,group,matchday
250,250,2018-09-18 16:55:00+00:00,Internazionale,Tottenham Hotspur,INT,TOT,g,2,1,2.100,1.05,1.400,0.882,1.722,0.899,B,
251,251,2018-09-18 16:55:00+00:00,Barcelona,PSV,BAR,PSV,g,4,0,3.483,0.00,2.034,0.669,2.524,0.400,B,
252,252,2018-09-18 19:00:00+00:00,Club Brugge,Borussia Dortmund,CBKV,DOR,g,0,1,0.000,1.05,0.511,0.946,0.833,1.726,A,
253,253,2018-09-18 19:00:00+00:00,Liverpool,Paris Saint-Germain,LIV,PSG,g,3,2,3.150,2.10,2.377,1.201,1.927,0.711,C,
254,254,2018-09-18 19:00:00+00:00,Schalke 04,FC Porto,SCH,POR,g,1,1,1.050,1.05,0.886,2.374,0.906,0.886,D,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
739,739,2022-04-26 19:00:00+00:00,Manchester City,Real Madrid,MNC,MAD,s,4,3,4.200,3.15,3.307,1.620,1.739,0.980,,
740,740,2022-04-27 19:00:00+00:00,Liverpool,Villarreal,LIV,VLR,s,2,0,2.100,0.00,1.843,0.090,2.481,0.186,,
741,741,2022-05-03 19:00:00+00:00,Villarreal,Liverpool,VLR,LIV,s,2,3,2.100,3.15,1.478,1.587,0.860,1.607,,
742,742,2022-05-04 19:00:00+00:00,Real Madrid,Manchester City,MAD,MNC,s,3,1,3.150,1.05,2.602,1.731,2.397,1.841,,


In [None]:
knockout_rounds = ['Round of 16', 'Quarter-finals', 'Semi-finals', 'Final']
matches = matches[matches['round'].isin(knockout_rounds)]

# Create a dictionary to specify the order of rounds
round_order = {
    'k': 'ro16 reached',
    'q': 'qf reached',
    's': 'sf reached',
    'f': 'final reached'
}

# Create a list to store the results
results = []

# Iterate over each year
for year in range(2019, 2023):
    # Filter matches for the current season
    season_start = pd.Timestamp(f'{year - 1}-07-01 00:00:00+00:00')
    season_end = pd.Timestamp(f'{year}-06-30 00:00:00+00:00')
    season_matches = filtered_matches[(filtered_matches['datetime'] >= season_start) & (filtered_matches['datetime'] <= season_end)]

    teams = set(season_matches['team1']).union(set(season_matches['team2']))

    for team in teams:
        team_data = {'team': team, 'year': year, 'ro16 reached': 0, 'qf reached': 0, 'sf reached': 0, 'final reached': 0, 'final won': 0}

        # Check if the team reached each round
        for round_name, column_name in round_order.items():
            if any((season_matches['team1'] == team) & (season_matches['round'] == round_name)) or \
               any((season_matches['team2'] == team) & (season_matches['round'] == round_name)):
                team_data[column_name] = 1

        # Check if the team won the final
        if any((season_matches['team1'] == team) & (season_matches['round'] == 'f') & (season_matches['score1'] > season_matches['score2'])) or \
           any((season_matches['team2'] == team) & (season_matches['round'] == 'f') & (season_matches['score2'] > season_matches['score1'])):
            team_data['final won'] = 1

        # Append the team's data to the results list
        results.append(team_data)

# Create a DataFrame from the results list
results_df = pd.DataFrame(results)

# Print the final DataFrame
results_df


Unnamed: 0,team,year,ro16 reached,qf reached,sf reached,final reached,final won
0,PSV,2019,0,0,0,0,0
1,Manchester City,2019,1,1,0,0,0
2,Ajax,2019,1,1,1,0,0
3,TSG Hoffenheim,2019,0,0,0,0,0
4,Lokomotiv Moscow,2019,0,0,0,0,0
...,...,...,...,...,...,...,...
125,Lille,2022,1,0,0,0,0
126,Malmo FF,2022,0,0,0,0,0
127,Dynamo Kiev,2022,0,0,0,0,0
128,Shakhtar Donetsk,2022,0,0,0,0,0


In [None]:
filtered_results_df = results_df[results_df['ro16 reached'] == 1]
filtered_results_df

Unnamed: 0,team,year,ro16 reached,qf reached,sf reached,final reached,final won
1,Manchester City,2019,1,1,0,0,0
2,Ajax,2019,1,1,1,0,0
5,Liverpool,2019,1,1,1,1,1
9,Paris Saint-Germain,2019,1,0,0,0,0
10,Manchester United,2019,1,1,0,0,0
...,...,...,...,...,...,...,...
115,Villarreal,2022,1,1,1,0,0
120,FC Salzburg,2022,1,0,0,0,0
124,Benfica,2022,1,1,0,0,0
125,Lille,2022,1,0,0,0,0


In [None]:
wodupe = filtered_results_df.drop_duplicates(subset='team')

wodupe

Unnamed: 0,team,year,ro16 reached,qf reached,sf reached,final reached,final won
1,Manchester City,2019,1,1,0,0,0
2,Ajax,2019,1,1,1,0,0
5,Liverpool,2019,1,1,1,1,1
9,Paris Saint-Germain,2019,1,0,0,0,0
10,Manchester United,2019,1,1,0,0,0
11,Barcelona,2019,1,1,1,0,0
13,Juventus,2019,1,1,0,0,0
14,Schalke 04,2019,1,0,0,0,0
16,Real Madrid,2019,1,0,0,0,0
17,Lyon,2019,1,0,0,0,0


## **Data Processing**

1. Put data in Dataframe

Clean Data:
2. Remove Unecessary Columns
3. Do any Data Filtering we need to do
4. Do any Formatting we need to, like dates, years, etc
5. Do any other data modifying or merging needed

Before you start the data analysis, choose how you want to modify the cleaned data for certain problems

Handle Missing Data

show Data

### Putting Net Value Data into one DataFrame and filtering

We have 5 different tables for all of the Net Values from different years, but we want to put it all into one table for further analysis.

In [None]:
# Add a year label to each of the datasets
twentythree['Year'] = "2023"
twentytwo['Year'] = "2022"
twentyone['Year'] = "2021"
twenty['Year'] = "2020"
nineteen['Year'] = "2019"

# List of DataFrames to concatenate
pds = [twentythree, twentytwo, twentyone, twenty, nineteen]
net_vals = pd.concat(pds, ignore_index=True)

net_vals

Now we are going to do some filtering. We will start by dropping columns that don't have data useful to us.

In [None]:
net_vals = net_vals.drop(net_vals.columns[[0, 1, 2, 4, 5, 6, 7, 9]], axis=1)
net_vals.rename(columns={'ø-Age': 'Market Value', 'Club.1' : 'Squad Size'}, inplace=True)
net_vals = net_vals.drop(0)

net_vals.head(5)

Now convert these market values to a numerical value

In [None]:
def clean_currency(column):
    return column.str.replace('€', '').str.replace('m', '0000').str.replace('bn', '0000000').str.replace('.', '').astype(float)

net_vals['Market Value'] = clean_currency(net_vals['Market Value'])

#Normalize the data
scaler = MinMaxScaler()

net_vals['Market Value Normalized'] = scaler.fit_transform(net_vals[['Market Value']])

#Drop our market value column
net_vals.drop(columns = ['Market Value'], inplace = True)

net_vals.head(50)

Now we are going to look at our Net Tranfer Values from differnet years, and collapse them into a single table for analysis.

In [None]:
# Add a year label to each of the datasets
twentythree['Year'] = "2023"
twentytwo['Year'] = "2022"
twentyone['Year'] = "2021"
twenty['Year'] = "2020"
nineteen['Year'] = "2019"

# List of DataFrames to concatenate
pds = [twentythree, twentytwo, twentyone, twenty, nineteen]
net_vals = pd.concat(pds, ignore_index=True)

net_vals

Now we will begin processing our data for net transfer spend

Now, we will begin processing our data for our net transfer spend

In [None]:
# Add a year label to each of the datasets
nts23['Year'] = "2023"
nts22['Year'] = "2022"
nts21['Year'] = "2021"
nts20['Year'] = "2020"
nts19['Year'] = "2019"

# List of DataFrames to concatenate
pds = [nts23, nts22, nts21, nts20, nts19]
net_transfer_vals = pd.concat(pds, ignore_index=True)

net_transfer_vals

Here, we are going to drop some unnecessary columns so our data is more readable

In [None]:
net_transfer_vals = net_transfer_vals.drop(net_transfer_vals.columns[[0, 1, 4, 5, 6, 7, 8]], axis=1)

net_transfer_vals.head(5)

Here, we rename the columns to get a better idea of what exactly it is we are looking at

In [None]:
new_col_names = ['Club', 'Expenditures', 'Year']

net_transfer_vals.columns = new_col_names

net_transfer_vals.head(5)

Now we convert these into numerical valeus

In [None]:
def clean_currency(column):
    return column.str.replace('€', '').str.replace('m', '00').str.replace('.', '').astype(float)

net_transfer_vals['Expenditures'] = clean_currency(net_transfer_vals['Expenditures'])

#Normalize the data
scaler = MinMaxScaler()

net_transfer_vals['Expenditures Normalized'] = scaler.fit_transform(net_transfer_vals[['Expenditures']])

#Drop our expenditures column
net_transfer_vals.drop(columns = ['Expenditures'], inplace = True)

net_transfer_vals.head(50)

Now we normalize and process the SPI data

In [None]:
spi = spi.drop(spi.columns[[0, 1, 3, 4, 5]], axis=1)

spi.rename(columns = {'name' : "Club"}, inplace = True)

#Normalize the spi
scaler = MinMaxScaler()

spi['SPI Normalized'] = scaler.fit_transform(spi[['spi']])

#Drop our spi column
spi.drop(columns = ['spi'], inplace = True)

spi.head(5)

Now we process our All Time Club Ranking Data

In [None]:
rank['Played per UCL'] = rank['Played'] / rank['Participated']
rank['Won per UCL'] = rank['Win'] / rank['Participated']
rank['Winrate'] = rank['Win'] / (rank['Loss'] + rank['Draw'])
rank['Lossrate'] = rank['Loss'] / (rank['Win'] + rank['Draw'])
rank['GD per Game'] = rank['Goal Diff'] / rank['Played']
rank['Pts per UCL'] = rank['Pts'] / rank['Participated']
rank['Titles per UCL'] = rank['Titles'] / rank['Participated']

selected_columns = ['Club', 'Played per UCL', 'Won per UCL', 'Pts per UCL', 'Titles per UCL', 'Winrate', 'Lossrate', 'GD per Game']
new_rank = rank[selected_columns].copy()

#here, we are dropping all instances where there is an infinite loss rate
new_rank.replace([np.inf, -np.inf], np.nan, inplace=True)
new_rank.dropna(inplace=True)

new_rank.head(5)

Now we normalize the processed data

In [None]:
scaler = MinMaxScaler()
new_rank['Matches per UCL Normalized'] = scaler.fit_transform(new_rank[['Played per UCL']])
new_rank['Wins per UCL Normalized'] = scaler.fit_transform(new_rank[['Won per UCL']])
new_rank['Pts per UCL Normalized'] = scaler.fit_transform(new_rank[['Pts per UCL']])
new_rank['Titles per UCL Normalized'] = scaler.fit_transform(new_rank[['Titles per UCL']])
new_rank['Winrate Normalized'] = scaler.fit_transform(new_rank[['Winrate']])
new_rank['Lossrate Normalized'] = scaler.fit_transform(new_rank[['Lossrate']])
new_rank['GD per game Normalized'] = scaler.fit_transform(new_rank[['GD per Game']])

#Drop our un-normalized column
new_rank.drop(columns = ['Played per UCL', 'Won per UCL', 'Pts per UCL', 'Titles per UCL', 'Winrate', 'Lossrate', 'GD per Game'], inplace = True)

# Display the new DataFrame
new_rank.head(50)

Here we implement manual mapping so that the names of the clubs in the dataframe are consistent

In [None]:
standard_names = pd.DataFrame({
    'Standard Name': ['Real Madrid', 'Bayern Munich', 'FC Barcelona'],
    'Alias': ['Real Madrid CF', 'FC Bayern München', 'Barcelona FC']
})

Now, we will merge our data into a single dataframe

In [None]:
merged_data = pd.merge(net_vals, net_transfer_vals, on=['Club', 'Year'], how='left')
merged_data = pd.merge(spi, net_vals, on=['Club'], how='left')
merged_data = pd.merge(merged_data, new_rank, on=['Club'], how='left')

merged_data.head(50)

## **Exploratory Analysis & Data Visualization**

In this section of the data science life cycle, we are going to graph the data in order to gain a better understanding of the data. Also, we attempt to perform statistical analyses in this section to gain mathematical evidence for the trends we may discover. In other words, as the title is indicating, we are going to further explore the data.

Get Data we want to analyze

Visualize it in graphs etc



## **Model: Analysis, Hypothesis Testing, & ML**

During this phase of the Data Lifecycle, we attempt to perform various modeling techniques (such as linear regression or logistic regression) in order to obtain a predictive model of our data. This allows us to predict values for data outside of the scope of our data. For example, we can use a linear regression model to predict how temperature will changein the next few years, which is exactly what we are going to attempt to do below.


## **Interpretation: Insights Learned**

This is the part of the lifecycle where we attempt to utilize our data analysis to draw conclusions and potentially infer certain portions of our data.

Based on our observations throughout our analysis and modeling, we can safely say that:

1-

2-

Overall, we can use this data and analysis to provide insights **to the U.S decision makers, especially those who are not aware of the devastating effects of storms on U.S lives and the raise in average temperatures.**
If we could conduct further research, we would have expanded our dataset and potentially increased the area that we were researching on. We could also include all the other types of the events that we had to omit in order to be able to run this code on our laptops. Also, we can look more into detail of the event’s time to see in which months of the year are having more storms.¶
We hope that seeing a data science pipeline from data processing ➡ Exploratory Data Analysis ➡ hypothesis testing ➡ ML and analysis has given you some insight into how you can leverage data.

To learn more about a given topic check the following links:
1. [Data processing](https://shahsean.github.io/#https://medium.com/better-programming/data-engineering-101-from-batch-processing-to-streaming-54f8c0da66fb)
2. [EDA](https://shahsean.github.io/#https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9)
3. [Hypothesis testing intuition](https://shahsean.github.io/#https://towardsdatascience.com/hypothesis-testing-the-what-why-and-how-867d382b99ca)
4. [Hypothesis testing](https://shahsean.github.io/#https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce)
5. [ML articles gold mine](https://shahsean.github.io/#https://medium.com/machine-learning-in-practice/over-200-of-the-best-machine-learning-nlp-and-python-tutorials-2018-edition-dd8cf53cb7dc)