# Predicting football matches results (data collection)

In this project, I will build a model that will try to predict outcomes of football matches, i.e. home team win, away team win, or draw. I will use the data about 20 seasons of English Premier League: from 2001/2002 to 2020/2021. My source data will include datasets from Kaggle, as well as data scrapped from other resources.

I will start by creating several dataframes that I will use to generate features. These will include:


1. *Matches* dataframe with information about results of all matches starting from season 2001/2002 and starting lineups of home and away teams. The data is taken from [here](https://www.kaggle.com/josephvm/english-premier-league-game-events-and-results).
2. *Final tables* dataframe with information about the final results of each season starting from 2001/2002, including final ranking of teams, numbers of games won, lost, and drawn, and goal difference. The data is taken from [here](https://www.kaggle.com/josephvm/english-premier-league-game-events-and-results).
3. *Players* dataframes with information about key attributes of players of Premier League teams, including related to their attacking, midfield, and defense skills. The attribute ratings are created by FIFA videogames developer and are taken from [here](https://www.kaggle.com/justdhia/fifa-players) and [here](https://www.kaggle.com/cashncarry/fifa-22-complete-player-dataset)

4. *Teams* dataframe with information about ratings of Premier League clubs, also created by FIFA videogames developer. The data is taken from [here](https://www.fifaindex.com/).

5. *Managers* dataframes with information about managers of Premier League clubs and their ratings. The data about managers was scraped from [here](https://en.wikipedia.org/wiki/List_of_Premier_League_managers). The ratings of managers are created by Football World Ranking website owners and were scaped from [here](https://www.clubworldranking.com/ranking-coaches). 

I will organize the project into three notebooks:

* data collection (current notebook);
* data cleaning, data analysis, and feature engineering;
* model building, training, and validation and predicting outcomes.

In this notebook, I will scrape the data about football managers which is not available as a ready-to-use dataset online.  

In [1]:
# importing necessary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

I will start by scraping the data about the names and tenures of managers of English Premier League clubs from Wikipedia.

In [2]:
# scraping data about football managers from Wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_Premier_League_managers'
html_doc = requests.get(url).text

soup = BeautifulSoup(html_doc, 'html.parser')

In [3]:
# creating managers dataframe
managers_table = soup.find_all('table', class_='wikitable sortable plainrowheaders')[0]
managers_list = pd.read_html(str(managers_table))
managers = pd.DataFrame(managers_list[0])

In [9]:
# saving to csv
managers.to_csv('managers.csv', index=False)

Next, I will scrape data about rankings of football managers from https://www.clubworldranking.com/. Code for scraping is taken and modified from [here](https://github.com/gonzaferreiro/Market_value_football_players/blob/master/Team_and_national_teams_ranking_scraps-Final.ipynb).

In [4]:
# function for extracting managers names
def extract_managers(soup):
    managers = []
    for each in soup.find_all('div', attrs={'class':'col-name'})[1:]:
        try:
            managers.append(each.text.strip())
        except:
            managers.append(np.nan)
    return managers

In [5]:
# function for extracting managers rankings
def extract_rankings(soup):
    rankings = []
    for each in soup.find_all('div', attrs={'class':'points RankingRight'}):
        try:
            rankings.append(int(each.text.strip()))
        except:
            rankings.append(np.nan)
    return rankings

In [6]:
# function to create dataframes with managers rankings
def create_managers_df(week, year):
    results = {'Manager':[], 'Ranking':[]} 
    
    for start in tqdm(range(0, 1000, 25)):
        url = f'https://www.clubworldranking.com/ranking-coaches?wd={week}&yr={year}&index={start}'
        r = requests.get(url)
        soup = BeautifulSoup(r.text,'html.parser')
        results['Manager'] += extract_managers(soup)
        results['Ranking'] += extract_rankings(soup)

    return pd.DataFrame(results)

In [7]:
managers_2011 = create_managers_df(49, 2011)
managers_2012 = create_managers_df(20, 2012)
managers_2013 = create_managers_df(21, 2013)
managers_2014 = create_managers_df(20, 2014)
managers_2015 = create_managers_df(22, 2015)
managers_2016 = create_managers_df(21, 2016)
managers_2017 = create_managers_df(21, 2017)
managers_2018 = create_managers_df(20, 2018)
managers_2019 = create_managers_df(20, 2019)
managers_2020 = create_managers_df(14, 2020)

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

In [12]:
# saving to csv
managers_dict = {
    'managers_2011': managers_2011,
    'managers_2012': managers_2012,
    'managers_2013': managers_2013,
    'managers_2014': managers_2014,
    'managers_2015': managers_2015,
    'managers_2016': managers_2016,
    'managers_2017': managers_2017,
    'managers_2018': managers_2018,
    'managers_2019': managers_2019,
    'managers_2020': managers_2020    
}
    
for name, df in managers_dict.items():
    df.to_csv(name+'.csv', index=False)

I have scraped necessary data about football managers and saved the created dataframes into several csv files. I will use these csv files in the next notebook for data analysis and feature engineering.