# Predicting the Results of Soccer Games Using Recent Previous Results for Home Teams

### Importing my improved data (from the prototype) which contains data from the 2016-2017 season all the way up to the 2021/2022 season
#### *It is a significantly larger dataset compared to the one I first used.

In [1]:
import pandas as pd
import numpy as np

#importing my data file
games = pd.read_csv("EPLMATCHES.csv", index_col=0)
games.head()

Unnamed: 0_level_0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,...,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98,Unnamed: 99,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104
Div,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
E0,13/08/16,Burnley,Swansea,0,1,A,0,0,D,J Moss,...,,,,,,,,,,
E0,13/08/16,Crystal Palace,West Brom,0,1,A,0,0,D,C Pawson,...,,,,,,,,,,
E0,13/08/16,Everton,Tottenham,1,1,D,1,0,H,M Atkinson,...,,,,,,,,,,
E0,13/08/16,Hull,Leicester,2,1,H,1,0,H,M Dean,...,,,,,,,,,,
E0,13/08/16,Man City,Sunderland,2,1,H,1,0,H,R Madley,...,,,,,,,,,,


### Checking how often teams show up in dataset. Teams are relegated and promoted over the years so some teams don't appear as often as others as the dataset has multiple seasons spanning from 2016 to 2021.

In [2]:
games["HomeTeam"].value_counts()

Burnley             114
Tottenham           114
Everton             114
Man City            114
Southampton         114
Arsenal             114
Liverpool           114
Chelsea             114
Man United          114
Leicester           114
West Ham            114
Crystal Palace      114
Newcastle            95
Brighton             95
Watford              95
Bournemouth          76
Wolves               76
West Brom            57
Aston Villa          57
Swansea              38
Stoke                38
Huddersfield         38
Fulham               38
Norwich              38
Sheffield United     38
Leeds                38
Sunderland           19
Middlesbrough        19
Hull                 19
Cardiff              19
Brentford            19
Name: HomeTeam, dtype: int64

### Checking if the features are a type that is actually usable for my model. 

In [3]:
games.dtypes

Date             object
HomeTeam         object
AwayTeam         object
FTHG              int64
FTAG              int64
                 ...   
Unnamed: 100    float64
Unnamed: 101    float64
Unnamed: 102    float64
Unnamed: 103    float64
Unnamed: 104    float64
Length: 104, dtype: object

### Converting my data from object to date time which can be used for the models. Now when I check again it is type datetime64 which is exactly what I needed to do

In [4]:
games["Date"] = pd.to_datetime(games["Date"])
games.dtypes

  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listl

Date            datetime64[ns]
HomeTeam                object
AwayTeam                object
FTHG                     int64
FTAG                     int64
                     ...      
Unnamed: 100           float64
Unnamed: 101           float64
Unnamed: 102           float64
Unnamed: 103           float64
Unnamed: 104           float64
Length: 104, dtype: object

### Here I start the process of creating basic predictors to test my model with, and choosing the models I will be using
### To begin I simply identified the home and away teams, as well as my targets for classification and put them into my Random Forest to test. The model should classify between if the Home team wins, or if they don't win (which includes a tie). 
### I split the testing and training using the dates and chose a date that was right around the 70/30 mark. The training set is every game before 11/21/2020 and the testing set is every game after

In [5]:
games["away_team_id"] = games["AwayTeam"].astype("category").cat.codes
games["home_team_id"] = games["HomeTeam"].astype("category").cat.codes
games["target"] = (games["FTR"] == "H")

In [6]:
from sklearn.ensemble import RandomForestClassifier\

rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state = 1)
train = games[games["Date"] < '2020-11-21']
test = games[games["Date"] > '2020-11-21']
predictors = ["away_team_id", "home_team_id"]
rf.fit(train[predictors], train["target"])
preds = rf.predict(test[predictors])

from sklearn.metrics import accuracy_score

## Here I test the accuracy of the Random Forest using sklearns accuracy score and from using the two predictors of Home and Away team it returned an accuracy of 61%. The breakdown is shown in the table below

In [7]:
accuracy_score(test["target"], preds)

0.612094395280236

In [8]:
table = pd.DataFrame(dict(actual = test["target"], prediction = preds))
pd.crosstab(index = table["actual"], columns = table["prediction"])

prediction,False,True
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,264,136
True,127,151


## My next task is to create new predictors which produce rolling averages for all of the home teams using stats like Home Team Goals, Shots, Shots on Target, Corners, Fouls, Yellow Cards, and Red Cards.

## I started by grouping teams by the team that was the Home team in that particular match then sorted them by date so that the rolling averages would pull from the last 8 games. The first 8 games for each team were removed because they didn't have enough data to produce a rolling average of the last 8 games before that so they were viewed as invalid.

## Then I chose the columns for which I produced rolling averages and created a new dataset with them. I cleaned up the new dataset by removing the overall grouping as well as having their be a unique index for each match listed in my dataset.

In [9]:
grouped_games = games.groupby("HomeTeam")

def rolling_averages(curgroup, cols, roll_cols):
    curgroup = curgroup.sort_values("Date")
    rolling_stats = curgroup[cols].rolling(8, closed='left').mean()
    curgroup[roll_cols] = rolling_stats
    curgroup = curgroup.dropna(subset= roll_cols)
    return curgroup

## These are the specific stats being using for the rolling averages of the home side
cols = ["FTHG", "HS", "HST", "HC","HF", "HY", "HR"]
roll_cols = [f"{c}_rolling" for c in cols]
roll_cols

['FTHG_rolling',
 'HS_rolling',
 'HST_rolling',
 'HC_rolling',
 'HF_rolling',
 'HY_rolling',
 'HR_rolling']

### Cleaning up the new dataset "games_rolling" with rolling averages

In [10]:
games_rolling = games.groupby("HomeTeam").apply(lambda x: rolling_averages(x, cols, roll_cols))

In [11]:
games_rolling

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,...,away_team_id,home_team_id,target,FTHG_rolling,HS_rolling,HST_rolling,HC_rolling,HF_rolling,HY_rolling,HR_rolling
HomeTeam,Div,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Arsenal,E0,2016-12-26,Arsenal,West Brom,1,0,H,0,0,D,N Swarbrick,...,28,0,True,2.250,14.625,4.250,6.250,9.625,1.250,0.125
Arsenal,E0,2017-01-01,Arsenal,Crystal Palace,2,0,H,1,0,H,A Marriner,...,8,0,True,2.250,16.000,5.375,6.125,9.625,1.375,0.125
Arsenal,E0,2017-01-10,Arsenal,Brighton,2,0,H,1,0,H,K Friend,...,4,0,True,2.125,17.625,5.625,6.500,9.125,1.125,0.125
Arsenal,E0,2017-01-22,Arsenal,Burnley,2,1,H,0,0,D,J Moss,...,5,0,True,2.000,19.000,6.500,6.125,8.875,1.125,0.125
Arsenal,E0,2017-01-31,Arsenal,Watford,1,2,A,0,2,A,A Marriner,...,27,0,False,2.000,19.875,7.250,6.625,9.125,1.000,0.250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wolves,E0,2022-05-03,Wolves,Crystal Palace,0,2,A,0,2,A,A Madley,...,8,30,False,1.375,10.875,3.875,4.875,10.000,2.000,0.125
Wolves,E0,2022-05-15,Wolves,Norwich,1,1,D,0,1,A,T Harrington,...,20,30,False,1.250,10.500,3.875,4.875,10.875,1.875,0.125
Wolves,E0,2022-10-02,Wolves,Arsenal,0,1,A,0,1,A,M Oliver,...,0,30,False,1.250,11.125,4.000,4.750,10.750,1.625,0.125
Wolves,E0,2022-10-03,Wolves,Watford,4,0,H,2,0,H,D England,...,27,30,True,1.250,12.500,4.375,5.125,11.125,1.875,0.125


In [12]:
games_rolling =  games_rolling.droplevel('HomeTeam')


In [13]:
games_rolling.index = range(games_rolling.shape[0])

In [14]:
games_rolling

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,...,away_team_id,home_team_id,target,FTHG_rolling,HS_rolling,HST_rolling,HC_rolling,HF_rolling,HY_rolling,HR_rolling
0,2016-12-26,Arsenal,West Brom,1,0,H,0,0,D,N Swarbrick,...,28,0,True,2.250,14.625,4.250,6.250,9.625,1.250,0.125
1,2017-01-01,Arsenal,Crystal Palace,2,0,H,1,0,H,A Marriner,...,8,0,True,2.250,16.000,5.375,6.125,9.625,1.375,0.125
2,2017-01-10,Arsenal,Brighton,2,0,H,1,0,H,K Friend,...,4,0,True,2.125,17.625,5.625,6.500,9.125,1.125,0.125
3,2017-01-22,Arsenal,Burnley,2,1,H,0,0,D,J Moss,...,5,0,True,2.000,19.000,6.500,6.125,8.875,1.125,0.125
4,2017-01-31,Arsenal,Watford,1,2,A,0,2,A,A Marriner,...,27,0,False,2.000,19.875,7.250,6.625,9.125,1.000,0.250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2027,2022-05-03,Wolves,Crystal Palace,0,2,A,0,2,A,A Madley,...,8,30,False,1.375,10.875,3.875,4.875,10.000,2.000,0.125
2028,2022-05-15,Wolves,Norwich,1,1,D,0,1,A,T Harrington,...,20,30,False,1.250,10.500,3.875,4.875,10.875,1.875,0.125
2029,2022-10-02,Wolves,Arsenal,0,1,A,0,1,A,M Oliver,...,0,30,False,1.250,11.125,4.000,4.750,10.750,1.625,0.125
2030,2022-10-03,Wolves,Watford,4,0,H,2,0,H,D England,...,27,30,True,1.250,12.500,4.375,5.125,11.125,1.875,0.125


## This next part is where I test all of my models with the new dataset that has the rolling averages. In this part I test a Random Forest, Support Vector Machine, Decision Tree, and a k-nearest Neighbors Model.

### First is the Random Forest

In [15]:
def randomForestPredictor(data, predictors):
    train = data[data["Date"] < '2020-11-21']
    test = data[data["Date"] > '2020-11-21']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    acc = accuracy_score(test["target"], preds)
    accuracy = np.mean(preds == test["target"])
    print(f'Test accuracy: {accuracy:.3f}')
    return accuracy

In [16]:
rfAcc = randomForestPredictor(games_rolling, predictors + roll_cols)
rfAcc

Test accuracy: 0.627


0.6266866566716641

###  Random Forest Accuracy: 62.7 %

## Next is the SVM...

In [17]:
from sklearn import svm

svmModel = svm.SVC(kernel='linear')

def SVM_predictor(data, predictors):
    train = data[data["Date"] < '2020-11-21']
    test = data[data["Date"] > '2020-11-21']
    svmModel.fit(train[predictors], train["target"])
    preds = svmModel.predict(test[predictors])
    acc = accuracy_score(test["target"], preds)
    
    # Evaluate the model's performance
    accuracy = np.mean(preds == test["target"])
    print(f'Test accuracy: {accuracy:.3f}')
    return acc


SVM_acc = SVM_predictor(games_rolling, predictors + roll_cols)
SVM_acc

Test accuracy: 0.628


0.6281859070464768

### SVM Accuracy: 62.8%

## Next is the Decision Tree...

In [18]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
#SPECIFICALLY FOR SVM
def DTree_Predictor(data, predictors):
    train = data[data["Date"] < '2020-11-21']
    test = data[data["Date"] > '2020-11-21']
    dt.fit(train[predictors], train["target"])
    preds = dt.predict(test[predictors])
    acc = accuracy_score(test["target"], preds)

    # Evaluate the model's performance
    accuracy = np.mean(preds == test["target"])
    print(f'Test accuracy: {accuracy:.3f}')
    return acc


DT_acc = DTree_Predictor(games_rolling, predictors + roll_cols)
DT_acc

Test accuracy: 0.538


0.5382308845577212

### Decision Tree Accuracy: ~ 56.2%

## And last is k-nearest Neighbors

In [19]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

#SPECIFICALLY FOR SVM
def knn_predictor(data, predictors):
    train = data[data["Date"] < '2020-11-21']
    test = data[data["Date"] > '2020-11-21']
    knn.fit(train[predictors], train["target"])
    preds = knn.predict(test[predictors])
    acc = accuracy_score(test["target"], preds)
    # Test the model


    # Evaluate the model's performance
    accuracy = np.mean(preds == test["target"])
    print(f'Test accuracy: {accuracy:.3f}')
    return acc


knn_acc = knn_predictor(games_rolling, predictors + roll_cols)
knn_acc

Test accuracy: 0.597


0.5967016491754122

### K-Nearest Neighbors Accuracy: 59.7% 

# Final Write-Up

# Intro
## In this project I used machine learning algorithms to predict the results of a home team's soccer games in the English Premier League from 2016-2021. My goal was to use rolling averages of specific statistics to see if it would improve the predictions. More specifically, I used the rolling averages of the last eight games instead of the 5 I intended to use because I found it produced slightly better results. I used four different models in the process: Random Forest, SVM, Decision Tree, and K-Nearest Neighbors. I wanted to see if some performed significantly better than others which I will discuss in the results.

## The statistics I used for the rolling averages were: Home Team Goals, Shots, Shots on Target, Corners, Fouls, Yellow Cards, and Red Cards. These were stats provided for the home team in the dataset, so I used them all.

# Results

## Of the four models I used, I found that the Support Vector Machine and the Random Forest performed better than the Decision Tree and the KNN Model. The SVM and Random Forest had an accuracy of 62.7 and 62.8 %, which I was impressed with, considering how hard it is to predict soccer games with even advanced statistics. The Decision Tree and the KNN Model performed slightly worse, with scores of just under 60%. Without the rolling averages, My random forest performed just as well, solely using home and away teams as predictors.


# Conclusions:
## Overall,  I was pleased with the performance of the models and think that they could be improved by using a dataset with more advanced statistics. I believe if I had a dataset with more statistics, it would be helpful for me to determing what predictors are more important than others. The dataset I used wasn't bad, but it was lacking some advanced predictors, which could be extremely helpful in predicting a winner. I was excited to see the models perform relatively well, though, and I hope to continue messing around with this in the future because it is something I really enjoy. 

# Suggestions/Future Work:
## Machine learning in soccer is definitely something that is being used and will continue to be used. I know many data scientists use machine learning to create advanced statistics which are used to predict matches, league winners, and tournament winners. It is cool how it is being used in all sports, and I think that it can be used within teams as well in the future to predict potential injuries, which can be helpful for rotations and coaching. There are many possible applications for it, and I hope I will be engaged in something like that in the future, considering how much I love soccer.


# References:
### I got some inspiration for this project from this person who used machine learning to make predictions for games in a brazilian league.

### https://towardsdatascience.com/machine-learning-algorithms-for-football-prediction-using-statistics-from-brazilian-championship-51b7d4ea0bc8
