# Machine learning model LaQuiniela

In this notebook we are going to develop a machine learning model to predict the matches results of LaLiga: (1) home, (2) visitor, (X) tie.

First of all, we import all the libraries we are going to use. 
The library `sqlite3` is for reading the files `.sqlite`, in particular, `laliga.sqlite`, with the information of the matches, and `clasification.sqlite`, where we have store the data from exercise 10.

The library `sklearn` is for the machine learning model. The variable which will be predicted is discrete, so we have to develop a clasification model. To do that we have selected the `RandomForestClassifier`. The other functions of this library will help us to mesure the fitness of our model.

In [45]:
import pandas as pd
import numpy as np
import sqlite3
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

The following variables indicate which seasons are going to be train and the result of which matches are going to be predicted.

In [41]:
# to train
seasons_train = "2000:2010"
# to predict
division = 1
season = "2019-2020"
matchday = 6

Inside the main folder (la-quiniela-main), there is a file named `classification.sqlite` which contains the data from exercise 10. This is the classification of every team at every season, division, matchday, etc.
There is also the file `laliga.sqlite` which contains the data of the matches. 

In [3]:
with sqlite3.connect("../laliga.sqlite") as conn:
        matches = pd.read_sql("SELECT * FROM Matches", conn)
conn.close() 
matches

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score
0,1928-1929,1,1,2/10/29,,Arenas Club,Athletic Madrid,2:3
1,1928-1929,1,1,2/10/29,,Espanyol,Real Unión,3:2
2,1928-1929,1,1,2/10/29,,Real Madrid,Catalunya,5:0
3,1928-1929,1,1,2/10/29,,Donostia,Athletic,1:1
4,1928-1929,1,1,2/12/29,,Racing,Barcelona,0:2
...,...,...,...,...,...,...,...,...
48775,2021-2022,2,42,5/29/22,,Real Oviedo,UD Ibiza,
48776,2021-2022,2,42,5/29/22,,Real Sociedad B,Real Zaragoza,
48777,2021-2022,2,42,5/29/22,,Sporting Gijón,UD Las Palmas,
48778,2021-2022,2,42,5/29/22,,CD Tenerife,FC Cartagena,


In [4]:
with sqlite3.connect("../clasification.sqlite") as conn:
        classification = pd.read_sql("SELECT * FROM clasification", conn)
conn.close() 
classification

Unnamed: 0,season,division,rank,team,matchday,GF,GA,GD,W,L,T,Pts
0,2021-2022,1,1.0,Real Madrid,1,4.0,1.0,3.0,1.0,0.0,0.0,3.0
1,2021-2022,1,2.0,Sevilla FC,1,3.0,0.0,3.0,1.0,0.0,0.0,3.0
2,2021-2022,1,3.0,Barcelona,1,4.0,2.0,2.0,1.0,0.0,0.0,3.0
3,2021-2022,1,4.0,Atlético Madrid,1,2.0,1.0,1.0,1.0,0.0,0.0,3.0
4,2021-2022,1,5.0,Valencia,1,1.0,0.0,1.0,1.0,0.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...
97549,1928-1929,1,6.0,Athletic Madrid,18,43.0,41.0,2.0,8.0,8.0,2.0,26.0
97550,1928-1929,1,7.0,Espanyol,18,32.0,38.0,-6.0,7.0,7.0,4.0,25.0
97551,1928-1929,1,8.0,Catalunya,18,45.0,49.0,-4.0,6.0,8.0,4.0,22.0
97552,1928-1929,1,9.0,Real Unión,18,40.0,42.0,-2.0,5.0,11.0,2.0,17.0


One of the most important parts is to clean the data to get two dataframes, with all the relevant data, one for the training and another for the testing.

We begin adding a new column to `matches`, with the result of each match: (1) home team, (X) tie or (2) visitor. If the game hasn't been played yet: `None`. 

In [46]:
matches["result"] = None
matches.loc[(matches["score"].str.split(":").str[0]) > (matches["score"].str.split(":").str[1]), "result"] = '1'
matches.loc[(matches["score"].str.split(":").str[0]) == (matches["score"].str.split(":").str[1]), "result"] = 'X'
matches.loc[(matches["score"].str.split(":").str[0]) < (matches["score"].str.split(":").str[1]), "result"] = '2'

matches.dropna(subset="score", inplace=True)

In this example, `seasons_train=2000:2010` which means we are going to train the data from season `2000-2001` to `2009-2010`.

In [6]:
# Data frame of clasification
classification_train = classification

classification_train["season"] = classification_train["season"].astype(str)
classification_train = classification_train.loc[(classification_train["season"].str.split("-").str[0] >= season_train.split(":")[0]) &
                    (classification_train["season"].str.split("-").str[1] <= season_train.split(":")[1])]


In [7]:
matches_train = matches

matches_train.loc[:,"season"] = matches_train["season"].astype(str)
matches_train = matches_train.loc[(matches_train["season"].str.split("-").str[0] >= season_train.split(":")[0]) &
                    (matches_train["season"].str.split("-").str[1] <= season_train.split(":")[1])]

In [48]:
def merge_and_clean_home(matches, classification):
    df_train = pd.merge(matches.copy(), classification.copy(), 
                             left_on=['home_team', "season", "division", "matchday"],
                             right_on=['team', "season", "division", "matchday"])
    df_train.dropna(subset=["home_team"], inplace=True)
    df_train.drop("team", axis=1, inplace=True)

    return df_train

In [49]:
def merge_and_clean_visitor(df_train, classification):
    df_train = pd.merge(df_train.copy(), classification.copy(), 
                             left_on=['away_team', "season", "division", "matchday"],
                             right_on=['team', "season", "division", "matchday"],
                             suffixes=("_home", "_away"))

    df_train.dropna(subset=["away_team"], inplace=True)
    df_train.drop("team", axis=1, inplace=True)

    return df_train

In [10]:
df_train = merge_and_clean_home(matches_train, classification_train)
df_train = merge_and_clean_visitor(df_train, classification_train)
df_train

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,result,rank_home,...,T_home,Pts_home,rank_away,GF_away,GA_away,GD_away,W_away,L_away,T_away,Pts_away
0,2000-2001,1,1,9/9/00,8:15 PM,Real Sociedad,Racing,2:2,X,10.0,...,1.0,1.0,9.0,2.0,2.0,0.0,0.0,0.0,1.0,1.0
1,2000-2001,1,1,9/9/00,9:00 PM,Real Zaragoza,Espanyol,1:2,2,14.0,...,0.0,0.0,6.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0
2,2000-2001,1,1,9/9/00,9:00 PM,Barcelona,Málaga CF,2:1,1,5.0,...,0.0,3.0,13.0,1.0,2.0,-1.0,0.0,1.0,0.0,0.0
3,2000-2001,1,1,9/9/00,9:00 PM,Dep. La Coruña,Athletic,2:0,1,4.0,...,0.0,3.0,17.0,0.0,2.0,-2.0,0.0,1.0,0.0,0.0
4,2000-2001,1,1,9/9/00,9:00 PM,Real Madrid,Valencia,2:1,1,7.0,...,0.0,3.0,15.0,1.0,2.0,-1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8415,2009-2010,2,42,6/19/10,6:00 PM,Cádiz CF,CD Numancia,4:2,1,20.0,...,14.0,50.0,8.0,55.0,53.0,2.0,16.0,15.0,11.0,59.0
8416,2009-2010,2,42,6/19/10,6:00 PM,Celta de Vigo,SD Huesca,0:1,2,14.0,...,13.0,52.0,13.0,36.0,40.0,-4.0,12.0,14.0,16.0,52.0
8417,2009-2010,2,42,6/19/10,6:00 PM,Elche CF,Real Sociedad,4:1,1,6.0,...,9.0,63.0,1.0,53.0,37.0,16.0,20.0,8.0,14.0,74.0
8418,2009-2010,2,42,6/19/10,6:00 PM,UD Las Palmas,Gimnàstic,1:0,1,17.0,...,15.0,51.0,18.0,42.0,55.0,-13.0,14.0,19.0,9.0,51.0


In [11]:
features = ['rank_home', 'GD_home', "W_home", 
            "Pts_home", 'rank_away', 'GD_away', 
            "W_away", "Pts_away"]
target = "result"

In [12]:
X_train = df_train[features]  
y_train = df_train[target]  

In [13]:
def select_data(df, season, division, matchday):
    df_test = df[(df["season"] == season) & 
                (df["division"] == division) & 
                (df["matchday"] == matchday)]
    return df_test

In [14]:
classification_test = select_data(classification, season, division, matchday)
matches_test = select_data(matches, season, division, matchday)

In [15]:
df_test = merge_and_clean_home(matches_test,classification_test)
df_test = merge_and_clean_visitor(df_test, classification_test)
df_test

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,result,rank_home,...,T_home,Pts_home,rank_away,GF_away,GA_away,GD_away,W_away,L_away,T_away,Pts_away
0,2019-2020,1,6,9/24/19,7:00 PM,Real Valladolid,Granada CF,1:1,X,14.0,...,3.0,6.0,5.0,12.0,6.0,6.0,3.0,1.0,2.0,11.0
1,2019-2020,1,6,9/24/19,8:00 PM,Real Betis,Levante,3:1,1,9.0,...,2.0,8.0,11.0,7.0,8.0,-1.0,2.0,3.0,1.0,7.0
2,2019-2020,1,6,9/24/19,9:00 PM,Barcelona,Villarreal,2:1,1,6.0,...,1.0,10.0,8.0,13.0,10.0,3.0,2.0,2.0,2.0,8.0
3,2019-2020,1,6,9/25/19,7:00 PM,CD Leganés,Athletic,1:1,X,20.0,...,2.0,2.0,4.0,7.0,2.0,5.0,3.0,0.0,3.0,12.0
4,2019-2020,1,6,9/25/19,7:00 PM,RCD Mallorca,Atlético Madrid,0:2,2,19.0,...,1.0,4.0,3.0,7.0,4.0,3.0,4.0,1.0,1.0,13.0
5,2019-2020,1,6,9/25/19,8:00 PM,Valencia,Getafe,3:3,X,13.0,...,3.0,6.0,10.0,10.0,9.0,1.0,1.0,1.0,4.0,7.0
6,2019-2020,1,6,9/25/19,9:00 PM,Real Madrid,CA Osasuna,2:0,1,1.0,...,2.0,14.0,12.0,4.0,5.0,-1.0,1.0,1.0,4.0,7.0
7,2019-2020,1,6,9/26/19,7:00 PM,SD Eibar,Sevilla FC,3:2,1,16.0,...,2.0,5.0,7.0,7.0,5.0,2.0,3.0,2.0,1.0,10.0
8,2019-2020,1,6,9/26/19,8:00 PM,Celta de Vigo,Espanyol,1:1,X,15.0,...,3.0,6.0,18.0,4.0,10.0,-6.0,1.0,3.0,2.0,5.0
9,2019-2020,1,6,9/26/19,9:00 PM,Real Sociedad,Alavés,3:0,1,2.0,...,1.0,13.0,17.0,2.0,7.0,-5.0,1.0,3.0,2.0,5.0


In [16]:
X_test = df_test[features]  
y_test = df_test[target]

In [17]:
rf_classifier = RandomForestClassifier()

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

rf_model = grid_search.best_estimator_

In [18]:
# Evaluar el modelo en el conjunto de prueba
y_pred = rf_model.predict(X_test)
y_pred

array(['2', '1', 'X', '2', '2', 'X', '1', 'X', 'X', '1'], dtype=object)

In [19]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
predictions = rf_model.predict_proba(X_test)

print(f"Accuracy: {accuracy}\n")
print(f"Confusion Matrix:\n{confusion_mat}\n")
print(f"Classification Report:\n{class_report}\n")
print(f"Predictions probability:\n{predictions}\n")

Accuracy: 0.6

Confusion Matrix:
[[3 0 2]
 [0 1 0]
 [0 2 2]]

Classification Report:
              precision    recall  f1-score   support

           1       1.00      0.60      0.75         5
           2       0.33      1.00      0.50         1
           X       0.50      0.50      0.50         4

    accuracy                           0.60        10
   macro avg       0.61      0.70      0.58        10
weighted avg       0.73      0.60      0.62        10


Predictions probability:
[[0.17534664 0.42222806 0.4024253 ]
 [0.52678088 0.19205422 0.2811649 ]
 [0.41866251 0.15105064 0.43028685]
 [0.0262123  0.80049269 0.17329501]
 [0.13460212 0.63007586 0.23532202]
 [0.21119386 0.30599373 0.48281241]
 [0.79309803 0.03725464 0.16964732]
 [0.20734706 0.38845885 0.40419409]
 [0.39699408 0.14809176 0.45491417]
 [0.87921138 0.02198939 0.09879922]]



# Accuracy general

In [32]:
df_train2 = merge_and_clean_home(matches, classification)
df_train2 = merge_and_clean_visitor(df_train2, classification)
df_train2.dropna(subset="score", inplace=True)

X_train2 = df_train2[features]
y_train2 = df_train2[target]

In [33]:
rf_classifier = RandomForestClassifier()

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train2, y_train2)

rf_model2 = grid_search.best_estimator_

In [22]:
df_test2 = merge_and_clean_home(matches,classification)
df_test2 = merge_and_clean_visitor(df_test2, classification)
df_test2.dropna(subset="score", inplace=True)

In [23]:
# Evaluar el modelo en el conjunto de prueba
X_test2 = df_test2[features]
y_test2 = df_test2[target]
y_pred2 = rf_model2.predict(X_test2)

In [24]:
# Evaluate the model
accuracy = accuracy_score(y_test2, y_pred2)
confusion_mat = confusion_matrix(y_test2, y_pred2)
class_report = classification_report(y_test2, y_pred2)
predictions = rf_model2.predict_proba(X_test2)

print(f"Accuracy: {accuracy}\n")
print(f"Confusion Matrix:\n{confusion_mat}\n")
print(f"Classification Report:\n{class_report}\n")
print(f"Predictions probability:\n{predictions}\n")

Accuracy: 0.6277708333333333

Confusion Matrix:
[[21518  2940   859]
 [ 3850  5837   716]
 [ 6548  2954  2778]]

Classification Report:
              precision    recall  f1-score   support

           1       0.67      0.85      0.75     25317
           2       0.50      0.56      0.53     10403
           X       0.64      0.23      0.33     12280

    accuracy                           0.63     48000
   macro avg       0.60      0.55      0.54     48000
weighted avg       0.63      0.63      0.60     48000


Predictions probability:
[[0.08746701 0.64622069 0.2663123 ]
 [0.81683962 0.05600774 0.12715264]
 [0.85901624 0.03557315 0.1054106 ]
 ...
 [0.03598797 0.52206884 0.4419432 ]
 [0.39353163 0.16639106 0.44007731]
 [0.00744167 0.73499036 0.25756797]]

