# Comparing diferent machine learning techniques to predict football results
# - Authors and affiliations
- Abstract
- 1. Introduction

# 1-Introduction

As one of the most popular sports on the planet, football has always been followed
very closely by a large number of people. In recent years, new types of data have
been collected for many games in various countries, such as play-by-play data including information on each shot or pass made in a match.

Information like this can be found in datasets like [the Football Events dataset](https://datasetsearch.research.google.com/search?docid=beC2NjeMuiLj9GvLAAAAAA%3D%3D) found on kaggle.

This projects objective is to use the above mentioned dataset to predict the number of goals on a football game using Regression algorithms.
This paper will start with an explanation and analysis of the used dataset, followed by an explanation of the approach the was decided to be used.

Tests will be performed on the developed models and explanation for the results will be given at the end.

# 2. Description of the problem/dataset

Our data set has two files of data and one dictionary. The first file (events.csv) gives information on all the
recorded events with 941,009 events for 9,074 games. The second file (ginf.csv) gives the details of the
odds for the games recorded in the first file . for each league we have information on the seasons
from 2012 to 2017 except the English league which information only starts in the season 2014.
The dictionary helps us to understands the values in some of the columns of the events table.

We will focus more on the first table which will be the most useful since we not using odds.

## 2.1. events.csv Columns
1. id odsp : Unique identifier of the game (odsp stands from oddsportal.com)
2. id event : Unique identifier of event (id odsp + sort order)
3. sort order : Chronological sequence of events in a game
4. time : Minute of the game when the event happened
5. text : Text commentary
6. event type : Primary event. 11 unique events (1-Attempt(shot), 2-Corner, 3-Foul, 4-Yellow Card, 5-Second Yellow Card, 6-(Straight) Red Card, 7-Substitution, 8-Free Kick Won, 9-Offside, 10-Hand Ball, 11-Penalty conceded)
7. event type2 : Cecondary event. 4 unique events (12 - Key Pass, 13 - Failed through ball, 14-Sending off, 15-Own goal)
8. side : 1-Home, 2-Away
9. event team : Team that produced the event. In case of Own goals, event team is the team that benefited from the own goal
10. opponent : Team that the event happened against
11. player : Name of the player involved in main event (converted to lowercase and special
chars were removed)
12. player2 : Name of player involved in secondary event 8
13. player in : Player that came in (only applies to substitutions)
14. player out : Player substituted (only applies to substitutions)
15. shot place : Placement of the shot (13 possible placement locations, available in the
dictionary, only applies to shots)
16. shot outcome : 4 possible outcomes (1-On target, 2-Off target, 3-Blocked, 4-Hit the post)
17. is goal : Binary variable if the shot resulted in a goal (own goals included)
18. location : Location on the pitch where the event happened (19 possible locations, available
in the dictionary)
19. bodypart : (1- right foot, 2-left foot, 3-head)
20. assist method : In case of an assisted shot, 5 possible assist methods (details in the dictionary)
21. situation : 4 types: 1-Open Play, 2-Set piece (excluding Direct Free kicks), 3-Corner, 4-Free kick

## 2.2 Loading the Data

In [4]:
import pandas as pd
import numpy as np
import matplotlib as plt
from matplotlib import pyplot
import scipy as sp
import seaborn as sb

From the website we also know that the missing values are set as 'NA'so we should treat them as such

In [8]:

events = pd.read_csv('football-events/events.csv',na_values=['NA'])
games = pd.read_csv('football-events/ginf.csv',na_values=['NA'])

In [10]:
events.describe()

Unnamed: 0,sort_order,time,event_type,event_type2,side,shot_place,shot_outcome,is_goal,location,bodypart,assist_method,situation,fast_break
count,941009.0,941009.0,941009.0,214293.0,941009.0,227459.0,228498.0,941009.0,467067.0,229185.0,941009.0,229137.0,941009.0
mean,53.858826,49.663663,4.326575,12.233764,1.48117,5.733693,1.926555,0.025978,6.209073,1.624831,0.264332,1.281316,0.004876
std,32.014268,26.488977,2.995313,0.46885,0.499646,3.3261,0.797055,0.159071,5.421736,0.7404,0.655501,0.709394,0.069655
min,1.0,0.0,1.0,12.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
25%,27.0,27.0,2.0,12.0,1.0,2.0,1.0,0.0,2.0,1.0,0.0,1.0,0.0
50%,53.0,51.0,3.0,12.0,1.0,5.0,2.0,0.0,3.0,1.0,0.0,1.0,0.0
75%,79.0,73.0,8.0,12.0,2.0,9.0,3.0,0.0,11.0,2.0,0.0,1.0,0.0
max,180.0,100.0,11.0,15.0,2.0,13.0,4.0,1.0,19.0,3.0,4.0,4.0,1.0


In [2]:
## 3. Approach