# Linear Regression project - Football matches

Using the Matches csv file from [Club-Football-Match-Data-2000-2025](https://github.com/xgabora/Club-Football-Match-Data-2000-2025) to build a linear regression model which predicts the score of a match given some factors.

In [1]:
import pandas as pd

In [33]:
dtypes = {'MatchTime':"str"}
parse_dates = ['MatchDate']
df = pd.read_csv('Matches.csv', dtype=dtypes, parse_dates=parse_dates)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230557 entries, 0 to 230556
Data columns (total 48 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Division     230557 non-null  object        
 1   MatchDate    230557 non-null  datetime64[ns]
 2   MatchTime    99072 non-null   object        
 3   HomeTeam     230557 non-null  object        
 4   AwayTeam     230557 non-null  object        
 5   HomeElo      141597 non-null  float64       
 6   AwayElo      141528 non-null  float64       
 7   Form3Home    229057 non-null  float64       
 8   Form5Home    229057 non-null  float64       
 9   Form3Away    229057 non-null  float64       
 10  Form5Away    229057 non-null  float64       
 11  FTHome       230554 non-null  float64       
 12  FTAway       230554 non-null  float64       
 13  FTResult     230554 non-null  object        
 14  HTHome       175977 non-null  float64       
 15  HTAway       175977 non-null  floa

In [34]:
df['Division'].unique()

array(['F1', 'F2', 'T1', 'D1', 'D2', 'B1', 'E2', 'E1', 'N1', 'P1', 'E0',
       'I2', 'SP2', 'SP1', 'I1', 'E3', 'SC0', 'SC1', 'SC2', 'SC3', 'G1',
       'EC', 'USA', 'SWE', 'NOR', 'IRL', 'BRA', 'ARG', 'MEX', 'JAP',
       'RUS', 'POL', 'DEN', 'ROM', 'AUT', 'SUI', 'FIN', 'CHN'],
      dtype=object)

Columns 29 and on are describing bettings odds, so we don't need to worry about those. We also don't need Division, match date/time, team names or elo. Finally, since older matches don't have some data, let's look at the last 5 years only. 

In [36]:
df = df[df['MatchDate'] >= '2020-01-01']
df = df[df['Division'].isin(['E0', 'F1', 'SP1','G1', 'I1'])]
df[0:28].head()

Unnamed: 0,Division,MatchDate,MatchTime,HomeTeam,AwayTeam,HomeElo,AwayElo,Form3Home,Form5Home,Form3Away,...,MaxUnder25,HandiSize,HandiHome,HandiAway,C_LTH,C_LTA,C_VHD,C_VAD,C_HTB,C_PHB
168413,E0,2020-01-01,12:30:00,Brighton,Chelsea,1659.32,1855.04,3.0,5.0,6.0,...,2.23,0.5,1.88,2.02,0.0145,0.4539,0.1097,0.022,0.3704,0.0296
168414,E0,2020-01-01,12:30:00,Burnley,Aston Villa,1683.98,1615.34,3.0,6.0,3.0,...,2.03,-0.8,2.06,1.84,0.0413,0.0693,0.0099,0.8274,0.0389,0.0132
168416,E0,2020-01-01,15:00:00,Newcastle,Leicester,1708.02,1830.2,3.0,6.0,3.0,...,1.99,0.8,2.03,1.87,0.0336,0.1672,0.0099,0.6525,0.0552,0.0815
168417,E0,2020-01-01,15:00:00,Southampton,Tottenham,1674.35,1840.87,7.0,7.0,4.0,...,2.33,0.3,2.0,1.9,0.1501,0.0342,0.01,0.6776,0.1141,0.014
168418,E0,2020-01-01,15:00:00,Watford,Wolves,1662.98,1774.71,7.0,8.0,6.0,...,1.91,0.3,1.89,2.01,0.0598,0.0173,0.8781,0.0099,0.025,0.0099


In [37]:
df = df[[
 'Form3Home',
 'Form5Home',
 'Form3Away',
 'Form5Away',
 'FTHome',
 'FTAway',
 'FTResult',
 'HTHome',
 'HTAway',
 'HTResult',
 'HomeShots',
 'AwayShots',
 'HomeTarget',
 'AwayTarget',
 'HomeFouls',
 'AwayFouls',
 'HomeCorners',
 'AwayCorners',
 'HomeYellow',
 'AwayYellow',
 'HomeRed',
 'AwayRed'           
]]
df.head()

Unnamed: 0,Form3Home,Form5Home,Form3Away,Form5Away,FTHome,FTAway,FTResult,HTHome,HTAway,HTResult,...,HomeTarget,AwayTarget,HomeFouls,AwayFouls,HomeCorners,AwayCorners,HomeYellow,AwayYellow,HomeRed,AwayRed
168413,3.0,5.0,6.0,6.0,1.0,1.0,D,0.0,1.0,A,...,5.0,5.0,8.0,15.0,5.0,3.0,2.0,3.0,0.0,0.0
168414,3.0,6.0,3.0,3.0,1.0,2.0,A,0.0,2.0,A,...,1.0,6.0,12.0,10.0,8.0,4.0,1.0,1.0,0.0,0.0
168416,3.0,6.0,3.0,7.0,0.0,3.0,A,0.0,2.0,A,...,2.0,10.0,8.0,12.0,4.0,5.0,1.0,1.0,0.0,0.0
168417,7.0,7.0,4.0,10.0,1.0,0.0,H,1.0,0.0,H,...,3.0,5.0,21.0,8.0,6.0,9.0,3.0,4.0,0.0,0.0
168418,7.0,8.0,6.0,7.0,2.0,1.0,H,1.0,0.0,H,...,3.0,4.0,12.0,6.0,4.0,7.0,3.0,1.0,1.0,0.0


In [39]:
df.isnull().sum() / len(df)

Form3Home      0.012698
Form5Home      0.012698
Form3Away      0.012698
Form5Away      0.012698
FTHome         0.000000
FTAway         0.000000
FTResult       0.000000
HTHome         0.000000
HTAway         0.000000
HTResult       0.000000
HomeShots      0.000000
AwayShots      0.000000
HomeTarget     0.000000
AwayTarget     0.000000
HomeFouls      0.000000
AwayFouls      0.000000
HomeCorners    0.000000
AwayCorners    0.000000
HomeYellow     0.000000
AwayYellow     0.000000
HomeRed        0.000000
AwayRed        0.000000
dtype: float64

In [40]:
df[df['Form3Home'].isnull()]

Unnamed: 0,Form3Home,Form5Home,Form3Away,Form5Away,FTHome,FTAway,FTResult,HTHome,HTAway,HTResult,...,HomeTarget,AwayTarget,HomeFouls,AwayFouls,HomeCorners,AwayCorners,HomeYellow,AwayYellow,HomeRed,AwayRed
168555,,,,,1.0,2.0,A,0.0,1.0,A,...,7.0,3.0,14.0,21.0,6.0,2.0,1.0,3.0,0.0,0.0
168560,,,,,4.0,2.0,H,2.0,1.0,H,...,5.0,5.0,19.0,21.0,3.0,5.0,1.0,2.0,0.0,0.0
168562,,,,,2.0,1.0,H,0.0,0.0,D,...,3.0,4.0,14.0,19.0,4.0,1.0,4.0,2.0,0.0,0.0
168572,,,,,3.0,1.0,H,1.0,1.0,D,...,9.0,3.0,24.0,18.0,8.0,1.0,4.0,4.0,0.0,0.0
168579,,,,,0.0,1.0,A,0.0,1.0,A,...,3.0,2.0,11.0,14.0,8.0,2.0,3.0,4.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172127,,,,,2.0,4.0,A,0.0,2.0,A,...,7.0,8.0,8.0,19.0,6.0,6.0,1.0,2.0,0.0,0.0
172143,,,,,0.0,0.0,D,0.0,0.0,D,...,3.0,2.0,19.0,9.0,4.0,2.0,2.0,2.0,0.0,0.0
172144,,,,,2.0,0.0,H,2.0,0.0,H,...,6.0,2.0,19.0,9.0,2.0,2.0,3.0,2.0,0.0,0.0
172245,,,,,0.0,0.0,D,0.0,0.0,D,...,2.0,3.0,22.0,25.0,6.0,2.0,1.0,2.0,0.0,0.0
