![image](https://user-images.githubusercontent.com/92790663/189541779-82e3ea35-da9c-444d-b5d6-3dfe4456c09c.png)

### BUSINESS PROBLEM

Approximately two million managers play the Fantasy Premier League (FPL) for 38 game weeks every season. Every game week, the big question on the two million minds is which soccer players will provide maximum ROI throughout the season. A predictive analytics model for weekly ROI from a player is crucial information sought by FPL managers. This project investigates a model based on the historical data of players' performance against their opponents.

#### BUSINESS OBJECTIVES
- Create a model that predicts points for each player weekly and evaluate the model's accuracy.
- Predict and select players with high returns on fantasy points before every game week.
- Players comparison using analytics.

#### DATA SOURCES 
[Link 1](https://www.fantasynutmeg.com)

- This source provides historical data from the 2016 season till the current season. The extracted data contains only data of players with double digits fantasy points across every fixture in respective seasons. There is an opportunity to extract every player's performance for every fixture from 2016 to the current season. This is an issue to resolve as part of the optimization phase of this project.


[Link 2](https://fantasy.premierleague.com/api/)

- This source is the official FPL API that only contains the data on players’ performance in the current season, players' positions and all the current season fixtures.


#### PERFORMANCE METRICS
- Accuracy
- R-Squared ($R^2$) score (Coefficient of determination).
- RMSE (Root Mean Squared Error)

### IMPORT NECESSARY LIBRARIES

In [1]:
import requests
import numpy as np
import pandas as pd

### ACCESSING DATA

In [2]:
# Read data.
df_allseasons = pd.read_csv('cleaned_merged_seasons.csv', index_col = 'Unnamed: 0')
df_allseasons.head()

  df_allseasons = pd.read_csv('cleaned_merged_seasons.csv', index_col = 'Unnamed: 0')


Unnamed: 0,season_x,name,position,team_x,assists,bonus,bps,clean_sheets,creativity,element,...,team_h_score,threat,total_points,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards,GW
0,2016-17,Aaron Cresswell,DEF,,0,0,0,0,0.0,454,...,2.0,0.0,0,0,0,0,55,False,0,1
1,2016-17,Aaron Lennon,MID,,0,0,6,0,0.3,142,...,1.0,0.0,1,0,0,0,60,True,0,1
2,2016-17,Aaron Ramsey,MID,,0,0,5,0,4.9,16,...,3.0,23.0,2,0,0,0,80,True,0,1
3,2016-17,Abdoulaye Doucouré,MID,,0,0,0,0,0.0,482,...,1.0,0.0,0,0,0,0,50,False,0,1
4,2016-17,Adam Forshaw,MID,,0,0,3,0,1.3,286,...,1.0,0.0,1,0,0,0,45,True,1,1


In [3]:
# Print all columns.
df_allseasons.columns

Index(['season_x', 'name', 'position', 'team_x', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'ict_index', 'influence', 'kickoff_time', 'minutes',
       'opponent_team', 'opp_team_name', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
       'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW'],
      dtype='object')

In [4]:
# Descriptive information on features.
df_allseasons.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98402 entries, 0 to 98401
Data columns (total 37 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   season_x           98402 non-null  object 
 1   name               98402 non-null  object 
 2   position           98402 non-null  object 
 3   team_x             48930 non-null  object 
 4   assists            98402 non-null  int64  
 5   bonus              98402 non-null  int64  
 6   bps                98402 non-null  int64  
 7   clean_sheets       98402 non-null  int64  
 8   creativity         98402 non-null  float64
 9   element            98402 non-null  int64  
 10  fixture            98402 non-null  int64  
 11  goals_conceded     98402 non-null  int64  
 12  goals_scored       98402 non-null  int64  
 13  ict_index          98402 non-null  float64
 14  influence          98402 non-null  float64
 15  kickoff_time       98402 non-null  object 
 16  minutes            984

In [5]:
# Check for any missing values.
df_allseasons.isnull().values.any()

True

In [6]:
# Check missing values for each feature.
df_allseasons.isna().sum()

season_x                 0
name                     0
position                 0
team_x               49472
assists                  0
bonus                    0
bps                      0
clean_sheets             0
creativity               0
element                  0
fixture                  0
goals_conceded           0
goals_scored             0
ict_index                0
influence                0
kickoff_time             0
minutes                  0
opponent_team            0
opp_team_name            0
own_goals                0
penalties_missed         0
penalties_saved          0
red_cards                0
round                    0
saves                    0
selected                 0
team_a_score            49
team_h_score            49
threat                   0
total_points             0
transfers_balance        0
transfers_in             0
transfers_out            0
value                    0
was_home                 0
yellow_cards             0
GW                       0
d

In [7]:
df_allseasons.team_x

0                NaN
1                NaN
2                NaN
3                NaN
4                NaN
            ...     
98397      Leicester
98398      Newcastle
98399    Southampton
98400       Brighton
98401       West Ham
Name: team_x, Length: 98402, dtype: object

In [8]:
# Check for duplicates on each row.
df_allseasons.duplicated().value_counts()

False    98402
dtype: int64

In [9]:
# Check for unique values.
df_allseasons.nunique()

season_x                 6
name                   989
position                 4
team_x                  23
assists                  5
bonus                    4
bps                    113
clean_sheets             2
creativity             860
element                737
fixture                380
goals_conceded          10
goals_scored             5
ict_index              273
influence              528
kickoff_time          1428
minutes                 91
opponent_team           20
opp_team_name           31
own_goals                2
penalties_missed         2
penalties_saved          3
red_cards                2
round                   47
saves                   14
selected             65713
team_a_score             9
team_h_score            10
threat                 149
total_points            31
transfers_balance    32217
transfers_in         24344
transfers_out        26734
value                  100
was_home                 2
yellow_cards             2
GW                      47
d

In [10]:
# Descriptive statistics.
df_allseasons.describe()

Unnamed: 0,assists,bonus,bps,clean_sheets,creativity,element,fixture,goals_conceded,goals_scored,ict_index,...,team_a_score,team_h_score,threat,total_points,transfers_balance,transfers_in,transfers_out,value,yellow_cards,GW
count,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,...,98353.0,98353.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0
mean,0.045873,0.122599,6.825359,0.120993,5.352928,311.321701,196.188248,0.542845,0.051279,1.963898,...,1.262097,1.491708,6.127121,1.541798,1318.144,13950.5,12631.75,52.49687,0.057814,20.718309
std,0.22768,0.520794,10.252218,0.326121,11.305636,181.148434,108.6632,0.995002,0.247819,3.218001,...,1.224245,1.310472,14.476371,2.658725,58594.15,50342.2,42870.93,13.123029,0.233392,11.605966
min,0.0,0.0,-18.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-7.0,-1857821.0,0.0,0.0,37.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,157.0,103.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,-1721.0,64.0,185.0,45.0,0.0,11.0
50%,0.0,0.0,0.0,0.0,0.0,306.0,200.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,-78.0,587.0,1422.0,49.0,0.0,21.0
75%,0.0,0.0,12.0,0.0,3.9,459.0,290.0,1.0,0.0,2.9,...,2.0,2.0,4.0,2.0,144.0,5894.0,8614.75,55.0,0.0,30.0
max,4.0,3.0,128.0,1.0,170.9,737.0,380.0,9.0,4.0,35.8,...,9.0,9.0,186.0,29.0,1983733.0,2104464.0,1872898.0,136.0,1.0,47.0


#### Observation
- Some records are missing.
- No duplicate observation.

#### FEATURE ENGINEERING

In [22]:
# Make a copy of the original piece of data.
df_allseasons_clean = df_allseasons.copy()

> To engineer two new features named `club_name` and `form`, we collect data from the `fantasynutmeg` API (Link 1 in the business problem statement), compare the columns of the collected data with the `df_allseasons` dataframe and extract the features highlighted above.

In [11]:
# Get yearly historic data from endpoint for available seasons and identify the keys in each disctionary using 2016 as an example.
Y2016= requests.get('https://www.fantasynutmeg.com/api/history/season/2016-17').json()
Y2017= requests.get('https://www.fantasynutmeg.com/api/history/season/2017-18').json()
Y2018= requests.get('https://www.fantasynutmeg.com/api/history/season/2018-19').json()
Y2019= requests.get('https://www.fantasynutmeg.com/api/history/season/2019-20').json()
Y2020= requests.get('https://www.fantasynutmeg.com/api/history/season/2020-21').json()
Y2021= requests.get('https://www.fantasynutmeg.com/api/history/season/2021-22').json()
Y2022= requests.get('https://www.fantasynutmeg.com/api/history/season/2022-23').json()

Y2016.keys()

dict_keys(['dd_agg_fixture', 'dd_agg_player', 'dd_hauls', 'history'])

In [13]:
# Convert history data dictionary to a pandas dataframe.
hist16_df = pd.DataFrame(Y2016['history'])
hist17_df = pd.DataFrame(Y2017['history'])
hist18_df = pd.DataFrame(Y2018['history'])
hist19_df = pd.DataFrame(Y2019['history'])
hist20_df = pd.DataFrame(Y2020['history'])
hist21_df = pd.DataFrame(Y2021['history'])

In [14]:
# Engineer feature to highlight each season year.
hist16_df['year'] = hist16_df.apply(lambda x: "2016-17", axis=1)
hist17_df['year'] = hist17_df.apply(lambda x: "2017-18", axis=1)
hist18_df['year'] = hist18_df.apply(lambda x: "2018-19", axis=1)
hist19_df['year'] = hist19_df.apply(lambda x: "2019-20", axis=1)
hist20_df['year'] = hist20_df.apply(lambda x: "2020-21", axis=1)
hist21_df['year'] = hist21_df.apply(lambda x: "2021-22", axis=1)

In [15]:
# Concatenate all history data across years.
hist_df = [hist16_df, hist17_df, hist18_df, hist19_df, hist20_df, hist21_df]

hist = pd.concat(hist_df, axis = 0, ignore_index=True)

In [16]:
# Preview history data.
hist.head()

Unnamed: 0,assists,bonus,bps,chance_of_playing_next_round,chance_of_playing_this_round,clean_sheets,code,cost_change_event,cost_change_event_fall,cost_change_start,...,influence_rank,influence_rank_type,threat_rank,threat_rank_type,corners_and_indirect_freekicks_order,corners_and_indirect_freekicks_text,direct_freekicks_order,direct_freekicks_text,penalties_order,penalties_text
0,0,0,18,100,100,0,48844,0,0,-3,...,,,,,,,,,,
1,0,2,660,100,100,12,11334,0,0,-1,...,,,,,,,,,,
2,1,19,723,0,75,10,51507,0,0,1,...,,,,,,,,,,
3,0,0,5,100,100,0,17127,0,0,-2,...,,,,,,,,,,
4,0,2,296,75,100,5,158074,0,0,-2,...,,,,,,,,,,


In [17]:
# Engineer feature to highlight the form of the players.
hist['form'] = hist['total_points']/38 

In [18]:
# Print all columns.
hist.columns

Index(['assists', 'bonus', 'bps', 'chance_of_playing_next_round',
       'chance_of_playing_this_round', 'clean_sheets', 'code',
       'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
       'cost_change_start_fall', 'creativity', 'dreamteam_count', 'ea_index',
       'element_type', 'ep_next', 'ep_this', 'event_points', 'first_name',
       'form', 'goals_conceded', 'goals_scored', 'ict_index', 'id',
       'in_dreamteam', 'influence', 'loaned_in', 'loaned_out', 'loans_in',
       'loans_out', 'minutes', 'news', 'now_cost', 'own_goals',
       'penalties_missed', 'penalties_saved', 'photo', 'points_per_game',
       'position', 'red_cards', 'saves', 'second_name', 'selected_by_percent',
       'special', 'squad_number', 'status', 'team', 'team_code', 'team_name',
       'threat', 'total_points', 'transfers_in', 'transfers_in_event',
       'transfers_out', 'transfers_out_event', 'value_form', 'value_season',
       'web_name', 'yellow_cards', 'year', 'news_added', 

We have the historical data from 16-21. We proceed as described below:
1. Create a column with the First Name, Last Name and the Year. 
2. Combine the name and the year of the df_`allseasons` dataframe to match the historical data. 
3. Map the two based on the First Name, Last Name and the Year and extract needed features.

In [19]:
# Engineer feature to highlight the players name and the season they played in.
hist['name_season'] = hist['first_name'] + ' ' + hist['second_name'] + '_' + hist['year']

In [20]:
# Display a sample of name_season column.
hist.name_season.head()

0                David Ospina_2016-17
1                   Petr Cech_2016-17
2           Laurent Koscielny_2016-17
3             Per Mertesacker_2016-17
4    Gabriel Armando de Abreu_2016-17
Name: name_season, dtype: object

In [21]:
# Data Quality Checks.
subset  = ['Mohamed Salah_2018-19']
check = hist[hist.name_season.isin(subset)]
check.form

1582    6.815789
Name: form, dtype: float64

In [24]:
# Engineer feature to highlight the players name and the season they played in.
df_allseasons_clean['name_season'] = df_allseasons_clean['name'] + '_' + df_allseasons_clean['season_x']

In [25]:
# Engineer a feature to highlight the club of the player.
teams=dict(zip(hist.name_season, hist.team_name))

df_allseasons_clean['club_name'] = df_allseasons_clean['name_season'].map(teams)

In [26]:
# Engineer a feature to highlight the form of the player.
teams=dict(zip(hist.name_season, hist.form))

df_allseasons_clean['form'] = df_allseasons_clean['name_season'].map(teams)

In [27]:
# Preview dataframe.
df_allseasons_clean.head()

Unnamed: 0,season_x,name,position,team_x,assists,bonus,bps,clean_sheets,creativity,element,...,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards,GW,name_season,club_name,form
0,2016-17,Aaron Cresswell,DEF,,0,0,0,0,0.0,454,...,0,0,0,55,False,0,1,Aaron Cresswell_2016-17,WHU,1.578947
1,2016-17,Aaron Lennon,MID,,0,0,6,0,0.3,142,...,0,0,0,60,True,0,1,Aaron Lennon_2016-17,EVE,0.578947
2,2016-17,Aaron Ramsey,MID,,0,0,5,0,4.9,16,...,0,0,0,80,True,0,1,Aaron Ramsey_2016-17,ARS,1.473684
3,2016-17,Abdoulaye Doucouré,MID,,0,0,0,0,0.0,482,...,0,0,0,50,False,0,1,Abdoulaye Doucouré_2016-17,WAT,1.0
4,2016-17,Adam Forshaw,MID,,0,0,3,0,1.3,286,...,0,0,0,45,True,1,1,Adam Forshaw_2016-17,MID,2.026316


In [28]:
# Data Quality Checks.
subset  = ['Marcus Rashford_2020-21']
check = df_allseasons_clean[df_allseasons_clean.name_season.isin(subset)]
check.form

50237    4.578947
50787    4.578947
51344    4.578947
51917    4.578947
52611    4.578947
53207    4.578947
53807    4.578947
54408    4.578947
55010    4.578947
55566    4.578947
56157    4.578947
56764    4.578947
57374    4.578947
57988    4.578947
58498    4.578947
59044    4.578947
59480    4.578947
60327    4.578947
60328    4.578947
61026    4.578947
61674    4.578947
62336    4.578947
63001    4.578947
63780    4.578947
64523    4.578947
65608    4.578947
65609    4.578947
65890    4.578947
66643    4.578947
67606    4.578947
68293    4.578947
68985    4.578947
69666    4.578947
70880    4.578947
70881    4.578947
70882    4.578947
72475    4.578947
73181    4.578947
Name: form, dtype: float64

In [29]:
# Print all columns.
df_allseasons_clean.columns

Index(['season_x', 'name', 'position', 'team_x', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'ict_index', 'influence', 'kickoff_time', 'minutes',
       'opponent_team', 'opp_team_name', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
       'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW', 'name_season', 'club_name', 'form'],
      dtype='object')

In [30]:
# Engineer feature to highlight the game dates from kickoff_time.
df_allseasons_clean['game_date'] = df_allseasons_clean['kickoff_time'].str.replace('T', ' ')
df_allseasons_clean['game_date'] = df_allseasons_clean['game_date'].str.replace(':00Z', '')

In [32]:
# Preview series.
df_allseasons_clean.game_date.head()

0    2016-08-15 19:00
1    2016-08-13 14:00
2    2016-08-14 15:00
3    2016-08-13 14:00
4    2016-08-13 14:00
Name: game_date, dtype: object

In [33]:
# Convert game_date feature to appropriate dtype.
df_allseasons_clean['game_date'] = pd.to_datetime(df_allseasons_clean['game_date'])

In [34]:
# Preview series.
df_allseasons_clean.game_date.head()

0   2016-08-15 19:00:00
1   2016-08-13 14:00:00
2   2016-08-14 15:00:00
3   2016-08-13 14:00:00
4   2016-08-13 14:00:00
Name: game_date, dtype: datetime64[ns]

In [35]:
# Engineer game season weather feature.
seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]

month_to_season = dict(zip(range(1,13), seasons))
df_allseasons_clean['game_weather'] = df_allseasons_clean.game_date.dt.month.map(month_to_season) 

In [36]:
# Data Quality Check.
df_allseasons_clean.game_weather.value_counts()

1    36939
2    26810
4    24533
3    10120
Name: game_weather, dtype: int64

In [37]:
# Engineer feature to highlights games that started before 13:00 (early starts) and those that started after 13:00 (late starts)
df_allseasons_clean['start_label'] = np.where((df_allseasons_clean['game_date'].dt.hour) < 13, 0, 1)

In [38]:
# Quality Check.
df_allseasons_clean[['game_date', 'start_label']].head(20)

Unnamed: 0,game_date,start_label
0,2016-08-15 19:00:00,1
1,2016-08-13 14:00:00,1
2,2016-08-14 15:00:00,1
3,2016-08-13 14:00:00,1
4,2016-08-13 14:00:00,1
5,2016-08-14 15:00:00,1
6,2016-08-15 19:00:00,1
7,2016-08-14 15:00:00,1
8,2016-08-13 14:00:00,1
9,2016-08-14 15:00:00,1


#### Data Quality
Data quality issues are mostly divided into four:
- Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
- Validity: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
- Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
- Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

After assessing the data, we have the following issues:

1. Missing data (`team_a_score`, `team_h_score`).

#### Data Tidiness
There are three main requirements for tidiness.

1. Each variable forms a column,
2. Each observation forms a row, and
3. Each type of observational unit forms a table.

The three above criteria's are fairly met by the dataset.

### CLEANING DATA

In [2]:
# Make a copy of the original piece of data.
df_allseasons_clean = df_allseasons.copy()

#### Quality issues

#### Issue #1:
- Missing data (`team_a_score`, `team_h_score`)

#### Define
- Drop all missing data.

#### Code

In [25]:
df_allseasons_clean[df_allseasons_clean.minutes == 0]

Unnamed: 0,season_x,name,position,assists,bonus,bps,clean_sheets,creativity,element,fixture,...,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards,GW,name_season,club_name,form
0,2016-17,Aaron Cresswell,DEF,0,0,0,0,0.0,454,10,...,0,0,0,55,False,0,1,Aaron Cresswell_2016-17,WHU,1.578947
3,2016-17,Abdoulaye Doucouré,MID,0,0,0,0,0.0,482,7,...,0,0,0,50,False,0,1,Abdoulaye Doucouré_2016-17,WAT,1.000000
8,2016-17,Alex McCarthy,GK,0,0,0,0,0.0,101,7,...,0,0,0,45,True,0,1,Alex McCarthy_2016-17,SOU,0.000000
10,2016-17,Andreas Pereira,MID,0,0,0,0,0.0,263,9,...,0,0,0,45,False,0,1,Andreas Pereira_2016-17,MUN,0.000000
15,2016-17,Angelo Ogbonna,DEF,0,0,0,0,0.0,456,10,...,0,0,0,50,False,0,1,Angelo Ogbonna_2016-17,WHU,1.184211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98393,2021-22,Jack Grieves,FWD,0,0,0,0,0.0,734,375,...,4094,5073,979,45,False,0,38,Jack Grieves_2021-22,WAT,0.000000
98396,2021-22,Josh Martin,MID,0,0,0,0,0.0,330,380,...,0,0,0,50,True,0,38,Josh Martin_2021-22,NOR,0.000000
98397,2021-22,Wilfred Ndidi,MID,0,0,0,0,0.0,216,377,...,-202,22,224,48,True,0,38,Wilfred Ndidi_2021-22,LEI,1.078947
98400,2021-22,Mathew Ryan,GK,0,0,0,0,0.0,65,373,...,-2,0,2,45,True,0,38,Mathew Ryan_2021-22,BHA,0.000000


In [26]:
zero_minutes = df_allseasons_clean[df_allseasons_clean.minutes == 0].index

In [27]:
df_allseasons_clean.drop(zero_minutes, axis = 0, inplace=True)

In [28]:
df_allseasons_clean.drop(['round'], axis=1, inplace=True)

In [29]:
df_allseasons_clean.kickoff_time

1        2016-08-13T14:00:00Z
2        2016-08-14T15:00:00Z
4        2016-08-13T14:00:00Z
5        2016-08-14T15:00:00Z
6        2016-08-15T19:00:00Z
                 ...         
98392    2022-05-22T15:00:00Z
98394    2022-05-22T15:00:00Z
98395    2022-05-22T15:00:00Z
98398    2022-05-22T15:00:00Z
98399    2022-05-22T15:00:00Z
Name: kickoff_time, Length: 49231, dtype: object

In [30]:
df_allseasons_clean['game_date'] = df_allseasons_clean['kickoff_time'].str.replace('T', ' ')
df_allseasons_clean['game_date'] = df_allseasons_clean['game_date'].str.replace(':00Z', '')

In [31]:
df_allseasons_clean['game_date'] = pd.to_datetime(df_allseasons_clean['game_date'])

In [32]:
df_allseasons_clean.game_date

1       2016-08-13 14:00:00
2       2016-08-14 15:00:00
4       2016-08-13 14:00:00
5       2016-08-14 15:00:00
6       2016-08-15 19:00:00
                ...        
98392   2022-05-22 15:00:00
98394   2022-05-22 15:00:00
98395   2022-05-22 15:00:00
98398   2022-05-22 15:00:00
98399   2022-05-22 15:00:00
Name: game_date, Length: 49231, dtype: datetime64[ns]

In [33]:
seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]

month_to_season = dict(zip(range(1,13), seasons))
month_to_season

{1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 1}

In [34]:
df_allseasons_clean['game_weather'] = df_allseasons_clean.game_date.dt.month.map(month_to_season) 

In [35]:
df_allseasons_clean.game_weather.value_counts()

1    18277
4    12980
2    12248
3     5726
Name: game_weather, dtype: int64

In [36]:
df_allseasons_clean.game_date.value_counts()

2022-05-22 15:00:00    280
2021-05-23 15:00:00    278
2020-07-26 15:00:00    275
2019-05-12 14:00:00    207
2018-05-13 14:00:00    175
                      ... 
2016-12-05 20:00:00      7
2016-12-10 12:30:00      7
2016-11-27 12:00:00      6
2016-10-16 12:30:00      6
2016-08-21 12:30:00      3
Name: game_date, Length: 1427, dtype: int64

In [37]:
df_allseasons_clean.game_date.dt.hour.value_counts()

15    10796
14     9845
19     6339
20     4359
16     4165
17     3741
12     3266
11     2365
18     2189
13     2166
Name: game_date, dtype: int64

In [38]:
import numpy as np
df_allseasons_clean['start_label'] = np.where((df_allseasons_clean['game_date'].dt.hour) < 13, 0, 1)

In [39]:
df_allseasons_clean.start_label.value_counts()

1    43600
0     5631
Name: start_label, dtype: int64

In [40]:
df_allseasons_clean[['game_date', 'start_label']].head(50)

Unnamed: 0,game_date,start_label
1,2016-08-13 14:00:00,1
2,2016-08-14 15:00:00,1
4,2016-08-13 14:00:00,1
5,2016-08-14 15:00:00,1
6,2016-08-15 19:00:00,1
7,2016-08-14 15:00:00,1
9,2016-08-14 15:00:00,1
11,2016-08-13 11:30:00,0
12,2016-08-13 14:00:00,1
13,2016-08-13 14:00:00,1


In [41]:
df_allseasons_clean.set_index('season_x', inplace=True)

df_allseasons_clean

Unnamed: 0_level_0,name,position,assists,bonus,bps,clean_sheets,creativity,element,fixture,goals_conceded,...,value,was_home,yellow_cards,GW,name_season,club_name,form,game_date,game_weather,start_label
season_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-17,Aaron Lennon,MID,0,0,6,0,0.3,142,3,0,...,60,True,0,1,Aaron Lennon_2016-17,EVE,0.578947,2016-08-13 14:00:00,3,1
2016-17,Aaron Ramsey,MID,0,0,5,0,4.9,16,8,3,...,80,True,0,1,Aaron Ramsey_2016-17,ARS,1.473684,2016-08-14 15:00:00,3,1
2016-17,Adam Forshaw,MID,0,0,3,0,1.3,286,6,1,...,45,True,1,1,Adam Forshaw_2016-17,MID,2.026316,2016-08-13 14:00:00,3,1
2016-17,Adam Lallana,MID,1,2,33,0,33.7,205,8,3,...,70,False,1,1,Adam Lallana_2016-17,LIV,3.657895,2016-08-14 15:00:00,3,1
2016-17,Adrián San Miguel del Castillo,GK,0,0,16,0,0.0,450,10,2,...,50,False,0,1,Adrián San Miguel del Castillo_2016-17,WHU,1.684211,2016-08-15 19:00:00,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-22,John Ruddy,GK,0,0,11,0,0.0,452,378,2,...,43,False,0,38,John Ruddy_2021-22,WOL,0.078947,2022-05-22 15:00:00,2,1
2021-22,Mohammed Salisu,DEF,0,0,13,0,0.0,351,377,4,...,45,False,0,38,Mohammed Salisu_2021-22,SOU,1.526316,2022-05-22 15:00:00,2,1
2021-22,N'Golo Kanté,MID,0,0,12,0,0.0,130,375,1,...,49,True,0,38,N'Golo Kanté_2021-22,CHE,2.078947,2022-05-22 15:00:00,2,1
2021-22,Matt Ritchie,DEF,0,0,3,0,0.0,292,374,0,...,49,False,0,38,Matt Ritchie_2021-22,NEW,0.526316,2022-05-22 15:00:00,2,1


In [42]:
df_allseasons_clean.drop(['game_date'], axis=1, inplace=True)

df_allseasons_clean

Unnamed: 0_level_0,name,position,assists,bonus,bps,clean_sheets,creativity,element,fixture,goals_conceded,...,transfers_out,value,was_home,yellow_cards,GW,name_season,club_name,form,game_weather,start_label
season_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-17,Aaron Lennon,MID,0,0,6,0,0.3,142,3,0,...,0,60,True,0,1,Aaron Lennon_2016-17,EVE,0.578947,3,1
2016-17,Aaron Ramsey,MID,0,0,5,0,4.9,16,8,3,...,0,80,True,0,1,Aaron Ramsey_2016-17,ARS,1.473684,3,1
2016-17,Adam Forshaw,MID,0,0,3,0,1.3,286,6,1,...,0,45,True,1,1,Adam Forshaw_2016-17,MID,2.026316,3,1
2016-17,Adam Lallana,MID,1,2,33,0,33.7,205,8,3,...,0,70,False,1,1,Adam Lallana_2016-17,LIV,3.657895,3,1
2016-17,Adrián San Miguel del Castillo,GK,0,0,16,0,0.0,450,10,2,...,0,50,False,0,1,Adrián San Miguel del Castillo_2016-17,WHU,1.684211,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-22,John Ruddy,GK,0,0,11,0,0.0,452,378,2,...,120,43,False,0,38,John Ruddy_2021-22,WOL,0.078947,2,1
2021-22,Mohammed Salisu,DEF,0,0,13,0,0.0,351,377,4,...,529,45,False,0,38,Mohammed Salisu_2021-22,SOU,1.526316,2,1
2021-22,N'Golo Kanté,MID,0,0,12,0,0.0,130,375,1,...,2468,49,True,0,38,N'Golo Kanté_2021-22,CHE,2.078947,2,1
2021-22,Matt Ritchie,DEF,0,0,3,0,0.0,292,374,0,...,253,49,False,0,38,Matt Ritchie_2021-22,NEW,0.526316,2,1


In [43]:
df_allseasons_clean.drop(['name_season', 'name', 'opponent_team', 'fixture', 'kickoff_time'], axis=1, inplace=True)

df_allseasons_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49231 entries, 2016-17 to 2021-22
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   position           49231 non-null  object 
 1   assists            49231 non-null  int64  
 2   bonus              49231 non-null  int64  
 3   bps                49231 non-null  int64  
 4   clean_sheets       49231 non-null  int64  
 5   creativity         49231 non-null  float64
 6   element            49231 non-null  int64  
 7   goals_conceded     49231 non-null  int64  
 8   goals_scored       49231 non-null  int64  
 9   ict_index          49231 non-null  float64
 10  influence          49231 non-null  float64
 11  minutes            49231 non-null  int64  
 12  opp_team_name      49231 non-null  object 
 13  own_goals          49231 non-null  int64  
 14  penalties_missed   49231 non-null  int64  
 15  penalties_saved    49231 non-null  int64  
 16  red_cards          

In [44]:
# df_allseasons["club_name"].replace({"WHU": "West Ham", "EVE": "Everton", "ARS": "Arsenal", "WAT": "Watford", "MID": "Middlesbrough",
# "LIV": "Liverpool", "SOU": "Southampton", "MUN": "Man Utd", "HUL": "Hull", "BUR": "Burnley", "CRY": "Crystal Palace", "TOT": "Spurs",
# "LEI": "Leicester", "WBA": "West Brom", "CHE": "Chelsea", "BOU": "Bournemouth", "STK": "Stoke",
# "MCI": "Man City", "SWA": "Swansea", "SUN": "Sunderland", "HUD": "Huddersfield", "BHA": "Brighton",
# "NEW": "Newcastle", "FUL": "Fulham", "WOL": "Wolves", "CAR": "Cardiff", "AVL": "Aston Villa",
# "NOR": "Norwich", "SHU": "Sheffield Utd", "LEE": "Leeds", "BRE": "Brentford"}, inplace=True)

In [45]:
df_allseasons_clean.isna().sum()

position              0
assists               0
bonus                 0
bps                   0
clean_sheets          0
creativity            0
element               0
goals_conceded        0
goals_scored          0
ict_index             0
influence             0
minutes               0
opp_team_name         0
own_goals             0
penalties_missed      0
penalties_saved       0
red_cards             0
saves                 0
selected              0
team_a_score          0
team_h_score          0
threat                0
total_points          0
transfers_balance     0
transfers_in          0
transfers_out         0
value                 0
was_home              0
yellow_cards          0
GW                    0
club_name            37
form                 37
game_weather          0
start_label           0
dtype: int64

In [46]:
df_allseasons_clean.dropna(inplace=True)

### ENCODING CATEGORICAL FEATURES

In [52]:
# position
# opp_team_name
# club_name
# was_home

In [47]:
df_allseasons_clean.position = df_allseasons_clean.position.astype(str)
df_allseasons_clean.opp_team_name = df_allseasons_clean.opp_team_name.astype(str)
df_allseasons_clean.club_name = df_allseasons_clean.club_name.astype(str)
df_allseasons_clean.was_home = df_allseasons_clean.was_home.astype(str)

In [48]:
df_allseasons_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49194 entries, 2016-17 to 2021-22
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   position           49194 non-null  object 
 1   assists            49194 non-null  int64  
 2   bonus              49194 non-null  int64  
 3   bps                49194 non-null  int64  
 4   clean_sheets       49194 non-null  int64  
 5   creativity         49194 non-null  float64
 6   element            49194 non-null  int64  
 7   goals_conceded     49194 non-null  int64  
 8   goals_scored       49194 non-null  int64  
 9   ict_index          49194 non-null  float64
 10  influence          49194 non-null  float64
 11  minutes            49194 non-null  int64  
 12  opp_team_name      49194 non-null  object 
 13  own_goals          49194 non-null  int64  
 14  penalties_missed   49194 non-null  int64  
 15  penalties_saved    49194 non-null  int64  
 16  red_cards          

In [49]:
df_allseasons_clean_dict = df_allseasons_clean.to_dict(orient='records')

In [50]:
df_allseasons_clean_dict[0]

{'position': 'MID',
 'assists': 0,
 'bonus': 0,
 'bps': 6,
 'clean_sheets': 0,
 'creativity': 0.3,
 'element': 142,
 'goals_conceded': 0,
 'goals_scored': 0,
 'ict_index': 0.9,
 'influence': 8.2,
 'minutes': 15,
 'opp_team_name': 'Spurs',
 'own_goals': 0,
 'penalties_missed': 0,
 'penalties_saved': 0,
 'red_cards': 0,
 'saves': 0,
 'selected': 13918,
 'team_a_score': 1.0,
 'team_h_score': 1.0,
 'threat': 0.0,
 'total_points': 1,
 'transfers_balance': 0,
 'transfers_in': 0,
 'transfers_out': 0,
 'value': 60,
 'was_home': 'True',
 'yellow_cards': 0,
 'GW': 1,
 'club_name': 'EVE',
 'form': 0.5789473684210527,
 'game_weather': 3,
 'start_label': 1}

In [51]:
# DictVectorizer.
from sklearn.feature_extraction import DictVectorizer

# instantiate a Dictvectorizer object for df.

dv = DictVectorizer(sparse=False) 

# sparse = False makes the output is not a sparse matrix.

df_encoded = dv.fit_transform(df_allseasons_clean_dict)

In [52]:
df_encoded

array([[ 1.,  0.,  0., ...,  0.,  1.,  0.],
       [ 1.,  0.,  0., ...,  0.,  1.,  0.],
       [ 1.,  0.,  0., ...,  0.,  1.,  1.],
       ...,
       [38.,  0.,  0., ...,  0.,  1.,  0.],
       [38.,  0.,  0., ...,  1.,  0.,  0.],
       [38.,  0.,  0., ...,  1.,  0.,  0.]])

In [53]:
# vocabulary
vocab = dv.vocabulary_

# show vocab
vocab

{'position=MID': 82,
 'assists': 1,
 'bonus': 2,
 'bps': 3,
 'clean_sheets': 4,
 'creativity': 36,
 'element': 37,
 'goals_conceded': 40,
 'goals_scored': 41,
 'ict_index': 42,
 'influence': 43,
 'minutes': 44,
 'opp_team_name=Spurs': 68,
 'own_goals': 76,
 'penalties_missed': 77,
 'penalties_saved': 78,
 'red_cards': 83,
 'saves': 84,
 'selected': 85,
 'team_a_score': 87,
 'team_h_score': 88,
 'threat': 89,
 'total_points': 90,
 'transfers_balance': 91,
 'transfers_in': 92,
 'transfers_out': 93,
 'value': 94,
 'was_home=True': 96,
 'yellow_cards': 97,
 'GW': 0,
 'club_name=EVE': 14,
 'form': 38,
 'game_weather': 39,
 'start_label': 86,
 'opp_team_name=Liverpool': 60,
 'club_name=ARS': 5,
 'opp_team_name=Stoke': 69,
 'club_name=MID': 22,
 'opp_team_name=Arsenal': 45,
 'was_home=False': 95,
 'club_name=LIV': 20,
 'position=GK': 81,
 'opp_team_name=Chelsea': 52,
 'club_name=WHU': 34,
 'position=DEF': 79,
 'opp_team_name=Leicester': 59,
 'club_name=HUL': 17,
 'position=FWD': 80,
 'opp_tea

In [54]:
# vocabulary
vocab = dv.vocabulary_
# show vocab
vocab

{'position=MID': 82,
 'assists': 1,
 'bonus': 2,
 'bps': 3,
 'clean_sheets': 4,
 'creativity': 36,
 'element': 37,
 'goals_conceded': 40,
 'goals_scored': 41,
 'ict_index': 42,
 'influence': 43,
 'minutes': 44,
 'opp_team_name=Spurs': 68,
 'own_goals': 76,
 'penalties_missed': 77,
 'penalties_saved': 78,
 'red_cards': 83,
 'saves': 84,
 'selected': 85,
 'team_a_score': 87,
 'team_h_score': 88,
 'threat': 89,
 'total_points': 90,
 'transfers_balance': 91,
 'transfers_in': 92,
 'transfers_out': 93,
 'value': 94,
 'was_home=True': 96,
 'yellow_cards': 97,
 'GW': 0,
 'club_name=EVE': 14,
 'form': 38,
 'game_weather': 39,
 'start_label': 86,
 'opp_team_name=Liverpool': 60,
 'club_name=ARS': 5,
 'opp_team_name=Stoke': 69,
 'club_name=MID': 22,
 'opp_team_name=Arsenal': 45,
 'was_home=False': 95,
 'club_name=LIV': 20,
 'position=GK': 81,
 'opp_team_name=Chelsea': 52,
 'club_name=WHU': 34,
 'position=DEF': 79,
 'opp_team_name=Leicester': 59,
 'club_name=HUL': 17,
 'position=FWD': 80,
 'opp_tea

In [55]:
df_transformed = pd.DataFrame(df_encoded, columns=dv.feature_names_)

df_transformed.tail()

Unnamed: 0,GW,assists,bonus,bps,clean_sheets,club_name=ARS,club_name=AVL,club_name=BHA,club_name=BOU,club_name=BRE,...,team_h_score,threat,total_points,transfers_balance,transfers_in,transfers_out,value,was_home=False,was_home=True,yellow_cards
49189,38.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,1.0,-23.0,97.0,120.0,43.0,1.0,0.0,0.0
49190,38.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,-131.0,398.0,529.0,45.0,1.0,0.0,0.0
49191,38.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,2.0,-390.0,2078.0,2468.0,49.0,0.0,1.0,0.0
49192,38.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,143.0,396.0,253.0,49.0,1.0,0.0,0.0
49193,38.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,2.0,455.0,683.0,228.0,59.0,1.0,0.0,0.0


In [56]:
df_transformed.shape

(49194, 98)

In [57]:
df_corr = df_transformed.corr()['total_points'].abs().sort_values(ascending=False).drop('total_points')
df_corr

bps              0.859812
bonus            0.779137
influence        0.734977
goals_scored     0.686719
ict_index        0.621022
                   ...   
club_name=SWA    0.002328
club_name=SUN    0.002304
team_h_score     0.001931
start_label      0.000347
club_name=BOU    0.000142
Name: total_points, Length: 97, dtype: float64

In [58]:
# get all the features that has at least 0.5 in correlation to the 
# target
features = df_corr[df_corr > 0.3].index.to_list()

In [59]:
features

['bps',
 'bonus',
 'influence',
 'goals_scored',
 'ict_index',
 'clean_sheets',
 'assists',
 'threat',
 'form',
 'minutes']

In [67]:
df_allseasons_clean.columns

Index(['position', 'assists', 'bonus', 'bps', 'clean_sheets', 'creativity',
       'element', 'goals_conceded', 'goals_scored', 'ict_index', 'influence',
       'minutes', 'opp_team_name', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'saves', 'selected', 'team_a_score',
       'team_h_score', 'threat', 'total_points', 'transfers_balance',
       'transfers_in', 'transfers_out', 'value', 'was_home', 'yellow_cards',
       'GW', 'club_name', 'form', 'game_weather', 'start_label'],
      dtype='object')

In [77]:
features2 = df_transformed.drop('total_points', axis=1)
target = df_transformed['total_points']

### Feature Selection (VIF)

In [69]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Utility function to return the VIF value for each feature provided
def compute_vif(features, df):
    """
    Returns a DataFrame containing features and their corresponding variance inflation factor
    features: list of features whoes multicollinearity check is needed
    df: DataFrame of the data under review
    """
    X = df[features]   
    X['intercept'] = 1
    # Create dataframe to store vif values
    vif = pd.DataFrame()
    vif['Feature'] = X.columns
    vif['Vif Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif = vif[vif['Feature']!='intercept']
    return vif

In [70]:
def select_features(df, threshold):
    """
    Returns two objects;
    1. a DataFrame containing features and their corresponding variance inflation factor, and
    2. Pandas Index object containing the list of features that have the least Multicollinearity in accordance with
       the supplied threshold.
    train_df: The training dataset whoes Multicollinearity is to be checked
    threshold: value to compare VIF value with, above which, the feature is dropped.
    """
    data = df.copy()
    flag = True
    while flag:
        features_to_consider = data.columns
        # Calling the "compute_vif" utility function the Variance Inflation Factor dataframe
        sorted_vif_df = (compute_vif(features_to_consider, data) 
                         .sort_values('Vif Factor', ascending=False).reset_index().drop('index', axis=1))

        # Get the highest vif value to compare against a threshold
        highest_vif = sorted_vif_df.at[0, 'Vif Factor']
        
        # Compare the highest_vif with a threshold (100 was decided for this problem by the team)
        if highest_vif >= threshold: # or highest_vif=='inf':
            # Select the feature corresponding to the highest_vif (index 0 for both)
            feature = sorted_vif_df.at[0, 'Feature'] 
            # Drop the feature
            data.drop(feature, axis=1, inplace=True) 
            
        else:
            flag = False
    return sorted_vif_df, data.columns

In [71]:
vif_df, selected_features = select_features(df_transformed[features], 50)

In [63]:
vif_df

Unnamed: 0,Feature,Vif Factor
0,ict_index,8.620392
1,influence,6.891249
2,bps,5.714201
3,threat,5.047271
4,goals_scored,3.55474
5,bonus,2.144341
6,minutes,1.740977
7,assists,1.450045
8,clean_sheets,1.397063
9,form,1.23884


In [64]:
selected_features

Index(['bps', 'bonus', 'influence', 'goals_scored', 'ict_index',
       'clean_sheets', 'assists', 'threat', 'form', 'minutes'],
      dtype='object')

### MODELING

In [65]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn import linear_model
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error

In [80]:
X_train, X_test, y_train, y_test = train_test_split(features2, target, test_size=0.3, shuffle=False)


In [79]:
# Utility function
def evaluate_model(model, x, y):
    """
    Utility function to print the model performance, (RMSE and R-Squared scores)
    model: Fitted model
    x: cross validation features dataset
    y: cross validation target values
    """
    predicted = model.predict(x) #get predictions
    RSME_score = mean_squared_error(y_true=y, y_pred=predicted, squared=False) #squared=False will RMSE instead of MSE
    R2_score = r2_score(y, predicted)
    
    print('RMSE:', RSME_score)
    print('R-Squared:', R2_score)
    print()

In [92]:
# creating a dictionary of Regressors to be experimented on.
models_dict = {'Linear Reg': LinearRegression(), 'DT Regressor': DecisionTreeRegressor(random_state=0),
          'RF Regressor':RandomForestRegressor(random_state=0), 'Lasso': LassoCV(random_state=0), 'Ridge Regressor': RidgeCV(),
          'BayesianRidge': linear_model.BayesianRidge(),'Gradient Boost': GradientBoostingRegressor(random_state=0), 'SGDRegressor': SGDRegressor(random_state=0)
         }

#looping through all the regressors, fitting and evaluating them on Cross validation and test data respectively
for key, model in models_dict.items():
    model.fit(X_train, y_train)
    print(f'Performance of {key} on Validation and Test:')
    print('=='*24)
    print ( 'Validation set:')
    print("**"*8)
    evaluate_model(model, X_test, y_test)
    print ( 'Test set:')
    print("**"*8)
    evaluate_model(model, X_test, y_test)

Performance of Linear Reg on Validation and Test:
Validation set:
****************
RMSE: 0.781438991087919
R-Squared: 0.933581631595072

Test set:
****************
RMSE: 0.781438991087919
R-Squared: 0.933581631595072

Performance of DT Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.47392901718238484
R-Squared: 0.975569933545721

Test set:
****************
RMSE: 0.47392901718238484
R-Squared: 0.975569933545721

Performance of RF Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.31502954475095446
R-Squared: 0.989205547501355

Test set:
****************
RMSE: 0.31502954475095446
R-Squared: 0.989205547501355

Performance of Lasso on Validation and Test:
Validation set:
****************
RMSE: 2.984109947045489
R-Squared: 0.031437413946643344

Test set:
****************
RMSE: 2.984109947045489
R-Squared: 0.031437413946643344

Performance of Ridge Regressor on Validation and Test:
Validation set:
****************
RMSE: 98.1575070930821
R-Sq

In [97]:
# creating a dictionary of Regressors to be experimented on.
models_dict1 = {'Linear Reg': LinearRegression(), 'DT Regressor': DecisionTreeRegressor(random_state=0),
          'RF Regressor':RandomForestRegressor(random_state=0), 'Lasso': LassoCV(random_state=0), 'Ridge Regressor': RidgeCV(),
          'BayesianRidge': linear_model.BayesianRidge(),'Gradient Boost': GradientBoostingRegressor(random_state=0), 'SGDRegressor': SGDRegressor(random_state=0)
         }

#looping through all the regressors, fitting and evaluating them on Cross validation and test data respectively
for key1, model1 in models_dict.items():
    model1.fit(X_train[selected_features], y_train)
    print(f'Performance of {key1} on Validation and Test:')
    print('=='*24)
    print ( 'Validation set:')
    print("**"*8)
    evaluate_model(model1, X_test[selected_features], y_test)
    print ( 'Test set:')
    print("**"*8)
    evaluate_model(model1, X_test[selected_features], y_test)

Performance of Linear Reg on Validation and Test:
Validation set:
****************
RMSE: 0.8729997065711557
R-Squared: 0.9171053993198663

Test set:
****************
RMSE: 0.8729997065711557
R-Squared: 0.9171053993198663

Performance of DT Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.9608855481562827
R-Squared: 0.899575108424597

Test set:
****************
RMSE: 0.9608855481562827
R-Squared: 0.899575108424597

Performance of RF Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.7034263081577232
R-Squared: 0.9461810448365885

Test set:
****************
RMSE: 0.7034263081577232
R-Squared: 0.9461810448365885

Performance of Lasso on Validation and Test:
Validation set:
****************
RMSE: 0.925713231510437
R-Squared: 0.906792471806779

Test set:
****************
RMSE: 0.925713231510437
R-Squared: 0.906792471806779

Performance of Ridge Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.8730010617626606
R-Squa