# Bolivian Football League Betting Odds

The purpose of this project is utilizing the historical "Professional Bolivian Football League" data scraped from oddsportal.com using Selenium and webscraper.io, in order to clean, engineer and run predictive models on it to make better informed sports betting decisions.


## 1. Data Cleaning & Feature Engineering

In [74]:
# Importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.style as style
import matplotlib.pyplot as plt
import seaborn as sns
import string
import matplotlib.ticker as ticker
import re
from collections import Counter
style.use('fivethirtyeight')



from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


# Reading and appending datasets
odds_16_08 = pd.read_csv('odds_data/raw/odds_portal_2016_2008.csv', parse_dates= ['date'] )
odds_17 = pd.read_csv('odds_data/raw/odds_portal_2017.csv', parse_dates= ['date'])
odds_18 = pd.read_csv('odds_data/raw/odds_portal_2018.csv', parse_dates= ['date'])
odds_19 = pd.read_csv('odds_data/raw/odds_portal_2019.csv', parse_dates= ['date'])

odds = odds_16_08.append([odds_17, odds_18, odds_19])
odds.head()
odds.info()

Unnamed: 0,web-scraper-order,web-scraper-start-url,match,match-href,Bookmakers,Home Odds,Draw Odds,Away Odds,Payout,results,date,time
0,1574834581-19424,https://www.oddsportal.com/soccer/bolivia/liga...,Blooming - San Jose,https://www.oddsportal.com/soccer/bolivia/liga...,Unibet,2.29,3.15,2.8,90.0%,"Final result 0:0 (0:0, 0:0)",2013-08-04,00:00
1,1574838770-26010,https://www.oddsportal.com/soccer/bolivia/liga...,The Strongest - Ciclon,https://www.oddsportal.com/soccer/bolivia/liga...,Unibet,1.25,5.25,9.0,90.8%,"Final result 4:2 (2:0, 2:2)",2016-04-17,19:00
2,1574835238-21061,https://www.oddsportal.com/soccer/bolivia/liga...,Blooming - Real Potosi,https://www.oddsportal.com/soccer/bolivia/liga...,bwin,1.7,3.4,4.33,89.8%,"Final result 2:0 (1:0, 1:0)",2014-04-12,21:00
3,1574826415-14585,https://www.oddsportal.com/soccer/bolivia/liga...,Real Potosi - Blooming,https://www.oddsportal.com/soccer/bolivia/liga...,bwin,1.45,3.75,6.5,90.1%,Final result 3:1,2009-05-31,20:00
4,1574828821-15323,https://www.oddsportal.com/soccer/bolivia/liga...,The Strongest - Guabira,https://www.oddsportal.com/soccer/bolivia/liga...,Unibet,1.55,3.5,5.5,89.9%,"Final result 2:1 (1:1, 1:0)",2010-03-31,22:00


<class 'pandas.core.frame.DataFrame'>
Int64Index: 25447 entries, 0 to 4428
Data columns (total 12 columns):
web-scraper-order        25447 non-null object
web-scraper-start-url    25447 non-null object
match                    25447 non-null object
match-href               25447 non-null object
Bookmakers               22441 non-null object
Home Odds                22441 non-null float64
Draw Odds                22441 non-null float64
Away Odds                22441 non-null float64
Payout                   22441 non-null object
results                  25447 non-null object
date                     25447 non-null datetime64[ns]
time                     25447 non-null object
dtypes: datetime64[ns](1), float64(3), object(8)
memory usage: 2.5+ MB


The 'results' and 'match' columns can be uncoupled into multiple different variables during the initial cleaning of the dataset.

### Cleaning the dataframe

Splitting strings and dropping Nan's


In [68]:
# Making column names easier to work with
odds.columns = [c.lower().replace("-", "_") for c in [i.replace(" ", "-") for i in odds.columns]]
odds.columns

# Payout column
odds['payout'] = round(pd.to_numeric(odds.payout.str.replace('%',''), errors='coerce')/100,2)

# Dropping NA's
odds.dropna(subset=['bookmakers','payout'],axis=0, inplace=True)

# CREATING NEW COLUMNS

# Season Year
odds['season_year'] = odds.date.dt.year
# Teams
odds['home_team'] = [i[0].strip() for i in odds.match.str.split("-")]
odds['away_team'] = [i[1].strip() for i in odds.match.str.split("-")]

# Implied odds ------> Could use this value to compare against model output
odds['implied_home_odds'] = round(1/odds.home_odds,3)
odds['implied_draw_odds'] = round(1/odds.draw_odds,3)
odds['implied_away_odds'] = round(1/odds.away_odds,3)

# Final Results
odds['final_result'] = [c[0:3] for c in [i.replace('Final result ', '') for i in odds.results]]
odds['home_goals'] = [i[0] for i in odds.final_result.str.split(':')]
odds['away_goals'] = [i[-1] for i in odds.final_result.str.split(':')]

# Half times results
odds['halftime'] = odds.results.str.extract(r"\((.*?)\)", expand=False).str.split(',')

odds.dropna(subset=['halftime'],inplace=True) # dropping rows that don't have half time info

odds['first_half_home'] = [c[0] for c in [i[0] for i in odds.halftime]]
odds['first_half_away'] = [c[2] for c in [i[0] for i in odds.halftime]]
odds['second_half_home'] = [c[0] for c in [i[1].strip() for i in odds.halftime]]
odds['second_half_away'] = [c[2] for c in [i[1].strip() for i in odds.halftime]]

# Fixing unique cases in score
odds.home_goals[odds.home_goals=='Wil'] = 7
odds.away_goals[odds.away_goals=='Wil'] = 0

# Dropping rows that have irrelevant teams (<10 matches played)
odds = odds.loc[~(
         (odds['home_team']=='Industrial Aviles')|
         (odds['away_team']=='Industrial Aviles')|
         (odds['home_team']=='Bermejo')|
         (odds['away_team']=='Bermejo')
            )]

# Dropping columns
odds.drop(['halftime','web_scraper_start_url',
           'results', 'web_scraper_order'], axis=1, inplace=True)
odds.shape

Index(['web_scraper_order', 'web_scraper_start_url', 'match', 'match_href',
       'bookmakers', 'home_odds', 'draw_odds', 'away_odds', 'payout',
       'results', 'date', 'time'],
      dtype='object')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


(20776, 22)

### Basic Feature Creation
Create simple features based on the goals scored throughout each individual match to facilitate feature engineering later on.

In [69]:
# Creating new features

# Transforming values into numeric type
for col in odds.columns[odds.columns!='date']:
    odds[col] = pd.to_numeric(odds[col], errors='ignore')
    
# Goals allowed
odds['home_goals_allowed'] = odds['away_goals']
odds['away_goals_allowed'] = odds['home_goals']

# Total number of goals
odds['total_goals'] = odds.home_goals + odds.away_goals

# Over/Under 2.5

def over_column(dataframe, over):
    if dataframe['total_goals'] > over:
        return 1
    else:
        return 0    
odds['over'] = odds.apply(lambda x: over_column(x, 2.5) , axis=1)


# First half goals
odds['total_first_half'] = odds.first_half_home + odds.first_half_away

# Second half goals
odds['total_second_half'] = odds.second_half_home + odds.second_half_away

# Penalties
#odds.results[odds.results.str.contains('penalties')]

# win_home_or_away (1 for home win, 2 for away win, 0 for draw)
def win_column(dataframe):
    if dataframe['home_goals'] > dataframe['away_goals']:
        return 1
    if dataframe['home_goals'] < dataframe['away_goals']:
        return 2
    else:
        return 0    
odds['win_home_or_away'] = odds.apply(lambda x: win_column(x) , axis=1)

# Winning team
odds['winner'] = ''
for i in range(len(odds)):
    if odds['win_home_or_away'].values[i] ==1:
        odds['winner'].values[i] = odds['home_team'].values[i]
    elif odds['win_home_or_away'].values[i] ==2:
        odds['winner'].values[i] = odds['away_team'].values[i]
    else:
        odds['winner'].values[i] = 'Draw'
        
        
# Creating 4 columns for home win/loss, and away win/loss
odds['home_win']=0
odds['away_win']=0
odds['home_loss']=0
odds['away_loss']=0
for i in range(len(odds)):
    if (odds['win_home_or_away'].values[i] == 1):
        odds['home_win'].values[i]=1
        odds['away_loss'].values[i]=1
        
    if(odds['win_home_or_away'].values[i] == 2):
        odds['away_win'].values[i]=1
        odds['home_loss'].values[i]=1
        

# First and second half winners 
odds['home_win_first_h']=0
odds['away_win_first_h']=0
odds['home_win_second_h']=0
odds['away_win_second_h']=0

for i in range(len(odds)):
    # First half winner binary 
    if (odds['first_half_home'].values[i] > odds['first_half_away'].values[i]):
        odds['home_win_first_h'].values[i]=1
    
    if (odds['first_half_away'].values[i] > odds['first_half_home'].values[i]):
        odds['away_win_first_h'].values[i]=1
   
    # Second half winner binary   
    if (odds['second_half_home'].values[i] > odds['second_half_away'].values[i]):
        odds['home_win_second_h'].values[i]=1
    
    if (odds['second_half_away'].values[i] > odds['second_half_home'].values[i]):
        odds['away_win_second_h'].values[i]=1
        

# Creating points_won column for home and away teams
odds['home_points']=0
odds['away_points']=0

for i in range(len(odds)):
    if (odds['win_home_or_away'].values[i] == 1):
        odds['home_points'].values[i]=3
        odds['away_points'].values[i]=0
        
    if(odds['win_home_or_away'].values[i] == 2):
        odds['home_points'].values[i]=0
        odds['away_points'].values[i]=3
        
    if(odds['win_home_or_away'].values[i] == 0):
        odds['home_points'].values[i]=1
        odds['away_points'].values[i]=1

# Creating games_played column for home and away teams for feature engineering purposes
odds['home_games_played'] = 1
odds['away_games_played'] = 1

# Create 'upsets' variable ---> 0 for no upset, 1 for upset
# Upset defined as home team losing against away team with lower odds of winning

def upset_column(dataframe):
    if (dataframe['implied_home_odds']
        > dataframe['implied_away_odds']) & (dataframe['home_goals'] <
                                             dataframe['away_goals']):
        return 1
    else:
        return 0 
odds['upset'] = odds.apply(lambda x: upset_column(x),axis=1) 


        

# saving df to .csv for future notebooks
odds.to_csv('/Users/miketondu/Dropbox/Data Science/Sharpest Minds/odds_data/bolivian_football_odds_clean.csv', index=False)


In [70]:
odds.head(100)

Unnamed: 0,match,match_href,bookmakers,home_odds,draw_odds,away_odds,payout,date,time,season_year,home_team,away_team,implied_home_odds,implied_draw_odds,implied_away_odds,final_result,home_goals,away_goals,first_half_home,first_half_away,second_half_home,second_half_away,home_goals_allowed,away_goals_allowed,total_goals,over,total_first_half,total_second_half,win_home_or_away,winner,home_win,away_win,home_loss,away_loss,home_win_first_h,away_win_first_h,home_win_second_h,away_win_second_h,home_points,away_points,home_games_played,away_games_played,upset
0,Blooming - San Jose,https://www.oddsportal.com/soccer/bolivia/liga...,Unibet,2.29,3.15,2.8,0.9,2013-08-04,00:00,2013,Blooming,San Jose,0.437,0.317,0.357,0:0,0,0,0,0,0,0,0,0,0,0,0,0,0,Draw,0,0,0,0,0,0,0,0,1,1,1,1,0
1,The Strongest - Ciclon,https://www.oddsportal.com/soccer/bolivia/liga...,Unibet,1.25,5.25,9.0,0.91,2016-04-17,19:00,2016,The Strongest,Ciclon,0.8,0.19,0.111,4:2,4,2,2,0,2,2,2,4,6,1,2,4,1,The Strongest,1,0,0,1,1,0,0,0,3,0,1,1,0
2,Blooming - Real Potosi,https://www.oddsportal.com/soccer/bolivia/liga...,bwin,1.7,3.4,4.33,0.9,2014-04-12,21:00,2014,Blooming,Real Potosi,0.588,0.294,0.231,2:0,2,0,1,0,1,0,0,2,2,0,1,1,1,Blooming,1,0,0,1,1,0,1,0,3,0,1,1,0
4,The Strongest - Guabira,https://www.oddsportal.com/soccer/bolivia/liga...,Unibet,1.55,3.5,5.5,0.9,2010-03-31,22:00,2010,The Strongest,Guabira,0.645,0.286,0.182,2:1,2,1,1,1,1,0,1,2,3,1,2,1,1,The Strongest,1,0,0,1,0,0,1,0,3,0,1,1,0
5,Blooming - Oriente Petrolero,https://www.oddsportal.com/soccer/bolivia/liga...,bwin,2.4,3.1,2.7,0.9,2012-04-29,22:30,2012,Blooming,Oriente Petrolero,0.417,0.323,0.37,2:2,2,2,2,2,0,0,2,2,4,1,4,0,0,Draw,0,0,0,0,0,0,0,0,1,1,1,1,0
6,Wilstermann - Blooming,https://www.oddsportal.com/soccer/bolivia/liga...,William Hill,1.73,3.4,4.0,0.89,2015-08-23,21:15,2015,Wilstermann,Blooming,0.578,0.294,0.25,3:2,3,2,1,0,2,2,2,3,5,1,1,4,1,Wilstermann,1,0,0,1,1,0,0,0,3,0,1,1,0
7,Nacional Potosi - The Strongest,https://www.oddsportal.com/soccer/bolivia/liga...,18bet,3.95,3.56,1.71,0.89,2015-04-30,00:00,2015,Nacional Potosi,The Strongest,0.253,0.281,0.585,4:3,4,3,2,1,2,2,3,4,7,1,3,4,1,Nacional Potosi,1,0,0,1,1,0,0,0,3,0,1,1,0
9,The Strongest - Oriente Petrolero,https://www.oddsportal.com/soccer/bolivia/liga...,William Hill,2.25,3.4,2.88,0.92,2015-02-22,00:00,2015,The Strongest,Oriente Petrolero,0.444,0.294,0.347,4:3,4,3,0,1,4,2,3,4,7,1,1,6,1,The Strongest,1,0,0,1,0,1,1,0,3,0,1,1,0
11,U. Sucre - San Jose,https://www.oddsportal.com/soccer/bolivia/liga...,bet365,2.0,3.3,3.6,0.92,2011-12-01,00:00,2011,U. Sucre,San Jose,0.5,0.303,0.278,0:1,0,1,0,0,0,1,1,0,1,0,0,1,2,San Jose,0,1,1,0,0,0,0,1,0,3,1,1,1
12,Aurora - San Jose,https://www.oddsportal.com/soccer/bolivia/liga...,bet365,1.85,3.25,3.75,0.9,2010-11-07,20:00,2010,Aurora,San Jose,0.541,0.308,0.267,4:1,4,1,4,0,0,1,1,4,5,1,4,1,1,Aurora,1,0,0,1,1,0,0,1,3,0,1,1,0
