In [382]:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
from sklearn.metrics import mean_squared_error

# Predicting the Winner for 2022

We have taken inspiration from the winner of the first Kaggle competition, who tried to [predict](https://web.archive.org/web/20160629205138/http://blog.kaggle.com/2010/06/09/computer-scientist-jure-zbontars-method-for-winning-the-eurovision-challenge/) the scores awarded by each country in the 2010 Eurovision Contest. 

We used the information on past voting and the approximate betting odds to predict the rankings for 2022. 

## Learning Voting Patterns

For predicting the rankings, we have used the past voting data from 1975-2021. We calculate the average points awarded by each country to each country to reveal voting patterns. We are using both jury and televoting numbers to reveal such patterns. Since we are only going to predict the winner, we will use only points awarded in the finals of each year for our analysis. 

In [1004]:
#cleaning 1975-2019 data
esc_scores = pd.read_csv("https://github.com/jackmheller/modernDataAnalytics/blob/main/Data/eurovision_song_contest_1975_2022.csv?raw=true")
# esc_scores = pd.concat([esc_scores,esc_2021,esc_2022],ignore_index=True)

esc_scores_final = esc_scores.loc[esc_scores['(semi-) final'] == 'f']   #using only finals scores
esc_scores_final =  esc_scores_final.loc[esc_scores_final['Points'] != 0]  #removing 0 points 
esc_scores_final.replace(['F.Y.R. Macedonia', 'North Macedonia'], 'Macedonia', inplace = True)
esc_scores_final.replace(['The Netherands'], 'The Netherlands', inplace = True)
esc_scores_final.replace(['The Netherlands'], 'Netherlands', inplace = True)

#calculating rankings for each year 
total_scores = esc_scores_final.groupby(['Year','To country']).sum().reset_index().sort_values(by=['Year','Points'],ascending=False)

true_rankings={}
for year in urls_years:
    true_rankings[year] = total_scores[total_scores['Year']==year]['To country'].tolist()

#calculate average scores 
avg_scores = esc_scores_final.groupby(['From country','To country']).mean().reset_index()
# avg_scores["Final Scores_with Odds"] = ' '
avg_scores = avg_scores.drop(columns=['Year'])


### Average scores awarded by Greece

Based on past voting data, Greece has awarded the most points on average to Cyprus - revealing a voting pattern that could be due to cultural and geographical reasons.

In [1009]:
avg_scores[avg_scores['From country']=='Greece'].sort_values(by='Points',ascending=False)

Unnamed: 0.1,From country,To country,Unnamed: 0,Points,Duplicate
805,Greece,Cyprus,22343.366667,11.033333,
815,Greece,Ireland,4222.727273,9.0,
824,Greece,Monaco,962.5,8.0,
795,Greece,Albania,33161.083333,7.916667,
803,Greece,Bulgaria,38368.833333,7.333333,
796,Greece,Armenia,30294.8,7.1,
833,Greece,Serbia & Montenegro,14920.0,7.0,
809,Greece,Finland,10432.8,6.5,
807,Greece,Denmark,10101.333333,6.5,
831,Greece,San Marino,49834.0,6.5,


## Using Betting Odds 

Betting odds (here, we are using decimal odds) represent the amount of money one wins for every $1 wagered. The lower the betting odds for a country, the higher the probabilty that they are going to win. Betting odds data was scraped from eurovisionworld.com/odds/eurovision, we are only using the data from 2015 onwards.

In [1010]:
#scraping the websites

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36'}
urls=['https://eurovisionworld.com/odds/eurovision','https://eurovisionworld.com/odds/eurovision-2021','https://eurovisionworld.com/odds/eurovision-2019','https://eurovisionworld.com/odds/eurovision-2018','https://eurovisionworld.com/odds/eurovision-2017','https://eurovisionworld.com/odds/eurovision-2016','https://eurovisionworld.com/odds/eurovision-2015']
urls_years = [2022,2021,2019,2018,2017,2016,2015]

def parse_table(url_list):
    
    odds_data = {}
    try:
        for url_id,url in enumerate(url_list):
            #Create a handle, page, to handle the contents of the website
            page = requests.get(url,headers=headers)

            #extracting table 

            # parser-lxml = Change html to Python friendly format
            # Obtain page's information
            soup = BeautifulSoup(page.text, 'lxml')

            # Obtain information from tag <table>
            table1 = soup.find('table', {"class":'o_table'})

            # Create a for loop to fill mydata
            data_set = []
            for i_j,j in enumerate(table1.find_all('tr')[1:]):
                if i_j>0:
                    row_data = j.find_all('td')
                    row = [i.text for i in row_data]
                    data_set.append(row)

            # Obtain every title of columns with tag <th>
            table_headers = []
            for i in table1.find_all('th'):
                title = i.text
                table_headers.append(title)

            #modifying headers manually
            if any(char.isdigit() for char in url.split("-")):
                if int(url.split("-")[-1]) == 2017:
                    header_prefix = ["Rank","Country","Song Name"]
                if int(url.split("-")[-1]) in [2016,2015]:
                    header_prefix = ["Rank","Song Name"]
            else:
                header_prefix = ["Rank","Country","Song Name","Winning Chance"]
            headers_final = header_prefix + table_headers[:table_headers.index('BETFAIREXCHANGE')+1]

            final_oddsdata = pd.DataFrame(data_set) #converting to df
            final_oddsdata.columns = headers_final #setting header names
            final_oddsdata.drop(final_oddsdata.tail(1).index,inplace=True) #removing last row

            #obtaining country name
            i=0
            for n in final_oddsdata['Song Name']:
                final_oddsdata.at[i, 'Country'] = n.split()[0]
                i+=1

            #manually correcting country names
            i=0
            for country in final_oddsdata['Country']:
                if country == 'UKUnited':
                    final_oddsdata.at[i, 'Country'] = 'United Kingdom'
                if country == 'Netherlands':
                    final_oddsdata.at[i, 'Country'] = 'The Netherlands'
                if country == 'Czech':
                    final_oddsdata.at[i, 'Country'] = 'Czech Republic'
                i+=1

            odds_data[urls_years[url_id]] = final_oddsdata

    except:
        print(url)
    return odds_data
    
all_urls_parsed = parse_table(urls) 

### Viewing 2022 Betting Odds

Ukraine had the lowest betting odds of 1.25 on BET365, which meant it had the highest probabilty of winning.
On the other hand, Lithuania had the lowest probability of winning, with betting odds of 1001. 

In [1011]:
all_urls_parsed[2022]   #change year here to see different years' odds

Unnamed: 0,Rank,Country,Song Name,Winning Chance,BET365,UNIBET,888SPORT,BETFRED,COOLBET,WILLIAMHILL,...,BETWAY,BETFAIRSPORT,BOYLESPORTS,SMARKETS,SKYBET,10BET,COMEON,BETSTARS,BWIN,BETFAIREXCHANGE
0,1,Ukraine,Ukraine Kalush Orchestra - Stefania,62%,1.25,1.22,1.29,1.25,1.34,1.33,...,1.29,1.25,1.25,1.34,1.29,1.3,1.3,2.2,1.3,1.34
1,2,Sweden,Sweden Cornelia Jakobs - Hold Me Closer,14%,5.0,5.1,7.25,6.5,6.0,7.0,...,8.0,5.5,5.5,7.6,7.0,4.3,4.3,6.5,5.0,8.0
2,3,Spain,Spain Chanel - SloMo,6%,10.0,13.0,13.0,15.0,13.0,8.0,...,15.0,14.0,10.0,19.0,11.0,10.0,10.0,31.0,10.0,18.0
3,4,United Kingdom,UKUnited Kingdom Sam Ryder - Space Man,6%,13.0,15.0,8.75,13.0,15.0,13.0,...,7.0,10.0,11.0,22.0,12.0,12.0,12.0,18.0,15.0,25.0
4,5,Italy,Italy Mahmood & Blanco - Brividi,3%,41.0,51.0,14.0,34.0,37.0,34.0,...,15.0,23.0,41.0,55.0,34.0,41.0,41.0,4.33,34.0,55.0
5,6,Norway,Norway Subwoolfer - Give That Wolf a Banana,1%,101.0,101.0,73.0,67.0,81.0,101.0,...,67.0,23.0,41.0,200.0,67.0,41.0,41.0,21.0,67.0,210.0
6,7,Poland,Poland Ochman - River,1%,101.0,101.0,58.0,67.0,81.0,67.0,...,51.0,81.0,51.0,300.0,81.0,67.0,67.0,21.0,51.0,300.0
7,8,Serbia,Serbia Konstrakta - In Corpore Sano,1%,67.0,101.0,51.0,101.0,131.0,81.0,...,51.0,51.0,51.0,300.0,81.0,41.0,41.0,81.0,67.0,300.0
8,9,Greece,Greece Amanda Tenfjord - Die Together,1%,101.0,101.0,69.0,101.0,101.0,67.0,...,51.0,56.0,81.0,260.0,81.0,67.0,67.0,17.0,101.0,270.0
9,10,Moldova,Moldova Zdob şi Zdub & Advahov Brothers - Tre...,1%,151.0,81.0,101.0,51.0,81.0,81.0,...,67.0,71.0,67.0,230.0,51.0,67.0,67.0,151.0,101.0,160.0


### Modifying the Odds

The odds needed to be modified in order to be comparable to the average points using the following formula:

$odds = \frac{1}{log(odds)}*a + b$

$a$ and $b$ were chosen experimentally through cross-validation as detailed below.

In [1012]:
#function to modify odds

def calc_odds(urls,a,b):
    
    #using only BET365
    odds_dict ={}
    for url_id,url in enumerate(urls):
        final_oddsdata = all_urls_parsed[urls_years[url_id]]
        odds = final_oddsdata[['Country','BET365']]
        odds.columns = ['Country','Odds']

        #converting odds to be comparable w avg points 
        recalc_odds=[]
        for odd in odds.Odds:
            o=float(odd)
            recalc_odds.append((1/np.log(o))*a + b)   
            
        odds.insert(2, "Modified_Odds", recalc_odds)

        odds_dict[urls_years[url_id]] = odds
        
    return odds_dict

### Evaluation Metrics

We are using Mean Squared Error (MSE) of the top-5 rankings to evaluate our predictions.

For choosing the parameters $a$ and $b$, we will iterate through every value within a range and calcuate the MSE at every step. The values that give us the minimum MSE will be used for our final predicitons. 

### Cross-Validation
For each year of the contest, the average scores are calculated without including that years' data so that we can evaluate the model's performance on unseen data. 

In [1014]:
#function to calculate MSE of 
def MSE_predict(urls_years,odds_dict,cross_validating = True):
    
    predictions = {}
    total_MSE=0
    for year in urls_years:

        #removing current year from data if cross-validating
        if cross_validating:
            esc_scores_temp = esc_scores_final[esc_scores_final['Year']!=year]
        else:
            esc_scores_temp = esc_scores_final[esc_scores_final['Year']<2022]
        #calculate average scores 
        avg_scores_temp = esc_scores_temp.groupby(['From country','To country']).mean().reset_index()
        avg_scores_temp["Final Scores_with Odds"] = ' '


        odds = odds_dict[year]
        #adding avg scores recevied with odds for each country
        for i,p in enumerate(avg_scores_temp['Points']):
            country = avg_scores_temp.loc[i,'To country']
            if country in list(odds.Country):
                odd = odds['Modified_Odds'][odds['Country']==country]
                avg_scores_temp.at[i,"Final Scores_with Odds"] = p + float(odd)
            else:
                avg_scores_temp.at[i,"Final Scores_with Odds"] = 0

        avg_scores_temp = avg_scores_temp.astype({'Final Scores_with Odds': 'float64'})

        #summing all points received for each country
        final_score_prediction = avg_scores_temp.groupby(['To country']).sum()
        final_score_prediction = final_score_prediction.sort_values(by=['Final Scores_with Odds'],ascending=False)
        
        predictions[year] = final_score_prediction.index.tolist()
        y_true = []
        y_pred = []
        for i,t in enumerate(true_rankings[year]):
            if t in final_score_prediction.index.tolist():
                y_true.append(i)
                y_pred.append(final_score_prediction.index.tolist().index(t))
        mse = mean_squared_error(y_true[:5], y_pred[:5])   #mse for top 5 rankings only 
        
        if not cross_validating:
            print("MSE for {} = {}".format(year,mse ))
        total_MSE+=mse
        
    if not cross_validating:  
        print("Total MSE for all years = {}".format(total_MSE))
        print("Average MSE for all years = {}".format(total_MSE/len(urls_years)))
    
    return total_MSE,predictions

In [1017]:
#cross validation to find odds parameters:

all_parameters=[]
for a in range(-10,10):
    for b in np.arange(-3,3,.2):
        odds_dict = calc_odds(urls,a,b)
        total_MSE,p = MSE_predict(urls_years[1:],odds_dict)      

        all_parameters.append([a,b,total_MSE,total_MSE/len(urls_years)])

### Choosing $a$ and $b$ 

We found that $a=9$ and $b=-3$ gave us the lowest total MSE.

In [1015]:
print("Parameters with minimum total MSE-")

print("[a, b, Total MSE, Average MSE] = {}".format(all_parameters[np.argmin(np.array(all_parameters),axis=0)[2]]))

Parameters with minimum total MSE-
[a, b, Total MSE, Average MSE] = [9, -3.0, 108.4, 15.485714285714286]


In [901]:
odds_dict = calc_odds(urls,9,-3)

### MSE for each year's predictions

The model predicts the rankings for most years with relatively low MSE, expect for 2017 and 2018. 

In 2017, it was due to Moldova's unexpected 3rd place ranking, for which the model predicted a rank of 15.
While in 2018, it was due to Austria's 3rd place ranking, for which the model predicted a rank of 21. 

In [902]:
total_MSE,predictions = MSE_predict(urls_years,odds_dict,cross_validating=False)

MSE for 2022 = 5.6
MSE for 2021 = 2.8
MSE for 2019 = 1.6
MSE for 2018 = 73.0
MSE for 2017 = 33.8
MSE for 2016 = 11.6
MSE for 2015 = 0.0
Total MSE for all years = 128.4
Average MSE for all years = 18.342857142857145


## 2022 Predictions

Our model correctly predicts the winner for 2021, Ukraine. However, this is not suprising since Ukraine had such high probabilty of winning as encapsulated by the betting odds. 

Below, we see the predicted Top 5 rankings for Eurovision 2022. 

In [917]:
print("Predicted Top 5:", predictions[2022][:5])
print("Actual Top 5:", true_rankings[2022][:5])

Predicted Top 5: ['Ukraine', 'Sweden', 'Italy', 'United Kingdom', 'Spain']
Actual Top 5: ['Ukraine', 'United Kingdom', 'Spain', 'Sweden', 'Serbia']
