# DAZN Sport Similarity Task

Jack Boylan

In [1]:
from google.colab import drive
import os
drive.mount('/content/drive')
path = "/content/drive/My Drive/DAZN_Task_jackboylan/" # directory where I have downloaded data to

os.chdir(path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Import Necessary Libraries

In [2]:

import warnings
warnings.filterwarnings('ignore')

# import libraries
import re
import sys
import copy
import time
import datetime
import scipy
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# pd.options.display.max_rows = None
# pd.options.display.max_columns = None


# Load Data

In [3]:
# load data
df = pd.read_csv('DAZN_Data_Scientist_Homework_Dataset.csv', sep='~')

# summarize shape
print(df.shape)

print(df.dtypes)


(37132, 15)
viewer_id                   int64
stream_start_time          object
stream_end_time            object
live_or_on_demand          object
customer_country           object
match_title                object
match_date                 object
competition_name           object
sport_name                 object
device_category            object
home_contestant_country    object
away_contestant_country    object
tournament_start_date      object
tournament_end_date        object
venue_country              object
dtype: object


In [49]:
df.head()

Unnamed: 0,viewer_id,stream_start_time,stream_end_time,live_or_on_demand,customer_country,match_title,match_date,competition_name,sport_name,device_category,home_contestant_country,away_contestant_country,tournament_start_date,tournament_end_date,venue_country,DayOfWeek,Month,stream_length_minutes,tournament_length_days,clean_text,card2vec
0,5049,2021-06-11 09:41:37,2021-06-11 12:02:18,Live,Japan,Lions v Dragons,2021-06-11 08:45:00,NPB,Baseball,Web,Japan,Japan,2021-03-02,2021-11-29,Japan,4,6,140.683333,272.0,5049 live japan lions v dragons npb web japan ...,"[0.07975250482559204, 0.08638769388198853, -0...."
1,9062,2021-06-13 20:13:36,2021-06-13 20:14:05,Live,Canada,Brazil vs. Canada,2021-06-13 19:10:00,FIBA AmeriCup (W),Basketball,Unknown,Brazil,Canada,2021-06-11,2021-06-19,Puerto Rico,6,6,0.483333,8.0,9062 live canada brazil vs. canada fiba americ...,"[0.16239094734191895, 0.004213003441691399, -0..."
2,9062,2021-05-30 12:53:38,2021-05-30 12:53:56,Live,Canada,MLB Network,2018-01-01 00:00:00,MLB,Baseball,Unknown,Unknown,Unknown,2018-01-28,2018-12-30,England,0,1,0.3,336.0,9062 live canada mlb network mlb unknown unkno...,"[0.029683837667107582, -0.043965209275484085, ..."
3,9062,2021-05-30 20:24:00,2021-05-30 20:24:34,Live,Canada,MLB Network,2018-01-01 00:00:00,MLB,Baseball,Unknown,Unknown,Unknown,2018-01-28,2018-12-30,England,0,1,0.566667,336.0,9062 live canada mlb network mlb unknown unkno...,"[0.027660829946398735, -0.05212020501494408, -..."
4,211,2021-05-01 20:01:47,2021-05-01 20:07:08,Live,Italy,Milan v Benevento,2021-05-01 18:45:00,Serie A,Soccer,Mobile,Italy,Italy,2020-09-19,2021-05-23,Italy,5,5,5.35,246.0,211 live italy milan v benevento serie a mobil...,"[0.00797139760106802, -0.02422248013317585, -0..."


# Data Exploration/Cleaning

We will first take a look at the data and check for things like missing values, extreme outliers, etc.

In [5]:
df.nunique()

viewer_id                  10000
stream_start_time          36169
stream_end_time            35822
live_or_on_demand              3
customer_country               7
match_title                 2044
match_date                  1178
competition_name             124
sport_name                    22
device_category                4
home_contestant_country       67
away_contestant_country       78
tournament_start_date         79
tournament_end_date           66
venue_country                 51
dtype: int64

In [None]:
sports = df['sport_name'].unique()
print('number of unique sports:', len(sports), '\n', sports)

number of unique sports: 22 
 ['Baseball' 'Basketball' 'Soccer' 'Tennis' 'Motorsport'
 'Mixed Martial Arts' 'Cycling Road' 'Golf' 'American Football'
 'Horse Racing' 'Rugby Union' 'Boxing' 'Ice Hockey' 'Darts' 'Handball'
 'Pool' 'WWE' 'OTHER' 'Futsal' 'Snooker' 'Field Hockey' 'E-Sports']


In [6]:
df.isnull().sum()

viewer_id                  0
stream_start_time          0
stream_end_time            0
live_or_on_demand          0
customer_country           0
match_title                0
match_date                 0
competition_name           0
sport_name                 0
device_category            0
home_contestant_country    0
away_contestant_country    0
tournament_start_date      0
tournament_end_date        0
venue_country              0
dtype: int64

# How will we define similarity? (Assumptions/Model Definition)

Taking a look at the variables in our data, there's a number of ways we could say that one sport is 'similar' to another, and a lot of this is up to personal preference or domain knowledge. 

For our purposes, we will use the 

>['viewer_id', 'live_or_on_demand', 'customer_country', 'match_title',
       'competition_name', 'device_category',
       'home_contestant_country', 'away_contestant_country', 'venue_country'] 
       
features as our text description of each event. We'll use [Doc2Vec](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py) to convert our text to vectors.

We can also extract numerical features 

> ['DayOfWeek', 'Month', 'stream_length_minutes', 'tournament_length_days']

to create a dataset suitable for computing similarity. I think these are important features to add from a viewer standpoint, as they tell us about the time commitments required for a particular sport.

We are assuming that these features compose a significant portion of the sport's characteristics, and that sports that share similar values among these features should be considered similar.

We will use cosine similarity to measure similarity between our entries. From this we can create an aggregate sports matrix that shows us the mean similarity measure between one sport and another.

We will create our functions to find the most similar/dissimilar sports using these similarity metrics, and also the N most similar sports to a given sport.

# Extract and Standardise Numerical Features

In [7]:
# extract values for day and month of match
df['match_date'] = pd.to_datetime(df['match_date'], format='%Y-%m-%dT%H:%M:%S.000Z')
df['DayOfWeek'] = df['match_date'].dt.dayofweek
df['Month'] = df['match_date'].dt.month

# extract stream length value
df['stream_start_time'] = pd.to_datetime(df['stream_start_time'], format='%Y-%m-%dT%H:%M:%S.000Z')
df['stream_end_time'] = pd.to_datetime(df['stream_end_time'], format='%Y-%m-%dT%H:%M:%S.000Z')

def get_time_diff(row):
    return pd.Timedelta(row['stream_end_time'] - row['stream_start_time']).seconds / 60.0

df['stream_length_minutes'] = df.apply(get_time_diff, axis=1)

# extract tournament length value
df['tournament_start_date'] = pd.to_datetime(df['tournament_start_date'], format='%Y-%m-%dT%H:%M:%S.000Z')
df['tournament_end_date'] = pd.to_datetime(df['tournament_end_date'], format='%Y-%m-%dT%H:%M:%S.000Z')

def get_tournament_len(row):
    return (row['tournament_end_date'] - row['tournament_start_date']).total_seconds() / 86400.0

df['tournament_length_days'] = df.apply(get_tournament_len, axis=1)

df['home_contestant_country'] = [n.replace('0','Unknown') for n in df['home_contestant_country']]

df.head()

Unnamed: 0,viewer_id,stream_start_time,stream_end_time,live_or_on_demand,customer_country,match_title,match_date,competition_name,sport_name,device_category,home_contestant_country,away_contestant_country,tournament_start_date,tournament_end_date,venue_country,DayOfWeek,Month,stream_length_minutes,tournament_length_days
0,5049,2021-06-11 09:41:37,2021-06-11 12:02:18,Live,Japan,Lions v Dragons,2021-06-11 08:45:00,NPB,Baseball,Web,Japan,Japan,2021-03-02,2021-11-29,Japan,4,6,140.683333,272.0
1,9062,2021-06-13 20:13:36,2021-06-13 20:14:05,Live,Canada,Brazil vs. Canada,2021-06-13 19:10:00,FIBA AmeriCup (W),Basketball,Unknown,Brazil,Canada,2021-06-11,2021-06-19,Puerto Rico,6,6,0.483333,8.0
2,9062,2021-05-30 12:53:38,2021-05-30 12:53:56,Live,Canada,MLB Network,2018-01-01 00:00:00,MLB,Baseball,Unknown,Unknown,Unknown,2018-01-28,2018-12-30,England,0,1,0.3,336.0
3,9062,2021-05-30 20:24:00,2021-05-30 20:24:34,Live,Canada,MLB Network,2018-01-01 00:00:00,MLB,Baseball,Unknown,Unknown,Unknown,2018-01-28,2018-12-30,England,0,1,0.566667,336.0
4,211,2021-05-01 20:01:47,2021-05-01 20:07:08,Live,Italy,Milan v Benevento,2021-05-01 18:45:00,Serie A,Soccer,Mobile,Italy,Italy,2020-09-19,2021-05-23,Italy,5,5,5.35,246.0


In [8]:

numeric_cols = ['DayOfWeek', 'Month', 'stream_length_minutes', 'tournament_length_days']
numeric_df = df[numeric_cols]

# standardise our data
standardised = scipy.stats.zscore(numeric_df)
numeric_df = pd.DataFrame(standardised, columns = numeric_cols)
numeric_arr = numeric_df.to_numpy()
numeric_df.head()

Unnamed: 0,DayOfWeek,Month,stream_length_minutes,tournament_length_days
0,-0.141244,1.203701,2.254073,0.280458
1,0.966163,1.203701,-0.545479,-2.438076
2,-2.356059,-6.785115,-0.54914,0.939497
3,-2.356059,-6.785115,-0.543815,0.939497
4,0.412459,-0.394062,-0.4483,0.012724


# Create Word Vectors from Text Data

In [9]:
!pip install texthero



In [12]:
# clean text

import texthero as hero
from texthero import preprocessing
custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace,
                   preprocessing.remove_diacritics,
                   #preprocessing.remove_brackets
                  ]


features = ['viewer_id',
       'live_or_on_demand', 'customer_country', 'match_title',
       'competition_name', 'device_category',
       'home_contestant_country', 'away_contestant_country', 'venue_country']

df['clean_text'] = hero.clean(df[features[0]], custom_pipeline)

for feature in features[1:]:
    df['clean_text'] += ' ' + hero.clean(df[feature], custom_pipeline)


df['clean_text'] = [n.replace('{','') for n in df['clean_text']]
df['clean_text'] = [n.replace('}','') for n in df['clean_text']]
df['clean_text'] = [n.replace('(','') for n in df['clean_text']]
df['clean_text'] = [n.replace(')','') for n in df['clean_text']]

df.head(2)




Unnamed: 0,viewer_id,stream_start_time,stream_end_time,live_or_on_demand,customer_country,match_title,match_date,competition_name,sport_name,device_category,home_contestant_country,away_contestant_country,tournament_start_date,tournament_end_date,venue_country,DayOfWeek,Month,stream_length_minutes,tournament_length_days,clean_text
0,5049,2021-06-11 09:41:37,2021-06-11 12:02:18,Live,Japan,Lions v Dragons,2021-06-11 08:45:00,NPB,Baseball,Web,Japan,Japan,2021-03-02,2021-11-29,Japan,4,6,140.683333,272.0,5049 live japan lions v dragons npb web japan ...
1,9062,2021-06-13 20:13:36,2021-06-13 20:14:05,Live,Canada,Brazil vs. Canada,2021-06-13 19:10:00,FIBA AmeriCup (W),Basketball,Unknown,Brazil,Canada,2021-06-11,2021-06-19,Puerto Rico,6,6,0.483333,8.0,9062 live canada brazil vs. canada fiba americ...


In [13]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
#tokenize and tag the card text
card_docs = [TaggedDocument(doc.split(' '), [i]) 
             for i, doc in enumerate(df.clean_text)]


In [14]:
#instantiate model
model = Doc2Vec(vector_size=64, window=2, min_count=1, workers=8, epochs = 40)
#build vocab
model.build_vocab(card_docs)
#train model
model.train(card_docs, total_examples=model.corpus_count, epochs=model.epochs)

In [15]:
#generate vectors
card2vec = [model.infer_vector((df['clean_text'][i].split(' '))) 
            for i in range(0,len(df['clean_text']))]


In [16]:
# Create a list of lists
dtv_arr = np.array(card2vec)
dtv = np.array(card2vec).tolist()
#set list to dataframe column
df['card2vec'] = dtv

In [17]:
arr = np.hstack((dtv_arr, numeric_arr))
# pd.DataFrame(arr).to_csv("dazn_streaming_data_features.csv")
arr[0]

array([ 7.97525048e-02,  8.63876939e-02, -1.47419661e-01, -8.38291943e-02,
       -1.33435056e-01, -1.86939031e-01, -1.45122662e-01, -1.16793498e-01,
        1.26969904e-01,  1.10280037e-01,  2.41244640e-02,  6.31890818e-02,
       -1.25219434e-01,  1.01557232e-01,  1.89303048e-02,  1.08136870e-01,
        3.84503566e-02,  5.03317714e-02,  7.78801069e-02,  1.47796422e-01,
       -8.50944892e-02,  1.37648359e-01,  2.32642889e-02, -5.36002666e-02,
        2.35394798e-02, -5.72783090e-02, -2.28557433e-03,  1.26002222e-01,
        2.49266508e-04,  1.80692729e-02, -1.15765303e-01,  1.32756770e-01,
        1.15410006e-02, -6.73440695e-02, -4.14405651e-02,  1.50490046e-01,
       -2.35171363e-01, -6.54827356e-02, -9.96426120e-02, -9.39504579e-02,
        1.31078586e-01,  1.64996475e-01, -5.31563722e-02, -7.19943866e-02,
        6.74366532e-03, -1.23612836e-01, -1.20284416e-01,  2.78975256e-03,
        2.22104713e-02, -9.96744111e-02, -1.67574093e-01,  2.13400479e-02,
        1.06387421e-01, -

# Computing Cosine Similarity Matrix

In [18]:
# define cosine similarity func

def np_cosine_similarity(u, v):
  u = np.expand_dims(u, 1)
  n = np.sum(u * v, axis=2)
  d = np.linalg.norm(u, axis=2) * np.linalg.norm(v, axis=1)

  return n / d

In [None]:
# Compute cosine similarity
x = arr

sports_similarity_mat = np.zeros((len(sports), len(sports)))
count_sports_similarity_mat = np.zeros((len(sports), len(sports)))
sports = list(sports)


# get list of sport names for each entry
compared_sport_name_lst = [df.iloc[j]['sport_name'] for j in range(len(x))]
# get list of sport indices
compared_sport_idx_lst = [sports.index(compared_sport_name) for compared_sport_name in compared_sport_name_lst]

k = 0
# this loop takes a single row from the array and computes cosine
# similarity against all rows in the array
for row in x:

    if k > 4999 and k%5000 == 0:
        print("finished row {}".format(k))

    sport_name = df.iloc[k]['sport_name']
    sport_index = sports.index(sport_name)

    # Calculate cosine similarity in NumPy
    results = np_cosine_similarity(x, [row])

    # update our aggregate sport similarity matrix
    for j, res in enumerate(results):

        sports_similarity_mat[sport_index][compared_sport_idx_lst[j]] += res
        count_sports_similarity_mat[sport_index][compared_sport_idx_lst[j]] += 1


    k += 1
    

In [22]:
sim_scores = np.nan_to_num(sports_similarity_mat / count_sports_similarity_mat)
sim_scores_df = pd.DataFrame(sim_scores)
# sim_scores_df.to_csv("sports_cosine_sim_matrix.csv")
sim_scores_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
0,0.134254,0.003657,0.008424,0.004036,0.095479,0.131223,0.090985,-0.03954,-0.072586,0.261381,-0.014348,-0.079346,-0.026907,0.170657,0.086171,0.061314,0.192888,0.149956,0.03898,0.074622,0.135574,0.123026
1,0.003657,0.149167,0.070671,0.180489,0.071033,0.007286,0.082673,0.162299,-0.032018,-0.099296,0.201021,0.182387,0.157711,-0.110408,0.083365,0.065353,-0.041156,-0.058281,0.076327,-0.072878,-0.161047,0.057066
2,0.008424,0.070671,0.105886,-0.020442,0.065531,0.02817,0.06751,-0.073909,0.197342,-0.016266,0.032775,0.088728,0.042006,0.084761,0.06932,0.010353,0.023008,0.149955,0.111642,0.054518,0.040513,0.17215
3,0.004036,0.180489,-0.020442,0.693496,-0.185019,-0.355527,0.190135,0.693844,-0.468075,-0.105214,0.536113,0.244738,0.388233,-0.342706,-0.017155,0.361619,-0.230576,-0.486913,0.050626,-0.348261,-0.536571,-0.16358
4,0.095479,0.071033,0.065531,-0.185019,0.367926,0.431175,0.033475,-0.152519,0.06297,0.109279,-0.088339,0.018531,-0.092548,0.067234,0.148798,-0.157235,0.264165,0.264565,-0.018861,0.150766,0.158248,0.17491


# SIMILAR SPORTS TO X FUNCTION

In terms of most similar sports to Golf, I would tend to agree that Tennis and Rugby would rank high.

In [None]:
# # load in saved files instead of re running everything

# sim_scores_df = pd.read_csv("sports_cosine_sim_matrix.csv", index_col=0, sep=',')
# sim_scores = sim_scores_new_df.to_numpy()

# saved_df = pd.read_csv('DAZN_Data_Scientist_Homework_Dataset.csv', sep='~')
# sports = saved_df['sport_name'].unique()

In [46]:
def find_similar_sports(sport_name, sport_similarity_mat=sim_scores, N=3, sports=sports):

    N += 1
    sport_ind = sports.index(sport_name)

    b = np.argpartition(sim_scores, -N)    # top N values from each row
    b = b[sport_ind,-N:]

    d = np.partition(sim_scores, -N)    # top N values indices from each row
    d = d[sport_ind,-N:]

    zipped_lists = zip(d, b)

    sorted_zipped_lists = sorted(zipped_lists, reverse=True)   # Sort by first element of each pair

    sorted_list = [(ind, score) for score, ind in sorted_zipped_lists]

    print('MOST SIMILAR SPORTS TO {}:\n'.format(sport_name))

    for element in sorted_list:
        if element[0] != sport_ind and N > 0:
           print(sports[element[0]], element[1])
           N -= 1


find_similar_sports('Golf', N=5)

MOST SIMILAR SPORTS TO Golf:

Tennis 0.6938436603780744
Rugby Union 0.5177537817872859
Ice Hockey 0.3514912384695351
Pool 0.27092902307628053
Boxing 0.2605019725794412


# MOST SIMILAR PAIRS OF SPORTS FUNCTION

Golf and Tennis are the most similar sports according to our work.

Field Hockey and 'OTHER' is harder to judge because we don't really know exactly what sports are in 'OTHER'. 

I wouldn't agree that Field Hockey and Darts are similar from the little I know about either.

In [47]:
# we want to avoid matching sports with themselves or repeating combinations of sports

def n_most_similar(a=sim_scores, sports=sports, n=2):
    m = a.shape[0]
    r = np.arange(m)
    mask = r[:,None] > r
    idx = a[mask].argpartition(-n)[-n:]

    clens = np.arange(m).cumsum()    
    grp_start = clens[:-1]
    grp_stop = clens[1:]-1    

    rows = np.searchsorted(grp_stop, idx)+1    
    cols  = idx - grp_start[rows-1]
    coords = list(zip(rows, cols))

    sorted_coords = []
    for c in coords:
        sorted_coords.append( ("{} and {}".format(sports[c[0]], sports[c[1]]) , 
                            sim_scores[c[0]][c[1]]) )

    sorted_coords.sort(key=lambda x: x[1], reverse=True)
    return sorted_coords


coords = n_most_similar(n=5)

print("MOST SIMILAR SPORTS:\n")
for a in coords:
  print("{}: {}".format(a[0], a[1]))

MOST SIMILAR SPORTS:

Golf and Tennis: 0.6938436603780744
Field Hockey and OTHER: 0.6162379725442297
Field Hockey and Darts: 0.6014892487637196
OTHER and Darts: 0.5941237692192245
Rugby Union and Tennis: 0.5361129851666161


# MOST DISSIMILAR PAIRS OF SPORTS FUNCTION

In [48]:
def n_least_similar(a=sim_scores, sports=sports, n=2):
    m = a.shape[0]
    r = np.arange(m)
    mask = r[:,None] > r
    idx = a[mask].argpartition(n)[:n]

    clens = np.arange(m).cumsum()    
    grp_start = clens[:-1]
    grp_stop = clens[1:]-1    

    rows = np.searchsorted(grp_stop, idx)+1    
    cols  = idx - grp_start[rows-1]
    coords = list(zip(rows, cols))

    sorted_coords = []
    for c in coords:
        sorted_coords.append( ("{} and {}".format(sports[c[0]], sports[c[1]]) , 
                            sim_scores[c[0]][c[1]]) )

    sorted_coords.sort(key=lambda x: x[1])
    return sorted_coords


coords = n_least_similar(n=5)

print("MOST DISSIMILAR SPORTS:\n")
for a in coords:
  print("{}: {}".format(a[0], a[1]))

MOST DISSIMILAR SPORTS:

Field Hockey and Golf: -0.6426338880747827
OTHER and Golf: -0.6148627963271553
American Football and Golf: -0.5690987107171803
Darts and Golf: -0.5390285654434096
Field Hockey and Tennis: -0.5365713365966149


Field Hockey and Golf are the most disparate sports, followed by 'OTHER' and Golf and American Football and Golf. I have no evidence to disagree with any of these results.

It's worth noting here that the cosine similarity score can range between [-1,1], where a
- −1 value will indicate strongly opposite vectors
- 0 independent (orthogonal) vectors
- 1 similar (positive co-linear) vectors. 

Intermediate values are used to assess the degree of similarity. This function returns the lowest cosine score, in this case the 'most opposite' sports. Another way to get dissimilar sports might be to get scores closest to 0, giving us sports that share no attributes.