# NMF Preprocessing for player recomendations

## Overview - 

## 1 - NMF model explanation

## 2 - NMF hyperparamter tuning

### 2.1 tuning max_iter 

### 2.2 tuning number of skill set groups "c"

## 3 - Model output 

### 3.1 Label Weight matrix skill set groups

### 3.2 Create ranking data from from skill set values

### 3.3 Save output and model 


In [12]:
## import packages and tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF 
from sklearn.preprocessing import Normalizer ,MaxAbsScaler
from sklearn.pipeline import make_pipeline
from utils import save_file

In [45]:
## load data
df = pd.read_csv('../data/final_per_90_and_pAdj.csv')

In [48]:
df.columns.values

array(['fb_id', 'player_name', 'Squad_from_op', 'Opposing_Poss',
       'Opposing_Touches', 'Player', 'Nation', 'Position',
       'Tackle_pct_won', 'Nineties', 'xA', 'Key_pass', 'Comp_prog',
       'Total_Cmp', 'Total_Att', 'Total_Cmp_pct', 'Total_Tot_Dist',
       'Total_Prg_Dist', 'Short_Cmp', 'Short_Att', 'Short_Cmp_pct',
       'Medium_Cmp', 'Medium_Att', 'Medium_Cmp_pct', 'Long_Cmp',
       'Long_Att', 'Long_Cmp_pct', 'touches', 'touches_def_box',
       'touches_def_3rd', 'touches_mid_3rd', 'touches_att_3rd',
       'touches_att_box', 'touch_live', 'dribble_Succ', 'dribble_Att',
       'dribble_Succ_pct', 'num_dribble_past', 'dribble_megs', 'carries',
       'carries_dist', 'carries_prog_dist', 'carries_prog',
       'carries_att_3rd', 'carries_att_box', 'miss_control', 'dispossed',
       'recep_targ', 'recep_succ', 'recept_pct', 'Fouls_drawn',
       'Offsides', 'Crosses', 'PK_won', 'Aerial_win_pct', 'Gls', 'Sh',
       'SoT', 'SoT%', 'Sh/90', 'SoT/90', 'G/Sh', 'G/SoT', 'Dist'

In [49]:
df['Opposing_Poss'].unique()

array([49.1, 54. , 49.9, 46.5, 51. , 48.3, 47.1, 42.6, 50.2, 51.4, 53.5,
       51.7, 55.1, 53.8, 48.9, 45.7, 46. , 51.8, 47.4, 45. , 51.9, 53.9,
       50.6, 54.9])

In [78]:
## set index and drop negative and categotical columns 
X = df.set_index('player_name')

# remove categorical and features with negative values
X = X.drop(columns= ['Base Salary','Player', 'Club',
                     'Nation', 'Position','fb_id', 'Squad_from_op','G-xG','np:G-xG'])
X.shape

(705, 100)

In [79]:
## select key features to input into NMF 
X = X[[ 'Nineties', 'xA',
       'Key_pass', 'Comp_prog', 'Total_Cmp', 'Total_Att', 'Total_Cmp_pct',
       'Total_Tot_Dist', 'Total_Prg_Dist', 'Short_Cmp', 'Short_Att',
        'Medium_Cmp', 'Medium_Att', 
       'Long_Cmp', 'Long_Att', 'Long_Cmp_pct', 'touches', 'touches_def_box',
       'touches_def_3rd', 'touches_mid_3rd', 'touches_att_3rd',
       'touches_att_box', 'touch_live', 'dribble_Succ', 'dribble_Att',
       'dribble_Succ_pct', 'num_dribble_past', 'dribble_megs', 'carries',
       'carries_dist', 'carries_prog_dist', 'carries_prog', 'carries_att_3rd',
       'carries_att_box', 'miss_control', 'dispossed', 'recep_targ',
       'recep_succ', 'recept_pct', 'Fouls_drawn', 'Offsides', 'Crosses',
       'PK_won', 'Aerial_win_pct', 'Gls', 'Sh', 'SoT', 'SoT%', 'Sh/90',
       'SoT/90', 'G/Sh', 'G/SoT', 'Dist', 'FK', 'PK', 'PKatt', 'xG', 'npxG',
       'npxG/Sh', 'pAdj_Total_tackles', 'pAdj_Tackles_Won',
       'pAdj_Tackles_Def_3rd', 'pAdj_Tackles_Mid_3rd', 'pAdj_Tackles_Att_3rd',
       'pAdj_Num_Dribblers_tackled', 'pAdj_Num_Dribbled_past', 
       'pAdj_Blocked_shots', 'pAdj_Blocked_SOT', 'Tackle_pct_won','pAdj_Blocked_pass',
       'pAdj_Interceptions', 'pAdj_Tackles_and_Ints', 'pAdj_Clearences',
      'pAdj_Red_cards','pAdj_Fouls', 'pAdj_def_interceptions', 'pAdj_Recoveries',
       'pAdj_Aerial_Duels_lost', 'lost_tackles', 'True_tackle_pct',
       'pAdj_Tackle_int_blocks', 'avg_shrt+med_pass_pct', 'pct_long_balls',
       'prog_carry+lng_comp+crosses', 'attacking_touches']]
# work with players who have played at least 5 games 
X = X.loc[(X['Nineties']>5)]
X.shape

(541, 85)

In [80]:
X.to_csv('../data/NMF_X.csv')

# 1 - NMF model explanation - Non Negative Matrix Factorization 

NMF or Non Negative Matrix Factorization is widely used for topic modeling and document clustering but there are many examples of NMF being used to cluster other kinds of datasets such as classifying Companies on extra financial criteria.https://towardsdatascience.com/using-nmf-to-classify-companies-a77e176f276f . I am going to use NMF to classify my MLS 2021 data set by the "topic" or ,grouping of player stats,that I will call "skill set group". I will use the skill sets to classify players in the league and then use a normalized version of the decomposed matrix to identify players who are closest to Key players Identified in my EDA . At the end I should have all players classified by a key skill set group. Players closest to the target players identified show how franchises could use NMF to Identify other players that fit into roles played by players in the MLS. 

# X matrix - Original player matrix
With the player matrix X being of n players(rows) and f features(player stats) (n x f)


## Weight matrix - weights of each row for each skill set group or topic 
c = number of Skill Set Groups that will be generated by NMF 
This matrix has rows filled by players  with the skill set features generated  by the NMF algorithm as the columns give you the weight matrix. 
W = n x c 

## Hidden layer matrix - hidden layer of feature values for each skill set group
This matrix has rows of players features with  skill set groups as the columns. 
H = f x c

## The matrix multiplication of W (weight)* H(hidden) approximates the original X player matrix

The two matixies W and H when multiplied togeather approximates the original X matrix. This allows player data to be observed in a lower dimensional space by comparing player skill set values in the weight matrix.  

## The W weight matrix is used to find players closest to the target player in question

The distance between the topic values of each player in the W matrix can be used to measure who is the most simular allowing for recommendations of players closest to the target player to be created.

The skill set group that has the highest ranking for each player is used to classify or label groups of players who are Rows filled by players  with the skill set features generated  by the NMF algorithm as the columns gives you the weight matrix,  W = n x c.similar in their player stats features. 



In [81]:
# function to viz skill groups and players most associated with skill group

def display_features(H,W,feature_names, X_matrix ,no_top_features, no_top_players):
    """ visualize skill get group and highest ranked players in group """
    topics = {}
    # iterate through topics in topic-term matrix, 'H' aka
    # H is the hidden layer which is shape (F x C) feature times topic matrix
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([ (feature_names[i] + " (" + str(topic[i].round(2)) + ")")
          for i in topic.argsort()[:-no_top_features - 1:-1]]))
        
        # add features to topics dictionary for later assesment. 
        
        topics[topic_idx] = [ (feature_names[i] + " (" + str(topic[i].round(2)) + ")") 
                             for i in topic.argsort()[:-(no_top_features+3) - 1:-1]]
        
        top_player_indicies = np.argsort( W[:,topic_idx] )[::-1][0:no_top_players]
        for p_index in top_player_indicies:
            
            print(p_index," ",X_matrix.index[p_index])
    
    return topics 


def add_skill_group(X,W):
    """ add skill group to player feature to use for classification"""
    df_new = X.copy()
    # Get the top predicted topic and add to df copy 
    df_new['pred_topic_num']= [np.argsort(each)[::-1][0] for each in W]
  
    return df_new

## 2 - NMF hyperparamter tuning

### 2.1 tuning of max_iter parameter to allow for model convergance 

In [82]:
# try increasing number of max_iter to allow for model convergance
for test in np.arange(100,1000,100):
    c = 11
    # Create a MaxAbsScaler: scaler
    transformer = MaxAbsScaler().fit(X)

    # scale data
    scaled_X= transformer.transform(X)

    # Create an NMF model: nmf

    nmf = NMF(n_components=c,max_iter=test,init='nndsvda',  random_state=42)

    W = nmf.fit_transform(scaled_X)
    H = nmf.components_
    err = nmf.reconstruction_err_

    print(nmf,'error :',err, '\n', ' W shape',W.shape,'H shape ',H.shape)           


# topics = display_features(H,W,feature_names, X, no_top_features,no_top_players)



NMF(init='nndsvda', max_iter=100, n_components=11, random_state=42) error : 15.267031395899144 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', n_components=11, random_state=42) error : 15.241489440260304 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', max_iter=300, n_components=11, random_state=42) error : 15.237397443966058 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', max_iter=400, n_components=11, random_state=42) error : 15.2373182856613 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', max_iter=500, n_components=11, random_state=42) error : 15.2373182856613 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', max_iter=600, n_components=11, random_state=42) error : 15.2373182856613 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', max_iter=700, n_components=11, random_state=42) error : 15.2373182856613 
  W shape (541, 11) H shape  (11, 85)
NMF(init='nndsvda', max_iter=800, n_components=11, random_state=42) error : 15.23731

### max_iter tuning 
- more than 300 iterations required to converge the model. Will increase to 500 iterations. 

## 2.2 tuning number of skill set groups "c"

### 2.2.1 Inital NMF attempt with c = 11 skill group topics 

In [83]:

c = 11
# Create a MaxAbsScaler: scaler
transformer = MaxAbsScaler().fit(X)

# scale data
scaled_X= transformer.transform(X)

# Create an NMF model: nmf

nmf = NMF(n_components=c,max_iter=600,init='nndsvda', random_state=42)

W = nmf.fit_transform(scaled_X)
H = nmf.components_
err = nmf.reconstruction_err_

print(nmf,'error :',err, '\n', ' W shape',W.shape,'H shape ',H.shape)

            
feature_names = X.columns.values
no_top_features = 4
no_top_players = 3

topics = display_features(H,W,feature_names, X, no_top_features,no_top_players)

NMF(init='nndsvda', max_iter=600, n_components=11, random_state=42) error : 15.2373182856613 
  W shape (541, 11) H shape  (11, 85)
Topic 0:
recept_pct (12.25) avg_shrt+med_pass_pct (11.91) touches_def_box (10.31) pct_long_balls (10.06)
526   Zac MacMath
488   Brad Guzan
518   Maxime Crépeau
Topic 1:
touches_att_box (4.97) npxG (4.36) SoT (4.34) xG (4.22)
192   Adam Buksa
45   Valentín Castellanos
106   Ola Kamara
Topic 2:
pAdj_Tackles_Mid_3rd (3.56) pAdj_Total_tackles (3.18) pAdj_Tackles_Won (3.17) pAdj_Tackles_and_Ints (2.65)
409   Judson
462   Franco Ibarra
246   Eric Remedi
Topic 3:
xA (1.93) Key_pass (1.74) Comp_prog (1.53) Long_Att (1.48)
2   Emanuel Reynoso
0   Carles Gil
6   Lucas Zelarayán
Topic 4:
pAdj_Clearences (2.18) pAdj_Blocked_shots (1.8) Aerial_win_pct (1.49) touches_def_3rd (1.45)
426   Nathan Cardoso
392   Jon Bell
388   Wyatt Omsberg
Topic 5:
Total_Cmp_pct (1.1) Short_Cmp (1.08) Short_Att (1.07) avg_shrt+med_pass_pct (1.04)
266   Bryce Duke
499   Kamohelo Mokotjo
54

In [84]:
# check groups 
df_new = add_skill_group(X,W)
df_new['pred_topic_num'].value_counts()

10    121
8     120
7     109
5      64
6      52
4      33
9      33
0       8
3       1
Name: pred_topic_num, dtype: int64

### 2.2.1 Tuning Insight 
- 11 skill groups lead to only 9 clusters shown above. I will reduce c to be able to have more balanced clusters. 
- The highest rated players for each skill set are related in playing style showing that the associations between players built out by NMF look promising 

## 2.2.2 - Attemping c('skill groups') = 9 to look for balanced groups.  

In [87]:
c = 9

nmf = NMF(n_components=c,max_iter=1000,init='nndsvda', l1_ratio=.05, random_state=42)

W = nmf.fit_transform(scaled_X)
H = nmf.components_
err = nmf.reconstruction_err_

print(nmf,'error :',err, '\n', ' W shape',W.shape,'H shape ',H.shape)
df_new = add_skill_group(X,W)
print(df_new['pred_topic_num'].value_counts())
            
feature_names = X.columns.values
no_top_features = 4
no_top_players = 5

topics = display_features(H,W,feature_names, X, no_top_features,no_top_players)

NMF(init='nndsvda', l1_ratio=0.05, max_iter=1000, n_components=9,
    random_state=42) error : 16.097065673959523 
  W shape (541, 9) H shape  (9, 85)
7    99
8    98
5    87
6    79
4    70
3    48
2    40
0    18
1     2
Name: pred_topic_num, dtype: int64
Topic 0:
recept_pct (9.36) avg_shrt+med_pass_pct (8.98) pct_long_balls (8.19) touches_def_box (7.97)
488   Brad Guzan
518   Maxime Crépeau
526   Zac MacMath
530   David Ochoa
428   J.T. Marcinkowski
Topic 1:
dribble_Succ (2.71) num_dribble_past (2.63) dribble_Att (2.52) dispossed (2.31)
51   Yeferson Soteldo
2   Emanuel Reynoso
136   Brian Rodríguez
6   Lucas Zelarayán
0   Carles Gil
Topic 2:
pAdj_Clearences (2.37) pAdj_Blocked_shots (2.05) touches_def_3rd (1.81) pAdj_def_interceptions (1.49)
426   Nathan Cardoso
348   Alan Franco
417   Daniel Steres
365   Rudy Camacho
422   Francisco Calvo
Topic 3:
Total_Cmp_pct (1.71) avg_shrt+med_pass_pct (1.66) recept_pct (1.44) Long_Cmp_pct (1.28)
499   Kamohelo Mokotjo
459   Ralph Priso-Mbongu

## 3 - Model output 

### 3.1 Label Weight matrix skill set groups
- use top features in each skill set group to create skill set names.

In [88]:
skill_sets = pd.DataFrame(topics )
columns =['Defensive passing','Dribbling','Defensive actions','Short passing','Playing time','Tackling','Attacking play','Passing volume','Progressive passing']
skill_sets.columns = columns
skill_sets

Unnamed: 0,Defensive passing,Dribbling,Defensive actions,Short passing,Playing time,Tackling,Attacking play,Passing volume,Progressive passing
0,recept_pct (9.36),dribble_Succ (2.71),pAdj_Clearences (2.37),Total_Cmp_pct (1.71),Nineties (2.1),pAdj_Tackles_Mid_3rd (1.1),touches_att_box (1.37),touches_mid_3rd (0.87),Crosses (1.07)
1,avg_shrt+med_pass_pct (8.98),num_dribble_past (2.63),pAdj_Blocked_shots (2.05),avg_shrt+med_pass_pct (1.66),Total_Cmp_pct (1.69),pAdj_Total_tackles (0.97),SoT (1.3),recep_succ (0.84),touches_att_3rd (0.7)
2,pct_long_balls (8.19),dribble_Att (2.52),touches_def_3rd (1.81),recept_pct (1.44),avg_shrt+med_pass_pct (1.62),pAdj_Tackles_Won (0.97),npxG (1.23),Total_Cmp (0.76),attacking_touches (0.65)
3,touches_def_box (7.97),dispossed (2.31),pAdj_def_interceptions (1.49),Long_Cmp_pct (1.28),dribble_Succ_pct (1.45),pAdj_Tackles_and_Ints (0.84),xG (1.19),Total_Att (0.76),Short_Att (0.54)
4,Long_Att (7.87),carries_prog (1.95),pAdj_Interceptions (1.48),dribble_Succ_pct (1.26),Long_Cmp_pct (1.39),pAdj_Num_Dribbled_past (0.82),Sh (1.18),carries (0.75),xA (0.52)
5,Total_Prg_Dist (7.76),attacking_touches (1.84),Aerial_win_pct (1.39),Dist (1.16),recept_pct (1.39),pAdj_Num_Dribblers_tackled (0.81),Offsides (1.03),recep_targ (0.73),recept_pct (0.51)
6,touches_def_3rd (7.18),dribble_megs (1.83),pAdj_Tackle_int_blocks (1.39),Short_Att (0.87),Short_Att (0.7),pAdj_Tackle_int_blocks (0.76),Gls (1.02),Total_Tot_Dist (0.72),pAdj_Blocked_pass (0.5)


### 3.2 Create ranking data from from skill set values

In [89]:
dfw= pd.DataFrame(W,columns=columns,index = X.index)
dfw.head()

Unnamed: 0_level_0,Defensive passing,Dribbling,Defensive actions,Short passing,Playing time,Tackling,Attacking play,Passing volume,Progressive passing
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Carles Gil,0.013093,0.224208,0.0,0.0,0.172804,0.0,0.084986,0.580714,0.495527
Julian Gressel,0.02568,0.032906,0.041975,0.0,0.118464,0.103169,0.112529,0.208273,0.720363
Emanuel Reynoso,0.002525,0.300686,0.0,0.0,0.212611,0.280021,0.123508,0.548108,0.20096
Albert Rusnák,0.01443,0.078649,0.0,0.0,0.324585,0.0,0.178034,0.359962,0.276517
Maximiliano Moralez,0.00974,0.058921,0.0,0.0,0.229654,0.189513,0.244651,0.536516,0.39802


### Use skill Set group values to create ranks to use for clustering and comparisons

In [90]:
for col in dfw.columns.values:
    title = col +' rank'
    dfw[title] = dfw[col].rank(pct=True,na_option = 'bottom',ascending=True)
dfw[['Defensive actions','Defensive actions rank']].nlargest(5,columns='Defensive actions rank')

Unnamed: 0_level_0,Defensive actions,Defensive actions rank
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Nathan Cardoso,0.355321,1.0
Alan Franco,0.315812,0.998152
Daniel Steres,0.308983,0.996303
Rudy Camacho,0.305769,0.994455
Francisco Calvo,0.301305,0.992606


### Defensive action rank insight 
- SJE new defenive signing Francisco Calvo comes up in top 5 rank with Nathan Cordoso for the defensive actions skill set group

## 3.3 Saving dataframes and models 
- save rank matrix
- save normalized Weight matrix for cosine simularity
- save pickled nmf model 

In [93]:
# save dfw for modeling and coparison

dfw.to_csv('../data/W_ranks.csv', encoding='utf-8', index=False)
skill_sets.to_csv('../data/skill_sets.csv', encoding='utf-8', index=False)

In [92]:
# save normalized W matrix for calculating cosine simularity between players

transformer = Normalizer().fit(W)
normed_W= transformer.transform(W)
# can add back club and salary here 
normed_W = pd.DataFrame(normed_W,index=X.index.values, columns=columns)

normed_W.to_csv('../data/normed_W.csv',encoding='utf-8')

In [15]:
## save nmf model for later use
            
save_file(data= nmf, fname='nmf.pkl',dname='../models/')

A file already exists with this name.



Do you want to overwrite? (Y/N) Y


Writing file.  "../models/nmf.pkl"
