In [2]:
import pandas as pd
import pymongo

from time import mktime
from dateutil import parser
from time import strptime
from pymongo import MongoClient
from tqdm import tqdm
from collections import Counter
from bson.objectid import ObjectId
from collections import OrderedDict 

from sklearn.preprocessing import MinMaxScaler

In [8]:
client = MongoClient()
db = client.wyscout
print("The database exists?\t%s" %( 'wyscout' in client.list_database_names() ))

The database exists?	True


# <font color=red> 1. Data manipulation on the collections</font>

If you have import in your mongodb database the initial collections and this is the first time that you will play with this notebook, please execute before the functions in this section to add all the fields necessary for the analysis. Take into account that some function can take few minutes (sometimes hour).

## <font color=orange>1.1 Informations decode</font>

### <font color=orange>1.1.1 Players name</font>

This function takes as an input the collection of the players ```pl_col``` and decode the attribute ```shortName```, ```firstName``` and ```lastName``` that have some characters encoded in utf-8.

Before the conversion we have that only 2264 players in the Fifa dataset math the name with the ```shortName``` of all the 3603 players in our collection (that corresponds to 62.84%). The other 1339 players (37.16%) have a name that don't appear in our collection.

After the convrsion we obtain that the players that match are 2800 (77.71%), while the other 803 players have again some 'problems' with the name.

In [442]:
def decodePlayerNames (pl_col):
    for docId in tqdm([ e['_id'] for e in pl_col.find({ }, {'_id'}) ]):
        doc = pl_col.find_one({'_id': docId})
        decodedShortName = doc['shortName'].encode().decode('unicode_escape').encode()
        decodedFirstName = doc['firstName'].encode().decode('unicode_escape').encode()
        decodedLastName = doc['lastName'].encode().decode('unicode_escape').encode()
        pl_col.update_one({ '_id': docId }, 
                          { '$set': {
                              'shortName': decodedShortName,
                              'firstName': decodedFirstName,
                              'lastName': decodedLastName
                          } })

In [443]:
#decodePlayerNames(db.players)

100%|█████████████████████████████████████████████████████████████████████████████| 3603/3603 [00:06<00:00, 536.69it/s]


### <font color=orange>1.1.2 Teams</font>

In [10]:
def decodeTeamNames (tm_col):
    for docId in tqdm([ e['_id'] for e in tm_col.find({ }, {'_id'}) ]):
        doc = tm_col.find_one({'_id': docId})
        cityDec = doc['city'].encode().decode('unicode_escape').encode()
        nameDec = doc['name'].encode().decode('unicode_escape').encode()
        offNameDec = doc['officialName'].encode().decode('unicode_escape').encode()
        tm_col.update_one({ '_id': docId }, 
                          { '$set': {
                              'city': cityDec,
                              'name': nameDec,
                              'officialName': offNameDec
                          } })

In [11]:
#decodeTeamNames(db.teams)

100%|███████████████████████████████████████████████████████████████████████████████| 142/142 [00:00<00:00, 510.92it/s]


## <font color=orange>1.2 Add infromations from Fifa dataset</font>

In [160]:
df_Fifa = pd.read_csv("C:/Users/riky9/Desktop/BDA/data/fifa.csv")

In [5]:
def addOverallAndPotential (pl_col, df):
    for i in tqdm(range(df.shape[0])):
        pl_col.update_one(
            { 'shortName': df['Name'].iloc[i], 
             'passportArea.name': df['Nationality'].iloc[i] 
            }, { '$set': {
                'overall': int(df['Overall'].iloc[i]),
                'potential': int(df['Potential'].iloc[i])
            }}
        )

In [446]:
addOverallAndPotential(db.players, df_Fifa)

100%|███████████████████████████████████████████████████████████████████████████| 18207/18207 [01:16<00:00, 237.96it/s]


## <font color=orange>1.3 Add features to `matches` and `events` collections</font>

In order to evaluate the `meanPlayerOverall` and the `meanPlayerPotential` we need to add the field `timestamp` as the conversion of the fild `date` that already exists and indicate when a match was palye.

Moreover, for each event now we have only the filed `secEvent` as a time parameter, which represent the seconds passed from the **beggining of the period** (first half, second half, first extra time, second extra time) of the game at the time when the event happened. To select the two time windows, the first is which we want analyse and the second is the time window on which we want predict the target variables, we need a feature that represent the seconds passed from the **beginning of the match** at the time when the event happenend. From this reason we add the the feature `secfromStart` at each event.

Consider an event _eH2_ that happenend during the second half of a match to evaluate the `secFromStart` of _eH2_ we sum the value stored in the field `secEvent` of _eH2_ the maximum value stored in the same field `secEvent` of all the events happenend during the first half. Then we applay this idea for the events happepend during the extra time persiods.

First we compute 
- _x = max(e[secEvent])_ for each event _e_ in the first half of the match, 
- _y = x + max(e[secEvent])_ for each event _e_ in the second half of the match, 
- _z = y + max(e[secEvent])_ for each event _e_ in the first extra time period and 
- _h = z + max(e[secEvent])_ for each event _e_ in the second extra time period. 

<img src="timeline.png" alt="Alt text that describes the graphic" title="Title text" />

Then, we store the new time informations from the beginning of the match in each event document in the collection and we add some features to each document in the matches collection: `secEndH1`= _x_ represents the second after which we consider ended the first half of the match, `secEndH2` = _y_ represents the second after which we consider ended the second half of the match, `secEndE1` = _z_ represents the second after which we consider ended the first half of the extra time and `secEnd` = _h_ represents the second after which we consider ended the match. Consider that `secEnd` is equal to _y_ and `secEndE1` = 0 in all the matches that don't have extra time.

In [593]:
def addSecFromStart (ev_col, mt_col):
    matchIds = [ m['wyId'] for m in mt_col.find({ }, {'_id':0, 'wyId':1}) ] #match id of each match in the matches collection
    for match in tqdm(matchIds):
        matchDoc = mt_col.find_one({'wyId': match})
        timestamp = int(mktime(parser.parse(matchDoc['date']).timetuple())) #we produce the timestamp from the date of the match
        matchEvents = [e for e in ev_col.find( { 'matchId': int(match) } )] #all events happened during the match

        evH1 = [ e for e in matchEvents if e['matchPeriod'] == '1H' ]
        evH2 = [ e for e in matchEvents if e['matchPeriod'] == '2H' ]
        evE1 = [ e for e in matchEvents if e['matchPeriod'] == 'E1' ]
        evE2 = [ e for e in matchEvents if e['matchPeriod'] == 'E2' ]
        evP = [ e for e in matchEvents if e['matchPeriod'] == 'P' ]

        x = max( e['eventSec'] for e in evH1 )
        y = max( e['eventSec'] for e in evH2 ) + x
        z = y
        h = z
        if len(evE1) > 0:
            z = max( e['eventSec'] for e in evE1 ) + y
            h = max( e['eventSec'] for e in evE2 ) + z

        evSecDict = { e['_id']: e['eventSec'] for e in matchEvents }
        
        #add secFromStart to events in H1 
        for eId in [ e['_id'] for e in evH1 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] } } )
        
        #add secFromStart to events in H2 
        for eId in [ e['_id'] for e in evH2 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + x } } )

        #notice that we enter the following loops only if the match has extra time
        #add secFromStart to events in E1
        for eId in [ e['_id'] for e in evE1 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + y } } )

        #add secFromStart to events in E2
        for eId in [ e['_id'] for e in evE2 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + z } } )
        
        #add secFromStart to events in P
        for eId in [ e['_id'] for e in evP ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + h } } )


        mt_col.update_one({ 'wyId': match }, { 
            '$set': {
                'timestamp': timestamp,
                'secEndH1': x, 'secEndH2': y, 'secEndE1': z, 'secEnd': h 
            } } )

## <font color=orange>1.4 Add `playerankScoreNorm` based on playeRank collection info</font>

playeRank 2.0

To evaluate correctly the `playerankScore` of the collection `playeRank` we decide to evaluate the correct minutes that each player played in the match considering the information available in the collection `mathces`, storing this information in the filed `realMinPlayed`. Then we apply the following three normalization on the field `playerankScore`

1. Min-Max normalization to have values in range (0,1)
2. Normalization based on _'how many minutes the player played in the match'_ compared to _'how many minutes the game lasted'_.
3. Min-Max normalization to have values in range (0,1)

to evaluate the field `playerankScoreNorm` that is added in the collection `playeRank`

In [705]:
def addRealMinPlayed (mt_col, rank_col):
    matchesIdRank = list(set([ e['matchId'] for e in rank_col.find() ]))
    
    for match in tqdm(matchesIdRank):
        matchDoc = mt_col.find_one({'wyId': match})
        minMatchDuration = int(matchDoc['secEnd']/60)
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt

        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]

            minPlayers = { d['playerId']: minMatchDuration * int(x == 'lineup') for x in ['bench', 'lineup'] for d in teamDoc['formation'][x] }

            for d in teamDoc['formation']['substitutions']:
                minPlayers.update({ d['playerOut']: d['minute'] })
                minPlayers.update({ d['playerIn']: minMatchDuration - d['minute'] })

            for plId in [k for k in minPlayers.keys()]:
                rank_col.update_one({ 'playerId': plId, 'matchId': match }, { '$set': {'realMinPlayed': minPlayers[plId] } })


def addPlayerankScoreNorm (mt_col, rank_col):
    #df from Rank collection
    dfRank = pd.DataFrame(list(rank_col.find({ }, {'_id':1, 'matchId':1, 'playerankScore':1, 'realMinPlayed': 1})))
    dfRank['realMinPlayed'] = dfRank['realMinPlayed'].values.astype(int)

    #df from Match collection
    dfMatch = pd.DataFrame(list(mt_col.find({ }, {'_id':0, 'wyId':1, 'secEnd':1})))
    dfMatch = dfMatch.rename(columns={'wyId':'matchId'})
    dfMatch['matchMinDuration'] = (dfMatch['secEnd']/60).values.astype(int)

    #df final
    df = pd.merge(dfRank, dfMatch, on='matchId')
    scaler = MinMaxScaler()
    df['playerankScoreNorm1'] = scaler.fit_transform(dfRank[['playerankScore']])
    df['playerankScoreNorm2'] = df['playerankScoreNorm1']*df['realMinPlayed']/df['matchMinDuration']
    df['playerankScoreNorm3'] = scaler.fit_transform(df[['playerankScoreNorm2']])

    #update the collection
    normedValsDict = {row['_id']: row['playerankScoreNorm3'] for index, row in df.iterrows()}
    for docId in tqdm([k for k in normedValsDict.keys()]):
        rank_col.update_one({ '_id': docId }, { '$set': { 'playerankScoreNorm': normedValsDict[docId] } })

In [706]:
addPlayerankScoreNorm(db.matches, db.playerank2)

100%|██████████████████████████████████████████████████████████████████████████| 46897/46897 [00:41<00:00, 1142.44it/s]


## <font color=orange> 1.5 Add `teamrankScore` to each match document </font>

We have used the collection `playerRank` to evaluate the rank of each player. The collection store a document for each player that have played in a match. The filed `playerankScoreNrom` is a value from 0 to 1 that indicates the goodnes of that player in that match, while the field `realMinutesPlayed` represents the numer of minutes played by that player in that match.

From these informations we decide to add to each player in the `players` collection the field `meanPlayerank` as the mean of all the values stored in `playerankScoreNorm` for the player.

In [216]:
def addPlayeRank (pl_col, rank_col):
    plIdRank = list(set( [ e['playerId'] for e in rank_col.find({ }) ] ))
    plIdPlayers = list(set( [ e['wyId'] for e in pl_col.find({ }) ] ))
    
    for p in tqdm(plIdPlayers):
        meanRank = -1
        meanRankMins = -1
        
        if p in plIdRank:
            plDocs = [ d for d in rank_col.find({ 'playerId': p }) ]    
            ranksMins = [(d['playerankScore'], d['minutesPlayed']) for d in plDocs ] #list of (rank, minutes) of each match in which the player p have played

            ranks = [ c[0] for c in ranksMins ]
            meanRank = sum(ranks)/len(ranks)
            ranksXmins = [ c[0]*c[1]/maxMinPlayed for c in ranksMins if c[1] > 0 ]
            meanRankMins = sum(ranksXmins)/len(ranksXmins)
        
        pl_col.update_one({ 'wyId': int(p) }, { 
            '$set' : { 
                'meanRank': meanRank,
                'meanRankMins': meanRankMins
            } })
        
    

In [202]:
plRankDict = addPlayeRank(db.players, db.playerank)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.31it/s]


## <font color=orange>1.6 Add `activePlayerank` subdocument to each `match` document</font>

In [1087]:
def addActivePlayers (mt_col, rank_col):
    docIds = [ m['_id'] for m in mt_col.find({ }, {'_id':1}) ] #match id of each match in the matches collection
    
    for docId in tqdm(docIds):
        matchDoc = mt_col.find_one({ '_id': docId })
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
        
        matchPlsRank = [ d for d in rank_col.find({ 'matchId': matchDoc['wyId'] }) ]
        plIdsAvailable = [ str(d['playerId']) for d in matchPlsRank ]
        
        playerankDict = {}
        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]
            
            lineupPls = [ str(d['playerId']) for d in teamDoc['formation']['lineup'] ]
            subs = teamDoc['formation']['substitutions'] # exist teams that don't substitute any player during the match
            subsPls = [ str(d['playerIn']) for d in subs if subs != 'null' ]
            
            for p in lineupPls + subsPls:
                if p in plIdsAvailable:
                    playerankDict.update({ p: [e['playerankScoreNorm'] for e in matchPlsRank if e['playerId'] == int(p)][0] })
                else:
                    playerankDict.update({ p: -1 })
                    
        mt_col.update_one({'_id': docId}, { '$set': { 'activePlayerank': playerankDict } })

In [1088]:
addActivePlayers(db.matches, db.playerank2)










  0%|                                                                                         | 0/1941 [00:00<?, ?it/s]








  0%|                                                                                 | 2/1941 [00:00<02:16, 14.18it/s]








  0%|▏                                                                                | 4/1941 [00:00<02:05, 15.43it/s]








  0%|▎                                                                                | 6/1941 [00:00<01:59, 16.25it/s]








  0%|▎                                                                                | 8/1941 [00:00<01:56, 16.58it/s]








  1%|▍                                                                               | 10/1941 [00:00<01:51, 17.30it/s]








  1%|▍                                                                               | 12/1941 [00:00<01:50, 17.52it/s]








  1%|▌                                                                               | 14/1941 

  8%|██████▎                                                                        | 154/1941 [00:09<02:30, 11.90it/s]








  8%|██████▍                                                                        | 157/1941 [00:09<02:08, 13.84it/s]








  8%|██████▌                                                                        | 160/1941 [00:09<01:55, 15.44it/s]








  8%|██████▌                                                                        | 162/1941 [00:09<01:49, 16.25it/s]








  8%|██████▋                                                                        | 164/1941 [00:09<01:47, 16.49it/s]








  9%|██████▊                                                                        | 166/1941 [00:09<01:42, 17.27it/s]








  9%|██████▉                                                                        | 169/1941 [00:09<01:35, 18.55it/s]








  9%|██████▉                                                                        | 171/1941 [00:09<01

 25%|███████████████████▊                                                           | 487/1941 [00:26<01:07, 21.43it/s]








 25%|███████████████████▉                                                           | 490/1941 [00:26<01:06, 21.66it/s]








 25%|████████████████████                                                           | 493/1941 [00:26<01:07, 21.45it/s]








 26%|████████████████████▏                                                          | 496/1941 [00:26<01:08, 21.22it/s]








 26%|████████████████████▎                                                          | 499/1941 [00:26<01:06, 21.55it/s]








 26%|████████████████████▍                                                          | 502/1941 [00:26<01:06, 21.52it/s]








 26%|████████████████████▌                                                          | 505/1941 [00:26<01:05, 21.78it/s]








 26%|████████████████████▋                                                          | 508/1941 [00:27<01

 44%|███████████████████████████████████                                            | 860/1941 [00:43<00:52, 20.44it/s]








 44%|███████████████████████████████████                                            | 863/1941 [00:43<00:54, 19.86it/s]








 45%|███████████████████████████████████▏                                           | 865/1941 [00:44<00:58, 18.47it/s]








 45%|███████████████████████████████████▎                                           | 868/1941 [00:44<00:54, 19.76it/s]








 45%|███████████████████████████████████▍                                           | 871/1941 [00:44<00:54, 19.64it/s]








 45%|███████████████████████████████████▌                                           | 874/1941 [00:44<00:51, 20.64it/s]








 45%|███████████████████████████████████▋                                           | 877/1941 [00:44<00:56, 18.86it/s]








 45%|███████████████████████████████████▊                                           | 879/1941 [00:44<01

 58%|█████████████████████████████████████████████▏                                | 1124/1941 [01:02<00:46, 17.48it/s]








 58%|█████████████████████████████████████████████▏                                | 1126/1941 [01:03<00:47, 17.10it/s]








 58%|█████████████████████████████████████████████▎                                | 1128/1941 [01:03<00:46, 17.50it/s]








 58%|█████████████████████████████████████████████▍                                | 1131/1941 [01:03<00:43, 18.76it/s]








 58%|█████████████████████████████████████████████▌                                | 1133/1941 [01:03<00:43, 18.64it/s]








 58%|█████████████████████████████████████████████▌                                | 1135/1941 [01:03<00:43, 18.50it/s]








 59%|█████████████████████████████████████████████▋                                | 1137/1941 [01:03<00:44, 17.91it/s]








 59%|█████████████████████████████████████████████▊                                | 1139/1941 [01:03<00

 73%|████████████████████████████████████████████████████████▉                     | 1418/1941 [01:20<00:30, 17.20it/s]








 73%|█████████████████████████████████████████████████████████                     | 1420/1941 [01:20<00:30, 16.95it/s]








 73%|█████████████████████████████████████████████████████████▏                    | 1423/1941 [01:20<00:27, 18.52it/s]








 73%|█████████████████████████████████████████████████████████▎                    | 1425/1941 [01:20<00:28, 18.42it/s]








 74%|█████████████████████████████████████████████████████████▎                    | 1427/1941 [01:20<00:27, 18.81it/s]








 74%|█████████████████████████████████████████████████████████▍                    | 1429/1941 [01:20<00:27, 18.46it/s]








 74%|█████████████████████████████████████████████████████████▌                    | 1432/1941 [01:21<00:25, 19.68it/s]








 74%|█████████████████████████████████████████████████████████▋                    | 1435/1941 [01:21<00

 89%|█████████████████████████████████████████████████████████████████████▏        | 1722/1941 [01:36<00:15, 14.28it/s]








 89%|█████████████████████████████████████████████████████████████████████▎        | 1724/1941 [01:36<00:14, 14.76it/s]








 89%|█████████████████████████████████████████████████████████████████████▎        | 1726/1941 [01:37<00:15, 13.54it/s]








 89%|█████████████████████████████████████████████████████████████████████▍        | 1728/1941 [01:37<00:16, 12.86it/s]








 89%|█████████████████████████████████████████████████████████████████████▌        | 1730/1941 [01:37<00:17, 12.23it/s]








 89%|█████████████████████████████████████████████████████████████████████▌        | 1732/1941 [01:37<00:18, 11.60it/s]








 89%|█████████████████████████████████████████████████████████████████████▋        | 1734/1941 [01:37<00:16, 12.23it/s]








 89%|█████████████████████████████████████████████████████████████████████▊        | 1736/1941 [01:37<00

##  <font color=orange>1.7 Add `teamPlayerank` field to each `match` document</font>

In [1112]:
def addTeamPlayerank (mt_col, pl_col):
    matchDocs = [e for e in mt_col.find() ]
    matchIds = [ d['wyId'] for d in matchDocs ]
    plIds = [ p['wyId'] for p in pl_col.find() ]
    #for each player p in the pl_col we store the list of matches in which p have played
    playerMatchesPlayed = { p: [ match['wyId'] for match in matchDocs if str(p) in match['activePlayerank'].keys() ] for p in tqdm(plIds) }
    
    for match in tqdm(matchIds):
        matchDoc = mt_col.find_one({'wyId': match})
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
        actTimestamp = matchDoc['timestamp']
        
        teamPlayerank = {}
        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]
            teamPlayerIds = [ d['playerId'] for x in ['bench', 'lineup'] for d in teamDoc['formation'][x] ] #ids as integers

            listPlayerank = []
            for p in teamPlayerIds:
                if p in plIds:
                    ranksPrevMathces = [ m['activePlayerank'][str(p)] for m in [ d for d in matchDocs if (d['wyId'] in playerMatchesPlayed[p]) and (d['timestamp'] < actTimestamp) ] if (m['activePlayerank'][str(p)] >= 0) ]
                    if len(ranksPrevMathces) > 0:
                        listPlayerank.append( pd.Series.ewm(pd.Series(ranksPrevMathces), span=len(ranksPrevMathces)).mean().mean() )

            if len(listPlayerank) > 0:
                teamPlayerank.update( { str(team): sum(listPlayerank)/len(listPlayerank) } )
            else:
                teamPlayerank.update( { str(team): -1 } )
        
        mt_col.update_one({ 'wyId': match }, { '$set': { 'teamPlayerank': teamPlayerank } })

In [1113]:
addTeamPlayerank(db.matches, db.players)












  0%|                                                                                         | 0/3603 [00:00<?, ?it/s]










  1%|▊                                                                              | 39/3603 [00:00<00:09, 386.14it/s]










  2%|█▌                                                                             | 70/3603 [00:00<00:09, 357.16it/s]










  3%|██▎                                                                           | 107/3603 [00:00<00:09, 359.86it/s]










  4%|███▏                                                                          | 145/3603 [00:00<00:09, 363.57it/s]










  5%|███▉                                                                          | 182/3603 [00:00<00:09, 365.48it/s]










  6%|████▋                                                                         | 214/3603 [00:00<00:09, 348.24it/s]










  7%|█████▍                                                                    

 77%|███████████████████████████████████████████████████████████▌                 | 2789/3603 [00:06<00:01, 453.96it/s]










 79%|████████████████████████████████████████████████████████████▌                | 2836/3603 [00:06<00:01, 440.47it/s]










 80%|█████████████████████████████████████████████████████████████▌               | 2882/3603 [00:06<00:01, 443.57it/s]










 81%|██████████████████████████████████████████████████████████████▌              | 2927/3603 [00:06<00:01, 441.55it/s]










 82%|███████████████████████████████████████████████████████████████▌             | 2972/3603 [00:07<00:01, 444.02it/s]










 84%|████████████████████████████████████████████████████████████████▍            | 3018/3603 [00:07<00:01, 446.11it/s]










 85%|█████████████████████████████████████████████████████████████████▌           | 3065/3603 [00:07<00:01, 451.72it/s]










 86%|██████████████████████████████████████████████████████████████████▍          | 3111/3

  6%|████▋                                                                          | 116/1941 [00:14<02:34, 11.80it/s]










  6%|████▊                                                                          | 118/1941 [00:14<02:46, 10.97it/s]










  6%|████▉                                                                          | 120/1941 [00:14<02:55, 10.38it/s]










  6%|████▉                                                                          | 122/1941 [00:14<03:01, 10.02it/s]










  6%|█████                                                                          | 124/1941 [00:15<03:05,  9.79it/s]










  6%|█████                                                                          | 125/1941 [00:15<03:16,  9.25it/s]










  6%|█████▏                                                                         | 126/1941 [00:15<03:24,  8.89it/s]










  7%|█████▏                                                                         | 127/

 13%|██████████                                                                     | 246/1941 [00:30<03:40,  7.68it/s]










 13%|██████████                                                                     | 247/1941 [00:30<03:41,  7.65it/s]










 13%|██████████                                                                     | 248/1941 [00:30<03:38,  7.73it/s]










 13%|██████████▏                                                                    | 249/1941 [00:31<03:37,  7.78it/s]










 13%|██████████▏                                                                    | 250/1941 [00:31<03:34,  7.90it/s]










 13%|██████████▏                                                                    | 251/1941 [00:31<03:32,  7.95it/s]










 13%|██████████▎                                                                    | 252/1941 [00:31<03:32,  7.94it/s]










 13%|██████████▎                                                                    | 253/

 19%|███████████████▏                                                               | 372/1941 [00:46<03:25,  7.65it/s]










 19%|███████████████▏                                                               | 373/1941 [00:46<03:25,  7.65it/s]










 19%|███████████████▏                                                               | 374/1941 [00:46<03:25,  7.63it/s]










 19%|███████████████▎                                                               | 375/1941 [00:47<03:22,  7.75it/s]










 19%|███████████████▎                                                               | 376/1941 [00:47<03:21,  7.77it/s]










 19%|███████████████▎                                                               | 377/1941 [00:47<03:17,  7.93it/s]










 19%|███████████████▍                                                               | 378/1941 [00:47<03:18,  7.88it/s]










 20%|███████████████▍                                                               | 379/

 26%|████████████████████▌                                                          | 506/1941 [01:02<02:45,  8.68it/s]










 26%|████████████████████▋                                                          | 507/1941 [01:03<03:09,  7.58it/s]










 26%|████████████████████▋                                                          | 508/1941 [01:03<03:36,  6.63it/s]










 26%|████████████████████▋                                                          | 509/1941 [01:03<03:36,  6.60it/s]










 26%|████████████████████▊                                                          | 510/1941 [01:03<03:24,  7.00it/s]










 26%|████████████████████▊                                                          | 511/1941 [01:03<03:15,  7.30it/s]










 26%|████████████████████▊                                                          | 512/1941 [01:03<03:12,  7.44it/s]










 26%|████████████████████▉                                                          | 513/

 33%|█████████████████████████▋                                                     | 632/1941 [01:19<02:48,  7.75it/s]










 33%|█████████████████████████▊                                                     | 633/1941 [01:19<02:50,  7.68it/s]










 33%|█████████████████████████▊                                                     | 634/1941 [01:20<02:50,  7.67it/s]










 33%|█████████████████████████▊                                                     | 635/1941 [01:20<02:49,  7.72it/s]










 33%|█████████████████████████▉                                                     | 636/1941 [01:20<02:47,  7.80it/s]










 33%|█████████████████████████▉                                                     | 637/1941 [01:20<02:46,  7.84it/s]










 33%|█████████████████████████▉                                                     | 638/1941 [01:20<02:45,  7.86it/s]










 33%|██████████████████████████                                                     | 639/

 39%|██████████████████████████████▊                                                | 758/1941 [01:35<02:27,  8.05it/s]










 39%|██████████████████████████████▉                                                | 759/1941 [01:35<02:25,  8.13it/s]










 39%|██████████████████████████████▉                                                | 760/1941 [01:36<02:26,  8.05it/s]










 39%|██████████████████████████████▉                                                | 761/1941 [01:36<02:28,  7.92it/s]










 39%|███████████████████████████████                                                | 762/1941 [01:36<02:28,  7.96it/s]










 39%|███████████████████████████████                                                | 763/1941 [01:36<02:26,  8.04it/s]










 39%|███████████████████████████████                                                | 764/1941 [01:36<02:27,  7.97it/s]










 39%|███████████████████████████████▏                                               | 765/

 46%|████████████████████████████████████▍                                          | 894/1941 [01:51<02:11,  7.97it/s]










 46%|████████████████████████████████████▍                                          | 895/1941 [01:52<02:11,  7.95it/s]










 46%|████████████████████████████████████▍                                          | 896/1941 [01:52<02:11,  7.97it/s]










 46%|████████████████████████████████████▌                                          | 897/1941 [01:52<02:07,  8.19it/s]










 46%|████████████████████████████████████▌                                          | 898/1941 [01:52<02:07,  8.17it/s]










 46%|████████████████████████████████████▌                                          | 899/1941 [01:52<02:07,  8.20it/s]










 46%|████████████████████████████████████▋                                          | 900/1941 [01:52<02:07,  8.18it/s]










 46%|████████████████████████████████████▋                                          | 901/

 53%|████████████████████████████████████████▉                                     | 1020/1941 [02:07<02:09,  7.12it/s]










 53%|█████████████████████████████████████████                                     | 1021/1941 [02:07<02:10,  7.07it/s]










 53%|█████████████████████████████████████████                                     | 1022/1941 [02:08<02:04,  7.37it/s]










 53%|█████████████████████████████████████████                                     | 1023/1941 [02:08<02:04,  7.40it/s]










 53%|█████████████████████████████████████████▏                                    | 1024/1941 [02:08<01:59,  7.67it/s]










 53%|█████████████████████████████████████████▏                                    | 1025/1941 [02:08<01:57,  7.80it/s]










 53%|█████████████████████████████████████████▏                                    | 1026/1941 [02:08<02:46,  5.48it/s]










 53%|█████████████████████████████████████████▎                                    | 1027/

 59%|██████████████████████████████████████████████                                | 1146/1941 [02:26<01:33,  8.47it/s]










 59%|██████████████████████████████████████████████                                | 1147/1941 [02:26<01:34,  8.39it/s]










 59%|██████████████████████████████████████████████▏                               | 1148/1941 [02:26<01:38,  8.07it/s]










 59%|██████████████████████████████████████████████▏                               | 1149/1941 [02:27<01:38,  8.05it/s]










 59%|██████████████████████████████████████████████▏                               | 1150/1941 [02:27<01:37,  8.13it/s]










 59%|██████████████████████████████████████████████▎                               | 1151/1941 [02:27<01:36,  8.15it/s]










 59%|██████████████████████████████████████████████▎                               | 1152/1941 [02:27<01:37,  8.12it/s]










 59%|██████████████████████████████████████████████▎                               | 1153/

 66%|███████████████████████████████████████████████████▍                          | 1279/1941 [02:44<01:39,  6.68it/s]










 66%|███████████████████████████████████████████████████▍                          | 1280/1941 [02:44<01:38,  6.73it/s]










 66%|███████████████████████████████████████████████████▍                          | 1281/1941 [02:44<01:36,  6.82it/s]










 66%|███████████████████████████████████████████████████▌                          | 1282/1941 [02:44<01:35,  6.89it/s]










 66%|███████████████████████████████████████████████████▌                          | 1283/1941 [02:45<01:35,  6.92it/s]










 66%|███████████████████████████████████████████████████▌                          | 1284/1941 [02:45<01:36,  6.82it/s]










 66%|███████████████████████████████████████████████████▋                          | 1285/1941 [02:45<01:34,  6.93it/s]










 66%|███████████████████████████████████████████████████▋                          | 1286/

 72%|████████████████████████████████████████████████████████▍                     | 1405/1941 [03:03<01:19,  6.77it/s]










 72%|████████████████████████████████████████████████████████▌                     | 1406/1941 [03:03<01:19,  6.76it/s]










 72%|████████████████████████████████████████████████████████▌                     | 1407/1941 [03:03<01:17,  6.87it/s]










 73%|████████████████████████████████████████████████████████▌                     | 1408/1941 [03:03<01:17,  6.87it/s]










 73%|████████████████████████████████████████████████████████▌                     | 1409/1941 [03:03<01:16,  6.95it/s]










 73%|████████████████████████████████████████████████████████▋                     | 1410/1941 [03:04<01:16,  6.96it/s]










 73%|████████████████████████████████████████████████████████▋                     | 1411/1941 [03:04<01:16,  6.94it/s]










 73%|████████████████████████████████████████████████████████▋                     | 1412/

 79%|█████████████████████████████████████████████████████████████▌                | 1531/1941 [03:21<01:04,  6.37it/s]










 79%|█████████████████████████████████████████████████████████████▌                | 1532/1941 [03:21<01:02,  6.54it/s]










 79%|█████████████████████████████████████████████████████████████▌                | 1533/1941 [03:21<01:01,  6.63it/s]










 79%|█████████████████████████████████████████████████████████████▋                | 1534/1941 [03:21<01:00,  6.77it/s]










 79%|█████████████████████████████████████████████████████████████▋                | 1535/1941 [03:21<00:58,  6.98it/s]










 79%|█████████████████████████████████████████████████████████████▋                | 1536/1941 [03:21<00:56,  7.12it/s]










 79%|█████████████████████████████████████████████████████████████▊                | 1537/1941 [03:22<00:55,  7.22it/s]










 79%|█████████████████████████████████████████████████████████████▊                | 1538/

 85%|██████████████████████████████████████████████████████████████████▋           | 1658/1941 [03:37<00:37,  7.58it/s]










 85%|██████████████████████████████████████████████████████████████████▋           | 1659/1941 [03:37<00:37,  7.55it/s]










 86%|██████████████████████████████████████████████████████████████████▋           | 1660/1941 [03:37<00:37,  7.59it/s]










 86%|██████████████████████████████████████████████████████████████████▋           | 1661/1941 [03:37<00:36,  7.62it/s]










 86%|██████████████████████████████████████████████████████████████████▊           | 1662/1941 [03:38<00:36,  7.64it/s]










 86%|██████████████████████████████████████████████████████████████████▊           | 1663/1941 [03:38<00:36,  7.71it/s]










 86%|██████████████████████████████████████████████████████████████████▊           | 1664/1941 [03:38<00:35,  7.76it/s]










 86%|██████████████████████████████████████████████████████████████████▉           | 1665/

 92%|███████████████████████████████████████████████████████████████████████▋      | 1784/1941 [03:53<00:21,  7.37it/s]










 92%|███████████████████████████████████████████████████████████████████████▋      | 1785/1941 [03:53<00:21,  7.39it/s]










 92%|███████████████████████████████████████████████████████████████████████▊      | 1786/1941 [03:54<00:21,  7.35it/s]










 92%|███████████████████████████████████████████████████████████████████████▊      | 1787/1941 [03:54<00:20,  7.52it/s]










 92%|███████████████████████████████████████████████████████████████████████▊      | 1788/1941 [03:54<00:20,  7.57it/s]










 92%|███████████████████████████████████████████████████████████████████████▉      | 1789/1941 [03:54<00:19,  7.66it/s]










 92%|███████████████████████████████████████████████████████████████████████▉      | 1790/1941 [03:54<00:19,  7.70it/s]










 92%|███████████████████████████████████████████████████████████████████████▉      | 1791/

 98%|████████████████████████████████████████████████████████████████████████████▊ | 1910/1941 [04:10<00:04,  7.66it/s]










 98%|████████████████████████████████████████████████████████████████████████████▊ | 1911/1941 [04:10<00:03,  7.85it/s]










 99%|████████████████████████████████████████████████████████████████████████████▊ | 1912/1941 [04:10<00:03,  8.31it/s]










 99%|████████████████████████████████████████████████████████████████████████████▊ | 1913/1941 [04:10<00:03,  8.38it/s]










 99%|████████████████████████████████████████████████████████████████████████████▉ | 1914/1941 [04:10<00:03,  8.34it/s]










 99%|████████████████████████████████████████████████████████████████████████████▉ | 1915/1941 [04:10<00:03,  8.32it/s]










 99%|████████████████████████████████████████████████████████████████████████████▉ | 1916/1941 [04:10<00:02,  8.34it/s]










 99%|█████████████████████████████████████████████████████████████████████████████ | 1917/

## <font color=orange>1.8 Add `teamPoits` subdocument to each `match` document</font>

In [1202]:
def addTeamPoints (mt_col, cm_col):
    compIds = [ d['wyId'] for d in cm_col.find({ }, { '_id':0, 'wyId':1 }) ]

    for comp in tqdm(compIds):
        matchesOfComp = [ e for e in mt_col.find({ 'competitionId': comp }) ]
        allTeamIds = set([ z for y in [list(x['teamsData'].keys()) for x in matchesOfComp] for z in y ])
        teamPoints = { t:0  for t in allTeamIds }
        matchIds = [ e[1] for e in sorted([ (match['timestamp'], match['wyId']) for match in matchesOfComp ]) ]
        for match in matchIds:
            matchDoc = mt_col.find_one({ 'wyId': match })
            teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
            teamA = teamIds[0]
            teamB = teamIds[1]

            mt_col.update_one({ 'wyId': match }, { '$set':  { 
                'teamPoints': { str(teamA): teamPoints[teamA], 
                               str(teamB): teamPoints[teamB] } } })

            scoreA = matchDoc['teamsData'][str(teamA)]['score']
            scoreB = matchDoc['teamsData'][str(teamB)]['score']

            if matchDoc['duration'] in ['ExtraTime', 'Penalties']:
                scoreA = matchDoc['teamsData'][str(teamA)]['scoreET']
                scoreB = matchDoc['teamsData'][str(teamB)]['scoreET']

            if scoreA > scoreB:
                teamPoints.update({ teamA: teamPoints[teamA] + 3 })
            elif scoreA < scoreB:
                teamPoints.update({ teamB: teamPoints[teamB] + 3 })
            else:
                teamPoints.update({ teamA: teamPoints[teamA] + 1, teamB: teamPoints[teamB] + 1 })

In [1203]:
addTeamPoints(db.matches, db.competitions)












  0%|                                                                                            | 0/7 [00:00<?, ?it/s]










 14%|████████████                                                                        | 1/7 [00:02<00:14,  2.40s/it]










 29%|████████████████████████                                                            | 2/7 [00:04<00:11,  2.20s/it]










 43%|████████████████████████████████████                                                | 3/7 [00:04<00:06,  1.60s/it]










 57%|████████████████████████████████████████████████                                    | 4/7 [00:04<00:03,  1.19s/it]










 71%|████████████████████████████████████████████████████████████                        | 5/7 [00:06<00:02,  1.35s/it]










 86%|████████████████████████████████████████████████████████████████████████            | 6/7 [00:08<00:01,  1.72s/it]










100%|██████████████████████████████████████████████████████████████████████████

# <font color=red>2. Computation of our results collection</font>

The function take in input
the events collection ```ev_col``` of all the available events,
the matches collection ```mt_col``` of all the matches played,
the players collection ```pl_col``` of all the wyscoyt players, and
the results collection ```res_col``` that, after the function execution, contains all the computed results.


The documents in the collection `res_col` contain the following fields:
- `matchId`: corresponds to the wyscout identifier of the match
- `teamId`: corresponds to the wyscout identifier of the team
- `isHome`: 1 if the team is playing at home, 0 otherwise 


- `teamPoints`: is the team score in the classification until the date of the match
- `teamPlayerank`: is the mean of the `playrankScore` normalized of all the players available for that team in that match 
- `meanPlayerOverall`: mean of player overall rating given by the Fifa 19 dataset 
- `meanPlayerPotential`: mean of player potential rating given by the Fifa 19 dataset 


- `meanPrevScore`, `meanPrevScoreET`, `meanPrevScoreHT` and `meanPrevScoreP`: each represents the mean of the scores respectively at the end of the match, at the end of extra time, at the end of the first half and at the end of the penalties of the matches played by the team before that match


- `numPass`, `numDuel`, `numFoul`, `numFreeKick`, `numGoalkeeperLeavingLine`, `numInterruption`, `numOffside`, `numOthersOnTheBall`, `numSaveAttempt` and `numShot`: each field corresponds to the number of events done by the team during the match and named respectively as _Pass, Duel, Foul, Free Kick, Goalkeeper Leaving Line, Interruption, Offside, Others On The Ball, Save Attempt, Shot_.


- `rateAccPass`: number of events named _Pass_ and tagged as accurate over `numPass` 
- `rateAccFreeKick`: number of events named _Free Kick_ and tagged as accurate over `numFreeKick`
- `rateAccShot`: number of events named _Shot_ and tagged as accurate over `numShot`


- `numYellowCard`, `numSecondYellowCard` and `numRedCard`: each field represents the number of events respectively tagged as _yellow card, second yellow card_ and _red card_

- `percBallPoss`: represents the percentage of all "interesting" events that are done by the team and it means "how much that team is sprint" 
- `percOppHalfField`: represents the percentage of the "interesting" events done by the team in the opposite half of field


- `numGoalsTW` and `numOwnGoalsTW`: represent the number of goals and own goals done by the team in the time window _TW_ analyzed, each corresponds to all the events that are tagged resepctively as _goal_ and _own goal_ and that are not names as _Save Attempt_ 
- `scoreTW`: is the score of that team at the end of the time window _TW_ analyzed and it corresponds to the sum of the field `numGoalsTW` of that team and the field `numOwnGoalsTW` of the opponent team in that match 
- `numGoalsTWP` and `numOwnGoalsTWP`: represent the number of goals and own goals done by the team in the time window _TWP_ of prediction, each corresponds to all the events that are tagged resepctively as _goal_ and _own goal_ and that are not names as _Save Attempt_ 
- `numGoals1H`, `numGoals2H` and `numGoalsET`: each represents the number of goals scored by the team during respectively the first half of the match, the second half of the match and the extra time
- `finalScore`: score of the team at the end of the match


- `isWinner`: 1 if the team is the winner of that match, 0 otherwise
- `didScoreInTWP`: 1 if the sum of the field `numGoalsTWP` of that team and the field `numOwnGoalsTWP` of the opponent team in that match is grater than zero, 0 otherwise
- `goalsDiff`:  value of goals difference of that team and the opponent team 

In [6]:
def computation (ev_col, mt_col, pl_col, res_col, xMinutes, yMinutes):
    
    if (xMinutes <= 0 or yMinutes <= 0):
        print('The length of the time window (given in minutes) must be grather than zero!')
        return
    
    xSecEnd = xMinutes * 60
    ySecEnd = xSecEnd + yMinutes * 60 

    allPlayersId = [ p['wyId'] for p in pl_col.find({ }) ]
    interestingEvNames = ['Pass', 'Duel', 'Free Kick',  'Others on the ball', 'Shot']
    matchIds = [ m['wyId'] for m in mt_col.find({ }, {'_id':0, 'wyId':1}) ] #match id of each match in the matches collection
    
    for match in tqdm(matchIds):
        matchDoc = mt_col.find_one({'wyId': match})
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
        
        matchEvents = [e for e in ev_col.find( { 'matchId': int(match) } )] #all events happened during the match
        
        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]
            
            #events of the actual team
            teamEventsAll = [ e for e in matchEvents if e['teamId'] == int(team) ]#events done by the team in the match
            #teamEvents = [ e for e in teamEventsAll if e['matchPeriod'] == '1H'] #events done by the team in the the time window
            
            teamEventsTW = [ e for e in teamEventsAll if e['secFromStart'] <= xSecEnd] #events done by the team in the the time window that we analyse
            teamEventsTWP = [ e for e in teamEventsAll if e['secFromStart'] > xSecEnd and e['secFromStart'] <= ySecEnd ] #events done by the team in the the time window that we want to predict
    
            #FILED: isHome
            isHome = int(teamDoc['side'] == 'home')
            #FILED: iswinner
            isWinner = int(int(matchDoc['winner']) == int(team))
            
            #FILEDS: numGaol..
            numGoalsHT = teamDoc['scoreHT']
            numGoals2HT = teamDoc['score'] - teamDoc['scoreHT']
            numGoalsET = max(teamDoc['scoreET'] - teamDoc['score'], 0)
            numGoalsP = teamDoc['scoreP']
            numGoalsTot = numGoalsHT + numGoals2HT + numGoalsET + numGoalsP
            finalScore = numGoalsHT + numGoals2HT + numGoalsET
            
            #FILEDS: numEvent
            ev_cnt = Counter([ e['eventName'] for e in teamEventsTW ])
            
            evAcc = { x: 
                     len([e for e in teamEventsTW if (e['eventName'] == x) and (1801 in [tag['id'] for tag in e['tags']]) ]) 
                     for x in ['Pass', 'Shot', 'Free Kick'] }
            numRedCard = len([e for e in teamEventsTW if (1701 in [tag['id'] for tag in e['tags']]) ])
            numYelCard = len([e for e in teamEventsTW if (1702 in [tag['id'] for tag in e['tags']]) ])
            numYelSecCard = len([e for e in teamEventsTW if (1703 in [tag['id'] for tag in e['tags']]) ])
            numGoalsTW = len([e for e in teamEventsTW if (e['eventName'] != 'Save attempt') and (101 in [tag['id'] for tag in e['tags']]) ])
            numOwnGoalsTW = len([e for e in teamEventsTW if (e['eventName'] != 'Save attempt') and (102 in [tag['id'] for tag in e['tags']]) ])
            
            #FIELDS: numGoalsTWP + numOwnGoalsTWP
            numGoalsTWP = len([e for e in teamEventsTWP if (e['eventName'] != 'Save attempt') and (101 in [tag['id'] for tag in e['tags']]) ])
            numOwnGoalsTWP = len([e for e in teamEventsTWP if (e['eventName'] != 'Save attempt') and (102 in [tag['id'] for tag in e['tags']]) ])
            
            #FILEDS: percentage ball possession + percentage opposite half field
            interesting_ev_cnt = len([e for e in teamEventsAll if e['eventName'] in interestingEvNames])
            oppHalfField_ev_cnt = len([e for e in teamEventsAll if e['positions'][0]['x']>50])
            percBallPoss = interesting_ev_cnt/len(matchEvents) * 100
            percOppHalfField = oppHalfField_ev_cnt/len(matchEvents) * 100
            
            #FILED: prevScore..
            game_time = matchDoc['timestamp']
            prev_allScores = list(mt_col.find({
                'teamsData.{}'.format(str(team)): {'$exists': True},
                'timestamp': {'$lt': game_time}
            },
            {'_id': 0, 
             'teamsData.{}.score'.format(str(team)): 1, 
             'teamsData.{}.scoreET'.format(str(team)): 1,
             'teamsData.{}.scoreHT'.format(str(team)): 1,
             'teamsData.{}.scoreP'.format(str(team)): 1} ))

            prev_scores = [ x['teamsData'][str(team)]['score'] for x in prev_allScores ]
            prev_scoresET = [ x['teamsData'][str(team)]['scoreET'] for x in prev_allScores ]
            prev_scoresHT = [ x['teamsData'][str(team)]['scoreHT'] for x in prev_allScores ]
            prev_scoresP = [ x['teamsData'][str(team)]['scoreP'] for x in prev_allScores ]

            prev_scores_dict = dict()
            for e in ['prev_scores', 'prev_scoresET', 'prev_scoresHT', 'prev_scoresP']:
                if len(eval(e)) > 0:
                    mean_score = sum(eval(e))/len(eval(e))
                    prev_scores_dict.update({'mean_' + e: mean_score})
            
            #FILEDS: meanPlayerOverall + meanPlayerPotential
            teamPlayerIds = [ d['playerId'] for x in ['bench', 'lineup'] for d in teamDoc['formation'][x] ]
            allOv = []
            allPot = []
            for playerId in teamPlayerIds:
                if playerId in allPlayersId:
                    allOv.append(db.players.find_one({'wyId': playerId}).get('overall', 0) )
                    allPot.append(db.players.find_one({'wyId': playerId}).get('potential', 0) )

            meanOv = sum(allOv)/len(allOv)
            meanPot = sum(allPot)/len(allPot)
            
            res_col.insert_one({
                'matchId': str(match), 
                'teamId': str(team),
                'isHome': isHome,
                'teamPoints': matchDoc['teamPoints'][team],
                'teamPlayerank': matchDoc['teamPlayerank'][team],
                'meanPlayerOverall': meanOv,
                'meanPlayerPotential': meanPot,
                'meanPrevScore': prev_scores_dict.get('mean_prev_scores',0),
                'meanPrevScoreET': prev_scores_dict.get('mean_prev_scoresET',0),
                'meanPrevScoreHT': prev_scores_dict.get('mean_prev_scoresHT',0),
                'meanPrevScoreP': prev_scores_dict.get('mean_prev_scoresP',0),
                'numPass': ev_cnt.get('Pass',0),
                'numDuel': ev_cnt.get('Duel',0),
                'numFoul': ev_cnt.get('Foul',0),
                'numFreeKick': ev_cnt.get('Free Kick',0),
                'numGoalkeeperLeavingLine': ev_cnt.get('Goalkeeper leaving line',0),
                'numInterruption': ev_cnt.get('Interruption',0),
                'numOffside': ev_cnt.get('Offside',0),
                'numOthersOnTheBall': ev_cnt.get('Others on the ball',0),
                'numSaveAttempt': ev_cnt.get('Save attempt',0),
                'numShot': ev_cnt.get('Shot',0),
                'rateAccPass': evAcc['Pass']/ev_cnt.get('Pass',1),
                'rateAccFreeKick': evAcc['Free Kick']/ev_cnt.get('Free Kick',1),
                'rateAccShot': evAcc['Shot']/ev_cnt.get('Shot',1),
                'numYellowCard': numYelCard,
                'numSecondYellowCard': numYelSecCard,
                'numRedCard': numRedCard,
                'percBallPoss': percBallPoss,
                'percOppHalfField': percOppHalfField,
                'numGoalsTW': numGoalsTW,
                'numOwnGoalsTW': numOwnGoalsTW,
                'scoreTW': 0,
                'numGoalsTWP': numGoalsTWP,
                'numOwnGoalsTWP': numOwnGoalsTWP,
                'numGoals1H': numGoalsHT,
                'numGoals2H': numGoals2HT,
                'numGoalsET': numGoalsET,
                'finalScore': finalScore,
                'isWinner': isWinner,
                'didScoreInTWP': 0 # int(numGoals2HT > 0)
            })
        
        #FILED: scoreTW
        teamA = str(teamIds[0])
        teamB = str(teamIds[1])
        fields = ['numGoalsTW', 'numOwnGoalsTW', 'scoreTW', 'numGoalsTWP', 'numOwnGoalsTWP', 'finalScore']
        info = { t: { f: res_col.find_one({ 'matchId': str(match), 'teamId': t })[f] for f in fields } for t in teamIds }
        
        res_col.update_one({ 'matchId': str(match), 'teamId': teamA }, { '$set': { 
            'scoreTW': info[teamA]['numGoalsTW'] + info[teamB]['numOwnGoalsTW'],
            'didScoreInTWP': int(( info[teamA]['numGoalsTWP'] + info[teamB]['numOwnGoalsTWP'] ) > info[teamA]['scoreTW']),
            'goalsDiff': info[teamA]['finalScore'] - info[teamB]['finalScore']
        } })
        res_col.update_one({ 'matchId': str(match), 'teamId': teamB }, { '$set': { 
            'scoreTW': info[teamB]['numGoalsTW'] + info[teamA]['numOwnGoalsTW'],
            'didScoreInTWP': int(( info[teamB]['numGoalsTWP'] + info[teamA]['numOwnGoalsTWP'] ) > info[teamB]['scoreTW']),
            'goalsDiff': info[teamB]['finalScore'] - info[teamA]['finalScore']
        } })
        
        

In [330]:
#db.resIt.delete_many({ })
computation(db.events_Italy, db.matches_Italy, db.players, db.resIt, 30, 20)

100%|████████████████████████████████████████████████████████████████████████████████| 380/380 [04:57<00:00,  1.27it/s]


In [331]:
db.resIt.find_one()

{'_id': ObjectId('5dc45fdfa7d2fe19e8d63226'),
 'matchId': '2576335',
 'teamId': '3162',
 'isHome': 1,
 'meanPlayerOverall': 53.391304347826086,
 'meanPlayerPotential': 55.130434782608695,
 'meanPrevScore': 2.3513513513513513,
 'meanPrevScoreET': 0.0,
 'meanPrevScoreHT': 1.0810810810810811,
 'meanPrevScoreP': 0.0,
 'numPass': 98,
 'numDuel': 65,
 'numFoul': 2,
 'numFreeKick': 14,
 'numGoalkeeperLeavingLine': 0,
 'numInterruption': 0,
 'numOffside': 1,
 'numOthersOnTheBall': 22,
 'numSaveAttempt': 3,
 'numShot': 5,
 'rateAccPass': 0.8367346938775511,
 'rateAccFreeKick': 0.7857142857142857,
 'rateAccShot': 0.2,
 'numYellowCard': 1,
 'numSecondYellowCard': 0,
 'numRedCard': 0,
 'percBallPoss': 40.74074074074074,
 'percOppHalfField': 20.123456790123456,
 'numGoalsTW': 0,
 'numOwnGoalsTW': 0,
 'scoreTW': 1,
 'numGoalsTWP': 1,
 'numOwnGoalsTWP': 0,
 'numGoals1H': 2,
 'numGoals2H': 0,
 'numGoalsET': 0,
 'finalScore': 2,
 'isWinner': 0,
 'didScoreInTWP': 1,
 'goalsDiff': -1}

## <font color=orange>2.1 Pandas DataFrame from our results collection</font>

In [1211]:
orderedCols = ['matchId', 'teamId', 'isHome',
               'teamPoints', 'teamPlayerank',
               'meanPlayerOverall', 'meanPlayerPotential', 'meanPrevScore', 'meanPrevScoreET', 'meanPrevScoreHT', 'meanPrevScoreP', 
               'numDuel', 'numFoul', 'numFreeKick', 'numGoalkeeperLeavingLine', 'numInterruption', 'numOffside', 'numOthersOnTheBall', 'numPass', 'numSaveAttempt', 'numShot',
               'rateAccFreeKick', 'rateAccPass', 'rateAccShot',
               'numYellowCard', 'numSecondYellowCard', 'numRedCard',
               'percBallPoss', 'percOppHalfField',
               'numGoalsTW', 'numOwnGoalsTW', 'scoreTW',
               'numGoalsTWP', 'numOwnGoalsTWP',
               'numGoals1H', 'numGoals2H', 'numGoalsET',
               'goalsDiff', 'finalScore', 'isWinner', 'didScoreInTWP'
              ]

def dfFromCursor (cursor):
    df = pd.DataFrame(list(cursor))
    res = pd.DataFrame(columns=orderedCols)
    for c in orderedCols:
        res[c] = df[c]
    
    return res

In [1212]:
res = dfFromCursor(db.res1.find({ 'matchId': '2058015' }, { '_id': 0 }))
res.transpose()

Unnamed: 0,0,1
matchId,2058015.0,2058015.0
teamId,2413.0,9598.0
isHome,0.0,1.0
teamPoints,10.0,11.0
teamPlayerank,0.672707,0.75577
meanPlayerOverall,52.5217,75.1364
meanPlayerPotential,54.3478,78.0909
meanPrevScore,2.2,1.8
meanPrevScoreET,0.2,0.6
meanPrevScoreHT,1.4,0.6


<img src="corrMatTriang.png" alt="Alt text that describes the graphic" title="Title text" />

# <font color=red>4. Model</font>

Suggestions to the choice of our models:
1. start with a model really performing, even if difficult to explain
2. then chose a simpler model but that makes it easier to work

Chose the metric of validation (__scoring__: precision, recall, accuracy or F1) explaining the choice on the base of the real necessities.

Do the Hyperparameter Tuning (HT) using the __GridSearchCV__ on the 20%  of the dataset, obtaining a decision tree (DT) that we will use to test the remaining 80% of the dataset on which must do the cross validation (CV).

Note that: both the functions for gridsearchCV that CV (on the 80% of the dataset) take as input the scoring parameter.