<font color='red' size=5>**Version with raw data**</font>

This notebook contains all the function that are needed to compute the prediction of the targets `isWinner` and `didScoreInTWP` (did a score in the time window of prediction) for a specific match.

Naming the selected match, for which we want do the prediction, **MP** (Match Prediction) the available data are the row data given at the beginning of the project and the extra dataset incorpored by us during the data preparation. The raw data consits in some MongoDB collections, `competitions_raw`, `teams_raw`, `players_raw`, `matches_raw` and `events_raw`, obtained by Wyscout. While the extra data, are a MongoDB collection of the rank of each player in each match that he had play, `playerank`, and the informations stored in a Fifa datased of 2019. Before process these data we have done a selection of them:
- from `competitions` we have select only the document of the competition in which MP is played, 
- from `teams` we have select all the teams that play in this competition,
- from `players` we have select all the players in the raw data collection,
- from `matches` we have select MP plus all the matches play before in this competition,
- from `events`  we have select all the events stored for all the matches selected before,
- from `playerank`  we have select all the docuemnts stored for all the matches selected before.

The prediction in made in three phases: (1) the first is the initialization phase with the selection of data as described before, (2) then the data are manipulated generating the new features and finally (3) the dataframe with the prediction results is generated.

In [1]:
%load_ext autoreload
%autoreload 2

from bson.objectid import ObjectId
from collections import Counter
from dateutil import parser
from hyperparameters import params as all_params
from hyperparameters import nanSet
import numpy as np
import pandas as pd
import pickle
import pymongo
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
import time
from time import mktime
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

***

# Initialization of data (Phase 1)

## Information decoding

The function `decodePlayerNames` takes as an input the collection of the players `pl_col` and decode the attribute `shortName`, `firstName` and `lastName` that have some characters encoded in utf-8.

Before the conversion we have that only 2264 players in the Fifa dataset math the name with the `shortName` of all the 3603 players in our collection (that corresponds to 62.84%). The other 1339 players (37.16%) have a name that don't appear in our collection.

After the convrsion we obtain that the players that match are 2800 (77.71%), while the other 803 players have again some 'problems' with the name.

The function `decodeTeamsNames` takes as an input the collection of the teams `tm_col` and decode the attribute `city`, `name` and `officialName` that have some characters encoded in utf-8.

In [2]:
def decodePlayerNames (pl_col):
    for docId in tqdm([ e['_id'] for e in pl_col.find({ }, {'_id'}) ]):
        doc = pl_col.find_one({'_id': docId})
        decodedShortName = doc['shortName'].encode().decode('unicode_escape')
        decodedFirstName = doc['firstName'].encode().decode('unicode_escape')
        decodedLastName = doc['lastName'].encode().decode('unicode_escape')
        pl_col.update_one({ '_id': docId }, 
                          { '$set': {
                              'shortName': decodedShortName,
                              'firstName': decodedFirstName,
                              'lastName': decodedLastName
                          } })

def decodeTeamNames (tm_col):
    for docId in tqdm([ e['_id'] for e in tm_col.find({ }, {'_id'}) ]):
        doc = tm_col.find_one({'_id': docId})
        cityDec = doc['city'].encode().decode('unicode_escape')
        nameDec = doc['name'].encode().decode('unicode_escape')
        offNameDec = doc['officialName'].encode().decode('unicode_escape')
        tm_col.update_one({ '_id': docId }, 
                          { '$set': {
                              'city': cityDec,
                              'name': nameDec,
                              'officialName': offNameDec
                          } })

## Initialization of soccer prediction

Imagine to be in the second half of a specific match and you want to predict if a team will win the game and if it will did a score in the second half of the game. The available information are all the data collect from the beginning of the competition in which the match is played and all the notions about the players, teams and coaches.

The function `initialization` takes as input the matchId `mpId_` and the name of the mongoDB database `dbName` on which all the collection of raw data are stored. Moreover it takes as input the boolean value `rebuildDB` that is true if you want to clean and rebuild all the data for the match. The function select the data as explained on top of this notebook, intialazing the collection that we need for the evaluation and decoding the information of the teams and players collection.

Now on there are the following global variables:
- `mpId` that correposnds to the identifier of the match for which we want do the prediction, 
- `mp` that correposnds to the document of the match for which we want do the prediction,
- `mpT1` and `mpT2` that correposnds to the identifiers of the teams that play the match for which we want do the prediction,
- `client` is the MongoDB client instance of pyMongo,
- `db` is the MongoDB database instance in which are stored all our collections,
- `matchIds` is the list of all the matches that are in the matches collection,
- `teamIds` is the list of all the matches that are in the teams collection.

In [3]:
def initialization (mpId_, dbName, rebuildDB):
    global mpId, mp, mpT1, mpT2, client, db, matchIds, teamIds
    
    client = pymongo.MongoClient()
    db = client[dbName]
    
    mpId = int(mpId_)
    mp = db.matches_raw.find_one({'wyId':mpId})
    mpCompId = mp['competitionId']
    
    if rebuildDB:        
        #competitions - only the competition in which the match is played
        db.competitions.drop()
        db.competitions.insert_many([e for e in db.competitions_raw.find({ 'wyId': mpCompId },  { '_id':0 })])
        
        #teams - all the teams that play in the competition
        db.teams.drop()
        teamIds = list(set([ int(x) for e in db.matches_raw.find({ 'competitionId': mpCompId }) for x in e['teamsData'].keys() ]))
        db.teams.insert_many([e for e in db.teams_raw.find({ 'wyId': { '$in' : teamIds } },  { '_id':0 })])
        
        #players - all the players in raw data
        db.players.drop()
        db.players.insert_many([e for e in db.players_raw.find({ },  { '_id':0 })])
        
        #mathces - the match selected plus all the matches played before in this competition
        db.matches.drop()
        db.matches.insert_many([e for e in db.matches_raw.find({ 'competitionId': mpCompId, 
                                                                 'date': { '$lte' : mp['date'] } }, { '_id':0 })])
        matchIds = [ e['wyId'] for e in db.matches.find() ]
        
        #events - all the events of all the matches selected
        db.events.drop()
        db.events.insert_many([e for e in db.events_raw.find({ 'matchId': { '$in' : matchIds } }, { '_id':0 })])
        
        #playerank - all the playerank of all the matches selected
        db.playerank.drop()
        db.playerank.insert_many([e for e in db.playerank_raw.find({ 'matchId': { '$in' : matchIds } },  { '_id':0 })])
        
        #decoding
        decodeTeamNames(db.teams)
        decodePlayerNames(db.players)
    
    mpTeamIds = list(mp['teamsData'].keys())
    mpT1 = int(mpTeamIds[0])
    mpT2 = int(mpTeamIds[1])
    mpT1Name = db.teams.find_one({'wyId':mpT1})['officialName']
    mpT2Name = db.teams.find_one({'wyId':mpT2})['officialName']
    
    print('Match: {}\nTeam 1: {} ({})\nTeam 2: {} ({})'.format(mpId, mpT1, mpT1Name, mpT2, mpT2Name))
    print('Date: {}\nCompetition: {} ({})'.format(mp['date'], mp['competitionId'], db.competitions.find_one({ 'wyId': mpCompId })['name']))
    print('\nINFO\nNumber of matches: {}\nNumber of events: {}\n'.format(db.matches.count_documents({ }), db.events.count_documents({ })))

***

# Computation of new features (Phase 2)

## Add infromations from Fifa dataset

The function `addOverallAndPotential` takes as input the collection of player `pl_col` and the dataframe `df` with the information about the players that we want to add to each player document.

In [4]:
def addOverallAndPotential (pl_col, df):
    for i in tqdm(range(df.shape[0])):
        pl_col.update_one(
            { 'shortName': df['Name'].iloc[i], 
             'passportArea.name': df['Nationality'].iloc[i] 
            }, { '$set': {
                'overall': int(df['Overall'].iloc[i]),
                'potential': int(df['Potential'].iloc[i])
            }}
        )

## Add features to `matches` and `events` collections

In order to evaluate the `meanPlayerOverall` and the `meanPlayerPotential` we need to add the field `timestamp` as the conversion of the fild `date` that already exists and indicate when a match was palye.

Moreover, for each event now we have only the filed `secEvent` as a time parameter, which represent the seconds passed from the **beggining of the period** (first half, second half, first extra time, second extra time) of the game at the time when the event happened. To select the two time windows, the first is which we want analyse and the second is the time window on which we want predict the target variables, we need a feature that represent the seconds passed from the **beginning of the match** at the time when the event happenend. From this reason we add the the feature `secfromStart` at each event.

Consider an event _eH2_ that happenend during the second half of a match to evaluate the `secFromStart` of _eH2_ we sum the value stored in the field `secEvent` of _eH2_ the maximum value stored in the same field `secEvent` of all the events happenend during the first half. Then we applay this idea for the events happepend during the extra time persiods.

First we compute 
- _x = max(e[secEvent])_ for each event _e_ in the first half of the match, 
- _y = x + max(e[secEvent])_ for each event _e_ in the second half of the match, 
- _z = y + max(e[secEvent])_ for each event _e_ in the first extra time period and 
- _h = z + max(e[secEvent])_ for each event _e_ in the second extra time period. 

<img src="timeline.png" alt="Alt text that describes the graphic" title="Title text" />

Then, we store the new time informations from the beginning of the match in each event document in the collection and we add some features to each document in the matches collection: `secEndH1`= _x_ represents the second after which we consider ended the first half of the match, `secEndH2` = _y_ represents the second after which we consider ended the second half of the match, `secEndE1` = _z_ represents the second after which we consider ended the first half of the extra time and `secEnd` = _h_ represents the second after which we consider ended the match. Consider that `secEnd` is equal to _y_ and `secEndE1` = 0 in all the matches that don't have extra time.

In [5]:
def addSecFromStart (ev_col, mt_col):
    matchIds = [ m['wyId'] for m in mt_col.find({ }, {'_id':0, 'wyId':1}) ] #match id of each match in the matches collection
    for match in tqdm(matchIds):
        matchDoc = mt_col.find_one({'wyId': match})
        timestamp = int(mktime(parser.parse(matchDoc['date']).timetuple())) #we produce the timestamp from the date of the match
        matchEvents = [e for e in ev_col.find( { 'matchId': int(match) } )] #all events happened during the match

        evH1 = [ e for e in matchEvents if e['matchPeriod'] == '1H' ]
        evH2 = [ e for e in matchEvents if e['matchPeriod'] == '2H' ]
        evE1 = [ e for e in matchEvents if e['matchPeriod'] == 'E1' ]
        evE2 = [ e for e in matchEvents if e['matchPeriod'] == 'E2' ]
        evP = [ e for e in matchEvents if e['matchPeriod'] == 'P' ]

        x = max( e['eventSec'] for e in evH1 )
        y = max( e['eventSec'] for e in evH2 ) + x
        z = y
        h = z
        if len(evE1) > 0:
            z = max( e['eventSec'] for e in evE1 ) + y
            h = max( e['eventSec'] for e in evE2 ) + z

        evSecDict = { e['_id']: e['eventSec'] for e in matchEvents }
        
        #add secFromStart to events in H1 
        for eId in [ e['_id'] for e in evH1 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] } } )
        
        #add secFromStart to events in H2 
        for eId in [ e['_id'] for e in evH2 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + x } } )

        #notice that we enter the following loops only if the match has extra time
        #add secFromStart to events in E1
        for eId in [ e['_id'] for e in evE1 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + y } } )

        #add secFromStart to events in E2
        for eId in [ e['_id'] for e in evE2 ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + z } } )
        
        #add secFromStart to events in P
        for eId in [ e['_id'] for e in evP ]:
            ev_col.update_one({ '_id': eId }, { '$set': { 'secFromStart': evSecDict[eId] + h } } )


        mt_col.update_one({ 'wyId': match }, { 
            '$set': {
                'timestamp': timestamp,
                'secEndH1': x, 'secEndH2': y, 'secEndE1': z, 'secEnd': h 
            } } )

## Add features to `playerank` collection

To evaluate correctly the `playerankScore` of the collection `playeRank` we decide to evaluate the correct minutes that each player played in the match considering the information available in the collection `mathces`, storing this information in the filed `realMinPlayed`. Then we apply the following three normalization on the field `playerankScore`

1. Min-Max normalization to have values in range (0,1)
2. Normalization based on _'how many minutes the player played in the match'_ compared to _'how many minutes the game lasted'_.
3. Min-Max normalization to have values in range (0,1)

The functions `addRealMinPlayed` and `addPlayerankScoreNorm` both take as input the collection of matches `mt_col` and the collection of the playerank `rank_col`, to each document of the plyerank collection the first function add the field `realMinPlayed` while the second add the field `playerankScoreNorm`.


In [6]:
def addRealMinPlayed (mt_col, rank_col):
    matchesIdRank = list(set([ e['matchId'] for e in rank_col.find() ]))
    
    for match in tqdm(matchesIdRank):
        matchDoc = mt_col.find_one({'wyId': match})
        minMatchDuration = int(matchDoc['secEnd']/60)
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt

        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]

            minPlayers = { d['playerId']: minMatchDuration * int(x == 'lineup') for x in ['bench', 'lineup'] for d in teamDoc['formation'][x] }

            for d in teamDoc['formation']['substitutions']:
                minPlayers.update({ d['playerOut']: d['minute'] })
                minPlayers.update({ d['playerIn']: minMatchDuration - d['minute'] })

            for plId in [k for k in minPlayers.keys()]:
                rank_col.update_one({ 'playerId': plId, 'matchId': match }, { '$set': {'realMinPlayed': minPlayers[plId] } })
                
def addPlayerankScoreNorm (mt_col, rank_col):
    #df from Rank collection
    dfRank = pd.DataFrame(list(rank_col.find({ }, {'_id':1, 'matchId':1, 'playerankScore':1, 'realMinPlayed': 1})))
    dfRank['realMinPlayed'] = dfRank['realMinPlayed'].values.astype(int)

    #df from Match collection
    dfMatch = pd.DataFrame(list(mt_col.find({ }, {'_id':0, 'wyId':1, 'secEnd':1})))
    dfMatch = dfMatch.rename(columns={'wyId':'matchId'})
    dfMatch['matchMinDuration'] = (dfMatch['secEnd']/60).values.astype(int)

    #df final
    df = pd.merge(dfRank, dfMatch, on='matchId')
    scaler = MinMaxScaler()
    df['playerankScoreNorm1'] = scaler.fit_transform(dfRank[['playerankScore']])
    df['playerankScoreNorm2'] = df['playerankScoreNorm1']*df['realMinPlayed']/df['matchMinDuration']
    df['playerankScoreNorm3'] = scaler.fit_transform(df[['playerankScoreNorm2']])

    #update the collection
    normedValsDict = {row['_id']: row['playerankScoreNorm3'] for index, row in df.iterrows()}
    for docId in tqdm([k for k in normedValsDict.keys()]):
        rank_col.update_one({ '_id': docId }, { '$set': { 'playerankScoreNorm': normedValsDict[docId] } })

## Add playerank features to `matches` collection

The function `addActivePlayers` takes as input the collection of matches `mt_col` and the collection of the playerank `rank_col` adding to each match docuemnt the filed `activePlayerank`, a dictionary that for each player that have effectively play in the match store the number of minutes played. To do that we have used the information available in each match document related to the players that are in bench or in the lineup and the sobsitution of some of them. 

In [7]:
def addActivePlayers (mt_col, rank_col):
    docIds = [ m['_id'] for m in mt_col.find({ }, {'_id':1}) ] #match id of each match in the matches collection
    
    for docId in tqdm(docIds):
        matchDoc = mt_col.find_one({ '_id': docId })
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
        
        matchPlsRank = [ d for d in rank_col.find({ 'matchId': matchDoc['wyId'] }) ]
        plIdsAvailable = [ str(d['playerId']) for d in matchPlsRank ]
        
        playerankDict = {}
        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]
            
            lineupPls = [ str(d['playerId']) for d in teamDoc['formation']['lineup'] ]
            subs = teamDoc['formation']['substitutions'] # exist teams that don't substitute any player during the match
            subsPls = [ str(d['playerIn']) for d in subs if subs != 'null' ]
            
            for p in lineupPls + subsPls:
                if p in plIdsAvailable:
                    playerankDict.update({ p: [e['playerankScoreNorm'] for e in matchPlsRank if e['playerId'] == int(p)][0] })
                else:
                    playerankDict.update({ p: -1 })
                    
        mt_col.update_one({'_id': docId}, { '$set': { 'activePlayerank': playerankDict } })

The function `addTeamPlayerank` takes as input the collection of matches `mt_col` and the collection of the player `pl_col` adding to each match docuemnt the filed `teamPlayerank`, a dictionary that for each team that have  play in the match store the mean of the rank values of its players stored in the `activePlayerank` field.

In [8]:
def addTeamPlayerank (mt_col, pl_col):
    matchDocs = [e for e in mt_col.find() ]
    matchIds = [ d['wyId'] for d in matchDocs ]
    plIds = [ p['wyId'] for p in pl_col.find() ]
    #for each player p in the pl_col we store the list of matches in which p have played
    playerMatchesPlayed = { p: [ match['wyId'] for match in matchDocs if str(p) in match['activePlayerank'].keys() ] for p in tqdm(plIds) }
    
    for match in tqdm(matchIds):
        matchDoc = mt_col.find_one({'wyId': match})
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
        actTimestamp = matchDoc['timestamp']
        
        teamPlayerank = {}
        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]
            teamPlayerIds = [ d['playerId'] for x in ['bench', 'lineup'] for d in teamDoc['formation'][x] ] #ids as integers

            listPlayerank = []
            for p in teamPlayerIds:
                if p in plIds:
                    ranksPrevMathces = [ m['activePlayerank'][str(p)] for m in [ d for d in matchDocs if (d['wyId'] in playerMatchesPlayed[p]) and (d['timestamp'] < actTimestamp) ] if (m['activePlayerank'][str(p)] >= 0) ]
                    if len(ranksPrevMathces) > 0:
                        listPlayerank.append( pd.Series.ewm(pd.Series(ranksPrevMathces), span=len(ranksPrevMathces)).mean().mean() )

            if len(listPlayerank) > 0:
                teamPlayerank.update( { str(team): sum(listPlayerank)/len(listPlayerank) } )
            else:
                teamPlayerank.update( { str(team): -1 } )
        
        mt_col.update_one({ 'wyId': match }, { '$set': { 'teamPlayerank': teamPlayerank } })

The function `addTeamPoints` takes as input the collection of matches `mt_col` and the collection of the competitions `cm_col` adding to each match docuemnt the filed `teamPoints`, a dictionary that for each team that have  play in the match store the points scored from the beginning of the competition until the previous match played by that team. To assign the points we decide to apply a standard method, at the end of the match we look at each team and we update the field of the points with the following rules:

    - plus 3 points if the team is the winner 
    - 0 points if the team is the looser
    - plus 1 point if the match ends with a draw 

In [9]:
def addTeamPoints (mt_col, cm_col):
    compIds = [ d['wyId'] for d in cm_col.find({ }, { '_id':0, 'wyId':1 }) ]

    for comp in tqdm(compIds):
        matchesOfComp = [ e for e in mt_col.find({ 'competitionId': comp }) ]
        allTeamIds = set([ z for y in [list(x['teamsData'].keys()) for x in matchesOfComp] for z in y ])
        teamPoints = { t:0  for t in allTeamIds }
        matchIds = [ e[1] for e in sorted([ (match['timestamp'], match['wyId']) for match in matchesOfComp ]) ]
        for match in matchIds:
            matchDoc = mt_col.find_one({ 'wyId': match })
            teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
            teamA = teamIds[0]
            teamB = teamIds[1]

            mt_col.update_one({ 'wyId': match }, { '$set':  { 
                'teamPoints': { str(teamA): teamPoints[teamA], 
                               str(teamB): teamPoints[teamB] } } })

            scoreA = matchDoc['teamsData'][str(teamA)]['score']
            scoreB = matchDoc['teamsData'][str(teamB)]['score']

            if matchDoc['duration'] in ['ExtraTime', 'Penalties']:
                scoreA = matchDoc['teamsData'][str(teamA)]['scoreET']
                scoreB = matchDoc['teamsData'][str(teamB)]['scoreET']

            if scoreA > scoreB:
                teamPoints.update({ teamA: teamPoints[teamA] + 3 })
            elif scoreA < scoreB:
                teamPoints.update({ teamB: teamPoints[teamB] + 3 })
            else:
                teamPoints.update({ teamA: teamPoints[teamA] + 1, teamB: teamPoints[teamB] + 1 })

## Computation of features

The function `computation` take in input
the events collection ```ev_col``` of all the available events,
the matches collection ```mt_col``` of all the matches played,
the players collection ```pl_col``` of all the wyscoyt players, and
the results collection ```res_col``` that, after the function execution, contains all the computed results.


The documents in the collection `res_col` contain the following fields:
- `matchId`: corresponds to the wyscout identifier of the match
- `teamId`: corresponds to the wyscout identifier of the team
- `isHome`: 1 if the team is playing at home, 0 otherwise 


- `teamPoints`: is the team score in the classification until the date of the match
- `teamPlayerank`: is the mean of the `playrankScore` normalized of all the players available for that team in that match 
- `meanPlayerOverall`: mean of player overall rating given by the Fifa 19 dataset 
- `meanPlayerPotential`: mean of player potential rating given by the Fifa 19 dataset 


- `meanPrevScore`, `meanPrevScoreET`, `meanPrevScoreHT` and `meanPrevScoreP`: each represents the mean of the scores respectively at the end of the match, at the end of extra time, at the end of the first half and at the end of the penalties of the matches played by the team before that match


- `numPass`, `numDuel`, `numFoul`, `numFreeKick`, `numGoalkeeperLeavingLine`, `numInterruption`, `numOffside`, `numOthersOnTheBall`, `numSaveAttempt` and `numShot`: each field corresponds to the number of events done by the team during the match and named respectively as _Pass, Duel, Foul, Free Kick, Goalkeeper Leaving Line, Interruption, Offside, Others On The Ball, Save Attempt, Shot_.


- `rateAccPass`: number of events named _Pass_ and tagged as accurate over `numPass` 
- `rateAccFreeKick`: number of events named _Free Kick_ and tagged as accurate over `numFreeKick`
- `rateAccShot`: number of events named _Shot_ and tagged as accurate over `numShot`


- `numYellowCard`, `numSecondYellowCard` and `numRedCard`: each field represents the number of events respectively tagged as _yellow card, second yellow card_ and _red card_

- `percBallPoss`: represents the percentage of all "interesting" events that are done by the team and it means "how much that team is sprint" 
- `percOppHalfField`: represents the percentage of the "interesting" events done by the team in the opposite half of field


- `numGoalsTW` and `numOwnGoalsTW`: represent the number of goals and own goals done by the team in the time window _TW_ analyzed, each corresponds to all the events that are tagged resepctively as _goal_ and _own goal_ and that are not names as _Save Attempt_ 
- `scoreTW`: is the score of that team at the end of the time window _TW_ analyzed and it corresponds to the sum of the field `numGoalsTW` of that team and the field `numOwnGoalsTW` of the opponent team in that match 
- `numGoalsTWP` and `numOwnGoalsTWP`: represent the number of goals and own goals done by the team in the time window _TWP_ of prediction, each corresponds to all the events that are tagged resepctively as _goal_ and _own goal_ and that are not names as _Save Attempt_ 
- `numGoals1H`, `numGoals2H` and `numGoalsET`: each represents the number of goals scored by the team during respectively the first half of the match, the second half of the match and the extra time
- `finalScore`: score of the team at the end of the match


- `isWinner`: 1 if the team is the winner of that match, 0 otherwise
- `didScoreInTWP`: 1 if the sum of the field `numGoalsTWP` of that team and the field `numOwnGoalsTWP` of the opponent team in that match is grater than zero, 0 otherwise
- `goalsDiff`:  value of goals difference of that team and the opponent team 

In [10]:
def computation (ev_col, mt_col, pl_col, res_col, xMinutes='', yMinutes=''):
    
    #legth of time windows given as integer
    if (isinstance(xMinutes, int) & isinstance(yMinutes, int)):
        if (xMinutes <= 0 or yMinutes <= 0):
            print('The length of the time window (given in minutes) must be grather than zero!')
            return

        xSecEnd = xMinutes * 60
        ySecEnd = xSecEnd + yMinutes * 60 

    allPlayersId = [ p['wyId'] for p in pl_col.find({ }) ]
    interestingEvNames = ['Pass', 'Duel', 'Free Kick',  'Others on the ball', 'Shot']
    matchIds = [ m['wyId'] for m in mt_col.find({ }, {'_id':0, 'wyId':1}) ] #match id of each match in the matches collection
    
    for match in tqdm(matchIds):
        matchDoc = mt_col.find_one({'wyId': match})
        teamIds = list(matchDoc['teamsData'].keys()) #team id for both teams in the match mt
        
        matchEvents = [e for e in ev_col.find( { 'matchId': int(match) } )] #all events happened during the match
        
        for team in teamIds:
            teamDoc = matchDoc['teamsData'][team]
            
            #events of the actual team
            teamEventsAll = [ e for e in matchEvents if e['teamId'] == int(team) ]#events done by the team in the match
            teamEventsTW, teamEventsTWP = [], []
            
            #legth of time windows given as integer
            if (isinstance(xMinutes, int) & isinstance(yMinutes, int)):
                teamEventsTW = [ e for e in teamEventsAll if e['secFromStart'] <= xSecEnd] #events done by the team in the time window that we analyse
                teamEventsTWP = [ e for e in teamEventsAll if e['secFromStart'] > xSecEnd and e['secFromStart'] <= ySecEnd ] #events done by the team in the time window that we want to predict
            else:
                #teamEvents = [ e for e in teamEventsAll if e['matchPeriod'] == '1H'] #events done by the team in the the time window
                teamEventsTW = [ e for e in teamEventsAll if e['matchPeriod'] == '1H'] #events done by the team in the time window that we analyse (first half of the match)
                teamEventsTWP = [ e for e in teamEventsAll if e['matchPeriod'] == '2H'] #events done by the team in the time window that we analyse (second half of the match)
            
    
            #FILED: isHome
            isHome = int(teamDoc['side'] == 'home')
            #FILED: iswinner
            isWinner = int(int(matchDoc['winner']) == int(team))
            
            #FILEDS: numGaol..
            numGoalsHT = teamDoc['scoreHT']
            numGoals2HT = teamDoc['score'] - teamDoc['scoreHT']
            numGoalsET = max(teamDoc['scoreET'] - teamDoc['score'], 0)
            numGoalsP = teamDoc['scoreP']
            numGoalsTot = numGoalsHT + numGoals2HT + numGoalsET + numGoalsP
            finalScore = numGoalsHT + numGoals2HT + numGoalsET
            
            #FILEDS: numEvent
            ev_cnt = Counter([ e['eventName'] for e in teamEventsTW ])
            
            evAcc = { x: 
                     len([e for e in teamEventsTW if (e['eventName'] == x) and (1801 in [tag['id'] for tag in e['tags']]) ]) 
                     for x in ['Pass', 'Shot', 'Free Kick'] }
            numRedCard = len([e for e in teamEventsTW if (1701 in [tag['id'] for tag in e['tags']]) ])
            numYelCard = len([e for e in teamEventsTW if (1702 in [tag['id'] for tag in e['tags']]) ])
            numYelSecCard = len([e for e in teamEventsTW if (1703 in [tag['id'] for tag in e['tags']]) ])
            numGoalsTW = len([e for e in teamEventsTW if (e['eventName'] != 'Save attempt') and (101 in [tag['id'] for tag in e['tags']]) ])
            numOwnGoalsTW = len([e for e in teamEventsTW if (e['eventName'] != 'Save attempt') and (102 in [tag['id'] for tag in e['tags']]) ])
            
            #FIELDS: numGoalsTWP + numOwnGoalsTWP
            numGoalsTWP = len([e for e in teamEventsTWP if (e['eventName'] != 'Save attempt') and (101 in [tag['id'] for tag in e['tags']]) ])
            numOwnGoalsTWP = len([e for e in teamEventsTWP if (e['eventName'] != 'Save attempt') and (102 in [tag['id'] for tag in e['tags']]) ])
            
            #FILEDS: percentage ball possession + percentage opposite half field
            interesting_ev_cnt = len([e for e in teamEventsAll if e['eventName'] in interestingEvNames])
            oppHalfField_ev_cnt = len([e for e in teamEventsAll if e['positions'][0]['x']>50])
            percBallPoss = interesting_ev_cnt/len(matchEvents) * 100
            percOppHalfField = oppHalfField_ev_cnt/len(matchEvents) * 100
            
            #FILED: prevScore..
            game_time = matchDoc['timestamp']
            prev_allScores = list(mt_col.find({
                'teamsData.{}'.format(str(team)): {'$exists': True},
                'timestamp': {'$lt': game_time}
            },
            {'_id': 0, 
             'teamsData.{}.score'.format(str(team)): 1, 
             'teamsData.{}.scoreET'.format(str(team)): 1,
             'teamsData.{}.scoreHT'.format(str(team)): 1,
             'teamsData.{}.scoreP'.format(str(team)): 1} ))

            prev_scores = [ x['teamsData'][str(team)]['score'] for x in prev_allScores ]
            prev_scoresET = [ x['teamsData'][str(team)]['scoreET'] for x in prev_allScores ]
            prev_scoresHT = [ x['teamsData'][str(team)]['scoreHT'] for x in prev_allScores ]
            prev_scoresP = [ x['teamsData'][str(team)]['scoreP'] for x in prev_allScores ]

            prev_scores_dict = dict()
            for e in ['prev_scores', 'prev_scoresET', 'prev_scoresHT', 'prev_scoresP']:
                if len(eval(e)) > 0:
                    mean_score = sum(eval(e))/len(eval(e))
                    prev_scores_dict.update({'mean_' + e: mean_score})
            
            #FILEDS: meanPlayerOverall + meanPlayerPotential
            teamPlayerIds = [ d['playerId'] for x in ['bench', 'lineup'] for d in teamDoc['formation'][x] ]
            allOv = []
            allPot = []
            for playerId in teamPlayerIds:
                if playerId in allPlayersId:
                    allOv.append(db.players.find_one({'wyId': playerId}).get('overall', 0) )
                    allPot.append(db.players.find_one({'wyId': playerId}).get('potential', 0) )

            meanOv = sum(allOv)/len(allOv)
            meanPot = sum(allPot)/len(allPot)
            
            res_col.insert_one({
                'matchId': str(match), 
                'teamId': str(team),
                'isHome': isHome,
                'teamPoints': matchDoc['teamPoints'][team],
                'teamPlayerank': matchDoc['teamPlayerank'][team],
                'meanPlayerOverall': meanOv,
                'meanPlayerPotential': meanPot,
                'meanPrevScore': prev_scores_dict.get('mean_prev_scores',0),
                'meanPrevScoreET': prev_scores_dict.get('mean_prev_scoresET',0),
                'meanPrevScoreHT': prev_scores_dict.get('mean_prev_scoresHT',0),
                'meanPrevScoreP': prev_scores_dict.get('mean_prev_scoresP',0),
                'numPass': ev_cnt.get('Pass',0),
                'numDuel': ev_cnt.get('Duel',0),
                'numFoul': ev_cnt.get('Foul',0),
                'numFreeKick': ev_cnt.get('Free Kick',0),
                'numGoalkeeperLeavingLine': ev_cnt.get('Goalkeeper leaving line',0),
                'numInterruption': ev_cnt.get('Interruption',0),
                'numOffside': ev_cnt.get('Offside',0),
                'numOthersOnTheBall': ev_cnt.get('Others on the ball',0),
                'numSaveAttempt': ev_cnt.get('Save attempt',0),
                'numShot': ev_cnt.get('Shot',0),
                'rateAccPass': evAcc['Pass']/ev_cnt.get('Pass',1),
                'rateAccFreeKick': evAcc['Free Kick']/ev_cnt.get('Free Kick',1),
                'rateAccShot': evAcc['Shot']/ev_cnt.get('Shot',1),
                'numYellowCard': numYelCard,
                'numSecondYellowCard': numYelSecCard,
                'numRedCard': numRedCard,
                'percBallPoss': percBallPoss,
                'percOppHalfField': percOppHalfField,
                'numGoalsTW': numGoalsTW,
                'numOwnGoalsTW': numOwnGoalsTW,
                'scoreTW': 0,
                'numGoalsTWP': numGoalsTWP,
                'numOwnGoalsTWP': numOwnGoalsTWP,
                'numGoals1H': numGoalsHT,
                'numGoals2H': numGoals2HT,
                'numGoalsET': numGoalsET,
                'finalScore': finalScore,
                'isWinner': isWinner,
                'didScoreInTWP': 0 # int(numGoals2HT > 0)
            })
        
        #FILED: scoreTW
        teamA = str(teamIds[0])
        teamB = str(teamIds[1])
        fields = ['numGoalsTW', 'numOwnGoalsTW', 'scoreTW', 'numGoalsTWP', 'numOwnGoalsTWP', 'finalScore']
        info = { t: { f: res_col.find_one({ 'matchId': str(match), 'teamId': t })[f] for f in fields } for t in teamIds }
        
        res_col.update_one({ 'matchId': str(match), 'teamId': teamA }, { '$set': { 
            'scoreTW': info[teamA]['numGoalsTW'] + info[teamB]['numOwnGoalsTW'],
            'didScoreInTWP': int(( info[teamA]['numGoalsTWP'] + info[teamB]['numOwnGoalsTWP'] ) > info[teamA]['scoreTW']),
            'goalsDiff': info[teamA]['finalScore'] - info[teamB]['finalScore']
        } })
        res_col.update_one({ 'matchId': str(match), 'teamId': teamB }, { '$set': { 
            'scoreTW': info[teamB]['numGoalsTW'] + info[teamA]['numOwnGoalsTW'],
            'didScoreInTWP': int(( info[teamB]['numGoalsTWP'] + info[teamA]['numOwnGoalsTWP'] ) > info[teamB]['scoreTW']),
            'goalsDiff': info[teamB]['finalScore'] - info[teamA]['finalScore']
        } })
        
        

***

# Target prediction (Phase 3)

## Dataframe creation from results

The function `dfFromCursor` take as input a `cursor` and produce a pandas dataframe from it and with the columns indicated by the list `orderedCols`.

This function is created to obtain a pandas dataframe from a MongoDB collection of documents.

In [11]:
orderedCols = ['matchId', 'teamId', 'isHome',
               'teamPoints', 'teamPlayerank',
               'meanPlayerOverall', 'meanPlayerPotential', 'meanPrevScore', 'meanPrevScoreET', 'meanPrevScoreHT', 'meanPrevScoreP', 
               'numDuel', 'numFoul', 'numFreeKick', 'numGoalkeeperLeavingLine', 'numInterruption', 'numOffside', 'numOthersOnTheBall', 'numPass', 'numSaveAttempt', 'numShot',
               'rateAccFreeKick', 'rateAccPass', 'rateAccShot',
               'numYellowCard', 'numSecondYellowCard', 'numRedCard',
               'percBallPoss', 'percOppHalfField',
               'numGoalsTW', 'numOwnGoalsTW', 'scoreTW',
               'numGoalsTWP', 'numOwnGoalsTWP',
               'numGoals1H', 'numGoals2H', 'numGoalsET',
               'goalsDiff', 'finalScore', 'isWinner', 'didScoreInTWP'
              ]

def dfFromCursor (cursor):
    df = pd.DataFrame(list(cursor))
    res = pd.DataFrame(columns=orderedCols)
    for c in orderedCols:
        res[c] = df[c]
    
    return res

## Dataframe update with target predictions

The function `dfPrediction` take as input
- a dataframe `df` with all the features that are needed for the prediction,
- the directory `inDir` from which upload the classifiers,
- the name of the classifier for the first target variable, `clf1_fileName`, and for the second target variable, `clf2_fileName`,

optional parameteres
- `save` the boolean flag used if we want save the resulting dataframe,
- the directory `outDir` in which save it,
- and the name of the output file `outFileName`.

The function load the classifiers and predicts the two target variables from the data of the fields contained in the `cols` of the input dataframe. So the output of the function is the input dataframe extended by two columns: `isWinner_predicted` and `didScoreInTWP_predicted`.

In [12]:
cols = ['isHome', 'meanPlayerOverall', 'meanPlayerPotential', 'meanPrevScore',
       'meanPrevScoreET', 'meanPrevScoreHT', 'meanPrevScoreP', 'numDuel',
       'numFoul', 'numFreeKick', 'numGoalkeeperLeavingLine', 'numGoalsTW',
       'numInterruption', 'numOffside', 'numOthersOnTheBall', 'numOwnGoalsTW',
       'numPass', 'numRedCard', 'numSaveAttempt', 'numSecondYellowCard',
       'numShot', 'numYellowCard', 'percBallPoss', 'percOppHalfField',
       'rateAccFreeKick', 'rateAccPass', 'rateAccShot', 'scoreTW',
       'teamPlayerank', 'teamPoints']

def dfPrediction (df, inDir, clf1_fileName, clf2_fileName, save=False, outDir='', outFileName=''):
    
    #loading models
    clf1 = pickle.load(open('./{}/{}.pkl'.format(inDir, clf1_fileName), 'rb'))
    clf2 = pickle.load(open('./{}/{}.pkl'.format(inDir, clf2_fileName), 'rb'))
    
    #predict targets
    df_pred = df
    df_pred['isWinner_predicted'] = clf1.predict(df[cols]).astype(int)
    df_pred['didScoreInTWP_predicted'] = clf2.predict(df[cols]).astype(int)
    
    if save:
        df_pred.to_json('./{}/{}.json'.format(outDir, outFileName), orient='records')
    
    return df_pred

***

# Execution

## Phase 1

In [13]:
%%time
initialization('2575999', 'soccerdb', True)

100%|█████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 498.11it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 3603/3603 [00:05<00:00, 605.96it/s]


Match: 2575999
Team 1: 3197 (FC Crotone)
Team 2: 3172 (Atalanta Bergamasca Calcio)
Date: September 20, 2017 at 8:45:00 PM GMT+2
Competition: 524 (Italian first division)

INFO
Number of matches: 367
Number of events: 625572

Wall time: 2min 15s


## Phase 2

In [14]:
df_Fifa = pd.read_csv("C:/Users/riky9/Desktop/BDA/data/fifa.csv")
addOverallAndPotential(db.players, df_Fifa)

100%|███████████████████████████████████████████████████████████████████████████| 18207/18207 [01:15<00:00, 241.69it/s]


In [15]:
addSecFromStart(db.events, db.matches)

100%|████████████████████████████████████████████████████████████████████████████████| 367/367 [11:15<00:00,  1.97s/it]


In [16]:
addRealMinPlayed(db.matches, db.playerank)

100%|████████████████████████████████████████████████████████████████████████████████| 367/367 [01:51<00:00,  2.51it/s]


In [17]:
addPlayerankScoreNorm(db.matches, db.playerank)

100%|████████████████████████████████████████████████████████████████████████████| 8969/8969 [00:07<00:00, 1207.65it/s]


In [18]:
addActivePlayers(db.matches, db.playerank)

100%|████████████████████████████████████████████████████████████████████████████████| 367/367 [00:04<00:00, 89.21it/s]


In [19]:
addTeamPlayerank(db.matches, db.players)

100%|████████████████████████████████████████████████████████████████████████████| 3603/3603 [00:00<00:00, 4818.75it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 367/367 [00:15<00:00, 23.38it/s]


In [20]:
addTeamPoints(db.matches, db.competitions)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.03s/it]


In [28]:
computation(db.events, db.matches, db.players, db.res)

100%|████████████████████████████████████████████████████████████████████████████████| 367/367 [04:10<00:00,  1.47it/s]


## Phase 3

In [29]:
res = dfFromCursor(db.res.find({ 'matchId': str(mpId) }, { '_id': 0 }))
res.transpose()

Unnamed: 0,0,1
matchId,2575999.0,2575999.0
teamId,3197.0,3172.0
isHome,0.0,1.0
teamPoints,1.0,4.0
teamPlayerank,0.268414,0.274712
meanPlayerOverall,45.2609,52.6364
meanPlayerPotential,48.087,57.8182
meanPrevScore,0.0,1.0
meanPrevScoreET,0.0,0.0
meanPrevScoreHT,0.0,0.5


In [30]:
resPred = dfPrediction(res, 'NewModels6', 'DT_1_3_10', 'DT_2_3_10')

In [31]:
resPred[['matchId', 'teamId', 'isWinner', 'isWinner_predicted', 'didScoreInTWP', 'didScoreInTWP_predicted']]

Unnamed: 0,matchId,teamId,isWinner,isWinner_predicted,didScoreInTWP,didScoreInTWP_predicted
0,2575999,3197,0,0,1,0
1,2575999,3172,1,1,1,1


***