# This is the second part of my Kernel containing only the Feature Engineering and LightGBM algorithm. The Kernel is divided into two sections due to memory and time constraints of kaggle kernel. For Exploratory Data Analysis and Base Model of my kernel, Visit the first part of the model <a href='https://www.kaggle.com/iamarjunchandra/part-1-pubg-eda-base-model'>Here!</a>

In [1]:
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
import gc

#Figures Inline and Visualization style
%matplotlib inline
sb.set()

In [2]:
train = pd.read_csv('../input/train_V2.csv')
test = pd.read_csv('../input/test_V2.csv')
train.dropna(inplace=True)

# **3. FEATURE ENGINEERING**

Let's Inspect the categorical colmn match type. 

In [3]:
train['matchType'].value_counts()

squad-fpp           1756186
duo-fpp              996691
squad                626526
solo-fpp             536761
duo                  313591
solo                 181943
normal-squad-fpp      17174
crashfpp               6287
normal-duo-fpp         5489
flaretpp               2505
normal-solo-fpp        1682
flarefpp                718
normal-squad            516
crashtpp                371
normal-solo             326
normal-duo              199
Name: matchType, dtype: int64

'groupId' and 'matchId' are available in the data.  From these, no. of players in the team and total players entered in the match can be extracted.

In [4]:
train['teamPlayers']=train.groupId.map(train.groupId.value_counts())
test['teamPlayers']=test.groupId.map(test.groupId.value_counts())
train['gamePlayers']=train.matchId.map(train.matchId.value_counts())
test['gamePlayers']=test.matchId.map(test.matchId.value_counts())

Let's create a new column with total enemy players . The players remaining other than the player's squad. 

In [5]:
train['enemyPlayers']=train['gamePlayers']-train['teamPlayers']
test['enemyPlayers']=test['gamePlayers']-test['teamPlayers']

Let's create a new column representing the total distance(ride+swim+walk) covered by the player in the game. 

In [6]:
train['totalDistance']=train['rideDistance']+train['swimDistance']+train['walkDistance']
test['totalDistance']=test['rideDistance']+test['swimDistance']+test['walkDistance']

New column which is the sum of assists and kills.

In [7]:
train['enemyDamage']=train['assists']+train['kills']
test['enemyDamage']=test['assists']+test['kills']

New column containing total kills by the team. For this, rows are grouped based on 'matchId', 'groupId' and the sum of matching row 'kills' are taken.

In [8]:
totalKills = train.groupby(['matchId','groupId']).agg({'kills': lambda x: x.sum()})
totalKills.rename(columns={"kills": "squadKills"}, inplace=True)
train = train.join(other=totalKills, on=['matchId', 'groupId'])
totalKills = test.groupby(['matchId','groupId']).agg({'kills': lambda x: x.sum()})
totalKills.rename(columns={"kills": "squadKills"}, inplace=True)
test = test.join(other=totalKills, on=['matchId', 'groupId'])

Lets create  new columns and find if any of them helps improve model prediction.

In [9]:
train['medicKits']=train['heals']+train['boosts']
test['medicKits']=test['heals']+test['boosts']

In [10]:
train['medicPerKill'] = train['medicKits']/train['enemyDamage']
test['medicPerKill'] = test['medicKits']/test['enemyDamage']

In [11]:
train['distancePerHeals'] = train['totalDistance']/train['heals']
test['distancePerHeals'] = test['totalDistance']/test['heals']

In [12]:
train['headShotKillRatio']=train['headshotKills']/train['kills']
test['headShotKillRatio']=test['headshotKills']/test['kills']

In [13]:
train['headshotKillRate'] = train['headshotKills'] / train['kills']
test['headshotKillRate'] = test['headshotKills'] / test['kills']

In [14]:
train['killPlaceOverMaxPlace'] = train['killPlace'] / train['maxPlace']
test['killPlaceOverMaxPlace'] = test['killPlace'] / test['maxPlace']

In [15]:
train['kills/distance']=train['kills']/train['totalDistance']
test['kills/distance']=test['kills']/test['totalDistance']

In [16]:
train['kills/walkDistance']=train['kills']/train['walkDistance']
test['kills/walkDistance']=test['kills']/test['walkDistance']

In [17]:
train['avgKills'] = train['squadKills']/train['teamPlayers']
test['avgKills'] = test['squadKills']/test['teamPlayers']

In [18]:
train['damageRatio'] = train['damageDealt']/train['enemyDamage']
test['damageRatio'] = test['damageDealt']/test['enemyDamage']

In [19]:
train['distTravelledPerGame'] = train['totalDistance']/train['matchDuration']
test['distTravelledPerGame'] = test['totalDistance']/test['matchDuration']

In [20]:
train['killPlacePerc'] = train['killPlace']/train['gamePlayers']
test['killPlacePerc'] = test['killPlace']/test['gamePlayers']

In [21]:
train["playerSkill"] = train["headshotKills"]+ train["roadKills"]+train["assists"]-(5*train['teamKills']) 
test["playerSkill"] = test["headshotKills"]+ test["roadKills"]+test["assists"]-(5*test['teamKills'])

In [22]:
train['gamePlacePerc'] = train['killPlace']/train['maxPlace']
test['gamePlacePerc'] = test['killPlace']/test['maxPlace']

The newly created features contains missing values and Infinity values in it. Let's replace these with 0.

In [23]:
train.fillna(0,inplace=True)
train.replace(np.inf, 0, inplace=True)
test.fillna(0,inplace=True)
test.replace(np.inf, 0, inplace=True)

In [24]:
train.count()

Id                       4446965
groupId                  4446965
matchId                  4446965
assists                  4446965
boosts                   4446965
damageDealt              4446965
DBNOs                    4446965
headshotKills            4446965
heals                    4446965
killPlace                4446965
killPoints               4446965
kills                    4446965
killStreaks              4446965
longestKill              4446965
matchDuration            4446965
matchType                4446965
maxPlace                 4446965
numGroups                4446965
rankPoints               4446965
revives                  4446965
rideDistance             4446965
roadKills                4446965
swimDistance             4446965
teamKills                4446965
vehicleDestroys          4446965
walkDistance             4446965
weaponsAcquired          4446965
winPoints                4446965
winPlacePerc             4446965
teamPlayers              4446965
gamePlayer

From the heat map, killPoints, rankPoints, winPoints, maxPlace are found to be not having any significance in determining winPlacePerc. So let's remove these features from the data set. 

In [25]:
train.drop(columns=['killPoints','rankPoints','winPoints','maxPlace'],inplace=True)
test.drop(columns=['killPoints','rankPoints','winPoints','maxPlace'],inplace=True)

In [26]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage, took from Kaggle.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
                    
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

In Pubg, if a player wins, his team mates are also winners. So instead on finding winPlacePerc for individual payers, let's find the winPlacePerc for each group in a match.  Let's write a function that will create new columns that are the match wise and group wise mean, max, min of all the current features and also rank them.

In [27]:
def feature(df):
    features = list(df.columns)
    features.remove("Id")
    features.remove("matchId")
    features.remove("groupId")
    features.remove("matchType")
    condition='False'
    
    if 'winPlacePerc' in df.columns:
        y = np.array(df.groupby(['matchId','groupId'])['winPlacePerc'].agg('mean'), dtype=np.float64)
        features.remove("winPlacePerc")
        condition='True'
        
    print("get group mean feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('mean')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = agg.reset_index()[['matchId','groupId']]
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_mean", "_mean_rank"], how='left', on=['matchId', 'groupId'])
        
    print("get group max feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('max')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_max", "_max_rank"], how='left', on=['matchId', 'groupId'])
    
    print("get group min feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('min')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_min", "_min_rank"], how='left', on=['matchId', 'groupId'])
    
    print("get match mean feature")
    agg = df.groupby(['matchId'])[features].agg('mean').reset_index()
    df_out = df_out.merge(agg, suffixes=["", "_match_mean"], how='left', on=['matchId'])
    df_id=df_out[["matchId", "groupId"]].copy()
    df_out.drop(["matchId", "groupId"], axis=1, inplace=True)
    
    del df, agg, agg_rank
    gc.collect()
    if condition=='True':
        return df_out,pd.DataFrame(y),df_id
    else:
        return df_out,df_id

In [28]:
x,y,id_train=feature(reduce_mem_usage(train))

Memory usage of dataframe is 1560.67 MB
Memory usage after optimization is: 436.82 MB
Decreased by 72.0%
get group mean feature
get group max feature
get group min feature
get match mean feature


In [29]:
x_test,id_test=feature(reduce_mem_usage(test))

Memory usage of dataframe is 649.29 MB
Memory usage after optimization is: 171.55 MB
Decreased by 73.6%
get group mean feature
get group max feature
get group min feature
get match mean feature


In [30]:
del train,test
gc.collect()

21

# **4. GRADIENT BOOSTING MODEL**

Split the data into train and validation set.

In [31]:
x['matchId']=id_train['matchId']
x['groupId']=id_train['groupId']
# Train test split
x_train,x_val,y_train,y_val=train_test_split(reduce_mem_usage(x),y,test_size=.1)
x_test=reduce_mem_usage(x_test)
id_val=x_val[['matchId','groupId']]
x_val.drop(['matchId','groupId'],axis=1,inplace=True)
x_train.drop(['matchId','groupId'],axis=1,inplace=True)
x.drop(['matchId','groupId'],axis=1,inplace=True)
del y
gc.collect()

Memory usage of dataframe is 2957.27 MB
Memory usage after optimization is: 1055.34 MB
Decreased by 64.3%
Memory usage of dataframe is 1279.61 MB
Memory usage after optimization is: 447.10 MB
Decreased by 65.1%


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


7

In [32]:
params = {
        "objective" : "regression", 
        "metric" : "mae", 
        "num_leaves" : 149, 
        "learning_rate" : 0.03, 
        "bagging_fraction" : 0.9,
        "bagging_seed" : 0, 
        "num_threads" : 4,
        "colsample_bytree" : 0.5,
        'min_data_in_leaf':1900, 
        'min_split_gain':0.00011,
        'lambda_l2':9
}

In [33]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(x_train, y_train,
                       free_raw_data=False)
lgb_eval = lgb.Dataset(x_val, y_val, reference=lgb_train,
                      free_raw_data=False)

In [34]:
model = lgb.train(params,
                lgb_train,
                num_boost_round=22000,
                valid_sets=lgb_eval,
                early_stopping_rounds=10,
                verbose_eval=1000)

Training until validation scores don't improve for 10 rounds.
[1000]	valid_0's l1: 0.0278541
[2000]	valid_0's l1: 0.0271067
[3000]	valid_0's l1: 0.0266859
[4000]	valid_0's l1: 0.026414
[5000]	valid_0's l1: 0.0261878
[6000]	valid_0's l1: 0.0260055
[7000]	valid_0's l1: 0.0258675
[8000]	valid_0's l1: 0.0257378
Early stopping, best iteration is:
[8362]	valid_0's l1: 0.0257023


# 6.Post Processing

Now that we have trained the model, let' have a look if we can make some tweaks in the predicted data so that the predicted value can be improved. First let's merge the predicted value with appropriate gamer Id in the train data.

In [35]:
y_pred_val = model.predict(x, num_iteration=model.best_iteration)
id_train['win_pred']=y_pred_val
id_train.set_index(['matchId','groupId'])
train = reduce_mem_usage(pd.read_csv("../input/train_V2.csv"))

df=pd.merge(train,id_train,on=['matchId','groupId'],how='right')
df

Memory usage of dataframe is 983.90 MB
Memory usage after optimization is: 288.39 MB
Decreased by 70.7%


Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,longestKill,matchDuration,matchType,maxPlace,numGroups,rankPoints,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc,win_pred
0,7f96b2f878858a,4d4b580de459be,a10357fd1a4a91,0,0,0.000000,0,0,0,60,1241,0,0,0.000000,1306,squad-fpp,28,26,-1,0,0.000000,0,0.000000,0,0,244.750000,1,1466,0.444336,0.470399
1,7516514fbd1091,4d4b580de459be,a10357fd1a4a91,0,0,0.000000,0,0,0,62,1232,0,0,0.000000,1306,squad-fpp,28,26,-1,0,0.000000,0,0.000000,0,0,48.281250,1,1465,0.444336,0.470399
2,c56d45be16aa86,4d4b580de459be,a10357fd1a4a91,0,0,318.000000,2,1,0,6,1185,4,1,27.656250,1306,squad-fpp,28,26,-1,0,0.000000,0,0.000000,0,0,342.750000,2,1476,0.444336,0.470399
3,100eef17c4d773,4d4b580de459be,a10357fd1a4a91,0,0,90.750000,0,0,0,61,1344,0,0,0.000000,1306,squad-fpp,28,26,-1,0,0.000000,0,0.000000,0,0,96.062500,1,1498,0.444336,0.470399
4,eef90569b9d03c,684d5656442f9e,aeb375fc57110c,0,0,91.500000,0,0,0,57,0,0,0,0.000000,1777,squad-fpp,26,25,1484,0,0.004501,0,11.039062,0,0,1434.000000,5,0,0.640137,0.625581
5,3dcf4259b62f66,684d5656442f9e,aeb375fc57110c,2,1,300.750000,0,0,6,18,0,2,1,78.000000,1777,squad-fpp,26,25,1476,0,5416.000000,0,0.000000,0,0,1847.000000,5,0,0.640137,0.625581
6,5406618d55edc0,684d5656442f9e,aeb375fc57110c,0,2,0.000000,0,0,2,56,0,0,0,0.000000,1777,squad-fpp,26,25,1480,0,3676.000000,0,0.000000,0,0,1855.000000,8,0,0.640137,0.625581
7,fb37a556eb6a3d,684d5656442f9e,aeb375fc57110c,0,3,179.000000,2,0,5,17,0,2,2,5.679688,1777,squad-fpp,26,25,1491,0,3694.000000,0,0.000000,0,0,2436.000000,6,0,0.640137,0.625581
8,1eaf90ac73de72,6a4a42c3245a74,110163d8bb94ae,1,0,68.000000,0,0,0,47,0,0,0,0.000000,1318,duo,50,47,1491,0,0.000000,0,0.000000,0,0,161.750000,2,0,0.775391,0.752669
9,3d588ea15ea8ba,6a4a42c3245a74,110163d8bb94ae,0,3,146.625000,1,0,2,18,0,2,1,10.843750,1318,duo,50,47,1494,0,340.500000,0,0.000000,0,0,1119.000000,2,0,0.775391,0.752669


In [36]:
print('The mae score is {}'.format(mean_absolute_error(df['winPlacePerc'],df['win_pred'])))
df = df[["Id", "matchId", "groupId", "maxPlace", "numGroups",'winPlacePerc', 'win_pred']]

  return umr_sum(a, axis, dtype, out, keepdims, initial)


The mae score is 0.019728279709704075


Let's take only one row from each groupby matchId and groupId since the winPlacePerc is almost same for each player in a team. Now sort and rank each group in a match. Rank is directly proportional to winPlacePerc.

In [37]:
df_grouped = df.groupby(["matchId", "groupId"]).first().reset_index()
df_grouped["team_place"] = df_grouped.groupby(["matchId"])["win_pred"].rank()
df_grouped

Unnamed: 0,matchId,groupId,Id,maxPlace,numGroups,winPlacePerc,win_pred,team_place
0,0000a43bce5eec,18b16ec699d8b6,023a9418cf67b0,28,28,0.333252,0.329069,10.0
1,0000a43bce5eec,236ab9e9c081b9,5a3afae17b53c0,28,28,0.036987,0.032605,2.0
2,0000a43bce5eec,3a6addfa0df938,fc62a751955351,28,28,0.000000,-0.003673,1.0
3,0000a43bce5eec,4bf06994bd4c9a,29ffb3ea02be3e,28,28,0.370361,0.388923,11.0
4,0000a43bce5eec,4d1bbbc19b9084,10ed15afafb7ec,28,28,1.000000,0.911555,26.0
5,0000a43bce5eec,599d924f8a02db,45a789e6675d36,28,28,0.592773,0.661414,17.0
6,0000a43bce5eec,6620b219ed2ee2,a638435c730f4e,28,28,0.777832,0.863013,24.0
7,0000a43bce5eec,6c44ef4381fe8d,644c42b440c778,28,28,0.703613,0.664647,18.0
8,0000a43bce5eec,767819928e6279,8bc0d095f488b4,28,28,0.259277,0.268977,8.0
9,0000a43bce5eec,7bd08592bb25e2,a121348062f67a,28,28,0.666504,0.736476,20.0


It has been found out that rank of team/team_place is proportional to winPlacePerc. So team_place can be used as the most important factor judging winplacePerc. Let's try to explain winPlacePerc as the ratio of team_place to numGroups. team_place will never be equal to zero. However winPlacePerc can also be zero. So let's subtract 1 from team_place as that will return zero in cases where team_place=1.

In [38]:
df_grouped["win_perc"] = (df_grouped["team_place"] - 1) / (df_grouped["numGroups"]-1)
df = df.merge(df_grouped[["win_perc","matchId", "groupId"]], on=["matchId", "groupId"], how="left")

Let's post process the new win_perc similar. winPlacePerc shoul not exceed 1 and should not drop below 0. It should be between 1 and 0. Also maxPlace=0 is impossible in a game and maxPlace=0 means their is no team. So winPerc=0. Similarly maxPlace=0 means only one team. 

In [39]:
df.loc[df['maxPlace'] == 0, "win_perc"] = 0
df.loc[df['maxPlace'] == 1, "win_perc"] = 1
df.loc[(df['maxPlace'] > 1) & (df['numGroups'] == 1), "win_perc"] = 0
df.loc[df['win_perc'] < 0,"win_perc"] = 0
df.loc[df['win_perc'] > 1,"win_perc"] = 1
df['win_perc'].fillna(df['win_pred'],inplace=True)

In [40]:
df_grouped[df_grouped['maxPlace']>1][['winPlacePerc','win_perc','maxPlace','numGroups','team_place']]

Unnamed: 0,winPlacePerc,win_perc,maxPlace,numGroups,team_place
0,0.333252,0.333333,28,28,10.0
1,0.036987,0.037037,28,28,2.0
2,0.000000,0.000000,28,28,1.0
3,0.370361,0.370370,28,28,11.0
4,1.000000,0.925926,28,28,26.0
5,0.592773,0.592593,28,28,17.0
6,0.777832,0.851852,28,28,24.0
7,0.703613,0.629630,28,28,18.0
8,0.259277,0.259259,28,28,8.0
9,0.666504,0.703704,28,28,20.0


# This idea I got while referring similar kernels published publicly during the competion time and the credit goes for <a href='https://www.kaggle.com/anycode/simple-nn-baseline-3'>Kernel Here</a>. This helps to change the predicted win by few decimal points and improve the mae score. 

In [41]:
subset = df.loc[df['maxPlace'] > 1]
gap = 1 / (subset['maxPlace'].values-1)
new_perc = np.around(subset['win_perc'].values / gap) * gap
df.loc[df.maxPlace > 1, "win_perc"] = new_perc

In [42]:
print('The new mae score is {}'.format(mean_absolute_error(df['winPlacePerc'],df['win_perc'])))

The new mae score is 0.016860036160491636


  return umr_sum(a, axis, dtype, out, keepdims, initial)


# Woahh!!! The Score has improved a lot. 

In [43]:
del x,train,df
gc.collect()

107

# SUBMISSION

In [44]:
y_pred = model.predict(x_test, num_iteration=model.best_iteration)
id_test['win_pred']=y_pred
id_test.set_index(['matchId','groupId'])
del x_train,x_val,y_train,y_val,x_test
gc.collect()

test = reduce_mem_usage(pd.read_csv("../input/test_V2.csv"))
df=pd.merge(test,id_test,on=['matchId','groupId'],how='right')
del id_test,test
gc.collect()
df

Memory usage of dataframe is 413.18 MB
Memory usage after optimization is: 121.74 MB
Decreased by 70.5%


Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,longestKill,matchDuration,matchType,maxPlace,numGroups,rankPoints,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,win_pred
0,9329eb41e215eb,676b23c24e70d6,45b576ab7daa7f,0,0,51.468750,0,0,0,73,0,0,0,0.000000,1884,squad-fpp,28,28,1500,0,0.00,0,0.000,0,0,588.000000,1,0,0.198595
1,d6267a32c5709c,676b23c24e70d6,45b576ab7daa7f,0,0,0.000000,0,0,0,71,0,0,0,0.000000,1884,squad-fpp,28,28,1368,0,2694.00,0,0.000,0,0,549.500000,0,0,0.198595
2,b896f8954a92e2,676b23c24e70d6,45b576ab7daa7f,1,0,74.187500,1,0,0,72,0,0,0,0.000000,1884,squad-fpp,28,28,1429,0,0.00,0,0.000,0,0,386.250000,7,0,0.198595
3,2f134f2c7be198,676b23c24e70d6,45b576ab7daa7f,0,0,0.000000,0,0,0,70,0,0,0,0.000000,1884,squad-fpp,28,28,1488,0,0.00,0,0.000,0,0,913.000000,2,0,0.198595
4,639bd0dcd7bda8,430933124148dd,42a9a0b906c928,0,4,179.125000,0,0,2,11,0,2,1,362.000000,1811,duo-fpp,48,47,1503,2,4668.00,0,0.000,0,0,2017.000000,6,0,0.945626
5,ef362b46754f2a,430933124148dd,42a9a0b906c928,1,6,597.500000,4,1,5,3,0,5,2,192.625000,1811,duo-fpp,48,47,1519,0,4672.00,0,0.000,0,0,2476.000000,7,0,0.945626
6,63d5c8ef8dfe91,0b45f5db20ba99,87e7e4477a048e,1,0,23.406250,0,0,4,49,0,0,0,0.000000,1793,squad-fpp,28,27,1565,0,0.00,0,0.000,0,0,788.000000,4,0,0.827120
7,77a4a76df633e3,0b45f5db20ba99,87e7e4477a048e,0,6,1058.000000,5,4,9,1,0,8,2,236.750000,1793,squad-fpp,28,27,1527,0,2118.00,0,0.000,0,0,2352.000000,2,0,0.827120
8,bfe3f0a1c67d92,0b45f5db20ba99,87e7e4477a048e,2,0,77.625000,0,0,0,50,0,0,0,0.000000,1793,squad-fpp,28,27,1527,0,0.00,0,0.000,0,0,186.250000,1,0,0.827120
9,9b5a556b0c8b30,0b45f5db20ba99,87e7e4477a048e,0,3,331.000000,3,0,0,3,0,4,1,17.625000,1793,squad-fpp,28,27,1551,0,454.50,0,0.000,0,0,1072.000000,8,0,0.827120


In [45]:
df = df[["Id", "matchId", "groupId", "maxPlace", "numGroups",'win_pred']]

Let's take only one row from each groupby matchId and groupId since the winPlacePerc is almost same for each player in a team. Now sort and rank each group in a match. Rank is directly proportional to predicted winPerc.

In [46]:
df_grouped = df.groupby(["matchId", "groupId"]).first().reset_index()
df_grouped["team_place"] = df_grouped.groupby(["matchId"])["win_pred"].rank()
df_grouped

Unnamed: 0,matchId,groupId,Id,maxPlace,numGroups,win_pred,team_place
0,0008c31a9be4a7,01fb9c20f6abc2,f52eb1f272953a,30,30,0.075019,3.0
1,0008c31a9be4a7,0943c3f283b976,b710749e3445c7,30,30,0.810417,25.0
2,0008c31a9be4a7,11b26f1f710257,be85c709aca6e3,30,30,1.005092,30.0
3,0008c31a9be4a7,1568e092a99583,8e3823fd6aae05,30,30,0.703921,20.0
4,0008c31a9be4a7,26d4045668cf95,04ff89eb138f2d,30,30,0.108250,4.0
5,0008c31a9be4a7,298bb0348ccd3a,146e1e2705d68c,30,30,0.213319,7.0
6,0008c31a9be4a7,3cd4258cebec3d,e0df27c83e3086,30,30,0.247626,8.0
7,0008c31a9be4a7,3f91e7fec60224,e93d8ca57b6a4e,30,30,0.809311,24.0
8,0008c31a9be4a7,44f84d3bba50e9,3e7671eccbbc31,30,30,0.143747,5.0
9,0008c31a9be4a7,46579fc2b1245d,8517ab3e8829eb,30,30,0.484603,14.0


In [47]:
df_grouped["win_perc"] = (df_grouped["team_place"] - 1) / (df_grouped["numGroups"]-1)
df = df.merge(df_grouped[["win_perc", "matchId", "groupId"]], on=["matchId", "groupId"], how="left")

In [48]:
df.loc[df.maxPlace == 0, "win_perc"] = 0
df.loc[df.maxPlace == 1, "win_perc"] = 1
df.loc[(df.maxPlace > 1) & (df.numGroups == 1), "win_perc"] = 0
df.loc[df['win_perc'] < 0,"win_perc"] = 0
df.loc[df['win_perc'] > 1,"win_perc"] = 1
df['win_perc'].fillna(df['win_pred'],inplace=True)

In [49]:
subset = df.loc[df['maxPlace'] > 1]
gap = 1 / (subset['maxPlace'].values-1)
new_perc = np.around(subset['win_perc'].values / gap) * gap
df.loc[df.maxPlace > 1, "win_perc"] = new_perc
df['winPlacePerc']=df['win_perc']

In [50]:
df=df[['Id','winPlacePerc']]
df.to_csv("submission_final.csv", index=False)

# If you liked the kernel, DO upvote! 