# Building Machine Learning Pipeline using AWS Sagemaker

##### Disclaimer: this notebook is created for academic purpose only. There is no affiliation with any intitution in this project.

## Project Brief

- This project is an extension of previous projects which aim to do web scraping and build a machine learning model to predict a tennis match outcome. 
- This project will focus on building the end-to-end pipeline for the machine learning model. 
- This project utilised AWS infrastructure which includes AWS Sagemaker, Amazon API Gateway, Amazon Lambda Function, and Amazon S3 Bucket. 

- The overall workflow consists of five key steps including: <br>
    1) Store scraped data in AWS S3 Bucket.<br>
    2) Split data into training, validation, & test set and store it in S3 Bucket.<br>
    3) Build, train, and deploy the Amazon SageMaker as an endpoint in production.<br>
    4) Create lambda function based on the endpoint.<br>
    5) Generate an API Gateway attach to the lambda function.

## Import Packages

In [38]:
# Import packages for preparing sagemaker environment
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sagemaker import get_execution_role

# Package for saving dataset into S3 bucket
import os

# Import python data processing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import datetime as dt
import pickle as pkl

# Import package for modelling data preparation
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OneHotEncoder

# Import package from sklearn for model evaluation
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [39]:
# Define region
my_region=boto3.session.Session().region_name
print(my_region)

eu-west-2


In [40]:
# Set an output path where the trained model will be saved 
bucket_name='atpbucketv3' # insert bucket name
prefix_output='xgboost-as-a-built-in-algo'
output_path='s3://{}/{}/output'.format(bucket_name,prefix_output)
print(output_path)

s3://atpbucketv3/xgboost-as-a-built-in-algo/output


## Reading Dataset from S3 Bucket

Notes: In this case, the result of the web scraping has been saved in csv files and stored to S3 Bucket manually. For the future work, the ETL process of web scraping can be incorporated in the automation as well.

In [41]:
bucket_name='atpbucketv3/' # insert bucket name
prefix_input='webscraperesult/'
try:
    df_serve_leaderboard=pd.read_csv('s3://' + bucket_name + prefix_input + 'serve_leaderboard_scrapped.csv')
    df_return_leaderboard=pd.read_csv('s3://' + bucket_name + prefix_input + 'return_leaderboard_scrapped.csv')
    df_under_pressure_leaderboard=pd.read_csv('s3://' + bucket_name + prefix_input + 'under_pressure_leaderboard_scrapped.csv')
    df_rankings=pd.read_csv('s3://' + bucket_name + prefix_input + 'rankings_scrapped.csv')
    df_serve_leaderboard=pd.read_csv('s3://' + bucket_name + prefix_input + 'serve_leaderboard_scrapped.csv')
    df_atp_match=pd.read_csv('s3://' + bucket_name + prefix_input + 'atp_match.csv')
    print('Download success')
except Exception as e:
    print('Download error: ',e)

Download success


## Data Cleaning and Feature Engineering

### Data Cleaning

In this section, we will:
1. Merge the atp match dataset with all player's feature datasets scrapped from atp website.
2. Check and transform missing values
3. Remove unnecessary columns for model building

In [42]:
# Merge all player's feature dataset
temp_merge_1=pd.merge(df_serve_leaderboard,df_rankings,on='player',how='inner')
temp_merge_2=pd.merge(temp_merge_1,df_return_leaderboard,on='player',how='inner')
temp_merge_3=pd.merge(temp_merge_2,df_under_pressure_leaderboard,on='player',how='inner')

df_player_feature=temp_merge_3.copy()

In [44]:
# Identify missing value
percent_missing = df_player_feature.isnull().sum() * 100 / len(df_player_feature)
missing_value_data = pd.DataFrame({'percent_missing': percent_missing})
missing_value_data = missing_value_data.sort_values('percent_missing', ascending = False)
missing_value_data['percent_missing'] = pd.Series(["{0:.2f}%".format(val) for val in missing_value_data['percent_missing']], index = missing_value_data.index)
missing_value_data

Unnamed: 0,percent_missing
player,0.00%
serve_rating,0.00%
rate_tie_break,0.00%
rate_break_points_saved,0.00%
rate_break_points_converted,0.00%
pressure_rating,0.00%
break_point,0.00%
return_game_won,0.00%
rate_2nd_return_points,0.00%
rate_1st_return_points,0.00%


In [45]:
# Make copy from original dataset for data preparation

df_fp = df_player_feature.copy()
df_match = df_atp_match.copy()

In [46]:
# Get player's last name to be the identifier when combining match data
df_fp['player_last_name']=df_fp['player'].apply(lambda x: x.split(" ")[-1])

In [47]:
# split winner and loser name
df_atp_match[['winner_last','winner_first']]=df_atp_match['winner'].str.split(' ',n=1, expand=True)
df_atp_match[['loser_last','loser_first']]=df_atp_match['loser'].str.split(' ',n=1, expand=True)

In [48]:
# Merge match dataset to player's feature dataset

df_merge_winner=pd.merge(df_atp_match, df_fp, how='left', left_on=['winner_last'], right_on=['player_last_name'])
df_merge_loser=pd.merge(df_merge_winner, df_fp, how='left', left_on=['loser_last'], right_on=['player_last_name'])

df_main=df_merge_loser.copy()

In [49]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23732 entries, 0 to 23731
Data columns (total 80 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   atp                            23732 non-null  int64  
 1   location                       23732 non-null  object 
 2   tournament                     23732 non-null  object 
 3   date                           23732 non-null  object 
 4   series                         23732 non-null  object 
 5   court                          23732 non-null  object 
 6   surface                        23732 non-null  object 
 7   round                          23732 non-null  object 
 8   best of                        23732 non-null  int64  
 9   winner                         23732 non-null  object 
 10  loser                          23732 non-null  object 
 11  wrank                          23732 non-null  float64
 12  lrank                          23732 non-null 

In [50]:
# Identify missing value
percent_missing = df_main.isnull().sum() * 100 / len(df_main)
missing_value_data = pd.DataFrame({'percent_missing': percent_missing})
missing_value_data = missing_value_data.sort_values('percent_missing', ascending = False)
missing_value_data['percent_missing'] = pd.Series(["{0:.2f}%".format(val) for val in missing_value_data['percent_missing']], index = missing_value_data.index)
missing_value_data

Unnamed: 0,percent_missing
player_last_name_y,22.16%
next_best_y,22.16%
player_y,22.16%
serve_rating_y,22.16%
rate_1st_serve_y,22.16%
...,...
winner_last,0.00%
winner_first,0.00%
loser_last,0.00%
loser_first,0.00%


In [51]:
# Drop rows which variable has low missing value %
removed_rows=['rate_deciding_set_y','tourn_played_y','player_y','serve_rating_y','rate_1st_serve_y',
'rate_1st_serve_points_y','rate_2nd_serve_points_y','avg_aces_per_match_y',
'avg_double_faults_per_match_y','age_y','points_y','rate_service_game_won_y',
'next_best_y','rate_1st_return_points_y','rate_2nd_return_points_y','return_game_won_y',
'break_point_y','pressure_rating_y','rate_break_points_converted_y','rate_break_points_saved_y',
'rate_tie_break_y','return_rating_y','break_point_x','next_best_x','return_rating_x',
'rate_1st_return_points_x','rate_2nd_return_points_x','return_game_won_x',
'rate_deciding_set_x','pressure_rating_x','rate_break_points_converted_x',
'rate_break_points_saved_x','rate_tie_break_x','points_x','tourn_played_x','age_x',
'avg_aces_per_match_x','rate_service_game_won_x','rate_2nd_serve_points_x','rate_1st_serve_points_x',
'rate_1st_serve_x','serve_rating_x','player_x','avg_double_faults_per_match_x']

df_main = df_main.dropna(how = 'any', subset = removed_rows)

Notes: x refers to winner, y refers to loser. In the next section, the name will be changed.

## Feature Engineering

There are three processes in the following sections: 
1. Features Selection
    - Some variables will be removed because they are related to the match scores, which are unknown before a match start. It was kept for data exploration only. 
    - Some variables that do not affect the chance of a player winning a game will be removed, e.g. winner_ioc, loser_ioc, tourney_id, tourney_name.
    - Some variables that have already been described by other variables will be removed, e.g. winner_id and loser_id, defined by winner_name and loser_name. 

2. Create Dependent Variable
    - Current dataset does not have a variable as the target feature.
    - A new feature is created to capture a binary value with '0' if Player 1 wins and '1' if Player 2 wins. This is the feature the model will predict. 
    - Below are the steps to create the feature:
        - 'win_result' will be labelled:
            - '0' if Player 1 wins
            - '1' if Player 2 wins
    - To avoid the biases of having Player 1 always be the winner, Player 1 and Player 2 will be called alphabetically, and win_result will be assigned accordingly.
    - The data will be subset according to the win_result value
    - The variables then renamed to Player 1 & Player 2, where the rule inversed between win_result 0 and 1. 
    - The two subsets will be concatenated to form one complete dataset again.
    - Class balance will be checked later to ensure roughly half of the matches, Player 1 is the winner, and the same goes with Player 2 to avoid bias.

3. Feature Transformation
    - In this section, the categorical features will be encoded. 

In [52]:
# make a copy for model building
df_model=df_main.copy()

### Feature selection

In [53]:
# Drop unnecessary/repeating columns

df_model = df_model.drop(['atp','location','tournament','comment','winner_last','winner_first','loser_last','loser_first',
                       'player_last_name_x','player_last_name_y','w1','w2','l1','l2','wsets','lsets',
                         ],axis = 1)

### Create dependent variable

In [54]:
# Assign value to win_result based on the alphabetical order

df_model["win_result"] = df_model.apply(lambda row: 1 if row["player_x"] > row["player_y"] else 0, axis=1)
df_model[["winner","player_x", "loser","player_y", "win_result"]].head(10)

Unnamed: 0,winner,player_x,loser,player_y,win_result
2,Ruusuvuori E.,Emil Ruusuvuori,Vesely J.,Jiri Vesely,0
3,Bublik A.,Alexander Bublik,Caruso S.,Salvatore Caruso,0
4,Goffin D.,David Goffin,Herbert P.H.,Pierre-Hugues Herbert,0
5,Travaglia S.,Stefano Travaglia,Kecmanovic M.,Miomir Kecmanovic,1
11,Chardy J.,Jeremy Chardy,Albot R.,Radu Albot,0
18,Travaglia S.,Stefano Travaglia,Ruusuvuori E.,Emil Ruusuvuori,1
20,Chardy J.,Jeremy Chardy,Fognini F.,Fabio Fognini,1
23,Goffin D.,David Goffin,Travaglia S.,Stefano Travaglia,0
24,Chardy J.,Jeremy Chardy,Struff J.L.,Jan-Lennard Struff,1
25,Bublik A.,Alexander Bublik,Berrettini M.,Matteo Berrettini,0


### Class Balance

In [55]:
# Calculate current class balance
print ('Number of Player 1 Win is: {}'.format((df_model.win_result == 0).sum()))
print ('Number of Player 1 Win is: {:.2f} %'.format((df_model.win_result == 0).sum()/len(df_model)*100))
print ('Number of Player 2 Win is: {}'.format((df_model.win_result == 1).sum()))
print ('Number of Player 2 Win is: {:.2f} %'.format((df_model.win_result == 1).sum()/len(df_model)*100))

Number of Player 1 Win is: 7729
Number of Player 1 Win is: 49.97 %
Number of Player 2 Win is: 7737
Number of Player 2 Win is: 50.03 %


In [56]:
# Since both class has already balanced, we will continue with creating the dependent variable
# Create subset for each win_result
df_subset_a=df_model.loc[df_model['win_result']==0]
df_subset_b=df_model.loc[df_model['win_result']==1]

In [57]:
# Rename columns for each subsets based on win_result input
# Rename subset for win_result = 0
df_subset_a=df_subset_a.rename(columns={
    'player_x':'p1_name','serve_rating_x':'serve_rating_p1','rate_1st_serve_x':'rate_1st_serve_p1',
    'rate_1st_serve_points_x':'rate_1st_serve_points_p1','rate_2nd_serve_points_x':'rate_2nd_serve_points_p1',
    'rate_service_game_won_x':'rate_service_game_won_p1','avg_aces_per_match_x':'avg_aces_per_match_p1',
    'avg_double_faults_per_match_x':'avg_double_faults_per_match_p1','age_x':'age_p1',
    'points_x':'points_p1','tourn_played_x':'tourn_played_p1','next_best_x':'next_best_p1',
    'return_rating_x':'return_rating_p1','rate_1st_return_points_x':'rate_1st_return_points_p1',
    'rate_1st_return_points_x':'rate_1st_return_points_p1','rate_2nd_return_points_x':'rate_2nd_return_points_p1',
    'rate_2nd_return_points_x':'rate_2nd_return_points_p1','return_game_won_x':'return_game_won_p1',
    'break_point_x':'break_point_p1','pressure_rating_x':'pressure_rating_p1',
    'rate_break_points_converted_x':'rate_break_points_converted_p1',
    'rate_break_points_saved_x':'rate_break_points_saved_p1','rate_tie_break_x':'rate_tie_break_p1',
    'rate_deciding_set_x':'rate_deciding_set_p1','player_y':'p2_name',
    'serve_rating_y':'serve_rating_p2','rate_1st_serve_y':'rate_1st_serve_p2',
    'rate_1st_serve_points_y':'rate_1st_serve_points_p2','rate_2nd_serve_points_y':'rate_2nd_serve_points_p2',
    'rate_service_game_won_y':'rate_service_game_won_p2','avg_aces_per_match_y':'avg_aces_per_match_p2',
    'avg_aces_per_match_y':'avg_aces_per_match_p2','avg_double_faults_per_match_y':'avg_double_faults_per_match_p2',
    'age_y':'age_p2','points_y':'points_p2','tourn_played_y':'tourn_played_p2','next_best_y':'next_best_p2',
    'return_rating_y':'return_rating_p2','rate_1st_return_points_y':'rate_1st_return_points_p2',
    'rate_2nd_return_points_y':'rate_2nd_return_points_p2','return_game_won_y':'return_game_won_p2',
    'break_point_y':'break_point_p2','pressure_rating_y':'pressure_rating_p2','pressure_rating_y':'pressure_rating_p2',
    'rate_break_points_converted_y':'rate_break_points_converted_p2','rate_break_points_saved_y':'rate_break_points_saved_p2',
    'rate_tie_break_y':'rate_tie_break_p2','rate_deciding_set_y':'rate_deciding_set_p2',
    'wrank':'rank_p1','lrank':'rank_p2','b365w':'b365_p1','b365l':'b365_p2',
    'psw':'ps_p1','psl':'ps_p2','maxw':'max_p1','maxl':'max_p2','avgw':'avg_p1','avgl':'avg_p2',
    'wpts':'pts_p1','lpts':'pts_p2'})

# Rename subset for win_result = 1 (inverse of df_subset_a)
df_subset_b=df_subset_b.rename(columns={
    'player_x':'p2_name','serve_rating_x':'serve_rating_p2','rate_1st_serve_x':'rate_1st_serve_p2',
    'rate_1st_serve_points_x':'rate_1st_serve_points_p2','rate_2nd_serve_points_x':'rate_2nd_serve_points_p2',
    'rate_service_game_won_x':'rate_service_game_won_p2','avg_aces_per_match_x':'avg_aces_per_match_p2',
    'avg_double_faults_per_match_x':'avg_double_faults_per_match_p2','age_x':'age_p2',
    'points_x':'points_p2','tourn_played_x':'tourn_played_p2','next_best_x':'next_best_p2',
    'return_rating_x':'return_rating_p2','rate_1st_return_points_x':'rate_1st_return_points_p2',
    'rate_1st_return_points_x':'rate_1st_return_points_p2','rate_2nd_return_points_x':'rate_2nd_return_points_p2',
    'rate_2nd_return_points_x':'rate_2nd_return_points_p2','return_game_won_x':'return_game_won_p2',
    'break_point_x':'break_point_p2','pressure_rating_x':'pressure_rating_p2',
    'rate_break_points_converted_x':'rate_break_points_converted_p2',
    'rate_break_points_saved_x':'rate_break_points_saved_p2','rate_tie_break_x':'rate_tie_break_p2',
    'rate_deciding_set_x':'rate_deciding_set_p2','player_y':'p1_name',
    'serve_rating_y':'serve_rating_p1','rate_1st_serve_y':'rate_1st_serve_p1',
    'rate_1st_serve_points_y':'rate_1st_serve_points_p1','rate_2nd_serve_points_y':'rate_2nd_serve_points_p1',
    'rate_service_game_won_y':'rate_service_game_won_p1','avg_aces_per_match_y':'avg_aces_per_match_p1',
    'avg_aces_per_match_y':'avg_aces_per_match_p1','avg_double_faults_per_match_y':'avg_double_faults_per_match_p1',
    'age_y':'age_p1','points_y':'points_p1','tourn_played_y':'tourn_played_p1','next_best_y':'next_best_p1',
    'return_rating_y':'return_rating_p1','rate_1st_return_points_y':'rate_1st_return_points_p1',
    'rate_2nd_return_points_y':'rate_2nd_return_points_p1','return_game_won_y':'return_game_won_p1',
    'break_point_y':'break_point_p1','pressure_rating_y':'pressure_rating_p1','pressure_rating_y':'pressure_rating_p1',
    'rate_break_points_converted_y':'rate_break_points_converted_p1','rate_break_points_saved_y':'rate_break_points_saved_p1',
    'rate_tie_break_y':'rate_tie_break_p1','rate_deciding_set_y':'rate_deciding_set_p1',
    'wrank':'rank_p2','lrank':'rank_p1','b365w':'b365_p2','b365l':'b365_p1',
    'psw':'ps_p2','psl':'ps_p1','maxw':'max_p2','maxl':'max_p1','avgw':'avg_p2','avgl':'avg_p1',
    'wpts':'pts_p2','lpts':'pts_p1'})

# Concatenate subsets
df_model=pd.concat([df_subset_a,df_subset_b])

# Reorder based on tourney_date, so it is close enough to the original ordering
df_model = df_model.sort_values(by='date', ascending=True)

# Reset indexing
df_model=df_model.reset_index(drop=True)

# Remove other unnecessary columns
df_model = df_model.drop(['date','winner','loser','p1_name','p2_name'],axis = 1)

In [58]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15466 entries, 0 to 15465
Data columns (total 60 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   series                          15466 non-null  object 
 1   court                           15466 non-null  object 
 2   surface                         15466 non-null  object 
 3   round                           15466 non-null  object 
 4   best of                         15466 non-null  int64  
 5   rank_p1                         15466 non-null  float64
 6   rank_p2                         15466 non-null  float64
 7   pts_p1                          15466 non-null  float64
 8   pts_p2                          15466 non-null  float64
 9   b365_p1                         15466 non-null  float64
 10  b365_p2                         15466 non-null  float64
 11  ps_p1                           15466 non-null  float64
 12  ps_p2                           

### Encoding Categorical Features
In order to build the prediction model, the categorical feature needs to be encoded using OneHotEncoder.

In [59]:
# Create function to encode categorical features using OneHotEncoder
def cat_encoding(cat_cols):
    ohe=OneHotEncoder()
    cat_encoded=ohe.fit_transform(cat_cols)
    columns=ohe.get_feature_names(list(cat_cols))
    cat_cols=pd.DataFrame(cat_encoded.todense(),columns=columns)    
    return cat_cols

In [60]:
# Define list of columns with categorical value
cat_cols=['series','court','surface','round','best of']

# Separate columns that will not be tranformed
df_not_cat=df_model.drop(columns=cat_cols)
df_cat=df_model[cat_cols]

# Encode features
df_cat_encoded=cat_encoding(df_cat)

# Concatenate original and transformed dataframe
df_final=pd.concat([df_not_cat,df_cat_encoded],axis=1)

In [61]:
df_final.head()

Unnamed: 0,rank_p1,rank_p2,pts_p1,pts_p2,b365_p1,b365_p2,ps_p1,ps_p2,max_p1,max_p2,...,round_1st Round,round_2nd Round,round_3rd Round,round_4th Round,round_Quarterfinals,round_Round Robin,round_Semifinals,round_The Final,best of_3,best of_5
0,19.0,18.0,1840.0,1855.0,3.0,1.36,2.93,1.45,3.65,1.46,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,41.0,10.0,1030.0,3005.0,3.75,1.25,3.43,1.36,3.8,1.36,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,42.0,8.0,1030.0,3325.0,2.75,1.4,2.66,1.53,2.85,1.55,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,22.0,9.0,1555.0,3150.0,4.5,1.18,4.96,1.21,5.4,1.29,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,46.0,1.0,951.0,11120.0,21.0,1.01,23.14,1.01,28.0,1.02,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


To run XGBoost in SageMaker, the dependent variable must be placed as the first column. Thus, we move win_result to be the first column.

In [62]:
# Shift column 'win_result' as the dependent variable to be the first column - sagemaker identifies dependent variable in the first column
first_column = df_final.pop('win_result')
df_final.insert(0, 'win_result', first_column)

# Data frame after shifting the column
print("After shifting column to first position")
display(df_final)

After shifting column to first position


Unnamed: 0,win_result,rank_p1,rank_p2,pts_p1,pts_p2,b365_p1,b365_p2,ps_p1,ps_p2,max_p1,...,round_1st Round,round_2nd Round,round_3rd Round,round_4th Round,round_Quarterfinals,round_Round Robin,round_Semifinals,round_The Final,best of_3,best of_5
0,0,19.0,18.0,1840.0,1855.0,3.00,1.36,2.93,1.45,3.65,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,41.0,10.0,1030.0,3005.0,3.75,1.25,3.43,1.36,3.80,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,42.0,8.0,1030.0,3325.0,2.75,1.40,2.66,1.53,2.85,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,22.0,9.0,1555.0,3150.0,4.50,1.18,4.96,1.21,5.40,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,46.0,1.0,951.0,11120.0,21.00,1.01,23.14,1.01,28.00,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15461,0,7.0,73.0,4110.0,856.0,1.16,5.00,1.18,5.63,1.21,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
15462,0,11.0,5.0,3330.0,5980.0,1.66,2.20,1.72,2.22,1.74,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
15463,0,8.0,62.0,3865.0,906.0,1.33,3.40,1.33,3.70,1.38,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
15464,1,26.0,1.0,1566.0,8340.0,4.00,1.25,4.62,1.24,4.62,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [63]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15466 entries, 0 to 15465
Data columns (total 75 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   win_result                      15466 non-null  int64  
 1   rank_p1                         15466 non-null  float64
 2   rank_p2                         15466 non-null  float64
 3   pts_p1                          15466 non-null  float64
 4   pts_p2                          15466 non-null  float64
 5   b365_p1                         15466 non-null  float64
 6   b365_p2                         15466 non-null  float64
 7   ps_p1                           15466 non-null  float64
 8   ps_p2                           15466 non-null  float64
 9   max_p1                          15466 non-null  float64
 10  max_p2                          15466 non-null  float64
 11  avg_p1                          15466 non-null  float64
 12  avg_p2                          

## Model Building

### Split Dataset into Train, Validation, and Test Set

We will split the dataset into train, validation, and test set. Then, we will upload the splitted set to S3 Bucket. These files will be called in during model training and prediction.

In [64]:
# Prepare dataset for model training 
dataset=pd.concat([df_final['win_result'],df_final.drop(['win_result'],axis=1)],
               axis=1)

In [65]:
# Split dataset into train, validation, and test set 
train_data, validation_data, test_data = np.split(dataset.sample(frac=1, random_state=1729), [int(0.7 * len(dataset)), int(0.9 * len(dataset))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

In [66]:
# Upload train, validation, and test set to S3 Bucket
bucket_name='atpbucketv3' # insert bucket name
prefix_output='xgboost-as-a-built-in-algo'

boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix_output, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix_output, 'validation/validation.csv')).upload_file('validation.csv')

s3_input_train = TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix_output), content_type='csv')
s3_input_validation = TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket_name, prefix_output), content_type='csv')

### Model Training

In [67]:
# Define xgboost container 
xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "1.2-2")

sess = sagemaker.Session()
role = get_execution_role()
a=boto3.Session().region_name

In [68]:
# Define the parameter to run the training of the xgboost model
xgb = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                                    role=sagemaker.get_execution_role(), 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket_name, prefix_output),
                                    sagemaker_session=sess,
                                    max_run=300,
                                    #max_wait=600,
                                    )

### Hyperparameter Tuning

In [2]:
# Train and fine-tune model 
xgb.set_hyperparameters(eta=0.1, objective='binary:logistic', num_round=25) 
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

### Model Deployment

In [70]:
# Deploy model 
xgb_predictor = xgb.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    serializer = CSVSerializer())

------!

### Prediction using Deployed Model

In [71]:
# Prediction using the deployed model 
test_data_array=test_data.drop(['win_result'],axis=1).values # loading data into an array
#xgb_predictor.content_type = 'text/csv' # set data type for an inference
xgb_predictor.serializer=CSVSerializer() # set serializer type

predictions=xgb_predictor.predict(test_data_array).decode('utf-8')
predictions_array=np.fromstring(predictions[1:],sep=',') # turn prediction to array

print(predictions_array.shape) # print output for test set

(1,)




### Endpoint Configuration

In [37]:
# Check endpoint name for API gateway configuration 
xgb_predictor.endpoint_name

'sagemaker-xgboost-2022-05-02-23-01-28-966'

In [36]:
# Delete endpoint - only for this project, to cut the billing charges
#xgb_predictor.delete_endpoint()

### Limitation and Future Work

There are some parts of the workflow that may still require manual configuration. For further study, it can be improved by using a more versatile orchestration tool like Amazon MWAA which able to automate complex machine learning pipelines by running DAG that covers end-to-end machine learning task. This could also help to connect the gap between having the web-scrapping in docker and move the result to amazon environment. This analysis also only try to train the model using XGBoost algorithm. For further study, it will be better to explore other models as well like ensemble model that may have a better performance in doing the prediction.

### Reference List

• aws. Evaluate the Model Deployed to SageMaker Hosting Services. [online] Available at:https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-test-model.html <br>
• aws. Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda. [online] Available at: https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/<br>
• aws. XGBoost Algorithm. [online] Available at:https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html<br>
• maldyvinandar. MSIN0166 Data Engineering Group Project. [online] Available at: https://github.com/maldyvinandar/data- eng-group-coursework.git<br>
• sagemaker. Use Version 2.x of the SageMaker Python SDK. [online] Available at: https://sagemaker.readthedocs.io/en/stable/v2.html<br>
• Sela E., S. Saksham, P . Y ash, & Zhuang Y . Simplify machine learning with XGBoost and Amazon SageMaker. [online] Available at: https://aws.amazon.com/blogs/machine-learning/simplify-machine-learning-with-xgboost-and-amazon-sagemaker/