# Artificial Dataset Creation

Since it is infeasible to get real players to play the ping pong game for long amounts of time (i.e., number of matches large enough to gather sufficient data), I decided to create a synthetic dataset by pitching two computer players playing against one another. Data is stored in multiple files, collected during different runs with different ball speeds.

### 1) Read and combine all the different data

In [1]:
""" Importing necessary packages """
import numpy as np
import pandas as pd
import json

In [26]:
""" Creating a list of filenames that need to be opened """
list_of_jsons = []

for i in range(5):
    list_of_jsons.append('trainingSet' + str(i+1) + '.json')

list_of_jsons

['trainingSet1.json',
 'trainingSet2.json',
 'trainingSet3.json',
 'trainingSet4.json',
 'trainingSet5.json']

In [27]:
""" Merge rows from all files """

df_X = pd.DataFrame()
df_Y = pd.DataFrame()
for json_name in list_of_jsons:
    data = []
    with open(json_name, 'r') as file:
        data = json.load(file)
    print(json_name)
    df_X = df_X.append(pd.DataFrame(data['xs']), ignore_index=True)
    df_Y = df_Y.append(pd.DataFrame(data['ys']), ignore_index=True)

trainingSet1.json
trainingSet2.json
trainingSet3.json
trainingSet4.json
trainingSet5.json


In [28]:
""" Check the sizes of X and Y """
print("Data X:", df_X.shape)
print("Data Y:", df_Y.shape)

Data X: (467925, 10)
Data Y: (467925, 3)


### 2) Find and delete duplicates

In [39]:
""" Get a series of booleans with true for duplicate values, keeping only the first occurance of duplicates """
duplicates = df_X.duplicated(keep='first')
print("Number of duplicate rows:", sum(duplicates))

Number of duplicate rows: 440305


In [45]:
""" Get index for values where duplicate is true """
duplicate_indices = duplicates[duplicates == True].index
duplicate_indices

Int64Index([   443,    444,    445,    446,    447,    448,    449,    450,
               451,    452,
            ...
            467915, 467916, 467917, 467918, 467919, 467920, 467921, 467922,
            467923, 467924],
           dtype='int64', length=440305)

In [46]:
""" Drop duplicate rows """
df_X.drop(duplicate_indices, axis=0, inplace=True)
df_Y.drop(duplicate_indices, axis=0, inplace=True)

In [63]:
""" Store into final_dataset.json """
with open('final_dataset_X.json', 'w') as file:
    df_X_out = df_X.to_json(orient='records')
    json.dump(fp=file, obj=df_X_out)

with open('final_dataset_Y.json', 'w') as file:
    df_Y_out = df_Y.to_json(orient='records')
    json.dump(fp=file, obj=df_Y_out)