# Full Data Approach
In this method, we use all the columns of data from two files, "badmintondata.csv" and "badmintondata2.csv," for both training and testing. Unlike the trajectory based approach that is limited to just trajectory coordinates, there is no need to create many additional features because most of the features in the data are already important and can be directly used. Before continuing, ensure all relevant dependencies are installed from requirements.txt.

Two datasets are given as
- badminton serving data (badmintondata.csv)  : data points of shuttlecocks which are recorded in a serving trajectory
- badminton rallying data (badmintondata2.csv) : data points of shuttlecocks which are recorded in a rallying trajectory

### Importing libraries and dependencies

In [2]:
# preprocessing and feature generation
import pandas as pd

# feature significance
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# data mining and model evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import learning_curve

# visualization
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import plot

## Data Loading
General observation about the data are that
- Serving trajectories starts from a height of around 1.6 or 2.2
- Rallying trajectories start from a height of either 2.6 or 3.2
- Blank rows are used to separate observations. 
- Non-sparse data are filtered out

Some derived data columns used in the preprocessing
- "OBSERVATION GROUP NUMBER" - observation group an observation belongs to
- "OBSERVATION NUMBER" - sequence number of an observation in a observation group

### Load and Preprocess Serving Data

In [29]:
# read csv data from badmintondata.csv
serving_data = pd.read_csv('../badmintondata.csv')

# Initialize variables
observation_group_num = 0
observation_num = 0
is_group = False

# Process rows
for index, row in serving_data.iterrows():
    # If the human player is at the 0 (invalid serve, noise), remove the row
    if row['HUMAN PLAYER POSITION (X) metres'] == 0:
        if is_group:
            is_group = False
        serving_data.drop(index, inplace=True)
    # If the human player is at the 4 (valid serve), increase the observation number
    elif row['HUMAN PLAYER POSITION (X) metres'] == 4:
        # if the row is part of the same observation group
        if is_group:
            observation_num+=1
        # if the row is part of a new observation group
        else:
            is_group = True
            # increase observation group number by 1
            observation_group_num+=1
            # resets observation sequence number to 1
            observation_num = 1
        # Assign the observation number and observation group number to the row
        serving_data.at[index, 'OBSERVATION NUMBER'] = observation_num
        serving_data.at[index, 'OBSERVATION GROUP NUMBER'] = observation_group_num

# Set 'OBSERVATION GROUP NUMBER' as int
serving_data['OBSERVATION GROUP NUMBER'] = serving_data['OBSERVATION GROUP NUMBER'].astype(int)

# Determine the human Z starting position based on the closest rounded version of the first Z point
def calculate_human_position(group):
    rounded_value_z = min(1.6, 2.2, 2.6, 3.2, key=lambda x: abs(x - group.iloc[0]['SHUTTLECOCK POSITIION IN AIR(Z) metres']))
    group['HUMAN PLAYER POSITION (Z) metres'] = rounded_value_z
    return group

serving_data = serving_data.groupby('OBSERVATION GROUP NUMBER').apply(calculate_human_position).reset_index(drop=True)
serving_data.to_csv('cleaned_data/serving_data.csv', index=False)
print("Number of Observation Groups from Serving Data:", serving_data["OBSERVATION GROUP NUMBER"].nunique())

### Load and Preprocess Rallying Data

In [30]:
# read csv data from badmintondata.csv
rallying_data = pd.read_csv('../badmintondata2.csv')

# Initialize variables
observation_num = 0
is_group = False

# Process rows
for index, row in rallying_data.iterrows():
    if row['HUMAN PLAYER POSITION (X) metres'] == 0:
        if is_group:
            is_group = False
        rallying_data.drop(index, inplace=True)
    elif row['HUMAN PLAYER POSITION (X) metres'] == 4:
        if is_group:
            observation_num+=1
        else:
            is_group = True
            # increase observation group number by 1
            observation_group_num+=1
            # resets observation sequence number to 1
            observation_num = 1
        rallying_data.at[index, 'OBSERVATION NUMBER'] = observation_num
        rallying_data.at[index, 'OBSERVATION GROUP NUMBER'] = observation_group_num

rallying_data = rallying_data.groupby('OBSERVATION GROUP NUMBER').apply(calculate_human_position).reset_index(drop=True)
rallying_data.to_csv('cleaned_data/rallying_data.csv', index=False)

print("Number of Observation Groups from rallying Data:", rallying_data["OBSERVATION GROUP NUMBER"].nunique())

Number of Observation Groups from Serving Data: 1212


### Combine Serving and Rallying Data

In [32]:
serving_data = pd.read_csv("cleaned_data/serving_data.csv")
rallying_data = pd.read_csv("cleaned_data/rallying_data.csv")

# combining the serving and rally dataframes
entire_badminton_data = pd.concat([serving_data, rallying_data], ignore_index=True)
# just renaming some of the data columns to avoid unnecessary errors
entire_badminton_data = entire_badminton_data.rename(columns={'SHUTTLECOCK POSITIION IN AIR(X ) metres': 'SHUTTLECOCK POSITIION IN AIR(X) metres'})
entire_badminton_data = entire_badminton_data.rename(columns={'SHUTTELCOCK SLANT ANGLE TO SIDELINE(DEGREE)': 'INITIAL SHUTTELCOCK SLANT ANGLE TO SIDELINE(DEGREE)'})

entire_badminton_data.to_csv('cleaned_data/entire_data.csv', index=False)