### Problem Statement - This project aims to classify horse racing bets as "Win" or "Lose" based on predictions of their results. Precise forecasts in this field can greatly improve the decision-making process for bookmakers and gamblers, which may result in greater financial results and enhanced betting tactics. Using Keras/TensorFlow, a machine learning pipeline will be built to preprocess the data, build pertinent features, and train a deep neural network model.

In [14]:
import pandas as p
from scikeras.wrappers import KerasClassifier
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

### Data Exploration

In [26]:
data_frame = p.read_csv('/Users/prerna/Downloads/tips.csv',encoding='unicode_escape')

In [27]:
data_frame.head(5)

Unnamed: 0,UID,ID,Tipster,Date,Track,Horse,Bet Type,Odds,Result,TipsterActive
0,1,1,Tipster A,24/07/2015,Ascot,Fredricka,Win,8.0,Lose,True
1,2,2,Tipster A,24/07/2015,Thirsk,Spend A Penny,Win,4.5,Lose,True
2,3,3,Tipster A,24/07/2015,York,Straightothepoint,Win,7.0,Lose,True
3,4,4,Tipster A,24/07/2015,Newmarket,Miss Inga Sock,Win,5.0,Lose,True
4,5,5,Tipster A,25/07/2015,Ascot,Peril,Win,4.33,Win,True


We have to transform the data, deal with categorical variables, and extract features corresponding to dates. The Result column also needs to be transformed into a binary format that can be used for classification.

### Data Preprocessing and Feature Engineering
- Using LabelEncoder, we will encode categorical variables and translate the category columns (Tipster, Track, Horse, and Bet Type) into a numerical representation.
- Additionally, extract the year, month, and day from the Date column to perform Date Featuring. 
- Will eliminate duplicate columns by eliminating the UID, ID, and Date columns since they don't offer any predictive significance. 
- In order to standardise the numerical properties, scale as well. 
- Additionally, we will convert the result to binary (1 for "Lose" and 0 for "Win") using binary encoding.

In [28]:
label_encoders = {}
categorical_columns = ['Tipster', 'Track', 'Horse', 'Bet Type']
for col in categorical_columns:
    label_encoders[col] = LabelEncoder()
    data_frame[col] = label_encoders[col].fit_transform(data[col].astype(str))

data_frame['Date'] = pd.to_datetime(data_frame['Date'])
data_frame['Year'] = data_frame['Date'].dt.year
data_frame['Month'] = data_frame['Date'].dt.month
data_frame['Day'] = data_frame['Date'].dt.day

data_frame.drop(['UID', 'ID', 'Date'], axis=1, inplace=True)

X = data_frame.drop('Result', axis=1)
y = data_frame['Result']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

  data_frame['Date'] = pd.to_datetime(data_frame['Date'])


### Model Training 
Using Keras, we will train a neural network. To avoid overfitting, the model has dropout layers and thick layers with ReLU activation. For binary classification, the output layer employs a sigmoid activation function.
To determine the ideal hyperparameters for our model, we will employ GridSearchCV. Accuracy and F1 score will be used to assess the performance while the grid search explores various optimizers and dropout rates.

In [29]:
y_train = [1 if i == "Lose" else 0 for i in y_train ]
y_test = [1 if i == "Lose" else 0 for i in y_test ]

In [30]:
def create_model(optimizer='adam', dropout_rate=0.0):
    model = Sequential()
    model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))  

    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

model = KerasClassifier(model=create_model, verbose=0)

param_grid = {
    'model__optimizer': ['adam', 'rmsprop'],
    'model__dropout_rate': [0.0, 0.1],
    'epochs': [10, 20]
}


scorer = make_scorer(accuracy_score)

grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=2)

grid_result = grid.fit(X_train, y_train)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **

Now we will create a pandas DataFrame from the grid search results. Comprehensive details on every parameter combination tried during the grid search are contained in the cv_results_ property.
Next, we will determine the accuracy and F1 score for a given model, features (X), and labels (y). These metrics will be added by the apply function to the results DataFrame for every row, where each row is associated with a particular parameter combination that was examined during the grid search.

In [31]:
results = p.DataFrame(grid_result.cv_results_)


def get_f1_score(estimator, X, y):
    y_pred = estimator.predict(X)
    return f1_score(y, y_pred)

def get_accuracy(estimator, X, y):
    y_pred = estimator.predict(X)
    return accuracy_score(y, y_pred)

results['mean_test_f1_score'] = results.apply(
    lambda row: get_f1_score(grid_result.best_estimator_, X_test, y_test), axis=1
)
results['mean_test_accuracy'] = results.apply(
    lambda row: get_accuracy(grid_result.best_estimator_, X_test, y_test), axis=1
)

sorted_results = results.sort_values(by=['mean_test_f1_score', 'mean_test_accuracy'], ascending=False)


In [32]:
sorted_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_epochs,param_model__dropout_rate,param_model__optimizer,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score,mean_test_f1_score,mean_test_accuracy
0,1.667311,0.008008,0.147305,0.000444,10,0.0,adam,"{'epochs': 10, 'model__dropout_rate': 0.0, 'mo...",0.798549,0.798876,0.798712,0.000163,2,0.888128,0.799346
1,1.448735,0.000625,0.146618,0.000751,10,0.0,rmsprop,"{'epochs': 10, 'model__dropout_rate': 0.0, 'mo...",0.795019,0.797568,0.796294,0.001275,7,0.888128,0.799346
2,1.882945,0.032232,0.148129,0.001044,10,0.1,adam,"{'epochs': 10, 'model__dropout_rate': 0.1, 'mo...",0.798745,0.798614,0.79868,6.5e-05,3,0.888128,0.799346
3,1.658055,0.000483,0.147589,0.000727,10,0.1,rmsprop,"{'epochs': 10, 'model__dropout_rate': 0.1, 'mo...",0.798745,0.798941,0.798843,9.8e-05,1,0.888128,0.799346
4,2.964836,0.02967,0.149595,0.000563,20,0.0,adam,"{'epochs': 20, 'model__dropout_rate': 0.0, 'mo...",0.798353,0.797046,0.797699,0.000654,5,0.888128,0.799346
5,2.665077,0.00727,0.148712,0.000341,20,0.0,rmsprop,"{'epochs': 20, 'model__dropout_rate': 0.0, 'mo...",0.796915,0.797568,0.797242,0.000327,6,0.888128,0.799346
6,3.445149,0.024496,0.149858,1.6e-05,20,0.1,adam,"{'epochs': 20, 'model__dropout_rate': 0.1, 'mo...",0.798157,0.798941,0.798549,0.000392,4,0.888128,0.799346
7,3.058427,0.001688,0.147308,0.000523,20,0.1,rmsprop,"{'epochs': 20, 'model__dropout_rate': 0.1, 'mo...",0.796457,0.795477,0.795967,0.00049,8,0.888128,0.799346


## Result 
- With a mean test accuracy of 0.798562 and a mean test f1 score of 0.887296, the model Adam, optimised with optimizer=rmsprop, dropout rate of 0.0, and epochs of 10 achieved the highest rank (rank 1).
- In terms of accuracy and F1 score, all models perform extremely similarly, suggesting that the model's performance on this dataset is not greatly affected by the optimizer and dropout rate selection. All models have an approximate mean test accuracy of 0.798562 and a constant F1 score of 0.887296.
- The performance of rmsprop and adam optimizers is close. Neither the dropout rate (0.0 nor 0.1) nor performance is significantly affected.

### Summary 
- As determined by accuracy and F1 score, the performance of various models is remarkably consistent across a range of hyperparameter values.
- The best result was obtained with 10 epochs, no dropout, and rmsprop optimizer, while alternative parameters also worked quite well.
- The model and preprocessing procedures appear to be reliable and well-suited for the dataset based on the consistency across various parameter values.