# Machine Learning

In [None]:
import pandas as pd, altair as alt
import numpy as np, seaborn as sns
sns.set(color_codes=True)
%matplotlib inline

cc_df = pd.read_csv("combined_results.csv", index_col=0)

### Data Transformation

In [None]:
cc_df[["Place_Section", "float_Time_Section"]].hist(figsize=(10,5))

Here are histograms of our most important features, and it appears that they are right skewed. We will log-transform them in order to make the distributions more normal.

In [None]:
cc_df.Place_Section = np.log10(cc_df.Place_Section)
cc_df.float_Time_Section = np.log10(cc_df.float_Time_Section)

In [None]:
cc_df[["Place_Section", "float_Time_Section"]].hist(figsize=(10,5))

Models tend to (empirically) favor normally distributed data.

### Preliminary Modelling

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import RobustScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

X = (cc_df[["Division", "Grade", "Place_Section", "float_Time_Section", "Year", "Sex"]]
          .to_dict(orient="records"))
y = cc_df["float_Time_State"]

X_train, X_test, y_train, y_test = (
    train_test_split(X, y, test_size=.2))

vec = DictVectorizer(sparse=False)
scaler = RobustScaler()
model = KNeighborsRegressor()

pipeline = Pipeline([("vectorizer", vec), 
                     ("scaler", scaler), 
                     ("model", model)])

We want to set an initial benchmark by using the very basic K-Nearest Neighbors model. We will then use these results to compare the model we end up choosing with.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

cv_scores = []
test_scores = []
ks = range(1, 10)

for k in ks:
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), 
                     ("scaler", scaler), 
                     ("model", model)])
    
    cv_scores.append(-cross_val_score(pipeline, X_train, y_train, cv=5, scoring="neg_mean_squared_error").mean())
    pipeline.fit(X_train, y_train)
    test_scores.append(mean_squared_error(pipeline.predict(X_test), y_test))

In [None]:
import matplotlib.pyplot as plt

pd.DataFrame({"cross-val": cv_scores, "test": test_scores}, index=ks).plot()
min_k = np.argmin(test_scores)
plt.title("KNearest Optimal K")
plt.xlabel("K")
plt.ylabel("Error")
plt.text(x=5, y=0.72, s="$Test$ = {:3.2f}".format(test_scores[min_k]))
plt.text(x=5, y=0.7, s="$CV$ = {:3.2f}".format(cv_scores[min_k]))
plt.vlines(x=min_k + 1, ymin=test_scores[min_k], ymax=cv_scores[min_k], colors="red")

From our graph, we can see that our initial results a pretty promising. We can observe the optimal value of K signalled by the red line connecting the curves. The lowest K is chosen from the lowest observed test error. With regards to the results, a mean squared error of `0.5` implies that our model's predictions of the "State Time" (float) were typically off by roughly `0.7` (square-root of the MSE) of a minute, or `42` seconds.

### Gradient Boosting Regression

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(min_samples_leaf=3)
pipeline = Pipeline([("vectorizer", vec), 
                     ("scaler", scaler), 
                     ("model", model)])
    
cv_score = (-cross_val_score(pipeline, X_train, y_train, cv=5, scoring="neg_mean_squared_error").mean())
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
test_score = (mean_squared_error(y_test, y_pred))

A Gradient Boosting model was chosen because of it's ensemble-design, and has historically been a powerful model used in learning.

In [None]:
print("Test = {:3.2f}".format(test_score))
print("CV = {:3.2f}".format(cv_score))

According to the results, a Gradient Boosting model is better for our problem (approximately 20-25% lower error).

In [None]:
results = pd.DataFrame({"True": y_test, "Predicted": y_pred})
sns.jointplot(data=results, x="True", y="Predicted",
              ci=100, kind='reg',
              joint_kws={'line_kws':{'color':'orange'}})
plt.text(x=15, y=25, s="$R^2$ = {:3.2f}".format(pipeline.score(X_test, y_test)))

Our plot shows us the distribution of test values (True), and the predicted values (Predicted) from our model. Our $R^2$ (coefficient of determination) is high, which implies that our model is able to predict values that are very close to the true values.