# Classifying MLB Free agents

Now we're going to actually try classifying. We'll bring in the final data (this might change from time to time), format it correctly, and then try some ML

In [1]:
# Bring in data
import pandas as pd
import pickle

with open('final_data.pickle', 'rb') as file:
    final_data = pickle.load(file)

In [2]:
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1098 entries, 4 to 1712
Data columns (total 22 columns):
Unnamed: 0     1098 non-null int64
Age            1098 non-null int64
Destination    1098 non-null object
Origin         1098 non-null object
WAR_3          1098 non-null float64
nameFirst      1098 non-null object
nameLast       1098 non-null object
Dollars        1098 non-null float64
Length         1098 non-null int64
Name           1098 non-null object
Position_x     1098 non-null object
playerID       1098 non-null object
yearID         1098 non-null int64
G              1098 non-null float64
OBP            1098 non-null float64
SLG            1098 non-null float64
HR             1098 non-null float64
RBI            1098 non-null float64
Position_y     1098 non-null object
name           1098 non-null object
teamID         1098 non-null object
label          1098 non-null int32
dtypes: float64(7), int32(1), int64(4), object(10)
memory usage: 193.0+ KB


In [5]:
# For features, let's drop all names, player/year IDs, position, Destination, and origin

# The Destination can stay for now, but drop all the others
X = final_data.drop(['playerID', 'nameFirst', 'nameLast', 'name',
                     'Origin', 'Position', 'yearID', 'Destination',
                     'teamID', 'label'], axis = 1).values

y = final_data['label'].values


# Split the data
from sklearn.model_selection import train_test_split

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y)

## Make the naive method

Classify based on the most common team: the Yankees

In [6]:
final_data['label'].value_counts().max()
final_data['label'].shape[0]

1897

In [7]:
# Calculate accuracy based on this
most_freq = float(final_data['label'].value_counts().max())
total_freq = float(final_data['label'].shape[0])


print("Naive Accuracy = {}".format(most_freq/total_freq))

Naive Accuracy = 0.44069583552978386


## Attempt 1: K-Nearest Neighbors

This is probably the simplest approach; let's see how it works

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')

In [9]:
print(knn.score(X_test, y_test))

y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))

0.35
             precision    recall  f1-score   support

          1       0.31      0.48      0.38       115
          2       0.00      0.00      0.00        30
          3       0.09      0.04      0.06        50
          4       0.43      0.45      0.44       168
          5       0.00      0.00      0.00        17

avg / total       0.30      0.35      0.32       380



  'precision', 'predicted', average, warn_for)


In [10]:
# Get actual probabilities
blah = knn.predict_proba(X_test)

## Attempt 2: Random Forest Classifier

In [11]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

# Fit the classifier to the data
rf.fit(X_train, y_train)

print(rf.score(X_test, y_test))

y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))

0.328947368421
             precision    recall  f1-score   support

          1       0.27      0.34      0.30       115
          2       0.08      0.03      0.05        30
          3       0.19      0.14      0.16        50
          4       0.43      0.46      0.44       168
          5       0.00      0.00      0.00        17

avg / total       0.30      0.33      0.31       380

