## Basketball teams classification

Let's first load required libraries:

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import tree
import numpy as np
from IPython.display import Image  
%matplotlib inline

## About dataset
This dataset is about the performance of basketball teams. The cbb.csv data set includes performance data about five seasons of 354 basketball teams. It includes the following fields:

Field	Description
TEAM	The Division I college basketball school
CONF	The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G	Number of games played
W	Number of games won
ADJOE	Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE	Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG	Power Rating (Chance of beating an average Division I team)
EFG_O	Effective Field Goal Percentage Shot
EFG_D	Effective Field Goal Percentage Allowed
TOR	Turnover Percentage Allowed (Turnover Rate)
TORD	Turnover Percentage Committed (Steal Rate)
ORB	Offensive Rebound Percentage
DRB	Defensive Rebound Percentage
FTR	Free Throw Rate (How often the given team shoots Free Throws)
FTRD	Free Throw Rate Allowed
2P_O	Two-Point Shooting Percentage
2P_D	Two-Point Shooting Percentage Allowed
3P_O	Three-Point Shooting Percentage
3P_D	Three-Point Shooting Percentage Allowed
ADJ_T	Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB	Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON	Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED	Seed in the NCAA March Madness Tournament
YEAR	Season
Load Data From CSV File
Let's load the dataset [NB Need to provide link to csv file]

In [None]:
df = pd.read_csv('cbb.csv')
df.head()
df.shape
(1406, 24)

## Add Column
Next we'll add a column that will contain "true" if the wins above bubble are over 7 and "false" if not. We'll call this column Win Index or "windex" for short.


In [None]:
df['windex'] = np.where(df.WAB > 7, 'True', 'False')

## Data visualization and pre-processing
Next we'll filter the data set to the teams that made the Sweet Sixteen, the Elite Eight, and the Final Four in the post season. We'll also create a new dataframe that will hold the values with the new column.

In [None]:
f1 = df.loc[df['POSTSEASON'].str.contains('F4|S16|E8', na=False)]
df1.head()
df1['POSTSEASON'].value_counts()

32 teams made it into the Sweet Sixteen, 16 into the Elite Eight, and 8 made it into the Final Four over 5 seasons.

Lets plot some columns to understand the data better:

In [None]:
# notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y

In [None]:
import seaborn as sns

bins = np.linspace(df1.BARTHAG.min(), df1.BARTHAG.max(), 10)
g = sns.FacetGrid(df1, col="windex", hue="POSTSEASON", palette="Set1", col_wrap=6)
g.map(plt.hist, 'BARTHAG', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
bins = np.linspace(df1.ADJOE.min(), df1.ADJOE.max(), 10)
g = sns.FacetGrid(df1, col="windex", hue="POSTSEASON", palette="Set1", col_wrap=2)
g.map(plt.hist, 'ADJOE', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

## Pre-processing: Feature selection/extraction
Lets look at how Adjusted Defense Efficiency plots

In [None]:
bins = np.linspace(df1.ADJDE.min(), df1.ADJDE.max(), 10)
g = sns.FacetGrid(df1, col="windex", hue="POSTSEASON", palette="Set1", col_wrap=2)
g.map(plt.hist, 'ADJDE', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

We see that this data point doesn't impact the ability of a team to get into the Final Four.

## Convert Categorical features to numerical values
Lets look at the postseason:

In [None]:
df1.groupby(['windex'])['POSTSEASON'].value_counts(normalize=True)

13% of teams with 6 or less wins above bubble make it into the final four while 17% of teams with 7 or more do.

Lets convert wins above bubble (winindex) under 7 to 0 and over 7 to 1:

In [None]:
df1['windex'].replace(to_replace=['False','True'], value=[0,1],inplace=True)
df1.head()

## Feature selection
Let's define feature sets, X:

In [None]:
X = df1[['G', 'W', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D',
       'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O',
       '3P_D', 'ADJ_T', 'WAB', 'SEED', 'windex']]
X[0:5]
y = df1['POSTSEASON'].values
y[0:5]

## Normalize the data

In [None]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

## Training and test data

In [None]:
# We split the X into train and test to find the best k
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Validation set:', X_val.shape,  y_val.shape)

## Classification

### KNN

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
k = 5
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

yhat = neigh.predict(X_val)
yhat[0:5]

from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Validation set Accuracy: ", metrics.accuracy_score(y_val, yhat))

### Accuracy

In [None]:
Ks = 16
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_val)
    mean_acc[n-1] = metrics.accuracy_score(y_val, yhat)

    
    std_acc[n-1]=np.std(yhat==y_val)/np.sqrt(yhat.shape[0])

mean_acc

### Decision tree

In [None]:
Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree # it shows the default parameters
Tree.fit(X_train,y_train)
predTree = Tree.predict(X_val)
print (predTree [0:5])
print (y_val [0:5])
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_val, predTree))
tree.plot_tree(Tree)
plt.show()

### Support vector machine - SVM

In [None]:
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
#RBF
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 
yhat = clf.predict(X_val)
yhat [0:5]

f1_score(y_val, yhat, average='weighted') 
print (classification_report(y_val, yhat))

In [None]:
#poly
clf = svm.SVC(kernel='poly')
clf.fit(X_train, y_train) 
yhat = clf.predict(X_val)
yhat [0:5]

f1_score(y_val, yhat, average='weighted') 
#print (classification_report(y_val, yhat))

print (classification_report(y_val, yhat))

In [None]:
#sigmoid
clf = svm.SVC(kernel='sigmoid')
clf.fit(X_train, y_train) 
yhat = clf.predict(X_val)
yhat [0:5]

f1_score(y_val, yhat, average='weighted')

#Sigmoid kernel provides the best accuracy
print (classification_report(y_val, yhat))

In [None]:
#linear
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train) 
yhat = clf.predict(X_val)
yhat [0:5]

f1_score(y_val, yhat, average='weighted') 
print (classification_report(y_val, yhat))

### Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
yhat = LR.predict(X_val)
yhat
array(['F4', 'S16', 'E8', 'E8', 'E8', 'E8', 'S16', 'F4', 'E8', 'S16',
       'S16', 'S16'], dtype=object)
yhat_prob = LR.predict_proba(X_val)
yhat_prob

### Model evaluation

In [None]:
from sklearn.metrics import f1_score
# for f1_score please set the average parameter to 'micro'
from sklearn.metrics import log_loss
def jaccard_index(predictions, true):
    if (len(predictions) == len(true)):
        intersect = 0;
        for x,y in zip(predictions, true):
            if (x == y):
                intersect += 1
        return intersect / (len(predictions) + len(true) - intersect)
    else:
        return -1

## Test set evaluation

In [None]:
test_df = pd.read_csv('basketball_train.csv',error_bad_lines=False)
test_df.head()

In [None]:
test_df['windex'] = np.where(test_df.WAB > 7, 'True', 'False')
test_df1 = test_df[test_df['POSTSEASON'].str.contains('F4|S16|E8', na=False)]
test_Feature = test_df1[['G', 'W', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D',
       'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O',
       '3P_D', 'ADJ_T', 'WAB', 'SEED', 'windex']]
test_Feature['windex'].replace(to_replace=['False','True'], value=[0,1],inplace=True)
test_X=test_Feature
test_X= preprocessing.StandardScaler().fit(test_X).transform(test_X)
test_X[0:5]

In [None]:
test_y = test_df1['POSTSEASON'].values
test_y[0:5]

### KNN

In [None]:
k = 5
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neigh.predict(test_X)

print("Test set Accuracy: ", metrics.accuracy_score(test_y, yhat))
print("Test F1-score: ", f1_score(test_y, yhat, average='weighted'))
print("Test Jaccard index: ", jaccard_index(test_y, yhat))
Test set Accuracy:  0.6285714285714286
Test F1-score:  0.62031212484994
Test Jaccard index:  0.4583333333333333

### Decision tree

In [None]:
Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 2)
Tree # it shows the default parameters
Tree.fit(X_train,y_train)
predTree = Tree.predict(test_X)
print("Test set Accuracy: ", metrics.accuracy_score(test_y, yhat))
print("Test F1-score: ", f1_score(test_y, yhat, average='weighted'))
print("Test Jaccard index: ", jaccard_index(test_y, yhat))
Test set Accuracy:  0.6
Test F1-score:  0.5353383458646617
Test Jaccard index:  0.42857142857142855

### SVM

In [None]:
clf = svm.SVC(kernel='sigmoid')
clf.fit(X_train, y_train) 
yhat = clf.predict(test_X)
print("Test set Accuracy: ", metrics.accuracy_score(test_y, yhat))
print("Test F1-score: ", f1_score(test_y, yhat, average='weighted'))
print("Test Jaccard index: ", jaccard_index(test_y, yhat))
Test set Accuracy:  0.6
Test F1-score:  0.5353383458646617
Test Jaccard index:  0.42857142857142855

### Logistic Regression

In [None]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
yhat = LR.predict(test_X)
yhat_prob = LR.predict_proba(test_X)

print("Test set Accuracy: ", metrics.accuracy_score(test_y, yhat))
print("Test F1-score: ", f1_score(test_y, yhat, average='weighted'))
print("Test Jaccard index: ", jaccard_index(test_y, yhat))
print("Test LogLoss: ",log_loss(test_y, yhat_prob))
Test set Accuracy:  0.6857142857142857
Test F1-score:  0.6899251963841629
Test Jaccard index:  0.5217391304347826
Test LogLoss:  1.03718699059278