# To predict whether the home team will win Indian Premier League Cricket Match


Name : Chandra Kumar Basavaraju

Objective : To predict whether the home team will win the Indian Premier League cricket match.

Project Description :
The goal is to predict if the home team will WIN the Indian Premier League game based on their past of the team and the players.

Information required about the game for the prediction :
1. Home team 
2. Visiting team
3. Who won the toss?
4. What did the toss winner opted to (bat first or field first)?

Data Collected :
1. Details of all the games played in the league till date from the beginning in 2008. 
2. IPL Records of all the players representing the current teams in the league.

Features extracted and used in the model :
1. Did home team win the toss?
2. Did home team bat first?
3. Home team's batting average
4. Home team's bowling average
5. Home team's winning rate
6. Visiting team's batting average
7. Visiting team's bowling average
8. Visiting team's winning rate


Target :
1. Did the home team win the game?


# Performance Metric : Recall/True Positive Rate/Sensitivity
(Of the actual positives how many were predicted positive)

Recall seemed appropriate as we are predicting for each game whether the home team will win the game or not?
So, recall tells out of all the matches where home team wins, how many did the model predict correct.


In [207]:
import numpy as np
import csv
import random
from collections import Counter

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,roc_auc_score,f1_score,confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB


class IPL:
    def load_ipl():
        with open('DATA_FINAL5.csv') as csv_file:
            data_file = csv.reader(csv_file)
            temp = next(data_file)
            a = int(temp[0])
            b = int(temp[1])
            data = np.empty((a,b),dtype=object)
            target = np.empty((a,),dtype=object)
            feature_names = next(data_file)
            print('Number of Instances : ',a)
            print('Number of features :',b)
            print('Input classes : ',feature_names[:-1])
            print('Output class : ',feature_names[-1])
                        
            for i,ir in enumerate(data_file):
                data[i] = np.asarray(ir[:-1],dtype=object)
                target[i] = ir[-1]
                            
            return data, target,feature_names
    def predict_random(X):
        r = len(X)
        Y = np.empty((r,),dtype=int)
        for i in range(r):
            Y[i] = random.randint(0,1)
        return Y
    def predict_majority_class(X):
        r = len(X)
        Y = np.empty((r,),dtype=int)
        for i in range(r):
            Y[i] = 1 #majority class in target train is 0
        return Y

X, Y, feature_names = IPL.load_ipl()

Number of Instances :  572
Number of features : 8
Input classes :  ['Did home team win the toss?', 'Did Home team bat first?', "Home Team's Winning Rate", "Away Team's Winning Rate", "Home Team's Batting Average", "Home team's Bowling Average", "Away Team's Batting Average", "Away team's Bowling Average"]
Output class :  Did home team win the match?


In [208]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(Y)
#print(le.classes_)
target = le.transform(Y)

le1 = preprocessing.LabelEncoder()
le1.fit(X[:,0])
T1 = le1.transform(X[:,0])
T1 = np.transpose(T1)

le2 = preprocessing.LabelEncoder()
le2.fit(X[:,1])
T2 = le2.transform(X[:,1])
T2 = np.transpose(T2)

temp = np.column_stack((T1,T2))

temp = np.column_stack((temp,np.transpose(X[:,2])))
temp = np.column_stack((temp,np.transpose(X[:,3])))
#le3 = preprocessing.LabelEncoder()
#le3.fit(X[:,2])
#V = le3.transform(X[:,2])
#V = np.transpose(V)

#temp = np.column_stack((temp,V))

#le4 = preprocessing.LabelEncoder()
#le4.fit(X[:,2])
#HT = le4.transform(X[:,2])
#HT = np.transpose(HT)

#temp = np.column_stack((temp,HT))

#le5 = preprocessing.LabelEncoder()
#le5.fit(X[:,3])
#ST = le5.transform(X[:,3])
#ST = np.transpose(ST)

#temp = np.column_stack((temp,ST))

#le6 = preprocessing.LabelEncoder()
#le6.fit(X[:,4])
#TW = le6.transform(X[:,4])
#TW = np.transpose(TW)

#temp = np.column_stack((temp,TW))

#le7 = preprocessing.LabelEncoder()
#le7.fit(X[:,5])
#TD = le7.transform(X[:,5])
#TD = np.transpose(TD)

#temp = np.column_stack((temp,TD))

i = 4
nof = len(X[0])
for k in range(i,nof):
    kthFeature = X[:,k]
    kthFeature_scaled = preprocessing.scale(kthFeature)
    a = np.transpose(kthFeature_scaled)
    temp = np.column_stack((temp,a))

input_preprocessed = temp


distribution_of_target_class = Counter(target)
print('Distribution of target class : ',distribution_of_target_class)

#predicting randomly
Y_pred_rand = IPL.predict_random(input_preprocessed)

print('\nPerformance when prediction is random : \n')

pred_rand_accuracy = accuracy_score(target, Y_pred_rand)
print('pred_rand_accuracy : %0.3f'%pred_rand_accuracy)

pred_rand_precision = precision_score(target, Y_pred_rand)
print('pred_rand_precision : %0.3f'%pred_rand_precision)

pred_rand_recall = recall_score(target, Y_pred_rand)
print('pred_rand_recall : %0.3f'%pred_rand_recall)

pred_rand_roc_auc = roc_auc_score(target,Y_pred_rand)
print('pred_rand_roc_auc : %0.3f'%pred_rand_roc_auc)

pred_rand_f1 = f1_score(target,Y_pred_rand)
print('pred_rand_f1 : %0.3f'%pred_rand_f1)



#predicting the majority class always i.e, 0 (Team 1 wins always)
Y_pred_majority = IPL.predict_majority_class(input_preprocessed)

print('\nPerformance when predicting the majority class always : \n')

pred_majority_accuracy = accuracy_score(target,Y_pred_majority)
print('pred_majority_accuracy : %0.3f'%pred_majority_accuracy)

pred_majority_precision = precision_score(target,Y_pred_majority)
print('pred_majority_precision : %0.3f'%pred_majority_precision)

pred_majority_recall = recall_score(target,Y_pred_majority)
print('pred_majority_recall : %0.3f'%pred_majority_recall)

pred_majority_roc_auc = roc_auc_score(target,Y_pred_majority)
print('pred_majority_roc_auc : %0.3f'%pred_majority_roc_auc)

pred_majority_f1 = f1_score(target,Y_pred_majority)
print('pred_rand_f1 : %0.3f'%pred_majority_f1)

Distribution of target class :  Counter({1: 309, 0: 263})

Performance when prediction is random : 

pred_rand_accuracy : 0.512
pred_rand_precision : 0.549
pred_rand_recall : 0.547
pred_rand_roc_auc : 0.509
pred_rand_f1 : 0.548

Performance when predicting the majority class always : 

pred_majority_accuracy : 0.540
pred_majority_precision : 0.540
pred_majority_recall : 1.000
pred_majority_roc_auc : 0.500
pred_rand_f1 : 0.701




In [209]:
A = 'accuracy'
P = 'precision'
R = 'recall'
F1 = 'f1'
ROC = 'roc_auc'
criterions = ['entropy','gini']

#decision tree classifier
print("\nPerformance of decision tree classifier : \n")
for cr in criterions: 
    print("\nCriterion for Decision Tree is : ",cr)
    dt = DecisionTreeClassifier(random_state=None,criterion=cr)
    dt_acc = cross_val_score(dt, input_preprocessed, target, cv=10,scoring=A)
    print('Accuracy of DT with CV : %0.3f'%np.mean(dt_acc))

    dt_pre = cross_val_score(dt, input_preprocessed, target, cv=10,scoring=P)
    print('Precision of DT with CV : %0.3f'%np.mean(dt_pre))

    dt_rec = cross_val_score(dt, input_preprocessed, target, cv=10,scoring=R)
    print('Recall of DT with CV : %0.3f'%np.mean(dt_rec))

    dt_f1 = cross_val_score(dt, input_preprocessed, target, cv=10,scoring=F1)
    print('F1 of DT with CV : %0.3f'%np.mean(dt_f1))

    dt_auc = cross_val_score(dt, input_preprocessed, target, cv=10,scoring=ROC)
    print("AUC of DT with CV : %0.3f"%np.mean(dt_auc))

#guassian naive bayes classifier
print("\nPerformance of Guassian Naive Bayes classifier : \n")
gnb = GaussianNB()

gnb_acc = cross_val_score(gnb,input_preprocessed,target,scoring=A,cv=10)
print('Accuracy of GNB with CV : %0.3f'%np.mean(gnb_acc))

gnb_pre = cross_val_score(gnb,input_preprocessed,target,scoring=P,cv=10)
print('Precision of GNB with CV : %0.3f'%np.mean(gnb_pre))

gnb_rec = cross_val_score(gnb,input_preprocessed,target,scoring=R,cv=10)
print('Recall of GNB with CV : %0.3f'%np.mean(gnb_rec))

gnb_f1 = cross_val_score(gnb,input_preprocessed,target,scoring=F1,cv=10)
print('F1 of GNB with CV : %0.3f'%np.mean(gnb_f1))

gnb_auc = cross_val_score(gnb, input_preprocessed, target,scoring=ROC,cv=10)
print("AUC of GNB with CV : %0.3f"%np.mean(gnb_auc))




Performance of decision tree classifier : 


Criterion for Decision Tree is :  entropy
Accuracy of DT with CV : 0.561
Precision of DT with CV : 0.600
Recall of DT with CV : 0.531
F1 of DT with CV : 0.560
AUC of DT with CV : 0.554

Criterion for Decision Tree is :  gini
Accuracy of DT with CV : 0.551
Precision of DT with CV : 0.601
Recall of DT with CV : 0.534
F1 of DT with CV : 0.555
AUC of DT with CV : 0.551

Performance of Guassian Naive Bayes classifier : 

Accuracy of GNB with CV : 0.593
Precision of GNB with CV : 0.603
Recall of GNB with CV : 0.735
F1 of GNB with CV : 0.661
AUC of GNB with CV : 0.596


In [210]:
#Logistic Regression linear model
print("\nPerformance of Logistic Regression model :\n")
Cs = [0.001, 0.01, 0.1, 1, 10, 100]
penalties = ['l1','l2']
for p in penalties:
    print("\nPenalty = ",p)
    for c in Cs:
        print('\nc = ',c)
        lr = LogisticRegression(random_state=0,penalty=p,C=c)
        lr_acc = cross_val_score(lr, input_preprocessed, target, cv=10,scoring=A)
        print('Accuracy of LR with CV : %0.3f'%np.mean(lr_acc))

        lr_pre = cross_val_score(lr, input_preprocessed, target, cv=10,scoring=P)
        print('Precision of LR with CV : %0.3f'%np.mean(lr_pre))

        lr_recall = cross_val_score(lr, input_preprocessed, target, cv=10,scoring=R)
        print('Recall of LR with CV : %0.3f'%np.mean(lr_recall))

        lr_f1 = cross_val_score(lr, input_preprocessed, target, cv=10,scoring=F1)
        print('F1 of LR with CV : %0.3f'%np.mean(lr_f1))

        lr_auc = cross_val_score(lr, input_preprocessed, target, cv=10,scoring=ROC)
        print("AUC of LR with CV : %0.3f"%np.mean(lr_auc))




Performance of Logistic Regression model :


Penalty =  l1

c =  0.001
Accuracy of LR with CV : 0.460


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Precision of LR with CV : 0.000
Recall of LR with CV : 0.000


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


F1 of LR with CV : 0.000
AUC of LR with CV : 0.500

c =  0.01
Accuracy of LR with CV : 0.460
Precision of LR with CV : 0.000


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Recall of LR with CV : 0.000
F1 of LR with CV : 0.000


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


AUC of LR with CV : 0.500

c =  0.1
Accuracy of LR with CV : 0.526
Precision of LR with CV : 0.536
Recall of LR with CV : 0.909
F1 of LR with CV : 0.674
AUC of LR with CV : 0.512

c =  1
Accuracy of LR with CV : 0.605
Precision of LR with CV : 0.609
Recall of LR with CV : 0.754
F1 of LR with CV : 0.674
AUC of LR with CV : 0.614

c =  10
Accuracy of LR with CV : 0.603
Precision of LR with CV : 0.612
Recall of LR with CV : 0.728
F1 of LR with CV : 0.664
AUC of LR with CV : 0.627

c =  100
Accuracy of LR with CV : 0.603
Precision of LR with CV : 0.612
Recall of LR with CV : 0.725
F1 of LR with CV : 0.663
AUC of LR with CV : 0.627

Penalty =  l2

c =  0.001
Accuracy of LR with CV : 0.531
Precision of LR with CV : 0.538
Recall of LR with CV : 0.932
F1 of LR with CV : 0.682
AUC of LR with CV : 0.525

c =  0.01
Accuracy of LR with CV : 0.538
Precision of LR with CV : 0.545
Recall of LR with CV : 0.877
F1 of LR with CV : 0.672
AUC of LR with CV : 0.533

c =  0.1
Accuracy of LR with CV : 0.560


In [211]:
print("Chosen Model : Logistic Regression with parameter penalty = l2 and C=0.001")
lr1 = LogisticRegression(random_state=0,penalty='l2',C=0.001)

lr1_recall = cross_val_score(lr1, input_preprocessed, target, cv=10,scoring=R)
print('Recall of LR with CV : %0.3f'%np.mean(lr1_recall))


lr1.fit(input_preprocessed,target)

print('\nTarget Class : ',feature_names[::-1][:1])

weights = lr1.coef_[0]
sorted_indices = np.argsort(weights)

print('\nTop positive features and their weights : \n')
for i in sorted_indices[::-1][:3]:
    print(feature_names[i],'\t',weights[i])

print('\nTop negative features and their weights : \n')
for i in sorted_indices[:3]:
    print(feature_names[i],'\t',weights[i])

Chosen Model : Logistic Regression with parameter penalty = l2 and C=0.001
Recall of LR with CV : 0.932

Target Class :  ['Did home team win the match?']

Top positive features and their weights : 

Home Team's Winning Rate 	 0.0129331147595
Did home team win the toss? 	 0.0109756188513
Away Team's Winning Rate 	 0.00722168796417

Top negative features and their weights : 

Home team's Bowling Average 	 -0.0158391398166
Away Team's Batting Average 	 -0.00724794223902
Did Home team bat first? 	 -0.00510842086515
