# Logistic Regression

We will try the logistic regression model to our data.
First we should load our data.

In [1]:
import pandas as pd
data = pd.read_csv("deneme.csv")
data = data.drop(["Player2", "FSP.2", "FSW.2", "SSP.2", "SSW.2", "ACE.2","DBF.2", "WNR.2", "UFE.2","BPC.2","BPW.2","TPW.2" , "FNL1", "FNL2","NPA.1","NPW.1","ST1.1","ST2.1","ST3.1","ST4.1","ST5.1","NPA.2","NPW.2","ST1.2","ST2.2","ST3.2","ST4.2","ST5.2"], axis=1)
data.rename(columns={"Player1":"Player", "FSP.1":"FSP" , "FSW.1" : "FSW" , "SSP.1" : "SSP" , "SSW.1" : "SSW","ACE.1" : "ACE", "DBF.1" : "DBF", "WNR.1":"WNR", "UFE.1" : "UFE", "BPC.1" : "BPC" , "BPW.1" : "BPW","TPW.1" : "TPW" }, inplace=True)
data.head(10)

Unnamed: 0,Player,Round,Result,FSP,FSW,SSP,SSW,ACE,DBF,WNR,UFE,BPC,BPW,TPW
0,Lukas Lacko,1,0,61,35,39,18,5,1.0,17,29,1,3,70
1,Leonardo Mayer,1,1,61,31,39,13,13,1.0,13,1,7,14,80
2,Marcos Baghdatis,1,0,52,53,48,20,8,4.0,37,50,1,9,106
3,Dmitry Tursunov,1,1,53,39,47,24,8,6.0,8,6,6,9,104
4,Juan Monaco,1,0,76,63,24,12,0,4.0,16,35,3,12,128
5,Santiago Giraldo,1,0,65,51,35,22,9,3.0,35,41,2,7,108
6,Dudi Sela,1,0,68,73,32,24,5,3.0,41,50,9,17,173
7,Fabio Fognini,1,1,47,18,53,15,3,4.0,21,31,6,20,78
8,David Guez,1,0,64,26,36,12,3,,20,39,3,7,67
9,Nikolay Davydenko,1,1,77,76,23,11,6,4.0,6,4,7,24,162


In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
%matplotlib inline



Then, we should create dataframes

In [23]:
y, x = dmatrices('Result ~ FSP + FSW + SSP + SSW +ACE + DBF + WNR + UFE + BPC + BPW',
                  data, return_type="dataframe")

We should flatten y into a 1D array

In [24]:
y = np.ravel(y)


We will fit our data to the model

In [25]:
model = LogisticRegression()
model = model.fit(x, y)

The score of our model is pretty high.

In [26]:
model.score(x,y)

0.872

Let's test our model using a validation set.
First we should divide our data to test and train sets, and then fit to the model.

In [27]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We should predict class labels and generate class probabilities.

In [28]:
predicted = model2.predict(x_test)
probs = model2.predict_proba(x_test)

Let's generate evaluation metrices

In [29]:
print (metrics.accuracy_score(y_test, predicted))
print (metrics.roc_auc_score(y_test, probs[:, 1]))

0.8
0.913229018492


The accuracy is between those values, thus very high.

Let's also look at the confusion matrix and a classification report with other metrics.

In [30]:
print (metrics.confusion_matrix(y_test, predicted))
print (metrics.classification_report(y_test, predicted))

[[30  7]
 [ 8 30]]
             precision    recall  f1-score   support

        0.0       0.79      0.81      0.80        37
        1.0       0.81      0.79      0.80        38

avg / total       0.80      0.80      0.80        75



Let's make some predictions :)

Let's try to predict a result of the match of Andy Murrey and Benjamin Becker (Wimbledon 2013, Round 1).

# BURADA SIKINTI OLABİLİR

In [40]:
#ilk böyle yapmıştım
#model.predict_proba(np.array([59, 29, 41, 14, 5, 1, 26, 18, 5, 1]))
#ama olmadı

model.predict_proba(np.array([0,59, 29, 41, 14, 5, 1, 26, 18, 5, 1]))



array([[ 0.0671674,  0.9328326]])

So the probability that Benjamin Becker winning a match is %93

In [41]:
model.predict_proba(np.array([0,57, 39, 43, 20, 11, 2, 38, 16, 10, 5]))



array([[  1.78375527e-04,   9.99821624e-01]])

The probability that Andy Murray winning the match is %99

Since Murray's probability is higher, we are predicting that Andy Murray had won, and it is true :)

In [43]:
model.predict_proba(np.array([0,65, 40, 35, 15, 4, 4, 31, 40, 13, 4]))



array([[  8.90994609e-05,   9.99910901e-01]])

The probability of Novak Djokovic winning a match %99.9910901

In [44]:
model.predict_proba(np.array([0,64, 48, 36, 16, 9, 2, 36, 21, 17, 7]))



array([[  3.57635351e-07,   9.99999642e-01]])

The proability of Andy Murray winning the match is %99.999964

Since the probability of Andy Murray winning the match is higher, we predicted that Murray had won the match, and this is correct :)