<img src="IPL_Logo.png" width="240" height="360" />

## Exploratory Data Analysis of Indian Premier League Matches and Predicting Match Scores

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)
3. [Data Profiling](#section3)
    - 3.1 [Understanding the Dataset](#section301)<br/>
    - 3.2 [Pre Profiling](#section302)<br/>
    - 3.3 [Preprocessing](#section303)<br/>
    - 3.4 [Post Profiling](#section304)<br/>
4. [Questions](#section4)
    - 4.1 [How many feature play a key role in deciding what the final score?](#section401)<br/>
    - 4.2 [What type of correlation between features to predict match score](#section402)<br/>
5. [Preparing X (independent features) for the model building.](#section5)
    - 5.1 [Check for the type and shape of X.](#section501)<br/>
6. [Extract y (dependent variable) for model building.](#section5)
    - 6.1 [Check for the type and shape of y.](#section601)<br/>
7. [Split the value of X and y into train and test datasets](#section7)
8. [Features Scaling](#section8)
9. [Check the shape of X and y of train dataset.](#section9)
10. [Check the shape of X and y of test dataset.](#section10)
11. [Instantiate Linear regression model using scikit-learn.](#section11)
12. [Fit the linear model on X_train and y_train.](#section12)
13. [Interpret the Model Coefficients.](#section13)
14. [Zip the features to pair the feature names with the coefficients.](#section14)
15. [Predict the train value using the built model.](#section15)
16. [Predict the test value using the built model.](#section16)
17. [Evaluate the model](#section17)
    - 17.1 [Using Mean Absolute Error metrics for both train and test.](#section1701)<br/>
    - 17.2 [Using Mean Squared Error for both train and test.](#section1702)<br/>
    - 17.3 [Using Root Mean Squared Error for both train and test.](#section1703)<br/>
    - 17.4 [Using R-square value for both train and test.](#section1704)<br/>    
18. [Conclusions](#section18)<br/> 

<a id=section1></a>

### 1. Problem Statement

In ODI and T-20 cricket, many factors play a key role in deciding what the final score will be.  Let’s look at some of the key factors:

- Number of wickets left
- Number of balls left
- On how much scores are the current batsman batting?
- How much the team had scored in last 5 overs?
- How much the team had lost wickets in last 5 overs?
- The nature of the pitch
- How strong is the batting and bowling team?

Let's use some of these factors __to predict match score using machine learning algorithms based on past data, Visualizations, Perspectives, etc__. 
We use regression analysis in machine learning to predict the final score of an ODI or T-20 match.

<a id=section2></a>

### 2. Data Loading and Description

I have used ‘ipl.csv’ datafile here for predicting scores in IPL cricket match. 
The dataset contains ball by ball coverage of:

- 1188 ODI matches: data/odi.csv
- 1474 T-20 matches: data/t20.csv
- 617 IPL matches: data/ipl.csv

The dataset contains details information related to matches,the ball by ball details of matches, such as venue, bat_team bowl_team, runs, wickets, overs etc.
- The dataset comprises of __76014 observations of 15 columns__. Below is a table showing names of all the columns and their description.

| Column Name        | Description                                             |
| ------------------ |:-------------                                          :| 
| mid                | Identity of match                                             | 
| date               | When the match happened                                       |  
| venue              | Stadium where match is being played                           | 
| bat_team           | Batting team name                                             |   
| bowl_team          | Bowling team name                                             |
| batsman            | Batsman name who faced that ball                              |
| bowler             | Bowler who bowled that ball                                   |
| runs               | Total runs scored by team at that instance                    | 
| wickets            | Total wickets fallen at that instance                         |
| overs              | Total overs bowled at that instance                           |
| runs_last_5        | Total runs scored in last 5 overs                             |
| wickets_last_5     | Total wickets that fell in last 5 overs                       |
| striker            | max(runs scored by striker, runs scored by non-striker)       |
| non-striker        | min(runs scored by striker, runs scored by non-striker)       |
| total              | Total runs scored by batting team after first innings         |

#### Source :
https://cricsheet.org/downloads/


#### Importing packages                                          

In [None]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics

%matplotlib inline
sns.set()

from subprocess import check_output



#### Importing the Dataset

In [None]:
matches_data = pd.read_csv("data/ipl.csv")     # Importing training dataset using pd.read_csv

<a id=section3></a>

## 3. Data Profiling

- In the upcoming sections we will first __understand our dataset__ using various pandas functionalities.
- Then with the help of __pandas profiling__ we will find which columns of our dataset need preprocessing.
- In __preprocessing__ we will deal with erronous and missing values of columns. 
- Again we will do __pandas profiling__ to see how preprocessing have transformed our dataset.

<a id=section301></a>

### 3.1 Understanding the Dataset

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end.

Let us check the basic information of the dataset. The very basic information to know is the dimension of the dataset – rows and columns – that’s what we find out with the method __shape__.

In [None]:
matches_data.shape

matches_data has __76014 rows and 15 columns.__

In [None]:
matches_data.columns

In [None]:
matches_data.head()

In [None]:
matches_data.tail()

In [None]:
matches_data.info()

In [None]:
matches_data.describe(include='all')

In [None]:
matches_data.isnull().sum()

In [None]:
matches_data.count()

<a id=section302></a>

### 3.2 Pre Profiling

- By pandas profiling, an __interactive HTML report__ gets generated which contins all the information about the columns of the dataset, like the __counts and type__ of each _column_. Detailed information about each column, __correlation between different columns__ and a sample of dataset.<br/>
- It gives us __visual interpretation__ of each column in the data.
- _Spread of the data_ can be better understood by the distribution plot. 
- _Grannular level_ analysis of each column.

In [None]:
profile = pandas_profiling.ProfileReport(matches_data)
profile.to_file(outputfile="matches_data_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as __matches_data_before_preprocessing.html__. Take a look at the file and see what useful insight you can develop from it. <br/>
Now we will process our data to better understand it.

<a id=section303></a>

### 3.3 Preprocessing

- Dealing with missing values<br/>
    - Replacing incorrect & multiple entry of the Pune team from bat_team column and bowl_team.
    - Value in overs column should be less than 20 overs means 19.6
    - Since data set contains details information about match ball by ball for second inning, check chasing runs score in runs column should be relative with total runs.
    - Value in wickets & wickets_last_5 column should less than or equal to 10
    

In [None]:
incorrect_names = ['rising pune supergiant', 'pune warriors']
pune_team_name = 'Rising Pune Supergiants'

for (row, col) in matches_data.iterrows():
     
    if str.lower(col.bat_team) in incorrect_names:
        matches_data['bat_team'].replace(to_replace=col.bat_team, value=pune_team_name, inplace=True)
    
    if str.lower(col.bowl_team) in incorrect_names:
        matches_data['bowl_team'].replace(to_replace=col.bowl_team, value=pune_team_name, inplace=True)
        

In [None]:
matches_data.overs.sort_values(ascending=False).unique()

In [None]:
matches_data.total.sort_values(ascending=False).unique()

In [None]:
matches_data.runs.sort_values(ascending=False).unique()

In [None]:
matches_data.wickets_last_5.sort_values(ascending=False).unique()

In [None]:
matches_data.wickets.sort_values(ascending=False).unique()

In [None]:
matches_data.count()

<a id=section304></a>

### 3.4 Post Pandas Profiling

In [None]:
#import pandas_profiling
profile = pandas_profiling.ProfileReport(matches_data)
profile.to_file(outputfile="matches_data_after_preprocessing.html")

Now we have preprocessed the data, now the dataset doesnot contain missing values. You can compare the two reports, i.e __matches_data_after_preprocessing.html__ and __matches_data_before_preprocessing.html__.<br/>
In __matches_data__after_preprocessing.html__ report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__ 
- Number of __variables__ = __13__ 

<a id=section304></a>

__Utils functions__

In [None]:
class ChartType:
    bar_chart = 1
    bar_chart_horizontal = 2
    line_chart = 3
    histogram_chart = 4
    stack_chart = 5
    scatter_chart = 6
    area_chart = 7
    pie_chart = 8


In [None]:
def showChart(data, chart_type, xlabel, ylabel, title=None, figsize=None, axis=None):
    '''
    data : data frame,
    xlabel : The label text for x axis.
    ylabel : The label text for y axis.
    title : The label text for title of chart.
    figsize : tuple of integers, optional, default: None
    axis : The axis limits to be set. Either none or all of the limits must
    be given.
    '''
    # Set figure size of chart
    if figsize != None:
        plt.figure(figsize=figsize)

    # Set x & y axis limit
    if axis != None:
        plt.axis(axis) 

    # Draw bar chart
    if ChartType.bar_chart == chart_type:
        data.plot.bar()
    elif ChartType.bar_chart_horizontal == chart_type:
        data.plot.barh()
    elif ChartType.stack_chart == chart_type:
        data.plot.bar(stacked=True)
    elif ChartType.line_chart == chart_type:
        data.plot.line()
    elif ChartType.histogram_chart == chart_type:
        data.plot.hist()
    elif ChartType.scatter_chart == chart_type:
        data.plot.area()
    elif ChartType.area_chart == chart_type:
        data.plot.area()
    elif ChartType.pie_chart == chart_type:
        plt.pie(data.values,
                       labels=data.index,
                       autopct='%1.2f', startangle=90)
        
#         explode = (0.2, 0, 0, 0, 0, 0)
#         plt.explode = explode
#         plt.autopct='%1.1f%%'
        plt.legend(data.index, loc="best")
        plt.axis('equal')
#         plt.pctdistance=1.1
#         plt.labeldistance=1.2
#         data.plot.pie()
        
    # Set title of chart, y & x axis
    if title != None:
        plt.title(title, fontsize=20)
        
    if xlabel != None:
        plt.xlabel(xlabel, fontsize=10)

    if ylabel != None:
        plt.ylabel(ylabel, fontsize=10)

    # Custom ticks for m axis
    plt.tick_params(axis='x', colors='black', direction='out', length=5, width=1, labelsize='large')
    
    # Custom ticks for m axis
    plt.tick_params(axis='y', colors='black', direction='in', length=5, width=1, labelsize='large')
    
    # Show char
    plt.show()
    

<a id=section4></a>

### 4. Questions

<a id=section401></a>

__4.1 How many feature play a key role in deciding what the final score?__

In [None]:
corr = matches_data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')

Let's try to do same analysis using seaborn lib without filter data set.

<a id=section401></a>

__4.2 What type of correlation between features to predict match score?__

In [None]:
sns.pairplot(matches_data, x_vars=["overs", "wickets", "striker", "non-striker"], y_vars=["runs"], height=5, aspect=.8);

By referring above graph, we can clearly say that, there is liner relationship between overs and runs feature.

<a id=section5></a>

## 5. Preparing X (independent features) for the model building. 

<a id=section501></a>

In ODI and T-20 cricket, many factors play a key role in deciding what the final score will be.  Let’s look at some of the key factors:

- Number of wickets left
- Number of balls left
- On how much scores are the current batsman batting?
- How much the team had scored in last 5 overs?
- How much the team had lost wickets in last 5 overs?
- The nature of the pitch
- How strong is the batting and bowling team?

While preprocessing, we can clearly say all other features didn’t make much difference in results, except below mentioned variables which has corelationshiop & will help in prediction of match score:
- runs
- wickets
- overs
- striker
- non-striker
    

In [None]:
# Alternate way to fetch data from data set
#X = matches_data.iloc[:,[7,8,9,12,13]].values

# define column which want to filter
feature_cols = ['runs','wickets','overs','striker','non-striker']                
def extract_X_Independent(matches_data):
    X = matches_data[feature_cols]
    return X

# Prepare X data set
X = extract_X_Independent(matches_data)

print(X.head())


<a id=section501></a>

### 5.1 Check for the type and shape of X.

In [None]:
def lr():
    print(type(X))
    print(X.shape)
lr()

<a id=section6></a>

### 6. Extract y (dependent variable) for model building.

In [None]:
# Alternate way to fetch data from data set
# y = matches_data.iloc[:, 14].values

# define column which want to predict
def extract_y_Dependent(data_set):
    y = data_set['total']
    return y

# Prepare y data set
y = extract_y_Dependent(matches_data)
print(y.head())

<a id=section601></a>

### 6.1 Check for the type and shape of y.

In [None]:
def lr():
    print(type(y))
    print(y.shape)
lr()

<a id=section7></a>

### 7. Split the value of X and y into train and test datasets

In [None]:
from sklearn.model_selection import train_test_split
def split_X_y_Into_Train_Test():
    return train_test_split(X, y, test_size=0.30, random_state=1)
X_train, X_test, y_train, y_test = split_X_y_Into_Train_Test()

<a id=section8></a>

### 8. Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

<a id=section9></a>

### 9. Check the shape of X and y of train dataset.

In [None]:
def lr():
    print(X_train.shape)
    print(y_train.shape)
lr()

<a id=section10></a>

### 10. Check the shape of X and y of test dataset.

In [None]:
def lr():
    print(X_test.shape)
    print(y_test.shape)
lr()

<a id=section11></a>

### 11. Instantiate Linear regression model using scikit-learn

In [None]:
# Using LinearRegression
from sklearn.linear_model import LinearRegression
def lr():
    linreg = LinearRegression()
    return linreg
linreg = lr()

In [None]:
# Using RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
def lr():
    linreg = RandomForestRegressor(n_estimators=100,max_features=None)
    return linreg
linreg = lr()

<a id=section12></a>

### 12. Fit the linear model on X_train and y_train.

In [None]:
def lr():
    linreg.fit(X_train, y_train)  
lr()

<a id=section13></a>

### 13. Interpret the Model Coefficients.

In [None]:
def lr():
    print('Intercept:',linreg.intercept_)                                            
    print('Coefficients:',linreg.coef_)
lr()

<a id=section14></a>

### 14. Zip the features to pair the feature names with the coefficients.

In [None]:
def lr():
    feature_cols.insert(0,'Intercept')
    coef = linreg.coef_.tolist()
    coef.insert(0, linreg.intercept_)
    eq1 = zip(feature_cols, coef)
    for c1,c2 in eq1:
        print(c1,c2)
lr()

<a id=section15></a>

### 15. Predict the train value using the built model.

In [None]:
y_pred_train = linreg.predict(X_train)
pred= pd.DataFrame(y_pred_train)
def lr():  
    pred.columns = ['total']
    head = pred.head()
    return head
lr()

<a id=section16></a>

### 16. Predict the test value using the built model.

In [None]:
y_pred_test = linreg.predict(X_test)    
pred_test= pd.DataFrame(y_pred_test)
def lr():
    pred_test.columns=['total']
    head = pred_test.head()
    return head
lr()

<a id=section17></a>

### Evaluate the model 

<a id=section1701></a>

#### 17.1 Using Mean Absolute Error metrics for both train and test.

In [None]:
from sklearn import metrics
def lr():
    MAE_train = metrics.mean_absolute_error(y_train, y_pred_train)
    MAE_test = metrics.mean_absolute_error(y_test, y_pred_test)
    print('MAE for training set is {}'.format(MAE_train))
    print('MAE for test set is {}'.format(MAE_test))
lr()

<a id=section1702></a>

#### 17.2 Evaluate the model using Mean Squared Error for both train and test.

In [None]:
def lr():
    MSE_train = metrics.mean_squared_error(y_train, y_pred_train)
    MSE_test = metrics.mean_squared_error(y_test, y_pred_test)
    print('MSE for training set is {}'.format(MSE_train))
    print('MSE for test set is {}'.format(MSE_test))
lr()

<a id=section1703></a>

#### 17.3 Evaluate the model using Root Mean Squared Error for both train and test.

In [None]:
import numpy as np
def lr():
    RMSE_train = np.sqrt( metrics.mean_squared_error(y_train, y_pred_train))
    RMSE_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))
    print('RMSE for training set is {}'.format(RMSE_train))
    print('RMSE for test set is {}'.format(RMSE_test))
lr()

<a id=section1704></a>

#### 17.4 Evaluate the model using R-square value for both train and test.

In [None]:
from sklearn.metrics import r2_score
def lr():
    R2_train = r2_score(y_train, y_pred_train) 
    R2_test = r2_score(y_test, y_pred_test) 
    print('R2 for training set is {}'.format(R2_train))
    print('R2 for test set is {}'.format(R2_test))
lr()

### Testing with a custom input

In [None]:
X.head()

##### Input : 
- "Total runs scored by team at that instance
- "Total wickets fallen at that instance
- "Total overs bowled at that instance"
- "Max(runs scored by striker, runs scored by non-striker)
- "Min(runs scored by striker, runs scored by non-striker)

Output : 
- "Match Score

In [None]:
# Testing with a custom input
# ball, wicket, over, striker batman score, nonstrick batman score
import numpy as np
new_prediction = linreg.predict(sc.transform(np.array([[70,0,10,25,0]])))
print("Prediction score:" , new_prediction)

<a id=section18></a>

### Conclusions

R-sqaured is a statistic that will give some information about the goodness of fit of a model. In regression, the R-squared coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R-squared value of 1 indicates that the regression predictions perfectly fit the data.

Below are observation for R2 using LinearRegression Model
- R2 for training set is __0.5041672457142679__.
- R2 for test set is __0.5090320453008903__

Below are observation for R2 using RandomForestRegressor Model
- R2 for training set is __0.895591249283709__.
- R2 for test set is __0.668770359654934__