

## Exploratory Data Analysis of IPL Matches

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)
3. [Data Profiling](#section3)
    - 3.1 [Understanding the Dataset](#section301)<br/>
    - 3.2 [Pre Profiling](#section302)<br/>
    - 3.3 [Preprocessing](#section303)<br/>
    - 3.4 [Post Profiling](#section304)<br/>

### 1. Problem Statement

#### Some Background Information
The Indian Premier League (IPL) is a Twenty20 cricket league tournamnet held in India contested during April and May of every year where top players from all over the world take part. The IPL is the most-attended cricket league in the world and ranks sixth among all sports leagues.

The team has got some world class players but has not been able to live up to the expectations of their supporters. Their poor show in IPL has left everyone disappointed. 

__I wanted to analyze__
- The reason behind their poor performance and suggest any recommendation for future auctions and player choices.
- Predicting the winner of the next season of IPL based on past data, Visualizations, Perspectives, etc.

The notebook contains:
- Basic Analysis like Teams with maximum matches, wins,etc
- Batsman Analysis
- Bowler Analysis


<a id=section2></a>

### 2. Data Loading and Description

- The dataset contains details information related to the matches such as location, contesting teams, umpires, results, etc. between 2008 and 2018.
- The dataset comprises of __696 observations of 18 columns__. Below is a table showing names of all the columns and their description.

| Column Name        | Description                                      |
| ------------------ |:-------------                                   :| 
| id                 | Identity of match                              | 
| season             | Season                                         |  
| city               | City in which match played                     | 
| date               | Date on which match played                     |   
| team1              | Name of Team One                               |
| team2              | Name of Team Two                               |
| toss_winner        | Name of team who won toss                      |
| toss_decision      | Name of team who make decision after won toss  | 
| result             | Result of match                                |
| dl_applied         | Is dl rule applied                             |
| winner             | Name of winner team                            |
| win_by_runs        | Win by runs                                    |
| win_by_wickets     | Win by wickets                                 |
| player_of_match    | Player of match                                |
| venue              | Venue of match                                 |
| umpire1            | Name of umpire one                             |
| umpire2            | Name of umpire two                             |
| umpire3            | Name of third umpire                           |

#### Source :
https://github.com/insaid2018/Term1/blob/master/Data/Projects/matches.csv


#### Importing packages                                          

In [279]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics

%matplotlib inline
sns.set()

from subprocess import check_output



#### Importing the Dataset

In [280]:
matches_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/matches.csv")     # Importing training dataset using pd.read_csv

<a id=section3></a>

## 3. Data Profiling

- In the upcoming sections we will first __understand our dataset__ using various pandas functionalities.
- Then with the help of __pandas profiling__ we will find which columns of our dataset need preprocessing.
- In __preprocessing__ we will deal with erronous and missing values of columns. 
- Again we will do __pandas profiling__ to see how preprocessing have transformed our dataset.

<a id=section301></a>

### 3.1 Understanding the Dataset

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end.

Let us check the basic information of the dataset. The very basic information to know is the dimension of the dataset – rows and columns – that’s what we find out with the method __shape__.

In [None]:
matches_data.shape

matches_data has __696 rows and 18 columns.__

In [None]:
matches_data.columns

In [None]:
matches_data.head()

In [None]:
matches_data.tail()

In [None]:
matches_data.info()

In [309]:
matches_data.describe(include='all')

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
count,696.0,696.0,689,696,696,696,696,696,696,696.0,693,696.0,696.0,693,696,695,695,60
unique,,,32,498,14,14,14,2,3,,14,,,214,35,55,58,18
top,,,Mumbai,2017-04-08,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,field,normal,,Mumbai Indians,,,CH Gayle,M Chinnaswamy Stadium,HDPK Dharmasena,S Ravi,C Shamshuddin
freq,,,94,2,91,90,90,413,686,,98,,,20,73,73,56,6
mean,974.103448,2012.965517,,,,,,,,0.027299,,13.472701,3.349138,,,,,
std,2143.239623,3.069266,,,,,,,,0.16307,,23.607994,3.411398,,,,,
min,1.0,2008.0,,,,,,,,0.0,,0.0,0.0,,,,,
25%,174.75,2010.0,,,,,,,,0.0,,0.0,0.0,,,,,
50%,348.5,2013.0,,,,,,,,0.0,,0.0,3.0,,,,,
75%,522.25,2016.0,,,,,,,,0.0,,19.0,6.0,,,,,


In [281]:
matches_data.isnull().sum()

id                   0
season               0
city                 7
date                 0
team1                0
team2                0
toss_winner          0
toss_decision        0
result               0
dl_applied           0
winner               3
win_by_runs          0
win_by_wickets       0
player_of_match      3
venue                0
umpire1              1
umpire2              1
umpire3            636
dtype: int64

<a id=section302></a>

### 3.2 Pre Profiling

- By pandas profiling, an __interactive HTML report__ gets generated which contins all the information about the columns of the dataset, like the __counts and type__ of each _column_. Detailed information about each column, __coorelation between different columns__ and a sample of dataset.<br/>
- It gives us __visual interpretation__ of each column in the data.
- _Spread of the data_ can be better understood by the distribution plot. 
- _Grannular level_ analysis of each column.

In [None]:
profile = pandas_profiling.ProfileReport(matches_data)
profile.to_file(outputfile="matches_data_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as __matches_data_before_preprocessing.html__. Take a look at the file and see what useful insight you can develop from it. <br/>
Now we will process our data to better understand it.

<a id=section303></a>

### 3.3 Preprocessing

- Dealing with missing values<br/>
    - Replacing missing entry of the __city__ using __venue__ information.
    - Dropping the column __umpire3__ as it has to many __null__ values.
    - Neutralise/ Standardise values of __date__ column in __YYYY-MM-DD__ format, as values in column has two different format “YYYY-MM-DD” and “DD/MM/YY”.
    - Find and validate missing values of winner & player of match column.
    - Neutralise/ Standardise values of __id__ column 

In [None]:
null_column = matches_data.columns[matches_data.isnull().any()]
null_column = null_column.drop('umpire3')
matches_data[null_column].isnull().sum()
matches_data[null_column].head(10)
null_column

In [289]:
matches_data[matches_data.winner.isnull()]

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
300,301,2011,Delhi,2011-05-21,Delhi Daredevils,Pune Warriors,Delhi Daredevils,bat,no result,0,,0,0,,Feroz Shah Kotla,SS Hazare,RJ Tucker,
545,546,2015,Bangalore,2015-04-29,Royal Challengers Bangalore,Rajasthan Royals,Rajasthan Royals,field,no result,0,,0,0,,M Chinnaswamy Stadium,JD Cloete,PG Pathak,
570,571,2015,Bangalore,2015-05-17,Delhi Daredevils,Royal Challengers Bangalore,Royal Challengers Bangalore,field,no result,0,,0,0,,M Chinnaswamy Stadium,HDPK Dharmasena,K Srinivasan,


In [308]:
matches_data.date.unique()

array(['2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08',
       '2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12',
       '2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16',
       '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20',
       '2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24',
       '2017-04-26', '2017-04-27', '2017-04-28', '2017-04-29',
       '2017-04-30', '2017-05-01', '2017-05-02', '2017-05-03',
       '2017-05-04', '2017-05-05', '2017-05-06', '2017-05-07',
       '2017-05-08', '2017-05-09', '2017-05-10', '2017-05-11',
       '2017-05-12', '2017-05-13', '2017-05-14', '2017-05-16',
       '2017-05-17', '2017-05-19', '2017-05-21', '2008-04-18',
       '2008-04-19', '2008-04-20', '2008-04-21', '2008-04-22',
       '2008-04-23', '2008-04-24', '2008-04-25', '2008-04-26',
       '2008-04-27', '2008-04-28', '2008-04-29', '2008-04-30',
       '2008-05-01', '2008-05-02', '2008-05-25', '2008-05-03',
       '2008-05-04', '2008-05-05', '2008-05-06', '2008-

In [288]:
matches_data.groupby('result').size()

result
no result      3
normal       686
tie            7
dtype: int64

In [None]:
matches_data.loc[matches_data.winner.isnull() & matches_data.umpire3.isnull(), ]


In [None]:
matches_data['id'].max()

In [None]:
matches_data['clean_date'] = matches_data.date.apply(lambda x: pd.to_datetime(x).strftime('%d/%m/%Y %H:%M'))
matches_data.clean_date.unique()

In [None]:
matches_data.drop(columns=['date'])

In [None]:
matches_data.team1.unique()

In [None]:
print('Total Matches Played:', matches_data.shape[0])
print(' \n Venues Played At:', matches_data['city'].unique())     
print(' \n Teams :', matches_data['team1'].unique())

__How many seasons we have got in the dataset__

In [None]:
season = matches_data['season'].unique()
type(season)
season.sort()
season

In [None]:
matches_data['season'].unique().sort()
season.shape

In [None]:
number_of_missing_values_city = len(matches_data.city) - matches_data.city.count()
number_of_missing_values_city

In [None]:
number_of_missing_values_umpire3 = len(matches_data.umpire3) - matches_data.umpire3.count()
number_of_missing_values_umpire3


In [None]:
matches_data.date.unique()

<a id=section304></a>

### 3.4 Post Pandas Profiling

In [None]:
#import pandas_profiling
profile = pandas_profiling.ProfileReport(matches_data)
profile.to_file(outputfile="matches_data_after_preprocessing.html")

Now we have preprocessed the data, now the dataset doesnot contain missing values, we have also introduced new feature named __FamilySize__. So, the pandas profiling report which we have generated after preprocessing will give us more beneficial insights. You can compare the two reports, i.e __matches_data_after_preprocessing.html__ and __matches_data_before_preprocessing.html__.<br/>
In __matches_data__after_preprocessing.html__ report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__ 
- Number of __variables__ = __13__ 
- Observe the newly created variable FamilySize, Click on Toggle details to get more detailed information about it.