<img src="IPL_Logo.png" width="240" height="360" />

## Exploratory Data Analysis of Indian Premier League Matches

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)
3. [Data Profiling](#section3)
    - 3.1 [Understanding the Dataset](#section301)<br/>
    - 3.2 [Pre Profiling](#section302)<br/>
    - 3.3 [Preprocessing](#section303)<br/>
    - 3.4 [Post Profiling](#section304)<br/>
4. [Questions](#section4)
    - 4.1 [Which season had most number of matches?](#section401)<br/>
    - 4.2 [Top five city that have hosted the most number of matches?](#section402)<br/>
    - 4.3 [Maximum Toss Winners](#section403)<br/>
    - 4.4 [Has Toss-winning helped in Match-winning?](#section404)<br/>
    - 4.5 [Toss Decisions across seasons](#section405)<br/>
    - 4.6 [Which team won by max runs and the best defending team?](#section406)<br/>
    - 4.7 [Which team won by max wickets and best chasing team?](#section407)<br/>
    - 4.8 [In which city does Weather affected matches?](#section408)<br/>
    - 4.9 [Which is the best chasing venues to win match in IPL?](#section409)<br/>
    - 4.10 [Which is the best defending venues to win match in IPL?](#section410)<br/>
    - 4.11 [Who is the favourite Umpires?](#section411)<br/>
    - 4.12 [Which team won maximum matches?](#section412)<br/>
    - 4.13 [Which player got maximum times Player of the match in IPL?](#section412)<br/>
5. [Conclusions](#section5)<br/> 

<a id=section1></a>

### 1. Problem Statement

#### Some Background Information
The Indian Premier League (IPL) is a Twenty20 cricket league tournament held in India contested during April and May of every year where top players from all over the world take part. The IPL is the most-attended cricket league in the world and ranks sixth among all sports leagues.

The team has got some world class players but has not been able to live up to the expectations of their supporters. Their poor show in IPL has left everyone disappointed. 

__I wanted to analyze__
- The reason behind their performance and suggest any recommendation for future auctions and player choices.
- Predicting the winner of the next season of IPL based on past data, Visualizations, Perspectives, etc.

The notebook contains:
- Basic Analysis like Teams with maximum matches, wins,etc


<a id=section2></a>

### 2. Data Loading and Description

- The dataset contains details information related to the matches such as location, contesting teams, umpires, results, etc. between 2008 and 2018.
- The dataset comprises of __696 observations of 18 columns__. Below is a table showing names of all the columns and their description.

| Column Name        | Description                                      |
| ------------------ |:-------------                                   :| 
| id                 | Identity of match                              | 
| season             | Season                                         |  
| city               | City in which match played                     | 
| date               | Date on which match played                     |   
| team1              | Name of Team One                               |
| team2              | Name of Team Two                               |
| toss_winner        | Name of team who won toss                      |
| toss_decision      | Toss decision                                  | 
| result             | Result of match                                |
| dl_applied         | Is dl rule applied                             |
| winner             | Name of winner team                            |
| win_by_runs        | Win by runs                                    |
| win_by_wickets     | Win by wickets                                 |
| player_of_match    | Name of player who received Player of match award                                                                       |
| venue              | Venue of match                                 |
| umpire1            | Name of umpire one                             |
| umpire2            | Name of umpire two                             |
| umpire3            | Name of third umpire                           |

#### Source :
https://github.com/insaid2018/Term1/blob/master/Data/Projects/matches.csv


#### Importing packages                                          

In [None]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics

%matplotlib inline
sns.set()

from subprocess import check_output



#### Importing the Dataset

In [None]:
matches_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/matches.csv")     # Importing training dataset using pd.read_csv

<a id=section3></a>

## 3. Data Profiling

- In the upcoming sections we will first __understand our dataset__ using various pandas functionalities.
- Then with the help of __pandas profiling__ we will find which columns of our dataset need preprocessing.
- In __preprocessing__ we will deal with erronous and missing values of columns. 
- Again we will do __pandas profiling__ to see how preprocessing have transformed our dataset.

<a id=section301></a>

### 3.1 Understanding the Dataset

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end.

Let us check the basic information of the dataset. The very basic information to know is the dimension of the dataset – rows and columns – that’s what we find out with the method __shape__.

In [None]:
matches_data.shape

matches_data has __696 rows and 18 columns.__

In [None]:
matches_data.columns

In [None]:
matches_data.head()

In [None]:
matches_data.tail()

In [None]:
matches_data.info()

In [None]:
matches_data.describe(include='all')

In [None]:
matches_data.isnull().sum()

In [None]:
matches_data.count()

<a id=section302></a>

### 3.2 Pre Profiling

- By pandas profiling, an __interactive HTML report__ gets generated which contins all the information about the columns of the dataset, like the __counts and type__ of each _column_. Detailed information about each column, __correlation between different columns__ and a sample of dataset.<br/>
- It gives us __visual interpretation__ of each column in the data.
- _Spread of the data_ can be better understood by the distribution plot. 
- _Grannular level_ analysis of each column.

In [None]:
profile = pandas_profiling.ProfileReport(matches_data)
profile.to_file(outputfile="matches_data_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as __matches_data_before_preprocessing.html__. Take a look at the file and see what useful insight you can develop from it. <br/>
Now we will process our data to better understand it.

<a id=section303></a>

### 3.3 Preprocessing

- Dealing with missing values<br/>
    - Replacing missing entry of the __city__ using __venue__ information.
    - Dropping the column __umpire3__ as it has to many __null__ values.
    - Neutralise/ Standardise values of __date__ column in __YYYY-MM-DD__ format, as values in column has two different format “YYYY-MM-DD” and “DD/MM/YY”.
    - Find and validate missing values of winner & player of match column.
    - Neutralise/ Standardise values of __id__ column and remove duplicate records from data set 
    - Replacing incorrect & multiple entry of the Pune team.

In [None]:
# Filter missing entry of the city
missing_city_data_set = matches_data.loc[matches_data.city.isnull(), ]

# print for observation
print("Missing entries for city : {} ".format(missing_city_data_set.city.size))
print("Venues names are : {} ".format(missing_city_data_set.venue.unique().sum()))

# print data set which give insight for missing value of city & venue
missing_city_data_set



There are __7__ missing entry of city and since all same rows have same venue i.e. "Dubai International Cricket Stadium", which mean match was played in Dubai city. So we can replace missing entry for city with "Dubai".

Let's replace missing entry of the city with "Dubai".

In [None]:
matches_data.city.fillna('Dubai', inplace = True)

Let's drop the column umpire3 as it has to many null values.

In [None]:
matches_data.drop('umpire3', axis = 1, inplace = True)


Neutralise/ Standardise values of date column in YYYY-MM-DD format, as values in column has two different format “YYYY-MM-DD” and “DD/MM/YY”.

In [None]:
matches_data.date.unique()

In [None]:
# Apply unique date format to value of dates columns
matches_data['date'] = matches_data.date.apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%d'))

# Check unique value again
matches_data.date.unique()

Find and validate missing values of winner & player of match column.

In [None]:
matches_data.loc[matches_data.winner.isnull() & matches_data.umpire3.isnull(), ]


In [None]:
# Get all record of winner & player column
matches_data[
    matches_data.winner.isnull() & 
    matches_data.player_of_match.isnull()
]

set(matches_data.winner).intersection(set(matches_data.player_of_match))

There are 3 records have missing value for winner & player of match. Considering value of __result__ column for above records, it's seem like those matches has been cancel or draw. 

Let's replace missing value of winner & player of match with some meaning full text, i.e. "draw".

In [None]:
# Replace missing value with some meaning full text.
matches_data.winner.fillna('draw', inplace=True)
matches_data.player_of_match.fillna('draw', inplace=True)

Neutralise/ Standardise values of id column and remove duplicate records from data set

In [None]:
matches_data.drop_duplicates(subset=['id'])
matches_data.shape

In [None]:
matches_data.isnull().sum()

In [None]:
matches_data[matches_data.umpire1.isnull()]

In [None]:
matches_data.umpire1.fillna('Unknown', inplace = True)
matches_data.umpire2.fillna('Unknown', inplace = True)

Replacing incorrect & multiple entry of the Pune team name in team1, team2, toss_winner and winner column

In [None]:
incorrect_names = ['rising pune supergiant', 'pune warriors']
pune_team_name = 'Rising Pune Supergiants'

for (row, col) in matches_data.iterrows():
     
    if str.lower(col.team1) in incorrect_names:
        matches_data['team1'].replace(to_replace=col.team1, value=pune_team_name, inplace=True)
    
    if str.lower(col.team2) in incorrect_names:
        matches_data['team2'].replace(to_replace=col.team2, value=pune_team_name, inplace=True)
    
    if str.lower(col.winner) in incorrect_names:
        matches_data['winner'].replace(to_replace=col.winner, value=pune_team_name, inplace=True)
    
    if str.lower(col.toss_winner) in incorrect_names:    
        matches_data['toss_winner'].replace(to_replace=col.toss_winner, value=pune_team_name, inplace=True)
    
matches_data.team1.unique()


In [None]:
matches_data.count()

<a id=section304></a>

### 3.4 Post Pandas Profiling

In [None]:
#import pandas_profiling
profile = pandas_profiling.ProfileReport(matches_data)
profile.to_file(outputfile="matches_data_after_preprocessing.html")

Now we have preprocessed the data, now the dataset doesnot contain missing values. You can compare the two reports, i.e __matches_data_after_preprocessing.html__ and __matches_data_before_preprocessing.html__.<br/>
In __matches_data__after_preprocessing.html__ report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__ 
- Number of __variables__ = __13__ 

<a id=section304></a>

__Utils functions__

In [None]:
class ChartType:
    bar_chart = 1
    bar_chart_horizontal = 2
    line_chart = 3
    histogram_chart = 4
    stack_chart = 5
    scatter_chart = 6
    area_chart = 7
    pie_chart = 8


In [None]:
def showChart(data, chart_type, xlabel, ylabel, title=None, figsize=None, axis=None):
    '''
    data : data frame,
    xlabel : The label text for x axis.
    ylabel : The label text for y axis.
    title : The label text for title of chart.
    figsize : tuple of integers, optional, default: None
    axis : The axis limits to be set. Either none or all of the limits must
    be given.
    '''
    # Set figure size of chart
    if figsize != None:
        plt.figure(figsize=figsize)

    # Set x & y axis limit
    if axis != None:
        plt.axis(axis) 

    # Draw bar chart
    if ChartType.bar_chart == chart_type:
        data.plot.bar()
    elif ChartType.bar_chart_horizontal == chart_type:
        data.plot.barh()
    elif ChartType.stack_chart == chart_type:
        data.plot.bar(stacked=True)
    elif ChartType.line_chart == chart_type:
        data.plot.line()
    elif ChartType.histogram_chart == chart_type:
        data.plot.hist()
    elif ChartType.scatter_chart == chart_type:
        data.plot.area()
    elif ChartType.area_chart == chart_type:
        data.plot.area()
    elif ChartType.pie_chart == chart_type:
        plt.pie(data.values,
                       labels=data.index,
                       autopct='%1.2f', startangle=90)
        
#         explode = (0.2, 0, 0, 0, 0, 0)
#         plt.explode = explode
#         plt.autopct='%1.1f%%'
        plt.legend(data.index, loc="best")
        plt.axis('equal')
#         plt.pctdistance=1.1
#         plt.labeldistance=1.2
#         data.plot.pie()
        
    # Set title of chart, y & x axis
    if title != None:
        plt.title(title, fontsize=20)
        
    if xlabel != None:
        plt.xlabel(xlabel, fontsize=10)

    if ylabel != None:
        plt.ylabel(ylabel, fontsize=10)

    # Custom ticks for m axis
    plt.tick_params(axis='x', colors='black', direction='out', length=5, width=1, labelsize='large')
    
    # Custom ticks for m axis
    plt.tick_params(axis='y', colors='black', direction='in', length=5, width=1, labelsize='large')
    
    # Show char
    plt.show()
    

<a id=section4></a>

### 4. Questions

<a id=section401></a>

__4.1 Which season had most number of matches?__

In [None]:
# Get data frame for seasons for number of matches
season_df = matches_data.groupby(matches_data.season).season.count().sort_values(ascending=False)

# Show information in barchart
showChart(season_df, ChartType.bar_chart_horizontal, 'Number of matches', 'Season', None, (12,10), None)


Let's try to do same analysis using seaborn lib without filter data set.

In [None]:
# Show countplot for number of matches per season using data frame
sns.countplot(x=matches_data.season, data=matches_data)
plt.show()

From the season __2011-2013__, we had the most number of matches from other seasons and __2013__ season has the most number of matches

<a id=section402></a>

__4.2 Top five city that have hosted the most number of matches?__

In [None]:
# Get data frame of matches played in different city
matches_per_city_df = matches_data.groupby([matches_data.city]).city.count().sort_values(ascending=False).head()

# Show information in barchart
showChart(matches_per_city_df, ChartType.pie_chart, None, None, None, (10,10), None)



By referring above graph, we can clearly say that, the highest number of matches had been played at __Mumbai__, followed by __Delhi, Kolkata and Bangalore__ in order.

<a id=section403></a>

__4.3 Maximum Toss Winners__

In [None]:
# Show countplot for team wise toss winner using data frame
ax = matches_data['toss_winner'].value_counts().plot.bar(width=0.8)
for p in ax.patches:
    ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+1))
plt.show()


By referring above graph, we can clearly say that, __Mumbai Indians__ seem to be very lucky having the highest win in tosses followed by __Kolkata Knight Riders__. __Kochi Tuskers Kerala__ had the lowest wins as they have played the lowest matches also. This does not show the higher chances of winning the toss as the number of matches played by each team is uneven.

Let's check, what do various teams opt (batting or fielding) upon winning the toss?

<a id=section404></a>

__4.4 Has Toss-winning helped in Match-winning?__

In [None]:
# Get data frame for toss winning
toss_winning_df = matches_data.groupby([matches_data.toss_winner, matches_data.winner]).toss_winner.count().sort_values(ascending=False).head(10)

# Show information in barchart
showChart(toss_winning_df, ChartType.bar_chart_horizontal, 'Number of matches', 'Teams Names', None, (6,8), [0, 55, 0, 1])


In [None]:
# Show piechart to analise toss decision for winning perspective
winning_matches_series = matches_data[matches_data['toss_winner'] == matches_data['winner']]
toss_winner_slices = [len(winning_matches_series),(577-len(winning_matches_series))]
labels = ['yes','no']

# Create dataframe from pandas series to draw piechart
df = pd.DataFrame(toss_winner_slices, labels)

# Show information in barchart
showChart(df, ChartType.pie_chart, None, None, None, (5,5), None)

Based on analysis on IPL matches data frame, the above piechart indicates,  the match winning probability for toss winning team was about __60%-40%__. Even though probability of winning was high, which does not necessary that a toss winner will be the match winner.

<a id=section405></a>

__4.5 Toss Decisions across seasons__

In [None]:
# Show countplot to analysis toss decisions by seasons
plt.figure(figsize=(14,10))
sns.countplot(x='season',hue='toss_decision',data=matches_data)
plt.show()

The decision for batting or fielding varies largely across the seasons. In some seasons, the probability that toss winners opt for batting is high, while it is not the case in other seasons. In 2016 though, the majority of toss winners opted for batting.

In [None]:
# Show countplot to analysis toss decisions team wise by seasons
plt.figure(figsize=(14,10))
sns.set_style('whitegrid')
t = sns.countplot(x="toss_winner", hue="toss_decision", data=matches_data)
plt.xticks(rotation = 90)
plt.tight_layout()
t.set(ylabel='Number of Matches', xlabel='Team')


Base on above chart, we can easily partition the teams into two groups, fielding loving and batting loving. __Gujarat Loins__ takes up fielding almost all the time. On the other hand, __Chennai Super Kings__ usually takes up batting. __Mumbai Indians__ has picked up batting and fielding almost equal number of times.

<a id=section406></a>

__4.6 Which team won by max runs and the best defending team?__

In [None]:
# Show swarmplot to analysis team won by max runs and the best defending team
plt.figure(figsize=(10,12))
sns.swarmplot(x=matches_data.win_by_runs, y=matches_data.winner, data=matches_data)
plt.show()


In [None]:
# Get data frame for win by max run
win_by_max_run_df = matches_data.sort_values(['win_by_runs'], ascending=[False]).head().groupby(matches_data.winner).winner.count().head()

# Show information in barchart
showChart(win_by_max_run_df, ChartType.bar_chart_horizontal, 'Frequency', 'Winner Teams Name', None, None, None)



As we can observe, the __Royal Challengers Bangalore__ (considered a batting heavy side) seems to be winning many matches by a huge margin. __Mumbai Indians and Kolkata Knight Riders__ (both considered balanced sides) show almost similar behaviour with victories distributed quite evenly. 

Apparently, __Mumbai Indians__ also appears to hold the record for the biggest win by runs and the best defending team.


<a id=section407></a>

__4.7 Which team won by max wickets and best chasing team?__

In [None]:
# Show swarmplot to analysis team won by max wickets and the best chasing team
plt.figure(figsize=(10,12))
sns.swarmplot(x=matches_data.win_by_wickets, y=matches_data.winner, data=matches_data)
plt.show()


In [None]:
# Get data frame for win by max run
win_by_max_wickets_df = matches_data.sort_values(['win_by_wickets'], ascending=False).head(10).groupby(matches_data.winner).winner.count().head(10)

# Show information in barchart
showChart(win_by_max_wickets_df, ChartType.bar_chart_horizontal, 'Frequency', 'Winner Teams Name', None, None, None)



Probably we don’t have much to gain from here. No specific trends can be observed.

From above swarm plot data, we can say that __Royal Challengers Bangalore__ best chasing team and won three times by maximum wickets taken.

<a id=section408></a>

__4.8 In which city does Weather affected matches?__

In [None]:
# Show barchart to analysis in which city does Weather affected matches
plt.figure(figsize=(14,10))
sns.countplot(matches_data.city[matches_data.dl_applied == 1])
plt.show()


After the team batting first has started its innings and there is an interruption due to rain or any other factor, Duckworth-Lewis Method can be applied. This DL algorithm is applied and the target for the team batting second is calculated based on the overs that can be bowled as decided by the umpires and considering above graph we can conclude maximum times Duckworth-Lewis Method have been applied at __Kolkata__. followed by __Delhi and Bangalore__.

<a id=section409></a>

__4.9 Which is the best chasing venues to win match in IPL?__

In [None]:
plt.figure(figsize=(5,12))
sns.swarmplot(x=matches_data.win_by_wickets, y=matches_data.venue, data=matches_data)

plt.show()

__Eden Gardens__ & __M Chinnaswamy Stadium__ were the best venues for chasing.

<a id=section410></a>

__4.10 Which is the best defending venues to win match in IPL?__

In [None]:
plt.figure(figsize=(8,12))
sns.swarmplot(x=matches_data.win_by_runs, y=matches_data.venue, data=matches_data)

plt.show()

From above swarm plot, we can say that __MA Chidambaram Stadium, Chepauk__ was the best defending stadium.

<a id=section411></a>

__4.11 Who is the favourite Umpires in IPL?__

In [None]:
ump = pd.concat([matches_data['umpire1'],matches_data['umpire2']]) 
ax = ump.value_counts().head(10).plot.bar(width=0.8,color='Y')
for p in ax.patches:
    ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.25))
plt.show()


From above chart, it's seem like __S Ravi__ had umpired maximum IPL matches.

<a id=section412></a>

__4.12 Which team won maximum matches?__

In [None]:
# Get data frame for player of match
winner_df =  matches_data.groupby(matches_data.winner).winner.count().sort_values(ascending=False).head(10)

# Show information in barchart
showChart(winner_df, ChartType.bar_chart_horizontal, 'Frequency', 'Winner Teams Name', None, None, None)


By referring above graph, we can clearly say that, __Mumbai Indians__ are clearly the most successful teams followed by __Chennai Super Kings__ and __Kolkata Knight Riders__. __Sunrisers Hyderabad__, __Gujarat Lions__ and __Deccan Chargers__ (which is now defunct) are trailing. But this is because of the lesser number of seasons they have played in. Therefore __Delhi Daredevils__ is the most unsuccessful IPL team.

<a id=section413></a>

__4.13 Which player got maximum times Player of the match in IPL?__

In [None]:
# Get data frame for seasons for number of matches
season_df = matches_data.groupby([matches_data.season, matches_data.player_of_match]).player_of_match.count().sort_values(ascending=False).head()

# Show information in barchart
showChart(season_df, ChartType.bar_chart_horizontal, 'Number of matches', 'Season', None, (15,10), None)


__Maximum player of match__

In [None]:
# Get data frame for player of match
player_of_match_df = matches_data.groupby(matches_data.player_of_match).player_of_match.count().sort_values(ascending=False).head()

# Show information in barchart
showChart(player_of_match_df, ChartType.bar_chart_horizontal, 'Frequency', 'Players Name', None, None, [0, 25, 0, 1])




In IPL matches, __Chris Gayle__ leads and holds the record for the highest number of awards won, of course, __AB de Villiers__ is also placed second in the list. 

An old IPL fans will know about Chris Gayle heroics in the first few seasons.


In [None]:
# Used heat map to check co-relation between column
cor = matches_data.corr()
plt.figure(figsize=(12,10))
sns.heatmap(cor, annot=True, cmap='coolwarm')
plt.show()


From the above heatmap diagram, there is a negative corelation between the win by wickets and win by runs, and we know that if one value is zero then other would be a non-zero value.

<a id=section5></a>

## 5. Conclusion 

- From the season __2011-2013__, we had the most number of matches from other seasons and __2013__ season has the most number of matches.
- the highest number of matches had been played at __Mumbai__, followed by __Delhi, Kolkata and Bangalore__ in order.
-  __Mumbai Indians__ seem to be very lucky having the highest win in tosses followed by __Kolkata Knight Riders__. __Kochi Tuskers Kerala__ had the lowest wins as they have played the lowest matches also. 
- The match winning probability for toss winning team was about __60%-40%__. 
- The decision for batting or fielding varies largely across the seasons. In some seasons, the probability that toss winners opt for __batting is high__, while it is not the case in other seasons. In 2016 though, the majority of toss winners opted for batting.
- __Gujarat Loins__ takes up fielding almost all the time. On the other hand, __Chennai Super Kings__ usually takes up batting. __Mumbai Indians__ has picked up batting and fielding almost equal number of times.
- __Royal Challengers Bangalore__ (considered a batting heavy side) seems to be winning many matches by a huge margin. __Mumbai Indians and Kolkata Knight Riders__ (both considered balanced sides) show almost similar behaviour with victories distributed quite evenly.  Apparently, __Mumbai Indians__ also appears to hold the record for the biggest win by runs and the best defending team.
- __Royal Challengers Bangalore__ best chasing team and won __three__ times by maximum wickets taken.
- Maximum times Duckworth-Lewis Method have been applied at __Kolkata__. followed by Delhi and Bangalore.
- __S Ravi__ had umpired maximum IPL matches.
- __Mumbai Indians__ are clearly the most successful teams followed by __Chennai Super Kings__ and __Kolkata Knight Riders__. __Sunrisers Hyderabad__, __Gujarat Lions__ and __Deccan Chargers__ (which is now defunct) are trailing. But this is because of the lesser number of seasons they have played in. Therefore __Delhi Daredevils__ is the most unsuccessful IPL team.
- __Chris Gayle__ leads and holds the record for the highest number of awards won, of course, __AB de Villiers__ is also placed second in the list.