<a href="https://colab.research.google.com/github/karthikdoguparthi/IPLAnalysis/blob/main/IPL_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - IPL Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Karthik Doguparthi

# **Project Summary -**

In this project, I set out to analyze historical data from the Indian Premier League (IPL) to uncover meaningful insights about teams, players, and match outcomes. The IPL is one of the most followed sporting events in the world, and I wanted to use data analytics to better understand its patterns, trends, and key performers. To achieve this, I worked with ball by ball data as well as match level datasets, applying Python libraries such as Pandas for data processing and Matplotlib and Seaborn for visualizations.

My first step was to clean and merge the datasets so I could create a single source of truth for analysis. Once that was ready, I began exploring different aspects of the game. I looked at toss outcomes and their influence on match results, identifying whether winning the toss had a significant advantage. I also analyzed matches where teams won the toss but ended up losing, which highlighted that strategy and execution matter more than the coin flip.

On the batting side, I explored the players with the highest overall runs, the best single match performances, and the consistency of top scorers across seasons. This gave me insights into how some players dominate not only in a single season but across multiple years. Similarly, on the bowling front, I identified bowlers with the most wickets overall, the best bowling figures in a single match, and those who consistently bowled maiden overs. These statistics showcased the bowlers who had the greatest impact in turning games around.

Fielding is often overlooked in T20 cricket, so I also studied dismissals by catches and identified the players most frequently dismissed in this manner as well as the fielders contributing the most. Additionally, I examined extras conceded by bowlers, which revealed how discipline can make a difference in close matches.

Beyond players, I analyzed umpiring trends, including the number of matches officiated by top umpires across seasons. I also studied the frequency of Duckworth Lewis (DL) affected matches and normal results across seasons to see how external factors like weather shaped the tournament.

By combining all these perspectives, I was able to build a holistic view of IPL history. This project not only strengthened my technical skills in data analytics and visualization but also gave me the satisfaction of connecting statistical findings with real cricketing insights. It showed me how data can tell stories that go beyond numbers and bring new appreciation for the sport I enjoy.

# **GitHub Link -**

https://github.com/karthikdoguparthi/IPLAnalysis

# **Problem Statement**


The objective of this project is to analyze historical Indian Premier League (IPL) data using data visualization and statistical techniques. The project focuses on uncovering insights about players, teams, umpires, and match outcomes through ball-by-ball and match-level datasets. The aim is to identify performance trends, winning patterns, and key contributors, thereby providing a data-driven perspective on the IPL.

#### **Define Your Business Objective?**

1. Franchises & Coaches: Identify top performing players, evaluate consistency, and refine strategies (e.g., batting order, bowler usage, toss impact).

2. Sponsors & Broadcasters: Highlight star performers and engaging storylines to enhance viewer engagement.

3. Analysts & Streamers: Gain a deeper understanding of IPL trends, such as highest run scorers, wicket-takers, or impactful matches.

4. Cricket Boards (BCCI/IPL): Use data to optimize tournament formats and scheduling decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
matches= pd.read_csv('/content/matches.csv')
deliveries= pd.read_csv('/content/deliveries.csv')

In [None]:
#Renaming column name from id to match_id
matches.rename(columns={'id':'match_id'},inplace=True)

In [None]:
# Combining Datasets
merged=pd.merge(matches,deliveries,on='match_id')

### Dataset First View

In [None]:
# Dataset First Look
merged

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged.shape

### Dataset Information

In [None]:
# Dataset Info
merged.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
merged.duplicated().sum()

In [None]:
# show duplicate records
merged[merged.duplicated()]

In [None]:
# Deleting duplicate records
merged.drop_duplicates(inplace=True)

In [None]:
# Dataset Duplicate Value Count
merged.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged.isnull().sum()

In [None]:
# print match_id where umpire1 is blank
merged[merged['umpire1'].isnull()]['match_id']

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,5))
merged.isnull().sum().plot(kind='bar')
plt.tight_layout()
plt.show()

### What did you know about your dataset?

It is a comprehensive dataset covering both macro level (match outcomes) and micro level (ball by ball events) details.

It allowed me to analyze not just who won, but how they won whether by runs, wickets, or close finishes.

Since it had both batting and bowling data at the ball level, I could calculate advanced stats like maiden overs, highest individual scores, and top bowling figures.

The dataset also covered external influences like toss decisions, Duckworth Lewis affected matches, and umpire involvements.

It was well structured for relational analysis since both datasets used match_id as a key, enabling smooth merging.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged.columns

In [None]:
# Dataset Describe
merged.describe(include='all').transpose()

### Variables Description

**Matches Dataset Columns**

id: Unique identifier for each match.

season: The year the IPL season took place.

city: The city where the match was played.

date: The date on which the match was played.

team1: The name of one of the teams that played the match.

team2: The name of the other team that played the match.

toss_winner: The team that won the toss.

toss_decision: The decision made by the toss-winning team (field or bat).

result: The result of the match (e.g., normal, tie).

dl_applied: Indicates if the Duckworth-Lewis method was applied (1 if applied, 0 if not).

winner: The team that won the match.

win_by_runs: Margin of victory in runs if the match was won by runs.

win_by_wickets: Margin of victory in wickets if the match was won by wickets.

player_of_match: The player who was awarded the Player of the Match.

venue: The name of the stadium where the match was played.

umpire1: The name of the first umpire.

umpire2: The name of the second umpire.

umpire3: The name of the third umpire (if any).

**Deliveries Dataset Columns**

match_id: Unique identifier for each match (same as in the matches dataset).
inning: The inning number (1 or 2).

batting_team: The team that is currently batting.

bowling_team: The team that is currently bowling.

over: The over number in the current inning.

ball: The ball number in the current over.

batsman: The name of the batsman facing the delivery.

non_striker: The name of the batsman at the non-striker end.

bowler: The name of the bowler delivering the ball.

is_super_over: Indicates if the ball was delivered in a Super Over (1 if true, 0 if not).

wide_runs: Number of runs scored as wides in the delivery.

bye_runs: Number of runs scored as byes in the delivery.

legbye_runs: Number of runs scored as leg byes in the delivery.

noball_runs: Number of runs scored as no-balls in the delivery.

penalty_runs: Number of penalty runs awarded in the delivery.

batsman_runs: Number of runs scored by the batsman on the delivery.

extra_runs: Total extra runs awarded in the delivery (sum of wide, bye, legbye, noball, and penalty runs).

total_runs: Total runs scored in the delivery (sum of batsman runs and extra runs).

player_dismissed: Name of the player dismissed in the delivery (if any).

dismissal_kind: The method of dismissal (e.g., caught, bowled, run-out).

fielder: Name of the fielder involved in the dismissal (if any).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
merged.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.


#replace blanks in umpire1, umpire2 and umpire3 column with Unknown
merged['umpire1'].fillna('Unknown',inplace=True)
merged['umpire2'].fillna('Unknown',inplace=True)
merged['umpire3'].fillna('Unknown',inplace=True)


#replace Rising Pune Supergiant with Rising Pune Supergiants in any where that occurs
merged = merged.replace("Rising Pune Supergiant", "Rising Pune Supergiants")

# Filling blanks in city column with Team 1 homeground
team_city_map = {
    "Chennai Super Kings": "Chennai",
    "Deccan Chargers": "Hyderabad",
    "Delhi Daredevils": "Delhi",
    "Gujarat Lions": "Rajkot",
    "Kings XI Punjab": "Mohali",
    "Kochi Tuskers Kerala": "Kochi",
    "Kolkata Knight Riders": "Kolkata",
    "Mumbai Indians": "Mumbai",
    "Pune Warriors": "Pune",
    "Rajasthan Royals": "Jaipur",
    "Rising Pune Supergiants": "Pune",
    "Royal Challengers Bangalore": "Bangalore",
    "Sunrisers Hyderabad": "Hyderabad"
}

def fill_city(row):
    if pd.isna(row['city']):
        return team_city_map.get(row['team1'], None)
    return row['city']

merged['city'] = merged.apply(fill_city, axis=1)

# replace blanks in winner column and player_of_match to Tie Match
merged['winner'].fillna('Tie Match',inplace=True)
merged['player_of_match'].fillna('Tie Match',inplace=True)

# Ensure one row per match
unique_matches = merged.drop_duplicates(subset="match_id")

In [None]:
merged.isnull().sum()

In [None]:
# download merged csv
merged.to_csv('merged.csv',index=False)

### What all manipulations have you done and insights you found?

Handled Missing Umpire Data

Replaced blanks (NaN) in umpire1, umpire2, and umpire3 columns with "Unknown".

This ensured umpire-related analysis (like number of matches officiated) wouldn’t fail due to missing values.

Standardized Team Names

Replaced "Rising Pune Supergiant" with "Rising Pune Supergiants".

This removed duplication of team names and prevented splitting stats between two variations of the same team.

Imputed Missing Cities

Used a mapping dictionary (team_city_map) to fill missing values in the city column based on Team 1’s home ground.

This gave you complete venue information, which later allowed season-wise city analysis and home-ground comparisons.

Handled Missing Match Results

Replaced blanks in winner and player_of_match with "Tie Match".

This allowed you to include tied matches in your analysis without leaving missing values, and also showed when no single winner or Player of the Match existed.

Ensured One Row per Match

Used drop_duplicates(subset="match_id") to avoid duplicate match entries (since merged data had ball-by-ball + match-level info).

This was critical for match-level insights like winners, toss analysis, and season-wise counts.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

#Teams that won toss and won match
match_results = unique_matches[['match_id', 'toss_winner', 'winner']]

# Filter matches where same team won toss and match
won_both = match_results[match_results['toss_winner'] == match_results['winner']]

# Count per team
team_wins = won_both['winner'].value_counts()

# Plot
plt.figure(figsize=(10,6))
team_wins.plot(kind='bar', color='skyblue')
plt.title("Teams That Won Both Toss and Match", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Number of Matches")
plt.xticks(rotation=45,ha ='right')
plt.tight_layout()
for i, value in enumerate(team_wins):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a bar chart because it clearly shows categorical comparisons between IPL teams in terms of how often they won both the toss and the match. Since each team is a distinct category and the number of matches is a straightforward count, a bar chart is the most effective way to represent the data. It allows quick identification of which teams have benefited the most from converting toss wins into match wins.

##### 2. What is/are the insight(s) found from the chart?

1. The Mumbai Indians lead significantly, having won both the toss and the match on 48 occasions, which indicates their ability to convert the initial advantage of the toss into a winning outcome more often than other teams.

2. Newer or less successful teams like Pune Warriors (3), Kochi Tuskers Kerala (4), and Rising Pune Supergiants (8) have very low counts, suggesting that either they did not win enough tosses or were unable to capitalize on them to secure victories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing which teams successfully convert toss wins into match wins can help franchises evaluate and refine their toss-based strategies.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Teams that won toss but lost match
match_results = unique_matches[['match_id', 'toss_winner', 'winner']]

# Filter matches where toss winner lost the match
lost_after_toss = match_results[match_results['toss_winner'] != match_results['winner']]

# Count per team (toss winner who lost match)
team_losses = lost_after_toss['toss_winner'].value_counts()

# Plot
plt.figure(figsize=(10,6))
team_losses.plot(kind='bar', color='salmon')
plt.title("Teams That Won Toss but Lost Match", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Number of Matches")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(team_losses):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar chart because it is the most effective way to compare categorical data across teams. It allows a clear visual comparison of how many times each IPL team won the toss but failed to win the match. Since the focus is on team-wise frequency, the bar chart provides an intuitive, easy-to-read representation.

##### 2. What is/are the insight(s) found from the chart?

1. Kings XI Punjab have the highest number of instances (40 matches) where they won the toss but lost the match. This indicates that winning the toss did not always translate into a winning strategy for them.


2. Teams like Delhi Daredevils (39) and Mumbai Indians (37) also show a high number of such occurrences, which highlights that even successful teams can sometimes fail to capitalize on the advantage of winning the toss.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Franchises can use these insights to evaluate and refine their post-toss strategies.
2. Broadcasters and analysts can use such insights to create engaging narratives

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Filter only caught dismissals
catches = merged[merged['dismissal_kind'] == 'caught']

# Count catches per fielding team
team_catches = catches['bowling_team'].value_counts()

# Plot
plt.figure(figsize=(10,6))
team_catches.plot(kind='bar', color='lightgreen')
plt.title("Total Catches by Teams", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Number of Catches")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(team_catches):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I picked a bar chart because it provides a clear comparison of total catches taken by different IPL teams. Since catches are discrete counts associated with categorical variables (teams), a bar chart makes it easy to quickly identify which teams have been the most effective in the field.

##### 2. What is/are the insight(s) found from the chart?

1. Royal Challengers Bangalore (555) and Mumbai Indians (554) are almost neck-to-neck as the teams with the highest number of catches, showing their consistent fielding contributions across seasons.

2. Teams like Kochi Tuskers Kerala (35) and Gujarat Lions (94) have the least catches, which can be attributed to their short stints in the IPL, highlighting how longevity in the league directly impacts cumulative stats.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Teams with high catch counts demonstrate strong fielding standards.
2. Highlighting such fielding statistics builds narratives beyond just batting and bowling.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Filter only bowled dismissals
catches = merged[merged['dismissal_kind'] == 'bowled']

# Count catches per fielding team
team_catches = catches['bowling_team'].value_counts()

# Plot
plt.figure(figsize=(10,6))
team_catches.plot(kind='bar', color='grey')
plt.title("Total Bowled Wickets by Teams", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Number of Wickets")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(team_catches):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I selected a bar chart because it effectively showcases the number of wickets taken by each IPL team through the “bowled” mode of dismissal. Since the metric involves team-wise comparison of counts, the bar chart makes it easy to identify which teams have been most effective in bowling out batsmen directly.

##### 2. What is/are the insight(s) found from the chart?

1. Mumbai Indians lead with 192 bowled wickets, highlighting the strength and accuracy of their bowling attack over the years, particularly their fast bowlers and spinners who consistently hit the stumps.

2. Teams like Kochi Tuskers Kerala (16) and Gujarat Lions (21) have the lowest number of bowled wickets, which aligns with their short stints in the IPL, showing that longevity in the league plays a big role in accumulating such stats.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. This data helps franchises understand the bowling strengths of different teams.
2. Bowled dismissals are considered one of the most exciting moments in cricket.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Calculate total runs scored by each team
team_runs = merged.groupby('batting_team')['total_runs'].sum().sort_values(ascending=False)

# Plot
plt.figure(figsize=(12,6))
team_runs.plot(kind='bar', color='orange')
plt.title("Total Runs Scored by Teams", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Total Runs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(team_runs):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar chart because it is the most effective way to compare the aggregate runs scored by different IPL teams. Since runs are cumulative numerical values and teams are categorical variables, a bar chart clearly highlights which teams have been the most prolific scorers in the league.

##### 2. What is/are the insight(s) found from the chart?

1. The Mumbai Indians (24,520 runs) lead all teams, reflecting both their long term consistency in batting and their extended participation across seasons.

2. At the bottom, short-lived franchises like Kochi Tuskers Kerala (1,901 runs), Rising Pune Supergiants (4,533 runs), and Gujarat Lions (4,862 runs) scored the least runs, which is expected given their brief presence in the league.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Teams with lower run totals can identify gaps in their batting lineups and invest in stronger top-order or power-hitting players during auctions.
2. This creates exciting storylines that boost fan loyalty and viewer engagement, ultimately driving higher commercial returns.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Calculate total runs scored by each batsman
batsman_runs = merged.groupby('batsman')['total_runs'].sum().sort_values(ascending=False)

# Plot top 10 batsmen
plt.figure(figsize=(10,6))
batsman_runs.head(10).plot(kind='bar', color='purple')
plt.title("Top 10 Run Scorers", fontsize=14)
plt.xlabel("Batsman")
plt.ylabel("Total Runs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(batsman_runs.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I selected bar charts because they provide the clearest and most direct way to compare performances of players across runs. Since each player is a distinct category and the values are numerical, bar charts allow for quick identification of top performers.

##### 2. What is/are the insight(s) found from the chart?

1.  Suresh Raina leads the list with 4745 runs, highlighting his consistency as a top-order batsman across multiple seasons.

2. Virat Kohli (4588) and Rohit Sharma (4334) follow closely, showing that these players have been central to their teams’ batting success.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. From the batting and bowling insights, franchises can use this data to make informed decisions during auctions, targeting consistent run scorers and wicket takers who provide proven value.

2. The analysis of extras conceded highlights a key area of improvement discipline in bowling. Addressing this through training and analytics can save crucial runs, potentially altering match outcomes and improving a team’s win rate.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Calculate total extras conceded per bowler
bowler_extras = merged.groupby('bowler')['extra_runs'].sum().sort_values(ascending=False)

# Plot top 10 bowlers
plt.figure(figsize=(10,6))
bowler_extras.head(10).plot(kind='bar', color='red')
plt.title("Top 10 Bowlers Who Conceded Most Extras", fontsize=14)
plt.xlabel("Bowler")
plt.ylabel("Total Extras Conceded")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(bowler_extras.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I selected bar charts because they provide the clearest and most direct way to compare performances of players across categories extras conceded. Since each player is a distinct category and the values are numerical, bar charts allow for quick identification of underperformers.

##### 2. What is/are the insight(s) found from the chart?

1. Lasith Malinga tops the list with 268 extras conceded, which, despite his success as a bowler, shows the flip side of being expensive in terms of discipline.

2. Prominent names like P Kumar (236) and U Yadav (186) also appear, reflecting areas where bowlers need to improve control to reduce free runs

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. From the batting and bowling insights, franchises can use this data to make informed decisions during auctions, targeting consistent run scorers and wicket takers who provide proven value.

2. The analysis of extras conceded highlights a key area of improvement discipline in bowling. Addressing this through training and analytics can save crucial runs, potentially altering match outcomes and improving a team’s win rate.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Filter only dismissals credited to bowlers
valid_dismissals = merged[
    merged['dismissal_kind'].isin([
        'bowled', 'caught', 'lbw', 'stumped', 'caught and bowled', 'hit wicket'])]

# Count dismissals per bowler
bowler_wickets = valid_dismissals.groupby('bowler')['player_dismissed'].count().sort_values(ascending=False)

# Plot top 10 bowlers
plt.figure(figsize=(10,6))
bowler_wickets.head(10).plot(kind='bar', color='teal')
plt.title("Top 10 Bowlers by Dismissals", fontsize=14)
plt.xlabel("Bowler")
plt.ylabel("Total Dismissals")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i,value in enumerate(bowler_wickets.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I selected bar charts because they provide the clearest and most direct way to compare performances of players across categories dismissals. Since each player is a distinct category and the values are numerical, bar charts allow for quick identification of top performers.

##### 2. What is/are the insight(s) found from the chart?

1. Again, Lasith Malinga dominates with 154 dismissals, proving his immense value as a strike bowler despite conceding extras.

2. Amit Mishra (134) and Harbhajan Singh (127) also rank high, emphasizing the importance of spinners in IPL success.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. From the batting and bowling insights, franchises can use this data to make informed decisions during auctions, targeting consistent run scorers and wicket takers who provide proven value.

2. The analysis of extras conceded highlights a key area of improvement discipline in bowling. Addressing this through training and analytics can save crucial runs, potentially altering match outcomes and improving a team’s win rate.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Count Man of the Match awards per player
mom_counts = unique_matches['player_of_match'].value_counts()

# Plot top 10 players
plt.figure(figsize=(10,6))
mom_counts.head(10).plot(kind='bar', color='gold')
plt.title("Top 10 Players by Man of the Match Awards", fontsize=14)
plt.xlabel("Player")
plt.ylabel("Number of Awards")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(mom_counts.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

The chart showcasing the top 10 players by Man of the Match awards was chosen because it clearly highlights individual performances and player impact within a competitive environment. This type of visualization helps in quickly identifying star players whose contributions are consistently recognized, which is a key metric in sports analytics.

##### 2. What is/are the insight(s) found from the chart?

1. Chris Gayle stands out with the highest number of Man of the Match awards (18).
2. The difference in awards between subsequent players (e.g., between Pathan, Warner, and de Villiers) is quite narrow.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Identifying players with the most Man of the Match awards can guide teams and sponsors in investing in high-impact individuals.
2. Recognizing the most popular and impactful players enables tailored marketing campaigns and merchandise strategies.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Most wide runs by bowler

# Count wide runs per bowler
bowler_wide_runs = merged.groupby('bowler')['wide_runs'].sum().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
bowler_wide_runs.head(10).plot(kind='bar', color='green')
plt.title("Top 10 Bowlers with Most Wide Runs", fontsize=14)
plt.xlabel("Bowler")
plt.ylabel("Total Wide Runs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(bowler_wide_runs.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen because it reveals which bowlers have conceded the highest number of wide runs, an important metric in understanding discipline and control in bowling key aspects that influence match outcomes.

##### 2. What is/are the insight(s) found from the chart?

1. P Kumar and SL Malinga stand out, with 144 and 143 wide runs respectively, indicating recurring issues with control or aggressive tactics that result in more extras.

2. After the top two, the numbers drop significantly (DW Steyn at 105), showing a considerable gap in wide runs given between the leaders and other bowlers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Teams can target specific bowlers for remedial coaching, improving discipline and reducing unnecessary runs.
2. Identifying bowlers with chronic issues can prevent teams from making costly recruitment decisions.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Result of the Match
result_type = unique_matches['result'].value_counts()

#Plot
plt.figure(figsize=(10,6))
result_type.plot(kind='bar', color='seagreen')
plt.title("Result Types", fontsize=14)
plt.xlabel("Result")
plt.ylabel("Number of Matches")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(result_type):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart provides an at a glance breakdown of match results, helping stakeholders quickly gauge the frequency of normal results, ties, and no results.

##### 2. What is/are the insight(s) found from the chart?

1. The overwhelming majority of matches conclude with normal results (626), with very few ties (7) and no results (3).
2. The very low count of "no result" matches indicates effective scheduling and contingency planning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Organizers can confidently schedule more matches, knowing that completion rates are high, which reassures broadcasters, advertisers, and ticket buyers.

2. The data offers proof to stakeholders and insurers that event disruptions are rare.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# Ensure one row per match and filter only matches where DL method was applied
dl_matches = unique_matches[unique_matches['dl_applied'] == 1]

# Count how many times each team was involved in DL matches
dl_counts = (dl_matches['team1'].value_counts() + dl_matches['team2'].value_counts()).sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
dl_counts.plot(kind='bar', color='steelblue')
plt.title("Number of Matches Affected by DL Method (per Team)", fontsize=14)
plt.xlabel("Team")
plt.ylabel("DL-Applied Matches")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(dl_counts):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()



##### 1. Why did you pick the specific chart?

This chart highlights which teams experienced the most matches affected by the Duckworth-Lewis (DL) method, revealing trends in weather or interruptions that influence tournament fairness and strategic planning.

##### 2. What is/are the insight(s) found from the chart?

1. Kolkata Knight Riders encountered DL conditions in 7 matches, more than any other team, signalling a potential geographic or scheduling pattern.

2. Several teams experienced DL interventions (ranging from 3 to 7 matches), but not all teams are equally affected.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Organizers can adjust future schedules or venue selections to minimize weather-affected matches and improve fairness.
2. Teams heavily affected can better inform sponsors and fans about contingencies.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Ensure one row per match and filter only matches won by runs
won_by_runs = unique_matches[unique_matches['win_by_runs'] > 0]

# Count number of such wins per team and sort descending
runs_wins = won_by_runs['winner'].value_counts().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
runs_wins.plot(kind='bar', color='navy')
plt.title("Matches Won by Runs (per Team)", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Matches Won by Runs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(runs_wins):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

This chart was selected because it highlights the teams that have most frequently won matches by defending a total, which is a direct indicator of bowling and fielding strength as well as strategic acumen during high pressure situations.

##### 2. What is/are the insight(s) found from the chart?

1. Mumbai Indians (47) and Chennai Super Kings (46) have the most wins by runs.

2. There is a substantial difference between the top teams and others, with third-placed Kings XI Punjab winning only 32 times by runs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Teams with strong defensive records can attract sponsors and fans.
2. Lower ranked teams can identify a need to improve their defending capabilities.

#### Chart - 14

In [None]:
# Chart - 14 visualization code
# Ensure one row per match and filter only matches won by wickets
won_by_wickets = unique_matches[unique_matches['win_by_wickets'] > 0]

# Count number of such wins per team and sort descending
wicket_wins = won_by_wickets['winner'].value_counts().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
wicket_wins.plot(kind='bar', color='pink')
plt.title("Matches Won by Wickets (per Team)", fontsize=14)
plt.xlabel("Team")
plt.ylabel("Matches Won by Wickets")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(wicket_wins):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart reveals which teams are most successful in chasing totals, showcasing the batting depth and composure of line-ups under pressure and aiding both fans and management in understanding team strengths.

##### 2. What is/are the insight(s) found from the chart?

1. Kolkata Knight Riders Lead in Chases: Kolkata Knight Riders (46) top the list for wins while chasing, suggesting they excel under scoreboard pressure.

2. Multiple teams (e.g., MI, RCB, DD, RR) have 40+ wicket wins.

#### Chart - 15

In [None]:
# Chart - 15 visualization code

# Calculate runs conceded per over by each bowler in each match
over_runs = (merged.groupby(['match_id', 'bowler', 'over'])['total_runs'].sum().reset_index())

# Filter only maiden overs and Count maiden overs per bowler
maiden_overs = over_runs[over_runs['total_runs'] == 0]
bowler_maidens = maiden_overs['bowler'].value_counts().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
bowler_maidens.head(10).plot(kind='bar', color='teal')
plt.title("Top 10 Bowlers by Maiden Overs", fontsize=14)
plt.xlabel("Bowler")
plt.ylabel("Maiden Overs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(bowler_maidens.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

This chart was chosen because it displays which bowlers are most successful at delivering maiden overs, a critical metric for pressure-building and game control in T20 cricket.


##### 2. What is/are the insight(s) found from the chart?

1. P Kumar’s Outstanding Record: P Kumar tops the chart with 12 maiden overs.

2. Only a select group of bowlers (IK Pathan, Malinga, Steyn) have delivered more than 8 maiden overs.



#### Chart - 16

In [None]:
# Chart - 16 visualization code

# Filter only caught dismissals
caught_dismissals = merged[merged['dismissal_kind'].isin(['caught', 'caught and bowled'])]

# Count dismissals per batsman
caught_counts = caught_dismissals['player_dismissed'].value_counts().sort_values(ascending=False)

# Plot top 10
plt.figure(figsize=(10,6))
caught_counts.head(10).plot(kind='bar', color='skyblue')
plt.title("Top 10 Batsmen Most Dismissed by Catches", fontsize=14)
plt.xlabel("Batsman")
plt.ylabel("Number of Caught Dismissals")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(caught_counts.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

This chart reveals which batsmen are most frequently dismissed by being caught, an important indicator of batting style, risk-taking, and potentially areas for technical improvement.


##### 2. What is/are the insight(s) found from the chart?

1. SK Raina leads significantly with 101 dismissals by catches, suggesting an aggressive or high-risk style at the crease.

2. Many top run-getters (Uthappa, Sharma, Gambhir) also feature that higher run making is often paired with more chances taken and thus higher caught out rates.

#### Chart - 17

In [None]:
# Chart - 17 visualization code

# Filter retired dismissals
retired_dismissals = merged[merged['dismissal_kind'].str.contains('retired', case=False, na=False)]

# Count players who were retired
retired_players = retired_dismissals['player_dismissed'].value_counts()

# Plot
if not retired_players.empty:
    plt.figure(figsize=(8,5))
    retired_players.plot(kind='bar', color='orange')
    plt.title("Players Retired Hurt")
    plt.xlabel("Player")
    plt.ylabel("Number of Retired Dismissals")
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    for i, value in enumerate(retired_players):
        plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
    plt.show()

##### 1. Why did you pick the specific chart?

This chart was picked because it highlights the rare but impactful event of players retiring hurt, providing insights into player vulnerability and physical risk areas in the game.

##### 2. What is/are the insight(s) found from the chart?

1. Harbhajan Singh and SR Tendulkar top the list which is higher than all other listed players who have only one such instance.

2. Retired hurt dismissals affect both legendary and emerging players, indicating no clear link to playing role or experience but rather the unpredictable nature of injuries.

#### Chart - 18

In [None]:
# Chart - 18 visualization code

# Season with most runs
season_runs = merged.groupby('season')['total_runs'].sum().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
season_runs.plot(kind='bar', color='purple')
plt.title("Total Runs Scored per Season", fontsize=14)
plt.xlabel("Season")
plt.ylabel("Total Runs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(season_runs):
    plt.text(i, value + 200, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen because it provides a season wise breakdown of run scoring trends, essential for understanding the evolution of the tournament’s scoring dynamics and batting dominance.


##### 2. What is/are the insight(s) found from the chart?

1. The highest total runs were scored in the 2013 (22,602) and 2012 (22,453) seasons, showing those as exceptionally high-scoring years.

2. There is a general upward trend in run scoring from 2008 to 2013, reflecting evolving strategies or better batting.

#### Chart - 19

In [None]:
# Chart - 19 visualization code

# Season with most wickets
season_wickets = merged.groupby('season')['player_dismissed'].count().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
season_wickets.plot(kind='bar', color='cyan')
plt.title("Total Dismissals per Season", fontsize=14)
plt.xlabel("Season")
plt.ylabel("Total Dismissals")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(season_wickets):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was selected to illustrate trends in the number of dismissals per season, which offers a lens into the balance between bat and ball, and the effectiveness of bowlers and fielders over time.


##### 2. What is/are the insight(s) found from the chart?

1. The 2013 season saw 912 dismissals, the most across all seasons, suggesting either more matches or more aggressive play styles.

2. Most seasons after 2010 show a relatively stable number of dismissals (~660–720), reflecting the establishment of a mature competition structure.

#### Chart - 20

In [None]:
# Chart - 20 visualization code

# Combine umpire1 and umpire2 into a single series
umpires = pd.concat([unique_matches['umpire1'], unique_matches['umpire2']]).value_counts().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
umpires.head(10).plot(kind='bar', color='brown')
plt.title("Top 10 Umpires by Matches Officiated", fontsize=14)
plt.xlabel("Umpire")
plt.ylabel("Number of Matches")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(umpires.head(10)):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen because umpire performance, experience, and consistency importantly affect match quality, making it valuable to identify the most frequently selected umpires.


##### 2. What is/are the insight(s) found from the chart?

1. HDPK Dharmasena and S Ravi are the Most Experienced and have officiated over 85 matches each.

2. There is a notable decrease to the third place (AK Chaudhary at 58 matches), highlighting a significant experience gap between the leading pair and the rest.

#### Chart - 21

In [None]:
# Chart - 21 visualization code

# Most runs in 1st innings in a season
first_innings_runs = merged[merged['inning'] == 1].groupby('season')['total_runs'].sum().sort_values(ascending=False)

#plot
plt.figure(figsize=(10,6))
first_innings_runs.plot(kind='bar', color='green')
plt.title("Total Runs in 1st Innings per Season", fontsize=14)
plt.xlabel("Season")
plt.ylabel("Total Runs in 1st Innings")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(first_innings_runs):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was picked because it gives an overview of how scoring in the first innings has evolved across seasons, helping analysts understand batting trends and conditions for teams setting targets.


##### 2. What is/are the insight(s) found from the chart?

1. he highest total runs in the first innings occurred in 2013 (11,848) and 2012 (11,658).

2. Between 2011–2017 (excluding 2014 and 2010), first innings runs consistently remained above 9,700.

#### Chart - 22

In [None]:
# Chart - 22 visualization code

# Most runs in 2nd innings in a season
second_innings_runs = merged[merged['inning'] == 2].groupby('season')['total_runs'].sum().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
second_innings_runs.plot(kind='bar', color='orange')
plt.title("Total Runs in 2nd Innings per Season", fontsize=14)
plt.xlabel("Season")
plt.ylabel("Total Runs in 2nd Innings")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
for i, value in enumerate(second_innings_runs):
    plt.text(i, value, str(value), ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen because it highlights trends in teams performance while chasing, essential for understanding evolving strategies and pressure scenarios in the second innings.


##### 2. What is/are the insight(s) found from the chart?

1. Seasons 2012 (10,795) and 2013 (10,693) had the most second-innings runs, aligning with the overall high scoring nature of those years.

2. There is a visible drop after 2013, suggesting possible improvements in bowling under pressure or changing pitch dynamics in later seasons.

#### Chart - 23

In [None]:
# Chart - 23 visualization code

# Filter only matches with normal result
normal_matches = unique_matches[unique_matches['result'] == 'normal']

# Count matches per season
normal_counts = normal_matches['season'].value_counts().sort_values(ascending=False)

# Plot
plt.figure(figsize=(10,6))
normal_counts.plot(kind='bar', color='teal')
plt.title("Normal Result Matches per Season", fontsize=14)
plt.xlabel("Season")
plt.ylabel("Number of Normal Result Matches")
plt.xticks(rotation=45, ha='right')

# Add labels
for i, value in enumerate(normal_counts):
    plt.text(i, value + 0.5, str(value), ha='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This chart was selected because it visualizes the reliability of the league in delivering undisturbed matches, which matters to fans, broadcasters, and event planners.


##### 2. What is/are the insight(s) found from the chart?

1. Most seasons have 56–74 matches ending with normal results, with 2012 and 2013 at the peak (74 each).

2. Occasional dips (e.g., 56 in 2015 and 2009) could be analyzed further for causes, such as weather disruptions or scheduling tweaks.

#### Chart - 24

In [None]:
# Chart - 24 visualization code
# Calculate runs scored by each batsman in each match
player_match_runs = deliveries.groupby(['match_id','batsman'])['batsman_runs'].sum().reset_index()

# Get Top 10 highest scores
top10_innings = player_match_runs.sort_values('batsman_runs', ascending=False).head(10)

# Create a unique label: Player + Match ID
top10_innings['label'] = top10_innings['batsman'] + " (Match " + top10_innings['match_id'].astype(str) + ")"

# Plot
plt.figure(figsize=(10,6))
bars = plt.bar(top10_innings['label'], top10_innings['batsman_runs'], color='crimson')

plt.title("Top 10 Highest Individual Scores in a Single Match", fontsize=14)
plt.xlabel("Player (with Match ID)")
plt.ylabel("Runs")
plt.xticks(rotation=45, ha='right')

# Annotate values on top of bars
for bar in bars:
    plt.text(bar.get_x() + bar.get_width()/2,
             bar.get_height(),
             str(bar.get_height()),
             ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

This chart was picked because individual big scores are standout performances that shape the narrative and excitement of the league, often drawing fans’ attention and impacting match outcomes.

##### 2. What is/are the insight(s) found from the chart?

1. CH Gayle’s 175 is by far the highest individual score, an outlier that sets a league benchmark for explosive batting.

2. The chart features multiple players like de Villiers, Sehwag, Warner with large scores featuring exceptional batting.

#### Chart - 25

In [None]:
# Chart - 25 visualization code
# Filter only valid wickets (exclude run outs, retired hurt, etc.)
valid_wickets = merged[merged['dismissal_kind'].notnull() & (merged['dismissal_kind'] != 'run out')]

# Count wickets per bowler per match
bowler_match_wickets = valid_wickets.groupby(['match_id','bowler'])['dismissal_kind'].count().reset_index()
bowler_match_wickets.rename(columns={'dismissal_kind':'wickets'}, inplace=True)

# Get Top 10 bowling performances
top10_bowling = bowler_match_wickets.sort_values('wickets', ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
bars = plt.bar(top10_bowling['bowler'], top10_bowling['wickets'], color='navy')

plt.title("Top 10 Bowling Performances (Most Wickets in a Match)", fontsize=14)
plt.xlabel("Bowler (with Match ID)")
plt.ylabel("Wickets")
plt.xticks(rotation=45, ha='right')
for bar in bars:
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), str(bar.get_height()), ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart was chosen because it highlights the top individual bowling performances—specifically, the most wickets taken in a single match which is a key metric for evaluating game changing impacts and identifying standout bowlers in the league.


##### 2. What is/are the insight(s) found from the chart?

1. Adam Zampa and Sohail Tanvir are the only bowlers to have taken 6 wickets in a match, distinguishing themselves from others who have a maximum of 5 wickets in their best performances.

2. Eight other bowlers (Kumble, Patel, Sharma, Harbhajan Singh, Tye, Unadkat, Faulkner, Narine) each have five wickets in a match, showcasing the high caliber and competitiveness among top bowlers in the tournament.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the above analysis, I suggest the client focus on leveraging the star players and high impact performances to strengthen fan engagement and brand partnerships, as these create strong emotional connections and increase viewership. Investing in targeted marketing campaigns around top performers like CH Gayle and bowlers with standout match winning spells can elevate merchandise sales, sponsorship value, and ticket revenues. At the same time, improving team strategies to reduce extras and optimize defensive performances will increase match-winning consistency, further building a winning reputation that attracts corporate sponsorships. Finally, expanding the league's presence by highlighting record breaking seasons and individual achievements can create new markets and revenue streams, balancing entertainment with strong sports management to maximize long-term business success.

# **Conclusion**

Based on the detailed analysis and insights derived from the data, I conclude that the project has effectively highlighted the key performance drivers and trends in the league that are vital for business growth. Understanding top player performances, match outcomes, and season wise statistics equips us with actionable intelligence to optimize fan engagement, sponsorship leverage, and team strategy. It is clear that data driven decision making, focused on showcasing star players and improving match quality, will unlock new revenue streams and strengthen brand loyalty. Overall, this comprehensive analysis positions us well to craft targeted business strategies that align sporting excellence with commercial success for sustained growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***