# Data Analysis of all-time top bowlers in test matches

The greatest bowlers of all time has been listed in the dataset in separate rows according to total wickets they took in the span of their career as a cricketer. Along with wickets, their are other attributes that measures the performance of each player over their career. 

**Attributes/Columns:**
 - Player - name of the players
 - Span - starting and ending year of the players' career
 - Mat - matches played by the player
 - Inns - innings played by the player
 - Balls - balls bowled by the player
 - Runs - Runs made by the player
 - Wkts - Wickets taken by the player
 - BBI - Best Bowling in Innings
 - BBM - Best Bowling in Match
 - Ave - Average of the player's wickets
 - Econ - Average number of runs conceded per over by the player 
 - SR - Strike rate of the player
 - 5 - number of times the player took 5 wickets in a match
 - 10 - number of times the player took 10 wickets in a match

**Reference** 
<br>Data Source: https://stats.espncricinfo.com/ci/content/records/93276.html

**Importing packages**

In [1]:
import numpy as np # linear Algebra
import pandas as pd # Data Processing

**Reading the dataset**

In [2]:
# Naming the DataFrame - df
# Reading the .xlsx file using pandas: pd.read_excel("<location of dataset>")
# Reading the sheet named 'wickets'

df = pd.read_excel("test_cricket.xlsx", sheet_name="wickets")

# Displaying the first 10 rows of the DataFrame
display(df.head(10))

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,BBI,BBM,Ave,Econ,SR,5,10
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3
5,SCJ Broad (ENG),2007-2021,148,272,29713,14502,523,2021-08-15 00:00:00,11/121,27.72,2.92,56.8,18,3
6,CA Walsh (WI),1984-2001,132,242,30019,12688,519,1937-07-01 00:00:00,13/55,24.44,2.53,57.8,22,3
7,DW Steyn (SA),2004-2019,93,171,18608,10077,439,1951-07-01 00:00:00,1960-11-01 00:00:00,22.95,3.24,42.3,26,5
8,N Kapil Dev (INDIA),1978-1994,131,227,27740,12867,434,1983-09-01 00:00:00,11/146,29.64,2.78,63.9,23,2
9,HMRKB Herath (SL),1999-2018,93,170,25993,12157,433,9/127,14/184,28.07,2.8,60.0,34,9


**Number of rows and columns**

In [3]:
# <name_of_DataFrame>.shape
df.shape

# output - (total number of rows, total number of columns)

(79, 14)

The DataFrame contains 79 rows (players) and 14 columns (attributes).

**Descriptive Statistics**
<br>Output shows the following details of all numerical attributes in the dataset: 
- **count** (number of observations) 
- **mean** (average of all values)
- **std** (standard deviation)
- **min** (minimum value among all the observations)
- **25%** (the value at 25th percentile i.e. 25% of obversations/data has values less than the mentioned value)
- **50%** (median value or said the value at 50th percentile i.e. 50% of obversations/data has values less than the mentioned value)
- **75%** (the value at 75th percentile i.e. 75% of obversations/data has values less than the mentioned value)
- **max** (maximum value among all the observations)

In [4]:
# <name_of_DataFrame>.describe()
df.describe()

Unnamed: 0,Mat,Inns,Balls,Runs,Wkts,Ave,Econ,SR,5,10
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,80.101266,144.797468,18630.303797,8595.506329,317.101266,27.466456,2.806582,59.187342,16.35443,2.797468
std,28.537692,51.04231,7190.036515,3080.256645,121.731587,3.657561,0.351666,9.349337,9.642372,3.235935
min,37.0,67.0,8785.0,4846.0,200.0,20.94,1.98,41.2,3.0,0.0
25%,60.5,110.0,13580.0,6456.5,229.0,24.425,2.6,53.3,9.5,1.0
50%,71.0,129.0,16498.0,7742.0,266.0,28.0,2.82,57.4,14.0,2.0
75%,93.0,169.0,21742.5,9756.0,374.5,29.87,3.08,63.95,20.5,3.5
max,166.0,301.0,44039.0,18355.0,800.0,34.79,3.46,91.9,67.0,22.0


**Overview:**
1. The average number of  matches played by the top bowlers is 80, with minimum of 37 matches and maximum of 166.
2. The average number of runs made by the bowlers is 8595. While the highest runs is 18355, only 25% scored more than 9756 runs. 
4. The average number of wickets taken by the bowlers is 317, with 75% of bowlers scoring less than 375.

**Data Types and Missing Values**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  79 non-null     object 
 1   Span    79 non-null     object 
 2   Mat     79 non-null     int64  
 3   Inns    79 non-null     int64  
 4   Balls   79 non-null     int64  
 5   Runs    79 non-null     int64  
 6   Wkts    79 non-null     int64  
 7   BBI     79 non-null     object 
 8   BBM     79 non-null     object 
 9   Ave     79 non-null     float64
 10  Econ    79 non-null     float64
 11  SR      79 non-null     float64
 12  5       79 non-null     int64  
 13  10      79 non-null     int64  
dtypes: float64(3), int64(7), object(4)
memory usage: 8.8+ KB


**Overview:**
1. Total number of observations/players: 79
2. Total number of attributes/variables: 14
3. Total number of object (string/mixed) data type: 4 (Player, Span, BBI, BBM)
4. Total number of integer (positive/negative/zero) data type: 7 (Mat, Inns, Balls, Runs, Wkts, 5, 10)
5. Total number of float (floating point number) data type: 3 (Ave, Econ, SR)
6. No missing data

**Renaming columns**

In [6]:
# <name_of_DataFrame> = <name_of_DataFrame>.rename(columns={'<name_of_column' : '<new_name>'})
df = df.rename(columns={'Mat':'Matches', 
                        'Inns':'Innings',
                        'Wkts':'Wickets',
                        'Ave': 'Average',
                        5: '5_wickets',
                        10:'10_wickets'})

display(df.head())

Unnamed: 0,Player,Span,Matches,Innings,Balls,Runs,Wickets,BBI,BBM,Average,Econ,SR,5_wickets,10_wickets
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3


**Removing columns**

In [7]:
# <name_of_DataFrame>.drop('<name_of_column>', axis= )
# axis=1 and  is used to select columns and rows, respectively
# inplace=True used to edit the original DataFrame
df.drop('BBI', axis=1, inplace=True)

display(df.head())

Unnamed: 0,Player,Span,Matches,Innings,Balls,Runs,Wickets,BBM,Average,Econ,SR,5_wickets,10_wickets
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,12/128,25.41,2.65,57.4,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,14/149,29.65,2.69,65.9,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-10-27 00:00:00,21.64,2.49,51.9,29,3
