# Test bowler data analysis with Python

**Objectives:**
- Import the test_cricket.xlsx file in jupyter notebook and read the sheet 'wickets'.
- Display the first 10 rows of the dataframe.
- Create a markdown cell and explain the meaning of each column.
- Find the number of rows and columns in the dataframe.
- Find the data statistics and check for the data types.
- Check for any missing values present in the dataset.
- Rename the column names appropriately.
- Remove a column from the dataframe.

Dataset reference: https://stats.espncricinfo.com/ci/content/records/93276.html

### Import required libraries

In [22]:
import pandas as pd
import numpy as np

### Reading an excel file and read the required sheet and displaying

In [23]:
#Read the excel file using pandas lirary and store it in a variable
df = pd.read_excel("test_cricket.xlsx", sheet_name = "wickets")

#Display the first ten rows of the dataset
display(df.head(10))

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,BBI,BBM,Ave,Econ,SR,5,10
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3
5,SCJ Broad (ENG),2007-2021,148,272,29713,14502,523,2021-08-15 00:00:00,11/121,27.72,2.92,56.8,18,3
6,CA Walsh (WI),1984-2001,132,242,30019,12688,519,1937-07-01 00:00:00,13/55,24.44,2.53,57.8,22,3
7,DW Steyn (SA),2004-2019,93,171,18608,10077,439,1951-07-01 00:00:00,1960-11-01 00:00:00,22.95,3.24,42.3,26,5
8,N Kapil Dev (INDIA),1978-1994,131,227,27740,12867,434,1983-09-01 00:00:00,11/146,29.64,2.78,63.9,23,2
9,HMRKB Herath (SL),1999-2018,93,170,25993,12157,433,9/127,14/184,28.07,2.8,60.0,34,9


#### Explanation of dataset
Overall, the dataset has 14 different features which is explained as follows,
<br>**1. Player:** Names of different bowlers with their national team enclosed in brackets. ***This column has string type dataset.***
<br>**2. Span:** Starting and ending year of each individual player. The debut and retirement years are separated by a dash. ***So, the column represents string type data.***
<br>**3. Mat:** Total number of matches played in their entire career. ***It has integer type value.***
<br>**4. Inns:** Total number of innings played in their entire career. ***It has integer type value.***
<br>**5. Balls:** Total number of balls delivered during their career span.***The column represents integer type value.***
<br>**6. Runs:** Total runs conceded by each bowler. ***This colunm represents integer type data.***
<br>**7. Wkts:** Total wickets taken by each bowler. ***This colunm represents integer type data.***
<br>**8. BBI:** Best individual bowling innigns by a bowler. First number is the amount of wickets taken which is followed by the number of runs conceded. The values are separeted by a slash. ***So, this is string type data.***
<br>**9. BBM:** Best individual match statistics a bowler. First number is the amount of wickets taken which is followed by the number of runs conceded. The values are separeted by a slash. ***So, this is string type data.***
<br>**10. Ave:** Bowling average by each individual bowler. ***It shows float type data.***
<br>**11. Econ:** Economy rate of a bowler is the result of total runs conceded by total number of overs bowled. ***Therefore, this gives float type data.***
<br>**12. SR:** Strike rate of a bolwer. ***It shows float type value in the dataset.***
<br>**13. 5:** Number of innings where five or more wickets were taken by a bowler. ***This represents integer type data.*** 
<br>**14. 10:** Number of matches where ten or more wickets were taken by a bowler. ***This represents integer type data.*** 
 
To conclude, the datset has 4 string type values, 7 integer type values, and 3 float type values.

### Number of rows and columns

In [24]:
# number of rows
print("number of rows = ", df.shape[0])

# number of columns
print("number of columns = ", df.shape[1])

number of rows =  79
number of columns =  14


### Checking data statistics

In [25]:
df.describe()

Unnamed: 0,Mat,Inns,Balls,Runs,Wkts,Ave,Econ,SR,5,10
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,80.101266,144.797468,18630.303797,8595.506329,317.101266,27.466456,2.806582,59.187342,16.35443,2.797468
std,28.537692,51.04231,7190.036515,3080.256645,121.731587,3.657561,0.351666,9.349337,9.642372,3.235935
min,37.0,67.0,8785.0,4846.0,200.0,20.94,1.98,41.2,3.0,0.0
25%,60.5,110.0,13580.0,6456.5,229.0,24.425,2.6,53.3,9.5,1.0
50%,71.0,129.0,16498.0,7742.0,266.0,28.0,2.82,57.4,14.0,2.0
75%,93.0,169.0,21742.5,9756.0,374.5,29.87,3.08,63.95,20.5,3.5
max,166.0,301.0,44039.0,18355.0,800.0,34.79,3.46,91.9,67.0,22.0


**From the analysis above, we can see the statistical values of each column: total number of observations, mean, standard deviation, minimum values, percentage of data distribution (25,50 and 75), and the maximum value of each numerical features.**

### Checking for missing values

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  79 non-null     object 
 1   Span    79 non-null     object 
 2   Mat     79 non-null     int64  
 3   Inns    79 non-null     int64  
 4   Balls   79 non-null     int64  
 5   Runs    79 non-null     int64  
 6   Wkts    79 non-null     int64  
 7   BBI     79 non-null     object 
 8   BBM     79 non-null     object 
 9   Ave     79 non-null     float64
 10  Econ    79 non-null     float64
 11  SR      79 non-null     float64
 12  5       79 non-null     int64  
 13  10      79 non-null     int64  
dtypes: float64(3), int64(7), object(4)
memory usage: 8.8+ KB


**As there are 79 observations and each of the 14 columns have 79 values each, there are no missing values in the dataframe**

### Renaming column names

In [27]:
#Rename columns by using a dictionary
df = df.rename(columns={'Mat':'Match', 
                        'Inns':'Innings',
                        'Wkts': 'Wickets',
                        'BBI': 'Best_innings_bowling',
                        'BBM': 'Best_match_bowling',
                        'Ave': 'Average',
                        'Econ': 'Economy',
                        'SR': 'Strike_rate',
                         5: 'Five_wickets',
                         10: 'Ten_wickets'})

display(df.head())

Unnamed: 0,Player,Span,Match,Innings,Balls,Runs,Wickets,Best_innings_bowling,Best_match_bowling,Average,Economy,Strike_rate,Five_wickets,Ten_wickets
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3


### Removing a column

In [28]:
#Remove a column df.drop
df = df.drop('Strike_rate', axis=1)
display(df.head())

Unnamed: 0,Player,Span,Match,Innings,Balls,Runs,Wickets,Best_innings_bowling,Best_match_bowling,Average,Economy,Five_wickets,Ten_wickets
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,29,3
