<img src="channels4_profile.jpg" width="400" height='400' align='centre'/>



## Background:

As part of the analytics team at ESPN, we continuously strive to enhance our understanding of sports dynamics and player performances to provide our audience with in-depth analyses and insights. Cricket, being one of the world's most popular sports, has a rich dataset of player performances that can be leveraged to uncover trends, potentials, and areas of improvement.

## Objective:

The primary objective of this analysis is to perform an elementary Exploratory Data Analysis (EDA) on the provided dataset of cricket players. The dataset encompasses various performance metrics and personal attributes of the players. Through this EDA, we aim to uncover patterns, correlations, and insights that could contribute to more nuanced reporting, commentary, and overall understanding of the game.

# Goal:
Perform EDA and provide insights with viz

# Approach:
* Imports
* Removing redundancies:
    - remove '+' sign from columns 
    - remove duplicates
    - remove/replace nulls based on multivariate and univariate analysis
    - check for data-type 
    - Feature Engineering: Create new column "Out/Not Out" based on Highest Score innings

* Removing Nulls

In [1]:
# Import libraries
import pandas as pd
import regex as re

pd.set_option('display.width', 400)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [2]:
# Read the file
df = pd.read_excel('/Users/dawny/Desktop/trial.xlsx')
df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,4s,6s
0,DG Bradman (AUS),1928-1948,52,80,10,6996,334,99.94,9800+,58.6,29,13,7,626+,6
1,HC Brook (ENG),2022-2023,12,20,1,1181,186,62.15,1287,91.76,4,7,1,141,23
2,AC Voges (AUS),2015-2016,20,31,7,1485,269*,61.87,2667,55.68,5,4,2,186,5
3,RG Pollock (SA),1963-1970,23,41,4,2256,274,60.97,1707+,54.48,7,11,1,246+,11
4,GA Headley (WI),1930-1954,22,40,4,2190,270*,60.83,416+,56.0,10,5,2,104+,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  65 non-null     object 
 1   Span    65 non-null     object 
 2   Mat     65 non-null     object 
 3   Inns    65 non-null     int64  
 4   NO      65 non-null     int64  
 5   Runs    65 non-null     int64  
 6   HS      65 non-null     object 
 7   Ave     65 non-null     float64
 8   BF      62 non-null     object 
 9   SR      62 non-null     float64
 10  100     65 non-null     int64  
 11  50      65 non-null     int64  
 12  0       65 non-null     int64  
 13  4s      65 non-null     object 
 14  6s      65 non-null     object 
dtypes: float64(2), int64(6), object(7)
memory usage: 7.7+ KB


In [4]:
df.describe()

Unnamed: 0,Inns,NO,Runs,Ave,SR,100,50,0
count,65.0,65.0,65.0,65.0,62.0,65.0,65.0,65.0
mean,137.138462,14.138462,6488.015385,53.556154,49.965323,19.646154,28.569231,7.984615
std,84.839716,10.942859,3940.275425,7.108072,11.341566,12.171214,18.195097,5.495715
min,20.0,1.0,990.0,48.0,25.59,1.0,3.0,1.0
25%,62.0,6.0,2748.0,49.37,43.4475,8.0,13.0,3.0
50%,137.0,12.0,6806.0,51.62,51.45,21.0,29.0,7.0
75%,193.0,19.0,8848.0,56.67,55.56,29.0,41.0,12.0
max,329.0,49.0,15921.0,99.94,91.76,51.0,68.0,22.0


In [5]:
df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,4s,6s
0,DG Bradman (AUS),1928-1948,52,80,10,6996,334,99.94,9800+,58.6,29,13,7,626+,6
1,HC Brook (ENG),2022-2023,12,20,1,1181,186,62.15,1287,91.76,4,7,1,141,23
2,AC Voges (AUS),2015-2016,20,31,7,1485,269*,61.87,2667,55.68,5,4,2,186,5
3,RG Pollock (SA),1963-1970,23,41,4,2256,274,60.97,1707+,54.48,7,11,1,246+,11
4,GA Headley (WI),1930-1954,22,40,4,2190,270*,60.83,416+,56.0,10,5,2,104+,1


In [6]:
# Rename multiple columns in a list
df = df.rename(columns= {'NO': 'Not_Outs',
                    'Ave': 'Average',
                    'Mat': 'Matches',
                    'HS': "Highest_Inn_Score", 
                    'BF': 'Balls_Faced', 
                    'SR': 'Strike_Rate'} )

df.head()

Unnamed: 0,Player,Span,Matches,Inns,Not_Outs,Runs,Highest_Inn_Score,Average,Balls_Faced,Strike_Rate,100,50,0,4s,6s
0,DG Bradman (AUS),1928-1948,52,80,10,6996,334,99.94,9800+,58.6,29,13,7,626+,6
1,HC Brook (ENG),2022-2023,12,20,1,1181,186,62.15,1287,91.76,4,7,1,141,23
2,AC Voges (AUS),2015-2016,20,31,7,1485,269*,61.87,2667,55.68,5,4,2,186,5
3,RG Pollock (SA),1963-1970,23,41,4,2256,274,60.97,1707+,54.48,7,11,1,246+,11
4,GA Headley (WI),1930-1954,22,40,4,2190,270*,60.83,416+,56.0,10,5,2,104+,1


### Rectifying the  columns with special characters + and *:
* 6s
* 4s
* Highest_inn_score

In [7]:
# Changing string pbject to integers: Balls Faced, 4s colum


df['Balls_Faced'] = df['Balls_Faced'].astype(str)

df['Balls_Faced']  = df['Balls_Faced'].str.replace('+', '', regex = False)

df['Balls_Faced'] = pd.to_numeric(df['Balls_Faced'], errors='coerce')

In [8]:
df['4s'] = df['4s'].astype(str)

df['4s'] = df['4s'].str.replace('+', '', regex = False)

df['4s'] = pd.to_numeric(df['4s'], errors='coerce')

In [9]:
df.info()

# rectifying the 6s colum

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player             65 non-null     object 
 1   Span               65 non-null     object 
 2   Matches            65 non-null     object 
 3   Inns               65 non-null     int64  
 4   Not_Outs           65 non-null     int64  
 5   Runs               65 non-null     int64  
 6   Highest_Inn_Score  65 non-null     object 
 7   Average            65 non-null     float64
 8   Balls_Faced        62 non-null     float64
 9   Strike_Rate        62 non-null     float64
 10  100                65 non-null     int64  
 11  50                 65 non-null     int64  
 12  0                  65 non-null     int64  
 13  4s                 65 non-null     int64  
 14  6s                 65 non-null     object 
dtypes: float64(3), int64(7), object(5)
memory usage: 7.7+ KB


In [10]:
# Converting the colum to string to search for potential + sign
df['6s'] = df['6s'].astype(str)

In [11]:
# boolean mask to reutrn rows that have a fault:
mask = df['6s'].apply(lambda x: bool(re.search('\+', x)))
df[mask]

Unnamed: 0,Player,Span,Matches,Inns,Not_Outs,Runs,Highest_Inn_Score,Average,Balls_Faced,Strike_Rate,100,50,0,4s,6s
10,GS Sobers (WI),1954-1974,93,160,21,8032,365*,57.78,4063.0,53.58,26,30,12,593,32+
11,GS Sobers (WI),1954-1974,93,160,21,8032,365*,57.78,4063.0,53.58,26,30,12,593,32+
63,SJ McCabe (AUS),1930-1938,39,62,5,2748,232,48.21,3217.0,60.02,6,13,4,241,5+


In [12]:
# Replace the + values with empty strings
df['6s'] = df['6s'].str.replace('\+', '', regex=False)

df['6s'] = pd.to_numeric(df['6s'], errors='coerce')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player             65 non-null     object 
 1   Span               65 non-null     object 
 2   Matches            65 non-null     object 
 3   Inns               65 non-null     int64  
 4   Not_Outs           65 non-null     int64  
 5   Runs               65 non-null     int64  
 6   Highest_Inn_Score  65 non-null     object 
 7   Average            65 non-null     float64
 8   Balls_Faced        62 non-null     float64
 9   Strike_Rate        62 non-null     float64
 10  100                65 non-null     int64  
 11  50                 65 non-null     int64  
 12  0                  65 non-null     int64  
 13  4s                 65 non-null     int64  
 14  6s                 62 non-null     float64
dtypes: float64(4), int64(7), object(4)
memory usage: 7.7+ KB


In [15]:
df['Highest_Inn_Score'] = df['Highest_Inn_Score'].astype(str)

# Step 1: Create 'Out/Not_Out' column based on presence of '*'
df['Out/Not_Out'] = df['Highest_Inn_Score'].apply(lambda x: 'Not Out' if '*' in x else 'Out')

# Step 2: Remove '*' and convert the score to float
df['Highest_Inn_Score'] = df['Highest_Inn_Score'].str.replace('*', '', regex=False).astype(float)

In [16]:
pd.DataFrame(df.isnull().sum(), columns = ['Missing'])

Unnamed: 0,Missing
Player,0
Span,0
Matches,0
Inns,0
Not_Outs,0
Runs,0
Highest_Inn_Score,0
Average,0
Balls_Faced,3
Strike_Rate,3


### The data is now free from redundancies