## A study to understand how teams, players and umpires have used referrals in the World Test Championship

### Possible insights desired

**PHASE 1**:

- How effective was the DRS call in extending the survival at the crease?

- No. of referrals innings wise

- Who was the batting partner who has probably assited the most in DRS?

**PHASE 2**:

- How many recognized batsmen were left? And was there a missed opportunity due to DRS being recklessly taken earlier?**

- Missed reviews by teams: Did # of remaining reviews have a say


### Data points collected (Phase 1)
-  Match in Series 
-  Series Name to produce facets 
-  Match Venue
-  Match Date (Month_Year)
-  Over of referral
-  Innings of referral in game
- Team taking review
- Team Batting/Bowling
- Umpire at time of review
- Batsman at time of review
- Outcome of review 
- Innings wise dismissal data
- Active wicket partnership
- Innings wise referral data (scraped from match notes, needs a bit of formatting)
- Commentary of that particular referral ball


### Data points (Phase 2)

- no. of recognized batsman to come vs missed opportunities for them


### Required libraries 

In [1]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from collections import defaultdict
import glob

In [2]:
url_list=pd.read_csv('data/URLs.csv')

url_list

Unnamed: 0,Cricinfo_URL,Cricbuzz_URL
0,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20715/...
1,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22954/...
2,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20716/...
3,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22955/...
4,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20717/...
5,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22859/...
6,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22860/...
7,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20718/...
8,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20719/...
9,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22743/...


In [3]:
url_list=pd.read_csv('data/URLs.csv')

url_list

row=2
cricinfo_match_notes_url=(url_list.iloc[row]['Cricinfo_URL'])
cricbuzz_match_url=(url_list.iloc[row]['Cricbuzz_URL'])

cricinfo_match_notes_url

'https://www.espncricinfo.com/series/19430/scorecard/1152847/england-vs-australia-2nd-test-icc-world-test-championship-2019-2021'

### Chromedriver initiatilisation for Selenium 

https://dev.to/razgandeanu/selenium-cheat-sheet-9lc

In [4]:
##Path of chromedriver
chromedriver="./data/chromedriver.exe"

### Scrape cricinfo match notes for DRS events

In [5]:
def scrape_cricinfo_match_notes(cricinfo_match_notes_url):
    '''Function to scrape Cricinfo match notes given a match URL. Returns soup of match notes for a particular match'''
    driver = webdriver.Chrome(executable_path=chromedriver)
    driver.get(cricinfo_match_notes_url)
    cricinfo_matchnotes_soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    return cricinfo_matchnotes_soup
    

### Process cricinfo match notes to obtain day wise referrals with outcome


In [6]:
def process_cricinfo_match_notes(cricinfo_matchnotes_soup):
    '''Process the information from the cricinfo matchnotes soup and obtain day wise referral information'''
    m_notes=cricinfo_matchnotes_soup.find('h1',text='Match Notes')
    all_days=[d for d in m_notes.next_element.next.find_all('ul',{'class':'bulleted-list'})]
    all_days.reverse()
    return all_days
    

### Bucket match wise referral information according to order of innings in match


Match notes are of this format, so parse notes in the order of days, whenever a new innings begins, bucket all referrals associated under it. 

![alt text](./reports/C1.png)

In [7]:
def create_innings_df(all_days):
    '''Process list of day wise reviews andreturn a neater dataframe'''
    innings_list=[] 
    innings_reviews=defaultdict(list)
    innings_list.append([a.text for a in all_days[0] if "innings" in a.text and len(a.text)<=30])
    ##There are instances where cricinfo match notes has the word 'innings' which might not be due to an innings beginning
    
    day_wise_reviews=[a.text for a in all_days[0] if a.text.startswith("Over") or "innings" in a.text]
    ##Append from day2 onwards to the day 1 list
    
    if len(all_days)>=2:
        for ad in all_days[1:]:
            daylist=[a.text for a in ad if a.text.startswith("Over") or "innings" in a.text]
            innings_list.append([a.text for a in ad if "innings" in a.text and len(a.text)<=30])
            for d in daylist:
                day_wise_reviews.append(d)
    
    idxs = [i for i,x in enumerate(day_wise_reviews) if 'innings' in x]
    start_end_idxs=list(map(list, zip(idxs, idxs[1:])))
    for s in start_end_idxs:
        innings_reviews[day_wise_reviews[s[0]]]=day_wise_reviews[s[0]+1:s[1]]
    innings_reviews[day_wise_reviews[max(idxs)]]=day_wise_reviews[max(idxs)+1:]
    
    ##Make the innings_wise_reviews dictionary to dataframe
    
    idf=pd.DataFrame.from_dict(innings_reviews,orient='index')
    idf.reset_index(inplace=True)
    idf.fillna('',inplace=True)
    innings_df=idf.melt(id_vars='index',value_name='reviews')
    
    ##In this case, the variable column does not add any value. Hence it can be dropped
    innings_df.drop(columns='variable',inplace=True) 
    innings_df.columns=['innings','reviews']
    innings_df=innings_df[innings_df.reviews!='']
    
    ###Augment Innings_df with the referral notes
    
    over=[]
    review_team=[]
    review_umpire=[]
    review_batsman=[]
    review_outcome=[]
    
    
    for review in innings_df.reviews:
        over.append(review.split('Over')[1].split(':')[0].strip())
        review_team.append(review.split(':')[1].strip().split('by ')[1].split(',')[0])
        review_umpire.append(review.split(':')[1].strip().split('Umpire - ')[1].split(',')[0])
        review_batsman.append(review.split(':')[1].strip().split('Batsman -')[1].strip().split('(')[0].strip())
        review_outcome.append(review.split(':')[1].strip().split('Batsman -')[1].strip().split('(')[1].split(')')[0].strip())

    innings_df['Over']=over
    innings_df['Review_team']=review_team
    innings_df['Review_batsman']=review_batsman
    innings_df['Review_umpire']=review_umpire
    innings_df['Review_outcome']=review_outcome
    innings_df['Umpires_call']=innings_df['Review_outcome'].apply(lambda x:"Umpire" in x)
    innings_df['index']=range(len(innings_df.reviews))
    innings_list_updated=[]
    for i in innings_list:
        if(i):
            for a in range(len(i)):
                innings_list_updated.append(i[a])
            
    ##Remove duplicates if any, from innings list 
    
    innings_list_updated=list(dict.fromkeys(innings_list_updated))
    innings_list_updated.reverse()
    
    innings_df.set_index('index',inplace=True,drop=False)
    return innings_df,innings_list_updated


### Map break in partnerships to referral events 

In [23]:
def analyze_partnership_breaks(cricinfo_matchnotes_soup,innings_df,innings_list):
    '''Analyze break in partnerships using Fall of wicket data and augment innings_df'''
    fow_text=[fow.text for fow in cricinfo_matchnotes_soup.find_all('div',{"class":"wrap dnb"}) if "Fall of wickets:" in fow.text]
    innings_fow=defaultdict(list)
    for a,inn in enumerate(reversed(innings_list)):
        innings_fow[inn]=[f.split(')')[0].strip().split(' ')[0] for i,f in enumerate(fow_text[a].split(':')[1].strip().split(',')) if i%2!=0 and f.split(')')[0].strip().split(' ')[0]!='retired']
    
    innings_fow_df=pd.DataFrame.from_dict(innings_fow,orient='index')
    innings_fow_df.reset_index(inplace=True)
    innings_fow_df.fillna('',inplace=True)
    innings_fow_df=innings_fow_df.melt(id_vars='index',value_name='wickets')
    ##In this case, the variable refers to the fall of wicket. 
    
    innings_fow_df['variable']=innings_fow_df['variable']+1
    innings_fow_df.columns=['innings','active_partnership','Over']
    innings_fow_df=innings_fow_df[innings_fow_df.Over!='']
    pbreak_innings=pd.merge(innings_df,innings_fow_df,on=['innings','Over'],how='inner')['index']
    innings_df['Partnership_broken']=False
    innings_df.loc[pbreak_innings,'Partnership_broken']=True
    innings_df.Over=innings_df.Over.astype('float')
    return innings_df,innings_fow_df




### Augment every ball with Cricbuzz commentary (for future use cases)

In [9]:
def parse_cricbuzz_commentary(cricbuzz_match_url):
    '''Parse ball by ball commentary from Cricbuzz match URL'''
    driver = webdriver.Chrome(executable_path=chromedriver)
    driver.get(cricbuzz_match_url)
    cricbuzz_match_soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    return cricbuzz_match_soup


### Process cricbuzz commentary to map innings and Over

In [10]:
def merge_commentary(text):
    return '###'.join(text)

In [18]:
def process_cricbuzz_commentary(cricbuzz_match_soup,innings_list):
    '''Process cricbuzz soup and create cricbuzz commentary dataframe'''
    commentary_text=[c.text for c in cricbuzz_match_soup.find_all('p',{'class':'cb-col cb-col-90 cb-com-ln'})]
    over_text=[o.text for o in cricbuzz_match_soup.find_all('span',{'cb-col cb-col-8 text-bold'})]
    inngs_breaks=[index for index, value in enumerate(over_text) if value == '0.1']

    if len(inngs_breaks)>=5:
        inngs_idx_to_remove=inngs_breaks.index([s for s,t in list(zip(inngs_breaks,inngs_breaks[1:])) if t-s<=6][0])
        inngs_breaks.pop(inngs_idx_to_remove)
        
    cricbuzz_commentary_df=pd.DataFrame({'Over':over_text,'commentary':commentary_text})

    start=0

    for nib, ib in enumerate(inngs_breaks):
        cricbuzz_commentary_df.loc[start:ib,'innings']=innings_list[nib]
        start=ib+1

    cricbuzz_commentary_df['commentary']=(cricbuzz_commentary_df.groupby(['Over','innings'],as_index=False)['commentary'].transform(merge_commentary).reset_index(drop=True))

    cricbuzz_commentary_df.Over=cricbuzz_commentary_df.Over.astype('float')

    cricbuzz_commentary_df.drop_duplicates(inplace=True)
        
    return cricbuzz_commentary_df


In [19]:
def process_innings_state(cricbuzz_commentary_df,innings_fow_df):
    '''Processing unbroken partnerships'''
    innings_fow_df.Over=innings_fow_df.Over.astype('float')
    cdf=cricbuzz_commentary_df.groupby('innings')['Over'].max()
    idf=innings_fow_df.groupby('innings')['Over'].max()
    idf2=pd.DataFrame(cdf.loc[cdf!=idf]).reset_index()
    if idf2.shape[0]>=1:
        ## To handle unbroken partnerships
        idf_wicket=innings_fow_df.groupby('innings')['active_partnership'].max()
        idf_wicket=pd.DataFrame(idf_wicket[idf_wicket<=9])
        fow_df_to_add=idf2.merge(idf_wicket,on=['innings'])
        fow_df_to_add['active_partnership']=fow_df_to_add['active_partnership']+1
        innings_fow_df=pd.concat([innings_fow_df,fow_df_to_add],axis=0)
        innings_fow_df.reset_index(inplace=True,drop=True)
    innings_state=pd.merge(cricbuzz_commentary_df,innings_fow_df,on=['Over','innings'],how='left')
    is_active=innings_state.groupby(['innings']).ffill()
    innings_state['active_partnership']=is_active['active_partnership']
    
    return innings_state

In [20]:
def compile_referral_data_step1(cricinfo_match_notes_url,cricbuzz_match_url):
    '''Compile all data from given cricinfo match notes url and cricbuzz match URL'''
    cricinfo_matchnotes_soup=scrape_cricinfo_match_notes(cricinfo_match_notes_url)
    all_days=process_cricinfo_match_notes(cricinfo_matchnotes_soup)
    cricbuzz_match_soup=parse_cricbuzz_commentary(cricbuzz_match_url)
    return cricinfo_matchnotes_soup,all_days,cricbuzz_match_soup

cricinfo_matchnotes_soup,all_days,cricbuzz_match_soup=compile_referral_data_step1(cricinfo_match_notes_url,cricbuzz_match_url)

In [24]:
def compile_referral_data_step2(cricinfo_matchnotes_soup,all_days,cricbuzz_match_soup):
    innings_df,innings_list=create_innings_df(all_days)
    innings_df_updated,innings_fow_df=analyze_partnership_breaks(cricinfo_matchnotes_soup,innings_df,innings_list)
    cricbuzz_commentary_df=process_cricbuzz_commentary(cricbuzz_match_soup,innings_list)
    innings_state=process_innings_state(cricbuzz_commentary_df,innings_fow_df)
    innings_df.Over=innings_df.Over.astype('float')
    reviews_match=pd.merge(innings_state,innings_df_updated,how='left')
    reviews_match.fillna('',inplace=True)
    reviews_match['match']=cricbuzz_match_url.split('/')[-1]
    return reviews_match
    

In [25]:
reviews_match=compile_referral_data_step2(cricinfo_matchnotes_soup,all_days,cricbuzz_match_soup)

reviews_match

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  del sys.path[0]


Unnamed: 0,Over,commentary,innings,active_partnership,reviews,Review_team,Review_batsman,Review_umpire,Review_outcome,Umpires_call,index,Partnership_broken,match
0,47.3,"Jack Leach to Pat Cummins, no run, slides in f...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
1,47.2,"Jack Leach to Pat Cummins, no run, drifting in...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
2,47.1,"Jack Leach to Pat Cummins, no run, smothered b...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
3,46.6,"Denly to Head, no run, smoothly driven by Head...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
4,46.5,"Denly to Head, no run, defended past silly point",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
5,46.4,"Denly to Head, no run, overpitched outside off...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
6,46.3,"Denly to Head, no run, whenever there's a chan...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
7,46.2,"Denly to Head, no run, almost shaves the outsi...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
8,46.1,"Denly to Head, no run, overpitched outside off...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019
9,45.6,"Jack Leach to Pat Cummins, no run, back round ...",Australia 2nd innings,7,,,,,,,,,eng-vs-aus-2nd-test-the-ashes-2019


In [26]:
reviews_match.to_csv('review_data/reviews_match_{0}.csv'.format(cricbuzz_match_url.split('/')[-1]),index=False)
print("Match parsed",cricbuzz_match_url.split('/')[-1])

Match parsed eng-vs-aus-2nd-test-the-ashes-2019
