## A study to understand how teams, players and umpires have used referrals in the World Test Championship

### Possible insights desired

**PHASE 1**:

- How effective was the DRS call in extending the survival at the crease?

- No. of referrals innings wise

- Who was the batting partner who has probably assited the most in DRS?

**PHASE 2**:

- How many recognized batsmen were left? And was there a missed opportunity due to DRS being recklessly taken earlier?**

- Missed reviews by teams: Did # of remaining reviews have a say


### Data points collected (Phase 1)
-  Match in Series 
-  Series Name to produce facets 
-  Match Venue
-  Match Date (Month_Year)
-  Over of referral
-  Innings of referral in game
- Team taking review
- Team Batting/Bowling
- Umpire at time of review
- Batsman at time of review
- Outcome of review 
- Innings wise dismissal data
- Active wicket partnership
- Innings wise referral data (scraped from match notes, needs a bit of formatting)
- Commentary of that particular referral ball


### Data points (Phase 2)

- no. of recognized batsman to come vs missed opportunities for them


### Required libraries 

In [40]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from collections import defaultdict

### Chromedriver initiatilisation for Selenium 

https://dev.to/razgandeanu/selenium-cheat-sheet-9lc

In [41]:
##Path of chromedriver
chromedriver="C:/Users/k.shridhar/Documents/chromedriver.exe"

### Scrape cricinfo match notes for DRS events

In [42]:
def scrape_cricinfo_match_notes(cricinfo_match_notes_url):
    '''Function to scrape Cricinfo match notes given a match URL. Returns soup of match notes for a particular match'''
    driver = webdriver.Chrome(executable_path=chromedriver)
    driver.get(cricinfo_match_notes_url)
    cricinfo_matchnotes_soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    return cricinfo_matchnotes_soup
    

### Process cricinfo match notes to obtain day wise referrals with outcome


In [43]:
def process_cricinfo_match_notes(cricinfo_matchnotes_soup):
    '''Process the information from the cricinfo matchnotes soup and obtain day wise referral information'''
    m_notes=cricinfo_matchnotes_soup.find('h1',text='Match Notes')
    all_days=[d for d in m_notes.next_element.next.find_all('ul',{'class':'bulleted-list'})]
    all_days.reverse()
    return all_days
    

### Bucket match wise referral information according to order of innings in match


Match notes are of this format, so parse notes in the order of days, whenever a new innings begins, bucket all referrals associated under it. 

![alt text](./reports/C1.png)

In [44]:
def create_innings_df(all_days):
    '''Process list of day wise reviews andreturn a neater dataframe'''
    innings_list=[] 
    innings_reviews=defaultdict(list)
    innings_list.append([a.text for a in all_days[0] if "innings" in a.text and len(a.text)<=30])
    ##There are instances where cricinfo match notes has the word 'innings' which might not be due to an innings beginning
    
    day_wise_reviews=[a.text for a in all_days[0] if a.text.startswith("Over") or "innings" in a.text]
    ##Append from day2 onwards to the day 1 list
    
    if len(all_days)>=2:
        for ad in all_days[1:]:
            daylist=[a.text for a in ad if a.text.startswith("Over") or "innings" in a.text]
            innings_list.append([a.text for a in ad if "innings" in a.text and len(a.text)<=30])
            for d in daylist:
                day_wise_reviews.append(d)
    
    idxs = [i for i,x in enumerate(day_wise_reviews) if 'innings' in x]
    start_end_idxs=list(map(list, zip(idxs, idxs[1:])))
    for s in start_end_idxs:
        innings_reviews[day_wise_reviews[s[0]]]=day_wise_reviews[s[0]+1:s[1]]
    innings_reviews[day_wise_reviews[max(idxs)]]=day_wise_reviews[max(idxs)+1:]
    
    ##Make the innings_wise_reviews dictionary to dataframe
    
    idf=pd.DataFrame.from_dict(innings_reviews,orient='index')
    idf.reset_index(inplace=True)
    idf.fillna('',inplace=True)
    innings_df=idf.melt(id_vars='index',value_name='reviews')
    
    ##In this case, the variable column does not add any value. Hence it can be dropped
    innings_df.drop(columns='variable',inplace=True) 
    innings_df.columns=['innings','reviews']
    innings_df=innings_df[innings_df.reviews!='']
    
    ###Augment Innings_df with the referral notes
    
    over=[]
    review_team=[]
    review_umpire=[]
    review_batsman=[]
    review_outcome=[]
    
    
    for review in innings_df.reviews:
        over.append(review.split('Over')[1].split(':')[0].strip())
        review_team.append(review.split(':')[1].strip().split('by ')[1].split(',')[0])
        review_umpire.append(review.split(':')[1].strip().split('Umpire - ')[1].split(',')[0])
        review_batsman.append(review.split(':')[1].strip().split('Batsman -')[1].strip().split('(')[0].strip())
        review_outcome.append(review.split(':')[1].strip().split('Batsman -')[1].strip().split('(')[1].split(')')[0].strip())

    innings_df['Over']=over
    innings_df['Review_team']=review_team
    innings_df['Review_batsman']=review_batsman
    innings_df['Review_umpire']=review_umpire
    innings_df['Review_outcome']=review_outcome
    innings_df['Umpires_call']=innings_df['Review_outcome'].apply(lambda x:"Umpire" in x)
    innings_df['index']=range(len(innings_df.reviews))
    innings_list_updated=[]
    for i in innings_list:
        if(i):
            for a in range(len(i)):
                innings_list_updated.append(i[a])
            
    ##Remove duplicates if any, from innings list 
    
    innings_list_updated=list(dict.fromkeys(innings_list_updated))
    innings_list_updated.reverse()
    
    innings_df.set_index('index',inplace=True,drop=False)
    return innings_df,innings_list_updated


### Map break in partnerships to referral events 

In [45]:
def analyze_partnership_breaks(cricinfo_matchnotes_soup,innings_df,innings_list):
    '''Analyze break in partnerships using Fall of wicket data and augment innings_df'''
    fow_text=[fow.text for fow in cricinfo_matchnotes_soup.find_all('div',{"class":"wrap dnb"}) if "Fall of wickets:" in fow.text]
    innings_fow=defaultdict(list)
    for a,inn in enumerate(reversed(innings_list)):
        innings_fow[inn]=[f.split(')')[0].strip().split(' ')[0] for i,f in enumerate(fow_text[a].split(':')[1].strip().split(',')) if i%2!=0]
    
    innings_fow_df=pd.DataFrame.from_dict(innings_fow,orient='index')
    innings_fow_df.reset_index(inplace=True)
    innings_fow_df.fillna('',inplace=True)
    innings_fow_df=innings_fow_df.melt(id_vars='index',value_name='wickets')
    ##In this case, the variable refers to the fall of wicket. 
    
    innings_fow_df['variable']=innings_fow_df['variable']+1
    innings_fow_df.columns=['innings','active_partnership','Over']
    innings_fow_df=innings_fow_df[innings_fow_df.Over!='']
    pbreak_innings=pd.merge(innings_df,innings_fow_df,on=['innings','Over'],how='inner')['index']
    innings_df['Partnership_broken']=False
    innings_df.loc[pbreak_innings,'Partnership_broken']=True
    return innings_df,innings_fow_df




### Augment every ball with Cricbuzz commentary (for future use cases)

In [46]:
def parse_cricbuzz_commentary(cricbuzz_match_url):
    '''Parse ball by ball commentary from Cricbuzz match URL'''
    driver = webdriver.Chrome(executable_path=chromedriver)
    driver.get(cricbuzz_match_url)
    cricbuzz_match_soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    return cricbuzz_match_soup


### Process cricbuzz commentary to map innings and Over

In [47]:
def process_cricbuzz_commentary(cricbuzz_match_soup,innings_list):
    '''Process cricbuzz soup and create cricbuzz commentary dataframe'''
    commentary_text=[c.text for c in cricbuzz_match_soup.find_all('p',{'class':'cb-col cb-col-90 cb-com-ln'})]
    over_text=[o.text for o in cricbuzz_match_soup.find_all('span',{'cb-col cb-col-8 text-bold'})]
    inngs_breaks=[index for index, value in enumerate(over_text) if value == '0.1']
    current_innings=[]
    current_commentary=[]
    current_over=[]
    end=0
    for i,b in enumerate(inngs_breaks):
        start=b
        for on,o in enumerate(over_text[end:start+1]):
            current_innings.append(innings_list[i])
            current_over.append(o)
            current_commentary.append(commentary_text[end:start+1][on])
        end=start+1
    cricbuzz_commentary_df=pd.DataFrame({'innings':current_innings,'Over':current_over,'Commentary':current_commentary})
    return cricbuzz_commentary_df


### Compile all tables

In [48]:
##SNIPPET TO EXTRACT ALL WTC SCORECARD URLS OF CRICINFO

# page_url='https://www.espncricinfo.com/scores/series/19430/season/2019/icc-world-test-championship'
# r1=requests.get(page_url)
# bs_main=BeautifulSoup(r1.text,'html.parser')
# urllist=[]
# for link in bs_main.find_all('a',href=True,text='SCORECARD'):
#     urllist.append('https://www.espncricinfo.com'+link['href'])
# pd.DataFrame(urllist).to_csv('url.csv')


In [49]:
def compile_referral_data(cricinfo_match_notes_url,cricbuzz_match_url):
    '''Compile all data from given cricinfo match notes url and cricbuzz match URL'''
    cricinfo_matchnotes_soup=scrape_cricinfo_match_notes(cricinfo_match_notes_url)
    cricbuzz_match_soup=parse_cricbuzz_commentary(cricbuzz_match_url)
    all_days=process_cricinfo_match_notes(cricinfo_matchnotes_soup)
    innings_df,innings_list=create_innings_df(all_days)
    innings_df_updated,innings_fow_df=analyze_partnership_breaks(cricinfo_matchnotes_soup,innings_df,innings_list)
    cricbuzz_commentary_df=process_cricbuzz_commentary(cricbuzz_match_soup,innings_list)
    innings_state=pd.merge(cricbuzz_commentary_df,innings_fow_df,on=['innings','Over'],how='left').ffill()
    last_fow_idx=min(innings_state[~pd.isnull(innings_state.active_partnership)].index)
    ##Process FOW for innings which were declared or not all out or won
    if last_fow_idx>=1:
        last_fow_val=(innings_state[~pd.isnull(innings_state.active_partnership)]['active_partnership'])[last_fow_idx]+1
        innings_state.loc[0:last_fow_idx-1]['active_partnership']=last_fow_val
    reviews_match=pd.merge(innings_state,innings_df_updated,how='left')
    reviews_match.fillna('',inplace=True)
    reviews_match['match']=cricbuzz_match_url.split('/')[-1]
    return reviews_match
    
    

### Read match wise Cricinfo and Cricbuzz URLs

In [50]:
url_list=pd.read_csv('review_data/URLs.csv')

url_list

Unnamed: 0,Cricinfo_URL,Cricbuzz_URL
0,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20715/...
1,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22954/...
2,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20716/...
3,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22955/...
4,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20717/...
5,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22859/...
6,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22860/...
7,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20718/...
8,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/20719/...
9,https://www.espncricinfo.com/series/19430/scor...,https://www.cricbuzz.com/cricket-scores/22743/...


### Debug

In [51]:
def compile_referral_data_debug(cricinfo_match_notes_url,cricbuzz_match_url):
    '''Compile all data from given cricinfo match notes url and cricbuzz match URL'''
    cricinfo_matchnotes_soup=scrape_cricinfo_match_notes(cricinfo_match_notes_url)
    cricbuzz_match_soup=parse_cricbuzz_commentary(cricbuzz_match_url)
    all_days=process_cricinfo_match_notes(cricinfo_matchnotes_soup)
    innings_df,innings_list=create_innings_df(all_days)
    return cricinfo_matchnotes_soup,cricbuzz_match_soup,all_days,innings_df,innings_list
    

### Run match by match [to avoid blocking of IPs]

In [52]:
# row=7
# cricinfo_match_notes_url=(url_list.iloc[row]['Cricinfo_URL'])
# cricbuzz_match_url=(url_list.iloc[row]['Cricbuzz_URL'])

# reviews_match=compile_referral_data(cricinfo_match_notes_url,cricbuzz_match_url)
# reviews_match.to_csv('review_data/reviews_match_{0}.csv'.format(cricbuzz_match_url.split('/')[-1]),index=False)
# print("Match parsed",cricbuzz_match_url.split('/')[-1])

# ##FOR DEBUGGING USE BELOW SNIPPET

# #cricinfo_matchnotes_soup,cricbuzz_match_soup,all_days,innings_df,innings_list=compile_referral_data_debug(cricinfo_match_notes_url,cricbuzz_match_url)

### Collect all match wise referral DF into one

In [53]:
import glob

reviews_df=pd.DataFrame()

for glob in glob.glob("data/*.csv"):
    df=pd.read_csv(glob)
    reviews_df=pd.concat([df,reviews_df])
    
# reviews_df.to_csv('reviews.csv')

In [54]:
reviews_df=pd.read_csv('review_data/reviews.csv')
reviews_df['innings_action']=(reviews_df['Review_team'].astype('str')).apply(lambda x: "Batting" if "Batting" in x else "Bowling")

In [55]:
reviews_df.shape

(29481, 14)

### Process review events- to add overturned, review loss, 1st review loss, 2nd review loss logics

In [56]:
df=reviews_df.loc[(~pd.isnull(reviews_df.reviews))]

# df=df[df.match=='eng-vs-aus-3rd-test-the-ashes-2019']
# ##Split into innings
# reviews_df['Review_team']=reviews_df['Review_team'].astype('str')
# df=reviews_df
# df.fillna('',inplace=True)
##Keep start review count =2
## Overturn event logic. bowling-upheld and p.broken=True [OR] batting-upheld with p.broken=False 
##1st review lost at =
##2nd review lost at = 
## Review loss logic is take innings, take team batting- check upheld or struck down or struck down umpires call. add 1 if struck down
## for team bowling- check upheld or struck down or stuck down umpires call. if struck down, then add 1.

In [57]:
df['overturned']=df[['Review_team','Review_outcome','Partnership_broken']].apply(lambda x: True if 
(('Bowling' in x.Review_team and x.Review_outcome=='Upheld' and x.Partnership_broken==True) 
 or ('Batting' in x.Review_team and x.Review_outcome=='Upheld' and x.Partnership_broken==False)) else False,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [58]:
df['review_lost']=df[['Review_team','Review_outcome','Partnership_broken']].apply(lambda x: True if 
(('Bowling' in x.Review_team and x.Review_outcome=='Struck down' and x.Partnership_broken==False) 
 or ('Batting' in x.Review_team and x.Review_outcome=='Struck down' and x.Partnership_broken==True)) else False,axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [59]:
df['first_review_lost']=(df.loc[df.review_lost==True].sort_values('Over').groupby(["match","innings","innings_action"])['review_lost'].rank('first')==1)

df['second_review_lost']=(df.loc[df.review_lost==True].sort_values('Over').groupby(["match","innings","innings_action"])['review_lost'].rank('first')==2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [60]:
df.fillna('',inplace=True)

In [61]:
cols_to_add_from_df=['overturned','review_lost','first_review_lost','second_review_lost']

for col in cols_to_add_from_df:
    reviews_df[col]=df.loc[df.index,col]

In [62]:
reviews_df.fillna('',inplace=True)

In [63]:
# reviews_df.to_csv('review_data/reviews_updated.csv')

In [64]:
reviews_df=pd.read_csv('review_data/reviews_updated.csv')

reviews_df.shape

(29481, 19)

### Visualisations for blog

- Batsman who took the most reviews (intiated by batsman)
- Whom did teams take a review against the most? 
- Umpire effectiveness (# of overturned decisions/total number of DRS calls for an umpire) Bar chart
- Review effectiveness by teams overall (Who had most partnership breaks when overturned)

In [65]:
df=reviews_df[~pd.isnull(reviews_df.reviews)]

df.shape

print(df.columns)

Index(['Unnamed: 0', 'innings', 'Over', 'Commentary', 'active_partnership',
       'reviews', 'Review_team', 'Review_batsman', 'Review_umpire',
       'Review_outcome', 'Umpires_call', 'index', 'Partnership_broken',
       'match', 'innings_action', 'overturned', 'review_lost',
       'first_review_lost', 'second_review_lost'],
      dtype='object')


### Overturned decisions per match

In [66]:
overturned_per_match=df.groupby('match')['overturned'].sum()

### Total decisions per match

In [67]:
total_per_match=df.groupby('match')['overturned'].count()

In [68]:
ot_decisions=[]

tot_avg_decisions=[]

In [69]:
for r in range((df.shape[0])):
    ot_decisions.append(overturned_per_match.loc[df.match.iloc[r]])
    tot_avg_decisions.append(overturned_per_match.loc[df.match.iloc[r]]/total_per_match.loc[df.match.iloc[r]])

In [70]:
df['overturned_decisions']=ot_decisions

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [71]:
df['ratio_overturned']=tot_avg_decisions

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [72]:
# fig = px.bar(df,x='match',y='overturned_decisions')
# # fig.update_xaxes(categoryorder='sum descending')

# fig.show()

In [73]:
df['team']=df['Review_team'].apply(lambda x:x.split('(')[0].strip())

df['bowler']=df['Commentary'].apply(lambda x:x.split('to')[0].strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [74]:
df.head()

Unnamed: 0.1,Unnamed: 0,innings,Over,Commentary,active_partnership,reviews,Review_team,Review_batsman,Review_umpire,Review_outcome,...,match,innings_action,overturned,review_lost,first_review_lost,second_review_lost,overturned_decisions,ratio_overturned,team,bowler
1560,1560,India 1st innings,115.2,"Holder to Hanuma Vihari, no run, he's given ou...",8,"Over 115.2: Review by India (Batting), Umpire ...",India (Batting),GH Vihari,PR Reiffel,Upheld,...,wi-vs-ind-2nd-test-india-tour-of-west-indies-2019,Batting,True,False,,,2,0.222222,India,Holder
4748,4748,India 1st innings,61.3,"Roach to Hanuma Vihari, leg byes, 1 run, horri...",5,"Over 61.3: Review by West Indies (Bowling), Um...",West Indies (Bowling),GH Vihari,PR Reiffel,Struck down,...,wi-vs-ind-2nd-test-india-tour-of-west-indies-2019,Bowling,False,True,False,True,2,0.222222,West Indies,Roach
4882,4882,India 1st innings,39.1,"Roach to Agarwal, no run, gone! Is he? Agarwal...",3,"Over 39.1: Review by India (Batting), Umpire -...",India (Batting),MA Agarwal,PR Reiffel,Upheld,...,wi-vs-ind-2nd-test-india-tour-of-west-indies-2019,Batting,True,False,,,2,0.222222,India,Roach
4995,4995,India 1st innings,20.3,"Cornwall to Kohli, no run, a huge lbw appeal a...",3,"Over 20.3: Review by West Indies (Bowling), Um...",West Indies (Bowling),V Kohli,RA Kettleborough,Struck down,...,wi-vs-ind-2nd-test-india-tour-of-west-indies-2019,Bowling,False,True,True,False,2,0.222222,West Indies,Cornwall
19570,19570,India 2nd innings,4.6,"Roach to Agarwal, out Lbw! Looks like he's gon...",1,"Over 4.6: Review by India (Batting), Umpire - ...",India (Batting),MA Agarwal,RA Kettleborough,Struck down - Umpires Call,...,wi-vs-ind-2nd-test-india-tour-of-west-indies-2019,Batting,False,False,,,2,0.222222,India,Roach


In [75]:
df.shape

(168, 23)

In [76]:
# df.to_csv('review_data/reviews_15dec.csv',index=False)

In [77]:
df1=df.loc[df.review_lost==True]

In [78]:
# df1.to_csv("review_data/rev_events.csv",index=False)

#### Fixes and enhancements:

- Active partnership sometimes increments additional fields- due to retired not outs and 1st innings declarations..etc.
- Add wicket keeper at review event