This Notebook is used to demonstrate a Python script that extracts indian-premier-league data from a https://www.espncricinfo.com/ using web scraping techniques. 
It leverages the requests library to fetch HTML content and BeautifulSoup to parse and extract the desired information.
At the end we have download the fetched data as a CSV file.

Click [link](https://www.espncricinfo.com/records/trophy/team-match-results/indian-premier-league-117) to visit the source page(Official ESPN).

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

import requests

In [2]:
# URL of the espncricinfo to scrape
url = "https://www.espncricinfo.com/records/trophy/team-match-results/indian-premier-league-117"

# Send an HTTP GET request to the website
response = requests.get(url)

# Parse the HTML code using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

We have single table in the above url, used table -> thead -> tr -> td to get columns details and table -> tbody -> tr -> td to get records.

In [3]:
for table in soup.select('table'):
    
    for thead in table.select('thead'):
        if len(thead.select('tr')) != 1:
            print('Please check HTML contents.')
            
        column_list = [ items.get_text() for items in thead.select('tr td')]
        
    record = []
    for tbody in table.select('tbody'):
        
        for row in tbody.select('tr'):
            link = row.find_all('a')[1].get('href')
            item_list = [ items.get_text() for items in row.select('td')]
            item_list[-1] = link
            record.append(item_list)
        
    table_df = pd.DataFrame(record, columns = column_list)
    
# table_df

In [4]:
match_results = table_df.copy()
match_results.head()

Unnamed: 0,Team 1,Team 2,Winner,Margin,Ground,Match Date,Scorecard
0,Titans,Super Kings,Super Kings,5 wickets,Ahmedabad,"May 28-29, 2023",/series/indian-premier-league-2023-1345038/guj...
1,Titans,Mumbai,Titans,62 runs,Ahmedabad,"May 26, 2023",/series/indian-premier-league-2023-1345038/guj...
2,Super Giants,Mumbai,Mumbai,81 runs,Chennai,"May 24, 2023",/series/indian-premier-league-2023-1345038/luc...
3,Super Kings,Titans,Super Kings,15 runs,Chennai,"May 23, 2023",/series/indian-premier-league-2023-1345038/che...
4,RCB,Titans,Titans,6 wickets,Bengaluru,"May 21, 2023",/series/indian-premier-league-2023-1345038/roy...


In [5]:
# Rename columns
match_results.columns = ['team_1','team_2','winner','margin','ground','match_date','scorecard']
# match_results.head()
match_results.shape

(1025, 7)

**Note:** 
- **scorecard** in the table refers to the link each matches details, which also conatins one unique id to represent match_id.
- **winner** contains *no results* which can be ignore or drops, indicating due to some reason that particular match was cancelled. 

In [6]:
# Polishing match_date
match_results.match_date = match_results.match_date.apply(lambda match_date: \
                                                          match_date.split()[0] + ' ' + \
                                                          match_date.split()[1][:2] + ', ' + \
                                                          match_date.split()[2])
match_results.match_date = match_results.match_date.apply(lambda match_date: \
                                                          match_date.replace(",,",","))

# Adding a new feature named match_id 
match_results['match_id'] = match_results.scorecard.apply(lambda scorecard: \
                                                          scorecard.split('/')[-2].split('-')[-1])

In [7]:
match_results.head()

Unnamed: 0,team_1,team_2,winner,margin,ground,match_date,scorecard,match_id
0,Titans,Super Kings,Super Kings,5 wickets,Ahmedabad,"May 28, 2023",/series/indian-premier-league-2023-1345038/guj...,1370353
1,Titans,Mumbai,Titans,62 runs,Ahmedabad,"May 26, 2023",/series/indian-premier-league-2023-1345038/guj...,1370352
2,Super Giants,Mumbai,Mumbai,81 runs,Chennai,"May 24, 2023",/series/indian-premier-league-2023-1345038/luc...,1370351
3,Super Kings,Titans,Super Kings,15 runs,Chennai,"May 23, 2023",/series/indian-premier-league-2023-1345038/che...,1370350
4,RCB,Titans,Titans,6 wickets,Bengaluru,"May 21, 2023",/series/indian-premier-league-2023-1345038/roy...,1359544


In [8]:
match_results.winner.value_counts()

Mumbai          138
Super Kings     131
KKR             119
RCB             114
Royals          101
Kings XI         85
Sunrisers        78
Daredevils       67
Capitals         38
Chargers         29
Titans           23
Punjab Kings     19
Super Giants     17
tied             14
Guj Lions        13
Warriors         12
Supergiant       10
no result         6
Kochi             6
Supergiants       5
Name: winner, dtype: int64

In [9]:
match_results.scorecard.tolist()[1]

'/series/indian-premier-league-2023-1345038/gujarat-titans-vs-mumbai-indians-qualifier-2-1370352/full-scorecard'

There are some conflict like :
- Super Giants and Supergiants are both same team, Also we have some 'no result' records as well.
- scorecard column has the link to get each match, with the place where match was played and match unique_id.

**Note:** In later steps all the confilcts will be trasformed.

In [10]:
match_results.to_csv("ipl_match_results.csv", index=False)
print("ipl_match_results.csv downloaded !!!")

ipl_match_results.csv downloaded !!!
