#  Scraping IPL(Indian Premier League) Points Table using Selenium & Python 

## Introduction
The Indian Premier League (IPL), also officially known as TATA IPL for sponsorship reasons, is a professional men's Twenty20 cricket league, contested by ten teams based out of seven Indian cities and three Indian states.[1][2] The league was founded by the Board of Control for Cricket in India (BCCI) in 2007. 

IPL 2023 is about to kick off in a few months and just like every season data science enthusiasts must be itching to get their hands on match data to slice and dice the data with an intent to gain insights in to the teams/players and the more ambitious of the lot would look to predict the results of these games.

# ![](https://assets.iplt20.com/bcci/articles/1662579908_np-iplImg.png)

##  What is Web Scraping?

Web Scraping gives us the ability to collect data from a source all by ourselves and in the format that we would like Of course there would be some limitations depending on the source of the data but we have greater control since we get to decide how and what data we scrape from the data available at the source.
We would be using iplt20.com to scrape the points tables as they are kind enough to allow scraping with some restrictions. We would work on IPL data.

The below infographic gives a gist of how we would be approaching this task.

## ![](https://i.imgur.com/HUwa4dj.jpeg)

# Outline of the Project:

1. Understanding the structure of iplt20.com Website.
2. Installing and Importing required libraries.
3. Simulating the page and Extracting the URLs of all the years from website using selenium.
4. Accessing each year and building a URL (Total 15 years as shown in the above figure).
5. Parsing the points table details into 13 fields: Year,Position,Team,played,won,lost,tied,no result,net run rate, for,      against,points form using Helper Functions.
6. Storing the extracted data into a dictionary.
7. Compiling all the data into a DataFrame using Pandas and saving the data into CSV file.

By the end of the project we’ll create DataFrame in the following format .
## ![](https://i.imgur.com/eMAW51z.jpeg)

# The selenium web driver is hosted on the replit server to run the below code, open the below link to run the code.

https://replit.com/@iyumrahul/selinium-iplt20-Scraping#main.py

## Steps followed:
We have broken down into 3 steps:-

1. First we would extract all the year links from the homepage for IPL. We would use selenium to gather all these links.

2. We would navigate all the tags and scrape the points table one by one using the functions.

3. Since we already have all inner and outer tags, we will navigate to the data in the points table page and scrape the details by calling the functions into dictionary and save it to a dataframe and finally export it to a csv file. 

### We would move on to the code now. For the two steps we mentioned earlier.

Install and impoert required libraries.

In [5]:
!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="web-scraping-iplt20")

!pip install selenium --upgrade --quiet
!pip install pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd

<IPython.core.display.Javascript object>

[jovian] Updating notebook "gowdarahul54/web-scraping-iplt20" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/gowdarahul54/web-scraping-iplt20[0m
[31mERROR: Could not find a version that satisfies the requirement as (from versions: none)[0m
[31mERROR: No matching distribution found for as[0m


Gathering the infomation of all the years ipl was conducted.

In [11]:
#year list

year_list = [2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019,2020, 2021, 2022]

In [12]:
# defining functions to get all the links

def get_url(year):
    base_url, input = 'https://www.iplt20.com/points-table/men/', str(year)
    ipl_pointsTable_url = base_url + input
    return ipl_pointsTable_url
def url_of_years():
    URL=[]
    for years in year_list:
        URL.append(get_url(years))
    return URL

In [13]:
url=url_of_years()

Installing Selenium webdriver to navigate to a web page, with Webdriver the browser loads all the resources of the website (JavaScript files, images, CSS, files, etc.)

In [None]:
def get_driver():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

In [None]:
print('Creating Driver')
driver = get_driver()
print('Feteching the page')
print('Page title:', driver.title)

In [None]:
# Defining functions to get tags
    
def get_outer_tag():
    outer_tag = driver.find_elements(By.CLASS_NAME, 'ih-pt-tab-bg')
    outerTag = outer_tag[0]
    inner_tag = outerTag.find_elements(By.CLASS_NAME, 'team0')
    return inner_tag

In [None]:
# definig a function to get the details of position column
def get_position_col():
    tags = get_outer_tag()
    Place = []
    for i in tags:
        place = i.find_elements(By.TAG_NAME, 'td')[0].text
        Place.append(place)
    dreturn Place

In [None]:
# definig a function to get the details of teams column
def get_Team_col():
    tags = get_outer_tag()
    Team = []
    for i in tags:
        team = i.find_elements(By.TAG_NAME, 'td')[1].text
        Team.append(team)
    return Team

In [None]:
# definig a function to get the details of matches played column
def get_Pld_col():
    tags = get_outer_tag()
    Pld = []
    for i in tags:
        pld = i.find_elements(By.TAG_NAME, 'td')[2].text
        Pld.append(pld)
    return Pld

In [None]:
# definig a function to get the details of matches won column
def get_Won_col():
    tags = get_outer_tag()
    Won = []
    for i in tags:
        won = i.find_elements(By.TAG_NAME, 'td')[3].text
        Won.append(won)
    return Won

In [None]:
# definig a function to get the details of matches lost column
def get_Lost_col():
    tags = get_outer_tag()
    Lost = []
    for i in tags:
        lost = i.find_elements(By.TAG_NAME, 'td')[4].text
        Lost.append(lost)
    return Lost

In [None]:
# definig a function to get the details of matches tied column
def get_Tied_col():
    tags = get_outer_tag()
    Tied = []
    for i in tags:
        tied = i.find_elements(By.TAG_NAME, 'td')[5].text
        Tied.append(tied)
    return Tied

In [None]:
# definig a function to get the details of matches ended with no results column
def get_No_Result_col():
    tags = get_outer_tag()
    No_Result = []
    for i in tags:
        noResult = i.find_elements(By.TAG_NAME, 'td')[6].text
        No_Result.append(noResult)
    return No_Result

In [None]:
# definig a function to get the details of net run rate column
def get_Net_RR_col():
    tags = get_outer_tag()
    Net_RR = []
    for i in tags:
        net_rr = i.find_elements(By.TAG_NAME, 'td')[7].text
        Net_RR.append(net_rr)
    return Net_RR

In [None]:
# definig a function to get the details of column having score over the number of overs played for 
def get_For_col():
    tags = get_outer_tag()
    _For = []
    for i in tags:
        _for = i.find_elements(By.TAG_NAME, 'td')[8].text
        _For.append(_for)
    return _For

In [None]:
# definig a function to get the details of column having score over the number of overs played against 
def get_Against_col():
    tags =get_outer_tag()
    Against = []
    for i in tags:
        against = i.find_elements(By.TAG_NAME, 'td')[9].text
        Against.append(against)
    return Against

In [None]:
# definig a function to get the details of points column
def get_Pts_col():
    tags = get_outer_tag()
    Pts = []
    for i in tags:
        points = i.find_elements(By.TAG_NAME, 'td')[10].text
        Pts.append(points)
    return Pts

In [None]:
# definig a function to get the details of column having last 5 games 
def get_Form_col():
    tags = get_outer_tag()
    Form = []
    for i in tags:
        form = i.find_elements(By.TAG_NAME,'td')[11].text.replace('\n','')
        Form.append(form)
    return Form

In [None]:
# definig a function to get the details of year column
def get_year_col(year):
    Year=[]
    for i in year:
        if i==2022 or i==2011:
            for j in range(10):
                Year.append(i)
        elif i==2012 or i==2013:
            for j in range(9):
                Year.append(i)  
        else:
            for j in range(8):
                Year.append(i)   
    return Year

Years=get_year_col(year_list)   

In [None]:
# Defining the main function to call all the other functions into dictonaries

if __name__ == "__main__":
    def get_data(url,years):
    dictonary={'Year':years,'position_column':[],'Team': [],'Pld':[],'Won': [],'Lost':[],'Tied':[],
               'No_result': [],'Net_RR': [],'For': [],'Against': [],'Points':[],'Form':[] }
    for i in url:
        driver.get(i)
        dictonary['position_column'].extend(get_position_col())
        dictonary['Team'].extend(get_Team_col())
        dictonary['Pld'].extend(get_Pld_col())
        dictonary['Won'].extend( get_Won_col())
        dictonary['Lost'].extend(get_Lost_col())
        dictonary['Tied'].extend(get_Tied_col())
        dictonary['No_result'].extend( get_No_Result_col())
        dictonary['Net_RR'].extend(get_Net_RR_col())
        dictonary['For'].extend(get_For_col())
        dictonary['Against'].extend(get_Against_col())
        dictonary['Points'].extend(get_Pts_col())
        dictonary['Form'].extend(get_Form_col())
    return dictonary

 ### The final step to get the extracted data into csv formate


In [None]:
# store the resulting out put in a variable
dictonary_data = get_data(url,Years)

# dataframe to get data of year 2022-points tabel in tablur form
df = pd.DataFrame(dictonary_data)

# the dataframe is then converted into csv file
df.to_csv('IPLT20-points-Tabel.csv', index=None)
print(df)

# Summary

- We place a webpage url request using selenium webdriver

- The Scraping was done using Python libraries such as Selenium for extracting the data,pandas for dataframe.

- Scraping points table details since past 15 years from iplt20.com website into 13 fields like Year,Position,Team,played,won,lost,tied,no result,net run rate, for,against,points form using Helper Functions.

- Parsed all the scraped data into a csv file containing total of 126 rows and 13 columns for every year since 2008 since we have data in the in dictonary  We flatten the data into tabular form.

- Post that we pre process the data as per our requirement and finally export the processed data to IPLT20-points-Table.csv file.

# Future work

- Extracting more details of the project and creator by accessing the project links and creator links
Code optimization

- Improving the documentation part of the project

- The data collection here can be improved vastly by in future by appending the years into the year list

# Refrences

- IPLt20 Homepage:- https://www.iplt20.com/points-table/men/2022

- Selenium Documentation:- https://www.selenium.dev/selenium/docs/api/py/api.html


In [None]:
jovian.commit()

<IPython.core.display.Javascript object>