# Web Scraping Through Pandas

## 1. Basics of HTML wrt Web Scraping

- Why HTML 
- How to Access HTML Data
- Common Elements & Tags used in Scraping
- Elements, Attributes,Div and Classes

## 2. Required Libraries

- Pandas
- requests
- BeautifulSoup

## 3. Authentication,Soup and HTML Parsing

- HTTP Status Codes
- Ethical Policies
- Bytes to HTML
- prettify

## 4. Finding Data

- Find & Findall
- Get
- Text
- List
- HyperLinks

## 5. Cleaning Data using Python 

- Python Data Structures
- Strings, Lists, Dictionaries
- Pandas operations

## 6. Export to df

## Practice

In [3]:
url = 'https://www.cricbuzz.com/cricket-match-squads/89763/mi-vs-dc-20th-match-indian-premier-league-2024' # Players Names

In [4]:
### Libraries

In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup# from math import sqrt

In [6]:
### Sending Request to webpage

In [7]:
request = requests.get(url)

In [8]:
request

<Response [200]>

In [9]:
type(request.content)

bytes

In [10]:
request.status_code # 200 Means ready to go

200

In [11]:
soup = BeautifulSoup(request.content,'html.parser')

In [12]:
MI_players = soup.find_all(class_='cb-player-name-left')
DC_players = soup.find_all(class_='cb-player-name-right')

In [13]:
MI_players

[<div class="cb-player-name-left"> <div> Rohit Sharma <br/> <span class="cb-font-12 text-gray">Batter</span> </div> </div>,
 <div class="cb-player-name-left"> <div> Ishan Kishan (WK) <br/> <span class="cb-font-12 text-gray">WK-Batter</span> </div> </div>,
 <div class="cb-player-name-left"> <div> Suryakumar Yadav <span class="cb-rank-diff-up cb-ico cb-plr-in-out"></span> <br/> <span class="cb-font-12 text-gray">Batter</span> </div> </div>,
 <div class="cb-player-name-left"> <div> Tilak Varma <br/> <span class="cb-font-12 text-gray">Batter</span> </div> </div>,
 <div class="cb-player-name-left"> <div> Hardik Pandya (C) <br/> <span class="cb-font-12 text-gray">Batting Allrounder</span> </div> </div>,
 <div class="cb-player-name-left"> <div> Tim David <br/> <span class="cb-font-12 text-gray">Batter</span> </div> </div>,
 <div class="cb-player-name-left"> <div> Mohammad Nabi <span class="cb-rank-diff-up cb-ico cb-plr-in-out"></span> <br/> <span class="cb-font-12 text-gray">Bowling Allrounde

In [14]:
Players = []
Role = []

In [15]:
def extract_player_infor(team):
    for i in team:
        player = i.text.strip().split(' ') # it is gng to become list after split
        sep =player.index('') # space index as junction 
        player_name,role = " ".join(player[:sep]).strip()," ".join(player[sep:]).strip()
        print(player_name,role)
        Players.append(player_name)
        Role.append(role)     

In [16]:
extract_player_infor(DC_players)

David Warner Batter
Prithvi Shaw Batter
Abishek Porel WK-Batter
Rishabh Pant (C & WK) WK-Batter
Tristan Stubbs WK-Batter
Axar Patel Bowling Allrounder
Lalit Yadav Batting Allrounder
Jhye Richardson Bowler
Anrich Nortje Bowler
Ishant Sharma Bowler
Khaleel Ahmed Bowler
Kumar Kushagra WK-Batter
Yash Dhull Batter
Jake Fraser-McGurk Batter
Sumit Kumar Bowling Allrounder
Praveen Dubey Bowler
Mukesh Kumar Bowler
Kuldeep Yadav Bowler
Ricky Bhui Batter
Shai Hope WK-Batter
Vicky Ostwal Bowler
Swastik Chikara Batting Allrounder
Mitchell Marsh Batting Allrounder
Rasikh Dar Salam Bowler
Ricky Ponting Head coach
Biju George Fielding coach
Pravin Amre Assistant coach
Sourav Ganguly Director of Cricket
James Hopes Fast Bowling coach
Gnaneswara Rao Assistant Fielding coach


In [17]:
Players[:5]

['David Warner',
 'Prithvi Shaw',
 'Abishek Porel',
 'Rishabh Pant (C & WK)',
 'Tristan Stubbs']

In [18]:
extract_player_infor(MI_players)

Rohit Sharma Batter
Ishan Kishan (WK) WK-Batter
Suryakumar Yadav Batter
Tilak Varma Batter
Hardik Pandya (C) Batting Allrounder
Tim David Batter
Mohammad Nabi Bowling Allrounder
Romario Shepherd Bowler
Piyush Chawla Bowler
Gerald Coetzee Bowler
Jasprit Bumrah Bowler
Akash Madhwal Bowler
Kwena Maphaka Bowler
Naman Dhir Batter
Nehal Wadhera Batter
Shams Mulani Batting Allrounder
Dewald Brevis Batter
Nuwan Thushara Bowler
Vishnu Vinod WK-Batter
Shreyas Gopal Bowling Allrounder
Luke Wood Bowler
Arjun Tendulkar Bowling Allrounder
Kumar Kartikeya Bowling Allrounder
Shivalik Sharma Batter
Anshul Kamboj Bowler
Mark Boucher Head coach
Kieron Pollard Batting coach
James Pamment Fielding coach
Sachin Tendulkar ICON
Lasith Malinga Bowling Coach
Jagadeesh Arunkumar Assistant Batting Coach


In [19]:
Players[len(Players)-5:]

['Kieron Pollard',
 'James Pamment',
 'Sachin Tendulkar',
 'Lasith Malinga',
 'Jagadeesh Arunkumar']

### url extraction

In [20]:
player_urls = []
common_url = 'https://www.cricbuzz.com'
def player_info_extraction(team):
    for i in team:
        player_url = i.get('href')
        if player_url is None :
            new_url = common_url+'NO URL'
        else:
            new_url = common_url+player_url
        player_urls.append(new_url)

In [22]:
Full_info = soup.find_all(class_ = 'cb-col cb-col-100 pad10 cb-player-card-left')
Full_info
player_info_extraction(Full_info)

In [23]:
Full_info = soup.find_all(class_ = 'cb-col cb-col-100 pad10 cb-player-card-right')
player_info_extraction(Full_info)

In [24]:
len(Players),len(Role),len(player_urls)

(61, 61, 50)

In [25]:
df = pd.DataFrame(
    {'Player' : Players,
    'Role'   : Role})

In [26]:
df.to_csv('F:\Freelance\MI_DC_Players.csv')