# Web Scrapping Using BeautifulSoup Library :
## Python for Data Analytics

#

#### 1. BeautifulSoup Library
Overview:

1.Purpose: BeautifulSoup is a Python library used for parsing HTML and XML documents. It creates parse trees from page source codes, which can be used to extract data from HTML pages easily.

2.Key Features:

1.Parsing HTML/XML: Converts complex HTML/XML documents into a tree structure, making it easy to navigate and search.

2.Tag Navigation: Allows for easy navigation and searching of HTML tags and their contents.

3.Compatibility: Works well with different parsers like html.parser, lxml, and html5lib.

3.Common Usage:

1.Extracting Data: Extract data from specific tags, attributes, or text.

2.Navigating HTML Structure: Traverse the HTML document tree using BeautifulSoup’s methods.


#### 2. Requests Library

Overview:

1.Purpose: The requests library is a Python library used for making HTTP requests. It simplifies the process of sending HTTP requests to web servers and handling responses.

2.Key Features:

  1.Sending HTTP Requests: Easily send GET, POST, PUT, DELETE, and other HTTP requests.

  2.Handling Responses: Access the content of responses, check the status code, and manage headers and cookies.

  3.Error Handling: Provides methods for handling common HTTP errors.

3.Common Usage:

  1.Fetching Web Pages: Retrieve the HTML content of web pages.
  2.Handling Data: Extract data from responses and manage request parameters.

#

### 1. Importing Required libraries

In [1]:
import pandas as pd 
from bs4 import BeautifulSoup
import requests

### 2. Target Website

In [15]:
# We will try to scrape the following page

url = 'https://www.scrapethissite.com/pages/forms/'

### 3. Lets begin to Scrape

In [9]:
page = requests.get(url)
soup = BeautifulSoup(page.text ,'html')

In [30]:
#we only have a single table on the page hence used find 
a = soup.find('table' , class_='table')

In [34]:
#find_all returns all the records we also have find() which returns the first record only  
headers = a.find_all('th')

In [51]:
headings = [i.text.strip() for i in headers ]

In [54]:
df = pd.DataFrame(columns = headings)

In [57]:
column_data= a.find_all('tr')

In [64]:
for row in column_data[1:]:
    row_data = (row.find_all('td'))
    individual_row_data = [data.text.strip() for data in row_data] 
    

    length = len(df)
    df.loc[length] = individual_row_data

in this loop we find that under "tr" tag we have our table row data which is inside "td" tag so we used fin_all on "td" after that each time we iterate the length of our dataframe will keep increasing and our row data will get updated each time 

In [66]:
df.head()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25


In [67]:
#Save the dataframe as csv to a location
df.to_csv('location',index=False)

#

#

#

# Wait bonus part..!

using pd.read_html()

we can also use pd.read_html('url') it will return all the tables from that web page in a list format taht we can extract using indexing.

In [79]:
df = pd.read_html('https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table')


In [81]:
df[0]

Unnamed: 0,Olympic Games
0,Main topics
1,Bids Boycotts Ceremonies Charter Host cities I...
2,Games
3,Summer Winter Youth Esports
4,Regional games
5,Asian African European Pacific Pan-American
6,Defunct games
7,Ancient Intercalated
8,vte


In [82]:
df[1]

Unnamed: 0_level_0,Team,Summer Olympic Games,Summer Olympic Games,Summer Olympic Games,Summer Olympic Games,Summer Olympic Games,Winter Olympic Games,Winter Olympic Games,Winter Olympic Games,Winter Olympic Games,Winter Olympic Games,Combined total,Combined total,Combined total,Combined total,Combined total
Unnamed: 0_level_1,Team (IOC code),No.,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,No.,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,No.,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Afghanistan (AFG),15,0,0,2,2,0,0,0,0,0,15,0,0,2,2
1,Algeria (ALG),14,5,4,8,17,3,0,0,0,0,17,5,4,8,17
2,Argentina (ARG),25,21,26,30,77,20,0,0,0,0,45,21,26,30,77
3,Armenia (ARM),7,2,8,8,18,8,0,0,0,0,15,2,8,8,18
4,Australasia (ANZ)[ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,Zimbabwe (ZIM)[ZIM],14,3,4,1,8,1,0,0,0,0,15,3,4,1,8
153,Independent Olympic Athletes (IOA)[IOA],3,1,0,1,2,0,0,0,0,0,3,1,0,1,2
154,Independent Olympic Participants (IOP)[IOP],1,0,1,2,3,0,0,0,0,0,1,0,1,2,3
155,Mixed team (ZZX)[ZZX],3,11,6,8,25,0,0,0,0,0,3,11,6,8,25


In [83]:
len(df)

55