### Author : Sanjoy Biswas
### Topic : Web Scraping with Python Using BeautifulSoup 
### Email : sanjoy.eee32@gmail.com

#### How Does Web Scraping Work?

When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. Generally, our code downloads that page’s source code, just as a browser would. But instead of displaying the page visually, it filters through the page looking for HTML elements we’ve specified, and extracting whatever content we’ve instructed it to extract.

For example, if we wanted to get all of the titles inside H2 tags from a website, we could write some code to do that. Our code would request the site’s content from its server and download it. Then it would go through the page’s HTML looking for the H2 tags. Whenever it found an H2 tag, it would copy whatever text is inside the tag, and output it in whatever format we specified.

#### The Components of a Web Page

HTML — contain the main content of the page.

CSS — add styling to make the page look nicer.

JS — Javascript files add interactivity to web pages.

Images — image formats, such as JPG and PNG allow web pages to show pictures.

#### Import Libraries

In [2]:
import numpy as np
import pandas as pd
from requests import get

In [87]:
import bs4 as bs
import urllib.request

#### Add URL From Which Data Scrap

In [88]:
src = urllib.request.urlopen('https://www.nytimes.com/').read()

In [89]:
bsoup = bs.BeautifulSoup(src, 'lxml')

In [1]:
#print(bsoup)

In [92]:
### .text shows only text from the particular class
print(bsoup.title.text)

The New York Times - Breaking News, World News & Multimedia


In [93]:
for link in bsoup.find_all('a'):
    print(link.get('href'))

#site-content
#site-index
/
/
https://www.nytimes.com/es/
https://cn.nytimes.com
https://www.nytimes.com/subscription/multiproduct/lp8HYKU.html?campaignId=6W74R
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://www.nytimes.com/section/todayspaper
/
/
https://www.nytimes.com/pages/world/index.html
https://www.nytimes.com/pages/national/index.html
https://www.nytimes.com/pages/politics/index.html
https://www.nytimes.com/pages/nyregion/index.html
https://www.nytimes.com/pages/business/index.html
https://www.nytimes.com/pages/opinion/index.html
https://www.nytimes.com/pages/technology/index.html
https://www.nytimes.com/section/science
https://www.nytimes.com/pages/health/index.html
https://www.nytimes.com/pages/sports/index.html
https://www.nytimes.com/pages/arts/index.html
https://www.nytimes.com/section/books
https://www.nytimes.com/section/style
https://www.nytimes.com/pages/dinin

In [94]:
print(bsoup.p)

<p class="css-gs67ux e1n8kpyg0">Here’s what you need to know at the end of the day.</p>


#### Find All data in Particular Class

In [95]:
print(bsoup.find_all('p'))

[<p class="css-gs67ux e1n8kpyg0">Here’s what you need to know at the end of the day.</p>, <p class="css-gs67ux e1n8kpyg0">The Mueller report is released.</p>, <p class="css-gs67ux e1n8kpyg0">A trip to deep in South Texas brush country, where the terrain is deadly.</p>, <p class="css-1pfq5u e1n8kpyg0">Sarah Huckabee Sanders, the White House press secretary, is seemingly unfazed by blows to her credibility.</p>, <p class="css-1pfq5u e1n8kpyg0">The findings revealed that some claims in the dossier appeared to be false while others were impossible to prove.</p>, <p class="css-1pfq5u e1n8kpyg0">When that woman is Meghan Markle, and the baby is a Royal Baby, it’s “generally seen as harmless fun.”</p>, <p class="css-1pfq5u e1n8kpyg0">The Mueller report is a good reminder of how important it is to prevent foreign interference in American elections.</p>, <p class="css-1pfq5u e1n8kpyg0">Mueller laid out the evidence for members of Congress to take action against President Trump. Will they?</p>, 

In [98]:
print(bsoup.find('p').get_text())

Here’s what you need to know at the end of the day.


In [99]:
ptags = bsoup.find_all('p')
for p in ptags:
    print(p.text)

Here’s what you need to know at the end of the day.
The Mueller report is released.
A trip to deep in South Texas brush country, where the terrain is deadly.
Sarah Huckabee Sanders, the White House press secretary, is seemingly unfazed by blows to her credibility.
The findings revealed that some claims in the dossier appeared to be false while others were impossible to prove.
When that woman is Meghan Markle, and the baby is a Royal Baby, it’s “generally seen as harmless fun.”
The Mueller report is a good reminder of how important it is to prevent foreign interference in American elections.
Mueller laid out the evidence for members of Congress to take action against President Trump. Will they?
Experts say mirroring another person’s facial expressions is essential for recognizing emotion, and also for feeling it.
In “The Absent Hand,” the writer Suzannah Lessard dissects a diverse swath of America, looking to understand the green expanses and urban sprawl that surround us.
In the latest

In [100]:
src = urllib.request.urlopen('http://www.espn.com/nba/statistics/player/_/stat/assists/sort/avgAssists/').read()
bsoup = bs.BeautifulSoup(src, 'lxml')
tbl = bsoup.find('table')

In [101]:
tbl_rows = tbl.find_all('tr')
for tr in tbl_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

['RK', 'PLAYER', 'TEAM', 'GP', 'MPG', 'AST', 'APG', 'TO', 'TOPG', 'AP48M', 'AST/TO']
['1', 'Russell Westbrook, PG', 'OKC', '2', '37.5', '21', '10.5', '10', '5.0', '13.4', '2.10']
['2', 'James Harden, PG', 'HOU', '2', '33.0', '20', '10.0', '11', '5.5', '14.5', '1.82']
['3', 'Nikola Jokic, C', 'DEN', '3', '36.0', '29', '9.7', '7', '2.3', '12.9', '4.14']
['4', 'Lou Williams, SG', 'LAC', '3', '28.3', '26', '8.7', '9', '3.0', '14.7', '2.89']
['\xa0', 'Draymond Green, PF', 'GS', '3', '34.0', '26', '8.7', '10', '3.3', '12.2', '2.60']
['6', 'Ben Simmons, PG', 'PHI', '3', '33.3', '24', '8.0', '9', '3.0', '11.5', '2.67']
['7', 'Kyle Lowry, PG', 'TOR', '2', '35.5', '15', '7.5', '4', '2.0', '10.1', '3.75']
['\xa0', 'Ricky Rubio, PG', 'UTAH', '2', '31.5', '15', '7.5', '4', '2.0', '11.4', '3.75']
['9', 'Kyrie Irving, PG', 'BOS', '2', '36.5', '14', '7.0', '5', '2.5', '9.2', '2.80']
['10', 'Reggie Jackson, PG', 'DET', '2', '23.5', '13', '6.5', '2', '1.0', '13.3', '6.50']
['RK', 'PLAYER', 'TEAM', 'GP',

In [102]:
type(row)

list

In [103]:
import pandas as pd 
data = pd.read_html("http://www.espn.com/nba/statistics/player/_/stat/assists/sort/avgAssists/")
for df in data:
    print(df)

     0                          1     2   3     4    5     6   7     8   \
0    RK                     PLAYER  TEAM  GP   MPG  AST   APG  TO  TOPG   
1     1      Russell Westbrook, PG   OKC   2  37.5   21  10.5  10   5.0   
2     2           James Harden, PG   HOU   2  33.0   20  10.0  11   5.5   
3     3            Nikola Jokic, C   DEN   3  36.0   29   9.7   7   2.3   
4     4           Lou Williams, SG   LAC   3  28.3   26   8.7   9   3.0   
5   NaN         Draymond Green, PF    GS   3  34.0   26   8.7  10   3.3   
6     6            Ben Simmons, PG   PHI   3  33.3   24   8.0   9   3.0   
7     7             Kyle Lowry, PG   TOR   2  35.5   15   7.5   4   2.0   
8   NaN            Ricky Rubio, PG  UTAH   2  31.5   15   7.5   4   2.0   
9     9           Kyrie Irving, PG   BOS   2  36.5   14   7.0   5   2.5   
10   10         Reggie Jackson, PG   DET   2  23.5   13   6.5   2   1.0   
11   RK                     PLAYER  TEAM  GP   MPG  AST   APG  TO  TOPG   
12   11           Monte M

#### Thank You All