#                             Welcome to Web Scraping Tutorial

# Import the requests library to download the pages from Web

In [1]:
import requests

In [10]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page1 =  requests.get("https://github.com/mohitsharma44official/Python-Web-Scraping-/blob/master/Basic%20HTML.htm")                     

## The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us

In [18]:
page

<Response [200]>

### After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [19]:
page.status_code

200

### A status_code of 200 means that the page downloaded successfully. Status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

<requests.adapters.HTTPAdapter at 0x7efe160>

In [21]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

#### You can see we extract the content but it's not look good.

## We use BeautifulSoup library to parse the document and extract the text in beautiful manner

In [22]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content,'html.parser')

In [23]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

#### NOW IT'S LOOK BETTER!

#### To get even more beautiful we can use prettify method

In [25]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


#### Note - If you don't use print() and directly try to print soup.prettify() you will end up with messy text!

In [26]:
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <title>\n   A simple example page\n  </title>\n </head>\n <body>\n  <p>\n   Here is some simple content for this page.\n  </p>\n </body>\n</html>'

#### LIKE THIS ...

###  Now, if you want to select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it

In [28]:
soup.children

<list_iterator at 0x5067588>

In [29]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [30]:
list(soup.children)[0]

'html'

In [31]:
list(soup.children)[1]

'\n'

In [33]:
list(soup.children)[2]

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [34]:
# To see the type
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [35]:
html = list(soup.children)[2]

In [37]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

#### As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we'll dive into the body

In [47]:
body = body = list(html.children)[3]

In [48]:
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [49]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [50]:
list(body.children)[1]

<p>Here is some simple content for this page.</p>

In [51]:
p = list(body.children)[1]
p.get_text()

'Here is some simple content for this page.'

#### We can use the get_text method to extract all of the text inside the tag

### What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page

In [52]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [53]:
# Note it returns the list so we use list indexing

In [54]:
soup.find_all('p')[0]

<p>Here is some simple content for this page.</p>

In [55]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

#### You can also use find()

In [56]:
soup.find('p')

<p>Here is some simple content for this page.</p>

In [57]:
soup.find('p').get_text()

'Here is some simple content for this page.'

#### Now, you have a good idea how to do web scraping. I will highly recommend you to check out how web scraping done in web pages using python. Refer to this - https://themenyouwanttobe.wordpress.com 