# Web Scrapping 
 - techniques involving automating the gathering of data from a website

## Rules of web scraping 
- Always try to get permission before scrapping
- If you make too many scraping attempts or requests your IP address could be blocked 
- Some sites automatically block scraping software 

## Limitations of web scraping
- Every website is unique so every web scraping is unique 
- A slight change or update to the website may completely break web scraping script

## Main front end components 
- HTML : Hypertext markdown language - Used to create basic structure and content of 
         a webpage 
- CSS : Cascading Style Sheets  - used for design and  style(color, font) of a webpage
- JS : Java Script - used to add interactivity to webpage

# Deeper dive into HTML code 
```
<!DOCTYPE Html>
<html>
   <head>
     <title> Title on Browser Tab </title>
   </head>
   <body>
       <h1> Website Header </h1>
       <p> Some Paragraph </p>
   </body>
</html>
```

# Deeper dive into CSS 
- CSS uses tags to define what html elements will be styled 
- styles.css is a file
- '#' refers to id and '.' refers to a class
```
<!DOCTYPE Html>
<html>
   <head>
     <link rel="stylesheets" href = "styles.css" >
     <title> Title on Browser Tab </title>
   </head>
   <body>
       <h1> Website Header </h1>
       <p id ='para2'> Some Paragraph </p>
   </body>
```
- what is within styles.css?
```
p{
   color:red;
   font-family:courier;
   font-size:160%;
}
.someclass{
    color:green;
   font-family:verdana;
   font-size:300%;
}
#someid{
    color:blue;
#para2{
    color:red;
}
.cool{
   color:green;
   font-family:verdana;
   font-size:300%;
}
}
```

# Libraries to install 
conda install requests 
conda install lxml - This package is needed by beautiful soup
conda install bs4

In [15]:
# Grabbing the title of a page
# result :returns a response object [requests.models.Response]
# result.text : returns the entire html document as text but not in a readable format
# soup : returns beautiful soup created text which is easily parsable
# soup.select() : helps us search for the tag. The output is a list
# .getTest : method used to extract text from tag
import requests
import bs4
result = requests.get("https://www.kristaseiden.com")
soup = bs4.BeautifulSoup(result.text,"lxml")
soup.select('title')[0].getText()

# How to use soup.select()?
soup.select('div') - all elements with 'div' tag 
soup.select(#some_id) - elements containing id = 'some_id'
soup.select(.some_class) - elements containing class = 'some_class'
soup.select('div span') - any elements namesd span within a div element
soup.select('div>span') - any elements named span directly within a div element, with
                          nothing in between

In [21]:
# Grabbing all elements of a class 
import requests
import bs4
result = requests.get("https://www.kristaseiden.com")
soup = bs4.BeautifulSoup(result.text,"lxml")

for widgets in soup.select('.widget-head'):
    print(widgets.text)


In [26]:
# Grabbing an image and save on local 
# Beautiful Soup can scan a page,locate the <img> tag and grab these urls's
# We can then download the URL's as images and write them to a computer
# Always check copyright permissions before downloading and using an image from a website

import requests
import bs4

result = requests.get("https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)")
soup = bs4.BeautifulSoup(result.text,"lxml")
first_image = soup.select('.thumbimage')[0]

image_src = "https:" +first_image['src']
image_link = requests.get(image_src)
with open ("deep blue processor.jpg","wb") as f:
    f.write(image_link.content)

In [None]:
# Project : https://toscrape.com/
# TODO : Get title of every book with a 2 star rating

import requests
import bs4

book_title_2_star_rating = []
for i in range(1,51):
    result = requests.get(f"http://books.toscrape.com/catalogue/page-{i}.html")
    soup = bs4.BeautifulSoup(result.text,"lxml")
    for product in soup.select('.product_pod'):
        if product.select('.star-rating.Two') != []:
            book_title_2_star_rating.append(product.select('a')[1]['title'])

In [181]:
import requests 
import bs4

result = requests.get("http://quotes.toscrape.com/")
soup = bs4.BeautifulSoup(result.text)


In [183]:
# To get unique authors 
author = set()
for authors in soup.select(".author"):
    author.add(authors.text)
author

{'Albert Einstein',
 'André Gide',
 'Eleanor Roosevelt',
 'J.K. Rowling',
 'Jane Austen',
 'Marilyn Monroe',
 'Steve Martin',
 'Thomas A. Edison'}

In [184]:
auth_quotes = []
for quotes in soup.select("div>span"):
    quote = quotes.getText()
    auth_quotes.append(quote)
auth_quotes
    

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 'by Albert Einstein\n(about)\n',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 'by J.K. Rowling\n(about)\n',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 'by Albert Einstein\n(about)\n',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 'by Jane Austen\n(about)\n',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 'by Marilyn Monroe\n(about)\n',
 '“Try not to become a man of success. Rather become a man of value.”',
 'by Albert Einstein\n(about)\n',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 'by André Gide\n(about)\n',
 "“I have not failed. I've just found 10,000 

In [185]:
for item in soup.select(".tag-item"):
    print(item.text)


love


inspirational


life


humor


books


reading


friendship


friends


truth


simile



In [203]:
# Solution 1: 
author = set()
for i in range(1,15):
        result = requests.get(f"http://quotes.toscrape.com/page/{i}/")
        if 'No quotes found!' not in result.text:
                soup = bs4.BeautifulSoup(result.text,"lxml")
                for authors in soup.select(".author"):
                        author.add(authors.text)
author

{'Albert Einstein',
 'Alexandre Dumas fils',
 'Alfred Tennyson',
 'Allen Saunders',
 'André Gide',
 'Ayn Rand',
 'Bob Marley',
 'C.S. Lewis',
 'Charles Bukowski',
 'Charles M. Schulz',
 'Douglas Adams',
 'Dr. Seuss',
 'E.E. Cummings',
 'Eleanor Roosevelt',
 'Elie Wiesel',
 'Ernest Hemingway',
 'Friedrich Nietzsche',
 'Garrison Keillor',
 'George Bernard Shaw',
 'George Carlin',
 'George Eliot',
 'George R.R. Martin',
 'Harper Lee',
 'Haruki Murakami',
 'Helen Keller',
 'J.D. Salinger',
 'J.K. Rowling',
 'J.M. Barrie',
 'J.R.R. Tolkien',
 'James Baldwin',
 'Jane Austen',
 'Jim Henson',
 'Jimi Hendrix',
 'John Lennon',
 'Jorge Luis Borges',
 'Khaled Hosseini',
 "Madeleine L'Engle",
 'Marilyn Monroe',
 'Mark Twain',
 'Martin Luther King Jr.',
 'Mother Teresa',
 'Pablo Neruda',
 'Ralph Waldo Emerson',
 'Stephenie Meyer',
 'Steve Martin',
 'Suzanne Collins',
 'Terry Pratchett',
 'Thomas A. Edison',
 'W.C. Fields',
 'William Nicholson'}

In [None]:
# Solution 2

page_still_valid = True
authors = set()
page = 1

while page_still_valid:

    # Concatenate to get new page URL
    page_url = url+str(page)
    
    # Obtain Request
    res = requests.get(page_url)
    
    # Check to see if we're on the last page
    if "No quotes found!" in res.text:
        break
    
    # Turn into Soup
    soup = bs4.BeautifulSoup(res.text,'lxml')
    
    # Add Authors to our set
    for name in soup.select(".author"):
        authors.add(name.text)
        
    # Go to Next Page
    page += 1

In [None]:
# Project : Scrape product Stats from live website (https://www.mydeal.com.au/sale/sensor-bins)

import random 
import requests 
from bs4 import BeautifulSoup
import pandas as pd

# Define Global Variables

required_url = 'https://www.mydeal.com.au/sale/sensor-bins'

user_agent_list = [
    # This list could be automated with new user agent periodically 
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15'
]

def get_html_text(required_url,user_agent_list):
    """
    This function is used to retrieve html document text for the a particular url

    Args: 
    required_url (str) : A particular url from which we want to scrape data 
    user_agent_list (lst): A list of user agent 

    Output:
    response (str) : html text
    """
    result = ''
    headers = {}
    num_user_agent = len(user_agent_list)
    user_agent_pointer = 0
    try:
        while result.lower()!= 'ok' and user_agent_pointer<num_user_agent :
            for i in user_agent_list:
                headers['User-Agent'] = i
                request = requests.get(required_url,headers = headers)
                result = request.reason
                user_agent_pointer+= 1
                if result.lower() == 'ok':
                    # request.text : returns the entire html document as text but not in a readable format
                    return request.text
                    break
                else:
                    print(result)
    except Exception as e:
        print(f"Check code :{e} - error was detected")


def get_product_title(soup):
    """
    This function is used to extract product title 
    Args:
    soup : Beautiful soup created text which is easily parsable

    Output:
    product_title (lst) - List of product title from the page
    """
    try :
        product_title = []
        for i in soup.select('.deal-title'):
            product_title.append(i.getText().replace('\n','').replace('\r','').lstrip().rstrip())
        return product_title
    except Exception as e :
        print(f"Check code :{e} - error was detected")

def get_product_price(soup):
    """
    This function is used to extract product price
    Args:
    soup : Beautiful soup created text which is easily parsable

    Output:
    deal_price (lst) - List of product price from the page
    """
    try :
        product_price = []
        product_tag = []

        for i in soup.select('.deal-title'):
          product_tag.append('#'+i.find_all('a')[0].get('id').replace('dealtitle','dealprice'))

        for tag in product_tag:
          product_price.append(soup.select(tag)[0].getText())
        return product_price 
    except Exception as e :
        print(f"Check code :{e} - error was detected")
        
def get_product_rating(soup):
    """
    This function is used to extract product rating
    Args:
    soup : Beautiful soup created text which is easily parsable

    Output:
    deal_rating (lst) - List of product ratings from the page
    """
    try :
        product_rating = []

        for rating in soup.select('.product-review-link'):
            product_rating.append(int(rating.getText().strip('()')))
        return product_rating
    except Exception as e :
        print(f"Check code :{e} - error was detected")

def main ():

    soup = BeautifulSoup(get_html_text(required_url,user_agent_list),"lxml")
    data = {}
    data['last_updated_date'] = pd.to_datetime('today').strftime("%Y-%m-%d")
    data['product_title'] = get_product_title(soup)
    data['product_price'] = get_product_price(soup)
    prod_data = pd.DataFrame(data)
    return prod_data


"""When a Python script is executed, 
Python sets the __name__ attribute of the module to "__main__" if the script is the entry point of execution. 
If the script is imported as a module into another script, __name__ is set to the name of the script/module.
"""
if __name__ == "__main__":
    main()
