## Lynda - Python for Automation - Web Scraping

* Course Home 
    * https://www.linkedin.com/learning/using-python-for-automation
* Web Scraping Section
    * https://www.linkedin.com/learning/using-python-for-automation/the-value-of-web-scraping
    
**Additional Resources:**
* https://towardsdatascience.com/data-science-skills-web-scraping-using-python-d1a85ef607ed
* https://learn.datacamp.com/courses/web-scraping-with-python
* http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
* https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* https://doc.scrapy.org/en/latest/intro/tutorial.html
    * https://scrapy.org/
* file:///C:/Users/jpkee/Desktop/PythonProjects/Python%20Cheat%20Sheets/web%20scrapping.pdf
* https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722
* https://www.youtube.com/watch?v=87Gx3U0BDlo


**Section 3**

Basic Steps
1. Send a GET query to the website
2. SAVE HTML-based doc is returned
3. Parse the returned data

We'll need these libraries:
1. Beautiful Soup
    * Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
2. lxml
    * lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
    * https://lxml.de/ 
3. requests
    * Requests is an elegant and simple HTTP library for Python, built for human beings.
    * https://requests.readthedocs.io/en/master/user/quickstart/
4. urllib2 
    * is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple    interface, in the form of the urlopen function.
    * https://docs.python.org/2/library/urllib2.html
    
5. And we'll use this site for practice:
    * http://quotes.toscrape.com

**Some More Details**
https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

Grab a string, find all the b tags:
* soup.find_all('b')

A list, find all the x and y tags:
* print soup.find_all(["x", "y"])



Find_all method - most common methods in beautiful soup
*examples of find_all
1. soup.find_all('title')
2. soup.find_all('p', 'title')
3. soup.find_all('a')
4. soup.find_all(id='link4')


Navigation
1. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree
2. https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python



In [14]:
# import the libraries
import requests
from bs4 import BeautifulSoup

# some logging here for fun
# import logging
# logging.warning('\nThis is the logging.warning bit')  # will print a message to the console
# logging.info('I told you so')  # will not print anything

# create a variable for the site
url = 'http://quotes.toscrape.com'

# create the request, response will get you the HTTP code
response = requests.get(url)
# if response == '200':
#     print(response)
# else:
#     print('not gettinging')
# 'response.text' returns the content of the response
    # basically response just returns the content
# we'll include the lxml parser
# soup = BeautifulSoup(response.text, 'lxml')

# let's see if it worked correctly by printing soup
# print(soup)



if response.status_code == 200:
    print('looks good')   
else:
    print('looks bad') 



looks good


The simplest way to navigate the parse tree is to say the name of the tag you want. 
If you want the <head> tag, just say:
    
    soup.head:
or
    
    soup.title()


repr(object)
Returns a string containing a printable representation of an object.
This is the same value yielded by conversions (reverse quotes). 
It is sometimes useful to be able to access this operation as an ordinary function. 
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), 
otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. 
A class can control what this function returns for its instances by defining a __repr__() method.


>>> s = "String:\tA"
>>> print s.encode('string_escape')
String:\tA
>>> print repr(s)
'String:\tA'

In [None]:


# loop over the strings
for string in soup.strings:
    print(repr(string))

In [None]:
# More on response
import requests 

# Making a get request 
response = requests.get('https://api.github.com') 
  
# printing request text 
print(response.text)

In [None]:
# # find all the b tags
# # soup.find_all('b')
# # so can I find all trs?

# this grabbed all of my column headers
soup.find_all('th')

In [None]:
# findin all anchor tags
soup.find_all('a')

In [None]:
# response.json will give you nicer output
response.json()

In [None]:
# import requests
# r = requests.get('https://api.github.com/events')
# r.json()

In [None]:
# so far we have ALL the code, but now to do we grab just what we want?
# right click on any code and inspect it
# <span class="text"... has all of our bits so will be good to use, in the body tag, Within the div and body tags

# reminder, HTML, Head and Body are the big tags

In [None]:
# import the libraries
import requests
from bs4 import BeautifulSoup as bs

# create a variable for the site
url = 'http://quotes.toscrape.com'

# create the request
response = requests.get(url)

# we'll include the lxml parser
soup = bs(response.text, 'lxml')

# create a variable for the quotes
    # and use the find_all function
quotes = soup.find_all('span', class_='text')
# this above will work, but still grab some extra html

# so let's create a loop to print each quote
for quote in quotes:
    print(quote.text +'\n')

# print(quotes)

In [None]:
# now let's grab all the authors
# these live in the small tag and author class 


# import the libraries
import requests
from bs4 import BeautifulSoup


# create a variable for the site
url = 'http://quotes.toscrape.com'

# create the request
response = requests.get(url)

# we'll include the lxml parser
soup = BeautifulSoup(response.text, 'lxml')

# create a variable for the quotes
# and use the find_all function
quotes = soup.find_all('span', class_='text')
# authors = soup.find_all('small', class_='author')
for quote in quotes:
    print(quote.text +'\n')
# for author in authors:
#     print(author.text)

# print(quotes)

In [None]:
authors = soup.find_all('small', class_='author')

for author in authors:
    print(author.text +'\n')

In [None]:
# So now update the for loop to print the authors and quotes
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
quotes = soup.find_all('span', class_='text')
authors = soup.find_all('small', class_='author')

# so use the range function with the length of the quotes variable to stick them together
for i in range(0, len(quotes)):
    print(authors[i].text +':')
    print(quotes[i].text+'\n')



In [None]:
# # now let's get the corresponding tags ('deep-thoughts', 'change', etc)
# # but that's using class 'tag', but there is >1 tag per quote so...
# # go a step broader 
# # let's grab the div tag and class tag section
# # each quote only has one tags section
# # add this line to get the tags
# tags = soup.find_all('div', class_=tags')

In [None]:
# import requests
# from bs4 import BeautifulSoup

# url = 'http://quotes.toscrape.com'

# response = requests.get(url)
# soup = BeautifulSoup(response.text, 'lxml')
# quotes = soup.find_all('span', class_='text')
# authors = soup.find_all('small', class_='author')
# # add this line to get the tags
# tags = soup.find_all('div', class_='tags')

# for i in range(0, len(quotes)):
#     print(authors[i].text +':')
#     print(quotes[i].text)
#     print(tags[i].text)


In [None]:
# That above works, but his solution was like this:

import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
quotes = soup.find_all('span', class_='text')
authors = soup.find_all('small', class_='author')
# add this line to get the tags
tags = soup.find_all('div', class_='tags')

for i in range(0, len(quotes)):
    print(authors[i].text +':')
    print(quotes[i].text)
    quoteTags = tags[i].find_all('a', class_='tag')
    # iterate thru all the quote tags and print the attributes
    print('  Tags:')
    for quoteTag in quoteTags:
                print('   ' + quoteTag.text)
    print('\n')



In [None]:
# Here's the HTML
#     <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

#         <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
#         <span>by <small class="author" itemprop="author">Albert Einstein</small>
#         <a href="/author/Albert-Einstein">(about)</a>
#         </span>
#         <div class="tags">
#             Tags:
#             <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    >      
#             <a class="tag" href="/tag/change/page/1/">change</a>
#             <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
#             <a class="tag" href="/tag/thinking/page/1/">thinking</a>
#             <a class="tag" href="/tag/world/page/1/">world</a>
#         </div>
#     </div>

# #  and here's how we match it up
# this is the actual quote
    # quotes = soup.find_all('span', class_='text')
        ## <span class="text" itemprop="text">
# this is the author
    # authors = soup.find_all('small', class_='author')
        ## <span>by <small class="author" itemprop="author">Albert Einstein</small>
# here are the tags
    # tags = soup.find_all('div', class_='tags')


In [None]:
# help(range)

In [None]:
# Confirm 200 Response
# response
# or print(response)

# or even better, get the code
# check for a 200
  ##print(VaribleHere.status_code)
print(r.status_code)

In [None]:
from IPython.display import Image
Image('C:/Users/jpkee/Desktop/PythonProjects/Pictures/BeautifulSoupClass.JPG')


**Preparing for paginated scraping**

* https://www.linkedin.com/learning/using-python-for-automation/preparing-for-paginated-scraping
* We'll use this site: https://scrapingclub.com/

In [None]:
# Use this challenge: https://scrapingclub.com/exercise/list_basic
# check out the pagination at the bottom

# grab your modules
import requests
from bs4 import BeautifulSoup

# create request to the site, pointing to the specific page
scrapySite = 'https://scrapingclub.com/exercise/list_basic/?page=1'
response2 = requests.get(scrapySite)

soup2 = BeautifulSoup(response2.text, 'lxml')
items2 = soup2.find_all('div', class_='col-lg-4 col-md-6 mb-4')

count = 1
for i in items2:
    itemName = i.find('h4', class_='card-title').text.strip('\n')
    itemPrice = i.find('h5').text
    print('%s ) Price: %s, Item Name: %s' % (count, itemPrice, itemName))
    count = count + 1

In [None]:
# Image('C:/Users/jpkee/Desktop/PythonProjects/Pictures/BeautifulSoupCardTitle.JPG')


In [None]:
# now let's add multipaging!
# check out the href for the pages, page link etc
        # hypertext reference ;)
pages = soup2.find('ul', class_='pagination')
urls = []
links = pages.find_all('a', class_='page-link')

# iterate thru all the page link elements
for link in links:
     # make sure the page number is a digit
    pageNum = int(link.text) if link.text.isdigit() else None
#     isDigit? https://www.tutorialspoint.com/python/string_isdigit.htm
    #check if page number != none, if not, add it to the x
    if pageNum != None:
        x = link.get('href')
        urls.append(x)
        # you'll get ''?page=7' if you print x 
# print(urls)
count = 1
for i in urls:
    newUrl = url + i
    response = requests.get(newUrl)
    soup2 = BeautifulSoup(response2.text, 'lxml')
    items2 = soup2.find_all('div', class_='col-lg-4 col-md-6 mb-4')

    for i in items2:
        itemName = i.find('h4', class_='card-title').text.strip('\n')
        itemPrice = i.find('h5').text
        # Just the formatting bit here
        print('%s ) Price: %s, Item Name: %s' % (count, itemPrice, itemName))
        count = count + 1


In [None]:
Image('C:/Users/jpkee/Desktop/PythonProjects/Pictures/BeautifulSoupCardPage.JPG')

In [None]:
response2

In [None]:
# soup2

In [None]:
x

**Getting Velodrome Info from Wikipedia**

In [None]:
# # Now, try to get the velodrome details here: https://en.wikipedia.org/wiki/List_of_cycling_tracks_and_velodromes
# # grab your modules
# import requests
# from bs4 import BeautifulSoup
# import prettify

# # create request to the site, pointing to the specific page
# velodromes = 'https://en.wikipedia.org/wiki/List_of_cycling_tracks_and_velodromes'
# veloResponse = requests.get(velodromes)
# # Grab the page contents
# tracks = BeautifulSoup(veloResponse.text, 'lxml')

# # So the tables are in wikitable sortable
# # Per https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722
# #     # we need to put this into a dictionary
    

# # find the table
# veloTable = tracks.find('table', {'class': 'wikitable sortable'})
# # find all the anchors
# veloLinks = veloTable.find_all('a')





# # # I want all the track names so, extract the track names into a list
# # # Create the empty list
# # trackNames = []
# # # Loop thru the cells and grab all the names, putting them into the empty 'trackNames' list



# # <span class="mw-headline" id="Velodromes_currently_in_use">Velodromes currently in use</span>




# # print(tracks.prettify())
# # print(veloTable)
# print(veloLinks)
# # print(veloLinks.prettify())
# # veloLinks()
# # print(veloLinks.prettify())
# # print(tracks)
# # https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722

In [None]:
# print(tracks)

In [None]:
# find vs find_all
## per https://linuxhint.com/python-beautifulsoup-tutorial-for-beginners/#:~:text=The%20find%20method%20searches%20for,element.&text=The%20find_all%20method%20on%20the,a%20list%20of%20type%20bs4.

# The find method searches for the first tag with the needed name and returns an object of type bs4.element.Tag.

# The find_all method on the other hand, searches for all tags with the needed tag name and 
# returns them as a list of type bs4.element.ResultSet. All the items in the list are 
# of type bs4.element.Tag, so we can carry out indexing on the list and continue our beautifulsoup exploration.

In [None]:
# this is cell 'A1' basically
# <a href="/wiki/Argentina" title="Argentina">Argentina</a>

# we'll include the lxml parser
soup = BeautifulSoup(response.text, 'lxml')

# create a variable for the quotes
# and use the find_all function
quotes = soup.find_all('span', class_='text')
# this above will work, but still grab some extra html

# so let's create a loop to print each quote
for quote in quotes:
    print(quote.text +'\n')

# print(quotes)

**More on Scraping Tools**
1. Beautiful Soup
2. Scrapy
3. Selenium

In [None]:
Image('C:/Users/jpkee/Desktop/PythonProjects/Pictures/BeautifulSoupVsSeleniumVsScrapy.JPG')

In [None]:
from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

# grab the table
table = soup.find('span', id='Singles').parent.find_next_sibling('table')

# grab all the columns from that table
for single in table.find_all('th', scope='col'):
    print(single.text)

In [None]:
from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

In [None]:
# now try to get Taylor Swift's studio albums, which is a similar table structure I think
from bs4 import BeautifulSoup
import requests


source_codeAlbum = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soupAlbum = BeautifulSoup(source_code.content)

tableAlbum = soupAlbum.find('span', id='Studio_albums').parent.find_next_sibling('table')
for single in tableAlbum.find_all('th', scope='row'):
    print(single.text)
    
    
# now how to get the 2nd column?
tableDetails = soupAlbum.find('span', id='Album').parent.find_next_sibling('table')
for single in tableDetails.find_all('th', scope='column'):
    print(single.text)

    
# this is how it lines up
    # quotes = soup.find_all('span', class_='text')
        ##                   <span class="text" itemprop="text">
        
# so to get the 2nd column

<th scope="col" rowspan="2" style="width:18em;">Album details</th>

In [None]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_cycling_tracks_and_velodromes').text

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())

My_table = soup.find('table',{'class':'wikitable sortable'})
links = My_table.findAll('a')


In [None]:
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)

In [None]:
import pandas as pd
df = pd.DataFrame()
df['Country'] = Countries

df.head()