In [4]:
'''

If you visit http://quotes.toscrape.com/, on the top right-hand side of the page, you will see a list with
the heading "Top Ten tags".

Using the BeautifulSoup library, write a python program to scrape these tags (for ex: love, inspiration, life, 
humor). 

Use pandas and/or NumPy to create a tabular representation of your data, with the column heading being 
TAGS. Each row in your tabular representation will have a tag (for ex: love). 

(a) Display this tabular representation. 

(b) In addition, print the HTML TAG OBJECT immediately surrounding the first tag (love).

(c) In addition, print the HTML TAG OBJECT immediately surrounding the last tag (simile).

Note:
You should NOT save the data as a CSV file.
You should not use exception handling.
'''

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import numpy as np

page = requests.get("http://quotes.toscrape.com/")

soup = BeautifulSoup(page.content, 'html.parser')
tags = soup.find("div", class_ = "tags-box").find_all("a", class_ = "tag")

print("HTML TAG OBJECT for the first tag:",tags[0])
print("HTML TAG OBJECT for the last tag:",tags[-1])

data = []

for tag in tags:
    label = tag.get_text().strip()
    data.append((label))

df = pd.DataFrame(np.array(data))
df.columns = ['TAGS']
print(df)

HTML TAG OBJECT for the first tag: <a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
HTML TAG OBJECT for the last tag: <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>
            TAGS
0           love
1  inspirational
2           life
3          humor
4          books
5        reading
6     friendship
7        friends
8          truth
9         simile


In [5]:
'''

Imagine that a data scientist started writing a web scraping script to collect data from 
https://www.bookdepository.com/bestsellers by handling pagination. As a first step, the data scientist wrote a 
"for" loop to create a list of all main pagination pages (1, 2, 3, and so on until 34). However, there are issues 
with the code that doesn't let the data scientist create a list of all main pagination pages. I have provided the 
code below. Please fix the issues with the code so that the code creates a list of all main pagination pages (1, 
2, 3, and so on until 34).
'''
main_page_list = []

for i in range(1, 35):
    main_page = 'https://www.bookdepository.com/bestsellers?page=' + str(i)
    print(main_page)
    main_page_list.append(main_page)

print(main_page_list)

https://www.bookdepository.com/bestsellers?page=1
https://www.bookdepository.com/bestsellers?page=2
https://www.bookdepository.com/bestsellers?page=3
https://www.bookdepository.com/bestsellers?page=4
https://www.bookdepository.com/bestsellers?page=5
https://www.bookdepository.com/bestsellers?page=6
https://www.bookdepository.com/bestsellers?page=7
https://www.bookdepository.com/bestsellers?page=8
https://www.bookdepository.com/bestsellers?page=9
https://www.bookdepository.com/bestsellers?page=10
https://www.bookdepository.com/bestsellers?page=11
https://www.bookdepository.com/bestsellers?page=12
https://www.bookdepository.com/bestsellers?page=13
https://www.bookdepository.com/bestsellers?page=14
https://www.bookdepository.com/bestsellers?page=15
https://www.bookdepository.com/bestsellers?page=16
https://www.bookdepository.com/bestsellers?page=17
https://www.bookdepository.com/bestsellers?page=18
https://www.bookdepository.com/bestsellers?page=19
https://www.bookdepository.com/bestselle

In [6]:
'''

Use pagination approach 1 to answer to this question.

If you take a deeper look at http://quotes.toscrape.com/, you will see a "Next →" button at the bottom right-hand 
side corner of the page. This indicates pagination i.e. this website is divided into a sequence of webpages, 
with each page having some quotations. Using the BeautifulSoup library, write a python program that extracts all
the quotations along with the authors, and the tags (for ex: change, deep-thoughts, thinking, world) from all of
these webpages. 

Use pandas and/or NumPy to create a tabular representation of your data, with the column 
headings being quote, author, and tags. Each row in your tabular representation will have a quotation, and its
author, and tags. Display this tabular data. Save the data as a .csv file named question6a.csv. But, you should not
attach the question6a.csv file when you submit your assignment.

Additional Requirements:
(1) Halt the program execution for two seconds after each time you visit a webpage so that your program doesn’t
place undue load on the server of http://quotes.toscrape.com/.
(2) Use a print function in such a way that each time your program is scraping a particular webpage, it displays 
something like “Now scraping:” followed by the URL of the page being scraped.

'''
data = []

current_listingpage_link = "https://quotes.toscrape.com/"

next_page_tag = True 

while next_page_tag:
    print("*****NOW SCRAPING LINKS FROM PRODUCT LISTING PAGE:",current_listingpage_link)
    print("\n")

    current_listingpage_resp = requests.get(current_listingpage_link)

    current_listingpage_soup = BeautifulSoup(current_listingpage_resp.content, "html.parser")

    quotes = current_listingpage_soup.find_all('div', class_ ='quote')

    for block in quotes:
        print("Quote:")
        try:
            text = block.find('span', class_ = "text").get_text().strip()
        except:
            text = "NA"
        print(text)
        try:
            author = block.find('small', class_ = "author").get_text().strip()
        except:
            author = "NA"
        print("Author:", author)
        try:
            tags = block.find('div', class_ = "tags").get_text().strip()
        except:
            tags = "NA"
        print(tags)
        print("\n")
        data.append((text, author, tags))
        time.sleep(2)
        
    next_page_tag = current_listingpage_soup.find("li", class_="next")

    if next_page_tag:
        print("Next Page Exists:")
        #Extract the link of the next page
        current_listingpage_link = "https://quotes.toscrape.com"+next_page_tag.find("a").attrs['href']

    else:
        print("No More Pagination")


df = pd.DataFrame(np.array(data))

df.columns = ['Quote','Author','Tags']

print(df)

df.to_csv('question6a.csv')

*****NOW SCRAPING LINKS FROM PRODUCT LISTING PAGE: https://quotes.toscrape.com/


Quote:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags:
            
change
deep-thoughts
thinking
world


Quote:
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags:
            
abilities
choices


Quote:
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags:
            
inspirational
life
live
miracle
miracles


Quote:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags:
            
aliteracy
books
classic
humor


Quote:
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags:
     

Quote:
“All you need is love. But a little chocolate now and then doesn't hurt.”
Author: Charles M. Schulz
Tags:
            
chocolate
food
humor


Quote:
“We read to know we're not alone.”
Author: William Nicholson
Tags:
            
misattributed-to-c-s-lewis
reading


Quote:
“Any fool can know. The point is to understand.”
Author: Albert Einstein
Tags:
            
knowledge
learning
understanding
wisdom


Quote:
“I have always imagined that Paradise will be a kind of library.”
Author: Jorge Luis Borges
Tags:
            
books
library


Quote:
“It is never too late to be what you might have been.”
Author: George Eliot
Tags:
            
inspirational


Next Page Exists:
*****NOW SCRAPING LINKS FROM PRODUCT LISTING PAGE: https://quotes.toscrape.com/page/5/


Quote:
“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”
Author: George R.R. Martin
Tags:
            
read
readers
reading
reading-books


Quote:
“You can never get a cup of 

Quote:
“′Classic′ - a book which people praise and don't read.”
Author: Mark Twain
Tags:
            
books
classic
reading


Next Page Exists:
*****NOW SCRAPING LINKS FROM PRODUCT LISTING PAGE: https://quotes.toscrape.com/page/9/


Quote:
“Anyone who has never made a mistake has never tried anything new.”
Author: Albert Einstein
Tags:
            
mistakes


Quote:
“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”
Author: Jane Austen
Tags:
            
humor
love
romantic
women


Quote:
“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”
Author: J.K. Rowling
Tags:
            
integrity


Quote:
“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of

In [7]:
'''

Use pagination approach 2 to answer to question 6 above. Save the data as a .csv file named question6b.csv. 
But, you should not attach the question6b.csv file when you submit your assignment.
'''
main_page_list = []

for i in range(1,11):
    #print(i)
    each_pagination_link = "https://quotes.toscrape.com/page/"+str(i)+"/"
    #print(each_pagination_link)
    main_page_list.append(each_pagination_link)
    
data = []

for current_listingpage_link in main_page_list:

    print("*****NOW SCRAPING LINKS FROM PRODUCT LISTING PAGE:",current_listingpage_link)

    current_listingpage_resp = requests.get(current_listingpage_link)

    current_listingpage_soup = BeautifulSoup(current_listingpage_resp.content, "html.parser")

    quotes = current_listingpage_soup.find_all('div', class_ ='quote')
    for block in quotes:
        print("Quote:")
        try:
            text = block.find('span', class_ = "text").get_text().strip()
        except:
            text = "NA"
        print(text)
        try:
            author = block.find('small', class_ = "author").get_text().strip()
        except:
            author = "NA"
        print("Author:", author)
        try:
            tags = block.find('div', class_ = "tags").get_text().strip()
        except:
            tags = "NA"
        print(tags)
        print("\n")
        data.append((text, author, tags))
        time.sleep(2)
        
df = pd.DataFrame(np.array(data))

df.columns = ['Quote','Author','Tags']

print(df)

df.to_csv('question6b.csv')

*****NOW SCRAPING LINKS FROM PRODUCT LISTING PAGE: https://quotes.toscrape.com/page/1/
Quote:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags:
            
change
deep-thoughts
thinking
world


Quote:
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags:
            
abilities
choices


Quote:
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags:
            
inspirational
life
live
miracle
miracles


Quote:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags:
            
aliteracy
books
classic
humor


Quote:
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags:


KeyboardInterrupt: 

In [8]:
'''

Imagine that you want to write a script that searches indeed.com for different roles and extract the 
job listings that result from the search. 

Imagine that you want to search the following roles:

--> data scientist
--> marketing
--> information systems

Each of these search strings need to be scraped from the following locations:

--> chicago
--> manhattan
--> san diego

Note that the roles and locations should not be searched in isolation. Instead, a role should be paired with a 
location. For example, one of the searches would be searching for data scientist in chicago.

Implement the Following Steps:
(1) Create an empty list that will eventually hold all the created links.
(2) Create lists for roles and locations.
(3) The first step in scraping through search is to create links for each search. Use a "for" loop to create
    links for each search (given above). 
(4) Add each link from step 3 above to the list you created in step 1.
(5) Print each link.
(6) Print the list with all the created links

You DO NOT NEED to scrape this web page.
'''

# Step 1
urls = []

# Step 2
roles = ['data+scientist','marketing','information+systems']
locations = ['Chicago','Manhattan','San+Diego']

#Steps 3-5
for role in roles: 
    for location in locations:
        created_link = 'https://www.indeed.com/jobs?q='+role+'&l='+location+'&from=searchOnHP&vjk=d4f7ddfdababc3e4'
        print(created_link)
        urls.append(created_link)

# Step 6
print(urls)

https://www.indeed.com/jobs?q=data+scientist&l=Chicago&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=data+scientist&l=Manhattan&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=data+scientist&l=San+Diego&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=marketing&l=Chicago&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=marketing&l=Manhattan&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=marketing&l=San+Diego&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=information+systems&l=Chicago&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=information+systems&l=Manhattan&from=searchOnHP&vjk=d4f7ddfdababc3e4
https://www.indeed.com/jobs?q=information+systems&l=San+Diego&from=searchOnHP&vjk=d4f7ddfdababc3e4
['https://www.indeed.com/jobs?q=data+scientist&l=Chicago&from=searchOnHP&vjk=d4f7ddfdababc3e4', 'https://www.indeed.com/jobs?q=data+scientist&l=Manhattan&from=s

In [9]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import numpy as np

data = []

search_strings = ['social+media','golf','hockey','the+world+cup','USA','NFL','MLB','NBA','twitter','Yankees']

search_string_links = []

for string in search_strings: 
    created_link = 'https://www.reuters.com/search/news?sortBy=&dateRange=&blob='+string
    #print(created_link)
    search_string_links.append(created_link)

for current_page_link in search_string_links:
    print("Now Scraping:",current_page_link)

    current_page_resp = requests.get(current_page_link)

    soup = BeautifulSoup(current_page_resp.content, "html.parser")

    articles = soup.find_all('div', class_='search-result-content')

    if articles != []:
       
        for article in articles:
    
            try:
                headline = article.find('h3', class_ ='search-result-title').get_text().strip()
            except:
                headline = "NA"
            
            try:
                datetime = article.find('h5', class_='search-result-timestamp').get_text().strip()
            except:
                datetime = "NA"
            
            try:
                hyperlink = 'https://www.reuters.com'+article.find('a').get('href')
            except:
                hyperlink = "NA"
          
            data.append((headline, datetime, hyperlink))
            
    time.sleep(3)
                
if data != []:

    df = pd.DataFrame(np.array(data))
    
    df.columns = ['Headline', 'Date & Time', 'Hyperlink']

    df.to_csv('question9.csv')
    
print(df)

Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=social+media
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=golf
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=hockey
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=the+world+cup
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=USA
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=NFL
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=MLB
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=NBA
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=twitter
Now Scraping: https://www.reuters.com/search/news?sortBy=&dateRange=&blob=Yankees
                                             Headline  \
0   Biden administration appeals ban on social med...   
1   Utah governor signs laws curbing social media ...   
2   Utah governor 