<a href="https://colab.research.google.com/github/kaushanr/python3-docs/blob/main/Section_33.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Project
##Introduction
In this project you'll be building a quotes guessing game. When run, your program will scrape a website for a collection of quotes. Pick one at random and display it. The player will have four chances to guess who said the quote. After every wrong guess they'll get a hint about the author's identity.

##Requirements

* Create a file called `scraping_project.py` which, when run, grabs data on every quote from the website http://quotes.toscrape.com

* You can use `bs4` and `requests` to get the data. For each quote you should grab the text of the quote, the name of the person who said the quote, and the href of the link to the person's bio. Store all of this information in a list.

* Next, display the quote to the user and ask who said it. The player will have four guesses remaining.

* After each incorrect guess, the number of guesses remaining will decrement. If the player gets to zero guesses without identifying the author, the player loses and the game ends. If the player correctly identifies the author, the player wins!

* After every incorrect guess, the player receives a hint about the author. 
  1. For the first hint, make another request to the author's bio page (this is why 
we originally scrape this data), and tell the player the author's birth date and location.

  2. The next two hints are up to you! Some ideas: the first letter of the author's first name, the first letter of the author's last name, the number of letters in one of the names, etc.

* When the game is over, ask the player if they want to play again. If yes, restart the game with a new quote. If no, the program is complete.

Good luck!

In [None]:
# my solution

import requests
from bs4 import BeautifulSoup
import csv
from time import sleep
from random import choice
import pickle
from os.path import exists

!pip install pyfiglet

#----------------------------------------DATA SCRAPING MODULES------------------------------------------------------

data_list = []

def scraper(url):

  response = requests.get(url)
  data = response.text
  soup = BeautifulSoup(data,'html.parser')
  quotes = soup.select('.quote')
  global data_list

  for quote in quotes:
    quote_text = quote.find(class_='text').text
    quote_author = quote.find(class_='author').text
    data_list.append(
        {
            'quote':quote_text,
            'author':quote_author
        }
    )
  return data_list


def crawler():

  if exists('data_list.pickle'): # checks whether file exists in directory
    print()
    print('Data file found. Loading data...','\n')

    with open('data_list.pickle','rb') as file: # 'rb' means read binary
      data_list = pickle.load(file)
    #return data_list

  else:
    print()
    print('No file found. Collecting site data. Please wait...','\n')
    s_url = 'http://quotes.toscrape.com'
    url = s_url

    while True:
      data_list = scraper(url)
      response = requests.get(url)
      data = response.text
      soup = BeautifulSoup(data,'html.parser')

      try:
        next = soup.find(class_='next').contents[1]['href']
        url = s_url + next
      except AttributeError:
        break
      sleep(2)
    print('Data gathering completed sucessfully!','\n')

    with open('data_list.pickle','wb') as file: # saved in .pickle format - 'wb' means write in binary
      pickle.dump(data_list,file)
  
  return data_list


def hint(author):

  author = author.replace(' ','-')
  url = f'https://quotes.toscrape.com/author/{author}/'
  response = requests.get(url)
  data = response.text
  soup = BeautifulSoup(data,'html.parser')
  author_bday = soup.find(class_='author-born-date').text
  author_loc = soup.find(class_='author-born-location').get_text()
  author_des = soup.find(class_='author-description').get_text()
  mask_names = author.split('-')
  mask_names.extend([author.replace('-',' ')])

  for name in mask_names:
    info = author_des.replace(name,'$$$$')
    author_des = info
    info = author_des.replace('$$$$ $$$$','$$$$')
    author_des = info

  try:

    while True:
      author_des = author_des.split('. ')[choice(range(1,10))]

      if author_des.strip() not in ('\n','  ',''):
        break

  except IndexError:
    author_des = author_des.split('. ')[0]

  words = [[char if ind == 0 else '_' for ind,char in enumerate(name)] for name in author.split('-')]
  author_initials = '  '.join([' '.join(word) for word in words])
  return (f'Born on {author_bday} {author_loc}',author_des,author_initials)


#--------------------------------------------------GAME ENGINE-------------------------------------------------------------

url = 'https://raw.githubusercontent.com/kaushanr/python3-docs/main/docs/heading_art.py'
r = requests.get(url)
with open('/content/heading_art.py', 'w') as f:
    f.write(r.text)

from heading_art import heading

data_list = crawler()

print(heading('* TRIVIA TIME *'))
print('Hi there!...let\'s play "guess who said it..."','\n')

def play_game():

  attempts = 4
  _quote_data = choice(data_list)
  quote = _quote_data['quote']
  _author = _quote_data['author']

  hint_result = hint(_author)

  print('Here\'s the quote.','\n')
  print(quote,'\n')

  for i in range(attempts):
    print(f'Guesses remaining : {attempts}')
    attempts -= 1
    user_input = input('Take a guess : ').title()
    if user_input == _author:
      print('You guessed correctly! Congratulations!')
      break
    elif attempts > 0:
      print('Nope, try again!')
      print(f'Here\'s a hint : {hint_result[i]}')
    else:
      print(f'Sorry, you\'ve run out of guesses. The answer was {_author}')
  
  play_again = input('Do you want to play again (Y/N)? : ').lower()
  while not play_again in ('y','n'):
    print('Please enter a valid input!')
    play_again = input('Do you want to play again (Y/N)? : ').lower()

  if play_again == 'n':
    return print('Okay, goodbye.')
  else:
    print()
    print('New Game','\n')
    return play_game() # recalling the same function upon exiting with return to loop through again

play_game()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Data file found. Loading data... 

[36m        _____ ____  _____     _____    _      _____ ___ __  __ _____        
__/\__ |_   _|  _ \|_ _\ \   / /_ _|  / \    |_   _|_ _|  \/  | ____| __/\__
\    /   | | | |_) || | \ \ / / | |  / _ \     | |  | || |\/| |  _|   \    /
/_  _\   | | |  _ < | |  \ V /  | | / ___ \    | |  | || |  | | |___  /_  _\
  \/     |_| |_| \_\___|  \_/  |___/_/   \_\   |_| |___|_|  |_|_____|   \/  
                                                                            
[0m
Hi there!...let's play "guess who said it..." 

Here's the quote. 

“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.” 

Guesses remaining : 4
Take a guess : gdss
Nope, try again!
Here's a hint : Born on November 27, 1942 in Seattle, Washington, The United States
Guesses remaining : 3
Take a guess : 
Nope, try again!
Here's a hint : King,

In [37]:
# Web Crawlers with Scrapy

#!pip install scrapy

In [35]:
# write a python file to directory - run this once only to avoid appending code to file

%%writefile book_scraper.py 

import scrapy

class BookSpider(scrapy.Spider):

  name = 'bookspider'  
  start_urls = ['http://books.toscrape.com/']

  def parse(self,response):
    for article in response.css('article.product_pod'): # looks at article tags with class = product-pod - css syntax
      yield {
          'price': article.css('.price_color::text').extract_first(),
          'title': article.css('h3 > a::attr(title)').extract_first() # CSS selector syntax used here...
      }
      next = response.css('.next > a::attr(href)').extract_first()
      if next:
        yield response.follow(next,self.parse) # recursion by calling the function internally again

Overwriting book_scraper.py


In [36]:
!scrapy runspider -O books.csv book_scraper.py # command line syntax, -O overwrites, -o appends to file



2022-09-10 16:25:17 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2022-09-10 16:25:17 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
2022-09-10 16:25:17 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-09-10 16:25:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-09-10 16:25:17 [scrapy.extensions.telnet] INFO: Telnet Password: 15b45df633108f18
2022-09-10 16:25:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-09-10 16:25:1