In [29]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Web Scraping in Python:

> - I got interested in web scraping recently.
> - And I'm excited to learn this new concept, which I heard plenty of times before but not used until now.
> - Hoping to learn & scrape some important (legal) sites and gather some insights.
> - This is the 1st notebook, where the basic web scraping python code is present. I learnt this on [Tinkernet Youtube](https://www.youtube.com/watch?v=QhD015WUMxE)

![ALT_TEXT_FOR_SCREEN_READERS](https://images.unsplash.com/photo-1484417894907-623942c8ee29?auto=format&fit=crop&q=60&w=600&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxzZWFyY2h8Mnx8d2ViJTIwc2NyYXBpbmd8ZW58MHx8MHx8fDA%3D)

### Index:

- [Method 1: Basic Web Scraping using BeautifulSoup](#method1)
- [How to Export Scraped data to csv file](#export_to_csv)

<a id="method1"></a>
### Method 1: Basic Web Scraping using BeautifulSoup

In [30]:
#IMPORT LIBRARIES

from bs4 import BeautifulSoup
import requests

In [31]:
#REQUEST WEBPAGE AND STORE IT AS A VARIABLE
page_to_scrape = requests.get("http://quotes.toscrape.com")
page_to_scrape 

# the output should be Response [200]

<Response [200]>

In [32]:
#USE BEAUTIFULSOUP TO PARSE THE HTML AND STORE IT AS A VARIABLE
soup = BeautifulSoup(page_to_scrape.text, 'html.parser')
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

In [33]:
#FIND ALL THE ITEMS IN THE PAGE WITH A CLASS ATTRIBUTE OF 'TEXT'
#AND STORE THE LIST AS A VARIABLE
quotes = soup.findAll('span', attrs={'class':'text'})
quotes

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [34]:
# FIND ALL THE ITEMS IN THE PAGE WITH A CLASS ATTRIBUTE OF 'AUTHOR'
# AND STORE THE LIST AS A VARIABLE

authors = soup.findAll('small', attrs={"class":"author"})
authors

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

In [35]:
# How to remove tags? Loop through the elements and use .text   
# You cannot use .text without looping

for i in authors:
    print(i.text)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


In [36]:
# LOOP THROUGH BOTH LISTS USING THE 'ZIP' FUNCTION
# AND PRINT AND FORMAT THE RESULTS

for quote, author in zip(quotes, authors):
    print(quote.text + "-" + author.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”-Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.”-J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”-Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”-Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”-Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.”-Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.”-André Gide
“I have not failed. I've just found 10,000 ways that won't work.”-Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”-Eleanor Roosevelt
“A day witho

> Feeling happy just learning this new simple web scrape. I got to know about zip function too. I have been coding since 4years now, had no clue about zip funciton.

<a id="export_to_csv"></a>
### How to Export Scraped data to csv file

In [37]:
#IMPORT CSV LIBRARY

import csv

In [38]:
#OPEN A NEW CSV FILE. IT CAN BE CALLED ANYTHING
file = open('scraped_quotes_2.csv', 'w')
file

<_io.TextIOWrapper name='scraped_quotes_2.csv' mode='w' encoding='cp1252'>

In [39]:
#CREATE A VARIABLE FOR WRITING TO THE CSV
writer = csv.writer(file)
writer

<_csv.writer at 0x23b8583e540>

In [40]:
#CREATE THE HEADER ROW OF THE CSV

writer.writerow(['Quote', 'Author'])

14

In [41]:
for quote, author in zip(quotes, authors):
    print(quote.text + "-" + author.text)
    writer.writerow([quote.text, author.text]) #WRITE EACH ITEM AS A NEW ROW IN THE CSV

file.close() #CLOSE THE CSV FILE

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”-Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.”-J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”-Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”-Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”-Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.”-Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.”-André Gide
“I have not failed. I've just found 10,000 ways that won't work.”-Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”-Eleanor Roosevelt
“A day witho

<a id = "selenium"></a>
### Selenium 

In [1]:
import csv
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import getpass
from selenium.common.exceptions import NoSuchElementException

In [43]:
# pip install selenium 
# already installed

Collecting selenium
  Downloading selenium-4.14.0-py3-none-any.whl (9.9 MB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.22.2-py3-none-any.whl (400 kB)
Collecting certifi>=2021.10.8
  Downloading certifi-2023.7.22-py3-none-any.whl (158 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.3-py3-none-any.whl (14 kB)
Collecting outcome
  Downloading outcome-1.3.0-py2.py3-none-any.whl (10 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: outcome, h11, exceptiongroup, wsproto, trio, trio-websocket, certifi, selenium
  Attempting uninstall: certifi
    Found existing installation: certifi 2020.12.5
    Uninstalling certifi-2020.12.5:
      Successfully uninstalled certifi-2020.12.5
Successfully installed certifi-2023.7.22 exceptiongroup-1.1.3 h11-0

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.
anaconda-project 0.9.1 requires ruamel-yaml, which is not installed.
