# Scrape Google Forever

## Amazon SQS

![Basic workflow of Amazon SQS](https://do38he9wic6d8.cloudfront.net/sqsconsole-20220308192726570/assets/images/5e3f44ce52788a4fb8b8432e4441bf3f-SQS-diagram.svg "Basic workflow of Amazon SQS")

### Producers

### Queues


> Amazon SQS provides queues for high-throughput, system-to-system messaging. You can use queues to decouple heavyweight processes and to buffer and batch work. Amazon SQS stores messages until microservices and serverless applications process them.

#### How to?

#### SQS Configuration

> The visibility timeout begins when Amazon SQS returns a message. If the consumer fails to process and delete the message before the visibility timeout expires, the message becomes visible to other consumers. If a message must be received only once, your consumer must delete it within the duration of the visibility timeout.

#### Dead Letter Queue (DLQ)

### Alternatively

In [3]:
def create_queue():
    sqs_client = boto3.client("sqs", region_name="us-west-2")
    response = sqs_client.create_queue(
        QueueName="my-new-queue",
        Attributes={
            "DelaySeconds": "0",
            "VisibilityTimeout": "60",  # 60 seconds
        }
    )
    print(response)

### Produce Messages

In [1]:
import boto3

# Create SQS client
sqs = boto3.client('sqs')

queue_url =  'SQS_QUEUE_URL'

# Send message to SQS queue
response = sqs.send_message(
    QueueUrl=queue_url,
    DelaySeconds=10,
    MessageAttributes={
        'Title': {
            'DataType': 'String',
            'StringValue': 'The Whistler'
        },
        'Author': {
            'DataType': 'String',
            'StringValue': 'John Grisham'
        },
        'WeeksOn': {
            'DataType': 'Number',
            'StringValue': '6'
        }
    },
    MessageBody=(
        'Information about current NY Times fiction bestseller for '
        'week of 12/11/2016.'
    )
)

print(response['MessageId'])

QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue) when calling the SendMessage operation: The specified queue does not exist for this wsdl version.

In [2]:
!pip install boto3

Collecting boto3
  Downloading boto3-1.21.23-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 589 kB/s eta 0:00:01
[?25hCollecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 497 kB/s eta 0:00:01
[?25hCollecting jmespath<2.0.0,>=0.7.1
  Downloading jmespath-1.0.0-py3-none-any.whl (23 kB)
Collecting botocore<1.25.0,>=1.24.23
  Downloading botocore-1.24.23-py3-none-any.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 322 kB/s eta 0:00:01
Installing collected packages: jmespath, botocore, s3transfer, boto3
Successfully installed boto3-1.21.23 botocore-1.24.23 jmespath-1.0.0 s3transfer-0.5.2


Not a very good idea to flaunt your AWS access key and secret key. It's better to confiqure AWS using [AWSCLI](https://aws.amazon.com/cli/)
<br>
Once you configure AWS on your system you can pass the required credentials through os.environ variables like mentioned below.

In [2]:
import boto3
import os

os.environ['AWS_PROFILE'] = "default"
os.environ['AWS_DEFAULT_REGION'] = "us-west-1"

# Create SQS client
sqs = boto3.client("sqs")

In [3]:
queue_url = 'https://sqs.us-west-1.amazonaws.com/XXXXXXXXXXX/google_query' # 'SQS_QUEUE_URL'

In [4]:
# Send message to SQS queue
response = sqs.send_message(
    QueueUrl=queue_url,
    DelaySeconds=10,
    MessageBody=(
        'Query Google infinitely'
    )
)

print(response['MessageId'])

c4e0e462-fe87-4947-97b0-260ac0ea47fd


That's our first message added to the queue.

### Consumer : Lambda function

In [40]:
import requests
import urllib
from requests_html import HTML
from requests_html import HTMLSession
import re

class ScrapeGoogle:
    def __init__(self):
        self
    # Get the source code given a url
    def get_source(self, url):
        # Given a url it's gonna give you the source code

        try:
            session = HTMLSession()
            response = session.get(url)
            return response

        except requests.exceptions.RequestException as e:
            print(e)
        
        
    def get_results(self, query):
        # When you give a query as the input it returns the sourcecode as response
        query = urllib.parse.quote_plus(query)
        response = self.get_source("https://www.google.com/search?q=" + query)
        return response


    def parse_results(self, response):
        if not response:
            return {}

        css_identifier_result = ".tF2Cxc"
        css_identifier_title = "h3"
        css_identifier_link = ".yuRUbf a"
        css_identifier_text = ".IsZvec"

        results = response.html.find(css_identifier_result)

        output = []

        for result in results:
            title = result.find(css_identifier_title, first=True)
            title  = title.text if title is not None else ""
            link = result.find(css_identifier_link, first=True)
            link = link.attrs['href'] if link is not None else ""
            text = result.find(css_identifier_text, first=True)
            text = text.text if text is not None else ""

            item = {
                "title": title,
                "link": link,
                "text": text
            }

            output.append(item)

        return output

    def google_search(self, query):
        response = self.get_results(query)
        return self.parse_results(response)

    # get a valid filename out of random string
    def get_valid_filename(self, query):
        # Special mention : https://github.com/django/django/blob/main/django/utils/text.py
        s = str(query).strip().replace(' ', '_')
        return re.sub(r'(?u)[^-\w.]', '', s)


In [53]:
scraper = ScrapeGoogle()

In [59]:
res = scraper.google_search('aws lambda certifications')

In [43]:
scraper.get_valid_filename('sofmsd salfu08q95lku /af4352436')

'sofmsd_salfu08q95lku_af4352436'

In [65]:
# lambda main function
# lambda main
def lambda_handler(event, context):
    
    scraper = ScrapeGoogle()
    bucket_name = "google-scraped-json"
    s3 = boto3.resource("s3")
    
    for record in event['Records']:
        payload = record["body"]
        scraped_json = scraper.google_search(payload)
        scraped_json = json.dumps({"results": scraped_json})
        s3_path = "Scraped/" + scraper.get_valid_filename(payload[:40]) + ".json"
        s3.Bucket(bucket_name).put_object(Key=s3_path, Body=scraped_json)
    return {
            'statusCode': 200,
            'body': json.dumps('file is created in:'+ s3_path)
        }