# Creating a Google Cloud Platform (GCP) Data Scraper

For this exercise, which will be built upon next week, we are going to stand up GCP resources to scrape [REDDIT](https://www.reddit.com/).
This scraping will be done by tapping into the REDDIT RSS Feeds.  

[Read about RSS here](https://en.wikipedia.org/wiki/RSS).

### Prerequisites

 1. The Labs for GCP Compute Engine
 1. Tutorial link for GCP Cloud Buckets
 1. Prior labs in AWS where you installed software on VMs
 1. The GCP Storage Practice that helps you understand install processes on VMs and location of programmatic manipulation of Storage buckets.

### Overview

 1. Create an Storage Bucket to collect the data
 1. Create a preemptible Compute Engine
 1. Install software on compute engine
 1. Write additional code modules to collect data into the bucket
 1. Collect data and write to storage bucket
 
#### Data Scraper Concept Overview
 
![DataScraperStructure_Mini_Project1.png MISSING](../images/DataScraperStructure_Mini_Project1.png)


**Note:** Please use the <span style="background:yellow">**us-central1**</span> region for all activities!


# 1. Create a Storage Bucket

Link: https://console.cloud.google.com/storage/
 * Name: **dsa_mini_project_your_pawprint**
 * Select a Regional storage class

# 2. Create a Preemptible Compute Engine (VM)

Link: https://console.cloud.google.com/compute/instances
 * Name: **dsa-mini-project_your_pawprint**
 * Select Micro Instance
![DataScraperVM_Instance.png MISSING](../images/DataScraperVM_Instance.png)


**BE SURE TO MAKE IT PREEMPTIBLE**

# 3. Install software on compute engine

**You will need to install software to your compute engine (VM)**

###  https://cloud.google.com/python/setup

 * [RSS Feed Libraries](https://wiki.python.org/moin/RssLibraries)
 * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
 

Also read through this helpful information about accessing Reddit RSS Feeds: 
https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/

##### Here is some sample python code to pull the REDDIT feed and just print it out.

In [15]:
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment

# Functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# Define URL of the RSS Feed I want
a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )


if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        dttm = item[ "date" ]
        title = item[ "title" ]
        summary_text = text_from_html(item[ "summary" ])
        link = item[ "link" ]
        
        print("====================")
        print("Title: {} ({})\nTimestamp: {}".format(title,link,dttm))
        print("--------------------\nSummary:\n{}".format(summary_text))
     
              
# ------------ Create file string
              
def reddit_post_string(feed):
    """
    Funciton to generate JSON file from reddit rss
    """    
    
    file_str = ""
              
    if (feed['bozo'] == 1):
        print("Error Reading/Parsing Feed XML Data")    
    else:
        for item in feed[ "items" ]:
            dttm = item[ "date" ]
            title = item[ "title" ]
            summary_text = text_from_html(item[ "summary" ])
            link = item[ "link" ]
        
            file_str += "====================\n"
            file_str += "Title: {} ({})\nTimestamp: {}\n".format(title,link,dttm)
            file_str += "--------------------\nSummary:\n{}".format(summary_text)

    return file_str

#test_string = reddit_post_string(feed)

#print(test_string)

Title: Iron + Oxygen + Water = Rust. Where did Hydrogen go? Also how did 2 Iron come? [ See Pic] (https://www.reddit.com/r/chemhelp/comments/r6y6x9/iron_oxygen_water_rust_where_did_hydrogen_go_also/)
Timestamp: 2021-12-02T03:59:15+00:00
--------------------
Summary:
Image - https://imgur.com/a/1qElgmm  ​  I don't know anything about chemistry. I'm self learning chemistry.  I know that there is law of conservation of mass which means when chemical reaction happens, the mass is preserved from before the reaction, meaning "mass before reaction = mass after reaction"  ​  After Iron reactions with Oxygen in the presence of water, we get rust.  In the chemical reaction things always balance. like if there is two oxygen on left, there will be two oxygen in right  ​  In this case there is 1 Fe on left, but Fe2 on right. How?  Also there is Hydrogen on LHS but no Hydrogen on RHS. How??  ​  Kindly clear my basic understanding  ​  Thank You  /u/infiltratorshepard r/chemhelp [link] [comments]
Titl

In [16]:
test_string = reddit_post_string(feed)

print(test_string)

Title: Iron + Oxygen + Water = Rust. Where did Hydrogen go? Also how did 2 Iron come? [ See Pic] (https://www.reddit.com/r/chemhelp/comments/r6y6x9/iron_oxygen_water_rust_where_did_hydrogen_go_also/)
Timestamp: 2021-12-02T03:59:15+00:00
--------------------
Summary:
Title: Hello from my spot in sunny NZ! I love this sub, and a good colour story 🥰💕💚 (https://www.reddit.com/r/entwives/comments/r6y6x8/hello_from_my_spot_in_sunny_nz_i_love_this_sub/)
Timestamp: 2021-12-02T03:59:15+00:00
--------------------
Summary:
Title: Did not receive WETH after wrapping through Opensea (https://www.reddit.com/r/opensea/comments/r6y6x7/did_not_receive_weth_after_wrapping_through/)
Timestamp: 2021-12-02T03:59:15+00:00
--------------------
Summary:
Title: What is something that you stopped doing after the pandemic started, and you're never doing again, even if the pandemic ends? (https://www.reddit.com/r/AskReddit/comments/r6y6x2/what_is_something_that_you_stopped_doing_after/)
Timestamp: 2021-12-02T03:5

In [10]:
feed

{'feed': {'tags': [{'term': ' reddit.com',
    'scheme': None,
    'label': 'r/ reddit.com'}],
  'updated': '2021-12-02T03:56:44+00:00',
  'updated_parsed': time.struct_time(tm_year=2021, tm_mon=12, tm_mday=2, tm_hour=3, tm_min=56, tm_sec=44, tm_wday=3, tm_yday=336, tm_isdst=0),
  'id': 'https://www.reddit.com/new/.rss?sort=new',
  'guidislink': True,
  'link': 'https://www.reddit.com/new/?sort=new',
  'links': [{'rel': 'self',
    'href': 'https://www.reddit.com/new/.rss?sort=new',
    'type': 'application/atom+xml'},
   {'rel': 'alternate',
    'href': 'https://www.reddit.com/new/?sort=new',
    'type': 'text/html'}],
  'title': 'newest submissions : reddit.com',
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': 'https://www.reddit.com/new/.rss?sort=new',
   'value': 'newest submissions : reddit.com'}},
 'entries': [{'authors': [{'name': '/u/InternationalRun9832',
     'href': 'https://www.reddit.com/user/InternationalRun9832'}],
   'author_detail': {'name': '/

# 4. Write additional code modules to collect data into the bucket

Since you have created a preemptible VM, it may disappear at any time.

### ADD MORE "Raw NBConvert" Cells as needed for code you want to save
 * In other words, write code here and save it often.  Then copy up to VM.

### This is also part of your submitted work for this module

### <span style="background:yellow">To-Do</span>

You will need to build off of RSS Feed Scrape code to write a JSON formatted file of data from when I showed above (title,url,summary, date/time) for each time the code runs.
The files should get a unique name each time, possibly look to making a file name from the run time.

In [8]:
# Sample code to create run time file name

import time
timestr = time.strftime("%Y%m%d-%H%M%S")
print(timestr)

file_name = "reddit-rss-" + timestr + ".txt"
print(file_name)

20211201-215109
reddit-rss-20211201-215109.json


In [17]:
# Testing writing to a json file

import json
with open(file_name, 'w') as f:
    json.dump(test_string, f, ensure_ascii=False)

### Helpful Link for Writing to the Cloud Storage

https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage

# 5. Collect data and write to storage bucket

### 5.1 Package your scrapping code into a script: `data_scrape1.py`.
You can either write this locally and upload or create the script directly on VM.


### 5.2 Run the script a few times / minutes

### 5.3 Get a listing of the contents of your bucket and paste into the cell below.

#### Optionally
You can grab a screen shot of the bucket contents from the console and embed it below using the 
```
![my screenshot](screen_shot.png)
```
and changing the cell to _Markdown_.




### Paste Bucket listing here ... or 
# Use the this ![my screenshot](screen_shot.png) to embed an uploaded image named "screen_shot.png"









# You can now go the VM console and Stop your instance. 

Then you can restart it next week to continue building it up instead of starting it over.


---


# Where is this exercise going?

Next module you will be introduced to various GCP Cloud APIs for things like Vision, Natural Language, etc.

You will be extending this scraper to utilize an API or two to process the data in the buckets.
The processing will produce analytical information that will feed into BigQuery tables, thereby faciltitating analytics and visualizations!



# Save your Notebook!