# Final Project Notebook

### NOTE: If you build and run a notebook in the cloud, just copy it down in place of this one!  
#### Be sure to have all your output captured within the notebook!
#### <span style="background:yellow">Be sure to save your work early and often!</span>

![Specific_Project_1.png MISSING](../images/Specific_Project_1.png)

# Add code as needed in the cells below to produce your analytical products

In [None]:
## setting up postgres database in GCP
gcloud sql connect finalproject-kg37m --user=postgres

CREATE DATABASE reddit;

\connect reddit;

CREATE TABLE entries (item VARCHAR(500), date TIMESTAMP, title VARCHAR(250), summary TEXT, link VARCHAR(500), sentiment_score VARCAHR(50), sentiment_magnitude FLOAT);

In [None]:
## setting up VM machine
sudo apt update && sudo apt install python3 python3-dev python3-venv
sudo apt install wget
wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
mkdir testproj
cd testproj
python3 -m venv env
source env/bin/activate
pip install google-cloud-storage
pip install --upgrade google-cloud-storage
pip3 install feedparser
pip3 install bs4
pip3 install BeautifulSoup4
vim pythonscraper.py

In [None]:
## scrape Reddit RSS into JSON files
import json
import random
import logging
import os
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment
from google.cloud import storage

a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

project = 'UMC DSA 8420 FS2021'
bucket = 'reddit-kg37m'
feed = feedparser.parse(a_reddit_rss_url)

dict_keys = ['dttm', 'title', 'summary_text', 'link']
temp_dict = dict()
for key in dict_keys:
    temp_dict[key] = []
    
if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")

else:
    for item in feed[ "items" ]:
        temp_dict['dttm'].append(item[ "date" ])
        temp_dict['title'].append(item[ "title" ])
        temp_dict['summary_text'].append(text_from_html(item[ "summary" ]))
        temp_dict['link'].append(item[ "link" ])

reddit_string = json.dumps(temp_dict)

filename = 'file'+str(random.randint(1,1000))+'.json'

blob = bucket.blob(filename)

blob.upload_from_string(reddit_string)

In [None]:
## more VM setup
sudo apt-get install apt-transport-https ca-certificates gnupg
echo "deb https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-182.0.0-linux-x86_64.tar.gz
gcloud auth application-default login

In [None]:
## process files from the Storage Bucket with Natural Language API
from google.cloud import storage
from google.cloud import language_v1
import json
import csv

project = 'UMC DSA 8420 FS2021'
bucket = 'reddit-kg37m'

def get_blob(_blob):
    storage_client = storage.Client(project=project)
    bucket = storage_client.get_bucket(_bucket_name)
    blob = bucket.get_blob(_blob)
    return blob.download_as_string()

def list_blobs(bucket_name):
    storage_client = storage.Client(project=project)
    bucket = storage_client.get_bucket(bucket_name)
    blobs = bucket.list_blobs()
    return blobs

for blob in list_blobs(bucket):
    text = get_blob(blob)
    document = language_v1.Document(content=text, type_=language_v1.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(request={"document": document}).document_sentiment
    file = open(blob, "a")
    file.write("sentiment_score: {}, sentiment_magnitude: {}".format(sentiment.score, sentiment.magnitude))
    file.close()
    
## put files together for upload
files=list_blobs(bucket)

def merge_JsonFiles(filename):
    result = list()
    for f1 in filename:
        with open(f1, 'r') as infile:
            result.append(json.load(infile))

    with open('allblobs.json', 'w') as output_file:
        json.dump(result, output_file)

merge_JsonFiles(files)

with open('allblobs.json') as json_file:
    data = json.load(json_file)
 
csv_file = open('data_file.csv', 'w')
 
# create the csv writer object
csv_writer = csv.writer(data)
 
# Counter variable used for writing
# headers to the CSV file
count = 0
 
for entry in data:
    if count == 0:

        header = keys()
        csv_writer.writerow(header)
        count += 1
 
    csv_writer.writerow(emp.values())
 
csv_file.close()

In [None]:
## populate database with files
gcloud sql connect finalproject-kg37m --user=postgres

\connect reddit;

COPY *
FROM 'data_file.csv'
DELIMITER ','
CSV HEADER;

In [None]:
## connect to postgres instance
wget http://ipv4.whatismyv6.com/ -O getip
grep -a1 "Address of" getip | grep '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}'

psql -h 35.223.210.188 -U postgres postgres

In [None]:
import getpass
#mypasswd = 'PASSWORD'
mypasswd = getpass.getpass()

In [None]:
## pull data in tabular format
import psycopg2
import pandas as pd
connection = psycopg2.connect(database = 'reddit', 
                              user = 'postgres', 
                              host = '35.223.210.188', 
                              password = mypasswd)
with connection, connection.cursor() as cursor:
    cursor.execute("SELECT * FROM entries")
    results = cursor.fetchall()
    df = pd.DataFrame(results)
    
df.head()

In [None]:
## tabular data analysis
df['sentiment_score'].value_counts()

In [None]:
## display data visualization
import matplotlib.pyplot as plt

df.sort_value('ddtm')

plt.plot(df['ddtm'], df['sentiment_magnitude'])
plt.show()

## Preparing your submission

### Deliverables: 
   1. This or a replacement Notebook
   1. An aggregateion of data in tabular format that conveyes something interesting about the Reddit RSS feed during your scraping.
     * The table can be embedded or uploaded into this folder (CSV or Excel)
   1. One or more data visualizations

Imbed your image into this page by saving your data visualization as: `FINAL_PROJECT_IMAGE.png`  
Upload it to the `module8/exercises/` folder.

If you need to, change the file type to `.jpg` or `.jpeg` or ... whatever, then update the link in this cell (double click to edit).  
Then re-run this markdown cell to see it.

![FINAL_PROJECT_IMAGE.png MISSING](./exercises/FINAL_PROJECT_IMAGE.png)

---
## Summarize in the fields below
 1. Describe the overall process and components you used for the project.
 2. What is the key insight from the tabularization?
 3. What is the key insight from the visualization?


# Save your Notebook!