<a href="https://colab.research.google.com/github/msaadsadiq/BigDataCourse/blob/master/Assignment_3_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Enabling APIs

# ECE 795 - Big Data
## Assignment #3 - Twitter Sentiment Analysis using Distributed Computing on DataProc 

### Provide your credentials to the runtime

In [0]:
# Authenticate your student profile

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


### Set the Project ID and Enable APIs

In [0]:
project_id = 'ece795'

In [0]:
from google.cloud import 

#### In GCP, there are many different services; Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Cloud Dataproc to name a few. In order to use any of these services in your project, you first have to enable them.

![alt text](https://cdn-images-1.medium.com/max/1600/1*rYZZH8w9iScxIXG27qG-ww.png)

#### Put your mouse over “APIs & Services” on the left-side menu, then click into “Library”. For this project, we will enable three APIs: Cloud Dataproc, Compute Engine, and Cloud Storage.

![alt text](https://cdn-images-1.medium.com/max/1600/1*qH5u_JSH2JLZW_SQTcetSQ.png)

### Running Example 1: Word Count

#### This word count example is similar to the one introduced earlier. It will use the Shakespeare dataset in BigQuery. The only difference is that instead of using Hadoop, it uses PySpark which is a Python library for Spark

Step 1: create the output table in BigQuery. We need a table to store the output of our Map Reduce procedure.
- Select your project or create a new one, remember to enable billing
- Go to Big Query
- Create a dataset, then a table using the following schema

![alt text](https://)

### Running Example 2: Twitter Sentiment

#### Get the Data

In [0]:
!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
!unzip trainingandtestdata.zip   
!rm trainingandtestdata.zip.1 test*.csv


URL transformed to HTTPS due to an HSTS policy
--2019-03-04 22:47:26--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘trainingandtestdata.zip.1’


2019-03-04 22:47:28 (68.3 MB/s) - ‘trainingandtestdata.zip.1’ saved [81363704/81363704]

Archive:  trainingandtestdata.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  


In [0]:
import pandas as pd
import numpy as np

# set the names for each column
cols = ['sentiment','id','date','query_string','user','text']
def main():
	# read training data with ISO-8859-1 encoding and column names set above
	df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding = 'ISO-8859-1',names=cols)
	# shuffle the data
	df = df.sample(frac=1).reset_index(drop=True)
	# set the random seed and split train and test with 99 to 1 ratio
	np.random.seed(777)
	msk = np.random.rand(len(df)) < 0.99
	train = df[msk].reset_index(drop=True)
	test = df[~msk].reset_index(drop=True)
	# save both train and test as CSV files
	train.to_csv('pyspark_sa_train_data.csv')
	test.to_csv('pyspark_sa_test_data.csv')

In [0]:
#!/usr/bin/python
"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()

# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)

conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'publicdata',
    'mapred.bq.input.dataset.id': 'samples',
    'mapred.bq.input.table.id': 'shakespeare',
}

# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'

# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

# Perform word count.
word_counts = (
    table_data
    .map(lambda record: json.loads(record[1]))
    .map(lambda x: (x['word'].lower(), int(x['word_count'])))
    .reduceByKey(lambda x, y: x + y))

# Display 10 results.
pprint.pprint(word_counts.take(10))

# Stage data formatted as newline-delimited JSON in Cloud Storage.
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
output_files = output_directory + '/part-*'

sql_context = SQLContext(sc)
(word_counts
 .toDF(['word', 'word_count'])
 .write.format('json').save(output_directory))

# Shell out to bq CLI to perform BigQuery import.
subprocess.check_call(
    'bq load --source_format NEWLINE_DELIMITED_JSON '
    '--replace '
    '--autodetect '
    '{dataset}.{table} {files}'.format(
        dataset=output_dataset, table=output_table, files=output_files
    ).split())

# Manually clean up the staging_directories, otherwise BigQuery
# files will remain indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
output_path = sc._jvm.org.apache.hadoop.fs.Path(output_directory)
output_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(
    output_path, True)

### Question 1. 

1. Crawl 10,000 Tweets about any topic you like (english language, must be a topic of divide i.e. having more than one view)
2. Upload them to storage bucket
3. Using PySpark run the sentiment analysis on those tweets on the model you trained earlier  

### Question 2. 

1. Perform word count on your tweets
2. Performa word count on the different stances