# Google Cloud Tutorial - PySpark and DataProc

# Setup

We assume that GC SDK is installed ([tutorial](https://cloud.google.com/sdk/docs/install)). After creating a project (i.e., gcloud projects create PROJECT_NAME e.g., ex-dataproc), we create a cluster in dataproc: 

!gcloud dataproc clusters create CLUSTER_NAME (e.g., ex-dataproc)
* --enable-component-gateway (access to the web interfaces of default and selected optional components (e.g., Jupyter) on the cluster.)
* --region REGION_NAME --zone ZONE_NAME
* --master-machine-type
* --num-workers 2 (i.e., 2 worker machines)
* --optional-components JUPYTER (several other options: anaconda, Docker, Solr)
* --project PROJECT_NAME (i.e., ex-dataproc)


In [None]:
!gcloud dataproc clusters create ex-dataproc --enable-component-gateway --region us-central1 --zone us-central1-c --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 2.0-debian10 --optional-components JUPYTER --project dataproc-334718

After creating cluster, there are one master machine (ex-dataproc-m) and two workers (ex-dataproc-w-0 and ex-dataproc-w-1). We can create an SSH tunnel using local port (e.g., 1080) to connect to a web interface (using Chrome).

In [None]:
!gcloud compute ssh ex-dataproc-m --project=dataproc-334718 --zone=us-central1-c -- -D 1080 -N

"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --proxy-server="socks5://localhost:1080" --user-data-dir="/tmp/ex-dataproc-m" http://ex-dataproc-m:8088

We create a storage bucket (e.g., ex-dataproc-bucket) to keep all of our data (e.g., csv files). See this [tutorial](https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-console).

gsutil mb -p PROJECT_ID -c STORAGE_CLASS -l BUCKET_LOCATION -b on gs://BUCKET_NAME

In [None]:
!gsutil mb -p dataproc-334718 -b on gs://ex-dataproc-bucket

# Upload data

We use a [dataset](https://www.kaggle.com/c/nlp-getting-started) of Kaggle.

In [None]:
!gsutil cp nlpDisasterTweets.csv gs://ex-dataproc-bucket

Note that in the case we have multiple files such as data-1.csv, data-2.csv. You put all these files into a folder (e.g., /data). The location of this dataset is "gs://ex-dataproc-bucket/data"

## Work like a charm

In [None]:
data = spark.read.format('csv').options(header='true', inferSchema='true', multiLine=True).load("gs://ex-dataproc-bucket/nlpDisasterTweets.csv")

                                                                                

In [None]:
print('Number of row in Data:', data.count())

[Stage 2:>                                                          (0 + 1) / 1]

Number of row in Data: 7613


                                                                                

In [None]:
data.show(5)

+---+-------+--------+--------------------+------+
| id|keyword|location|                text|target|
+---+-------+--------+--------------------+------+
|  1|   null|    null|Our Deeds are the...|     1|
|  4|   null|    null|Forest fire near ...|     1|
|  5|   null|    null|All residents ask...|     1|
|  6|   null|    null|13,000 people rec...|     1|
|  7|   null|    null|Just got sent thi...|     1|
+---+-------+--------+--------------------+------+
only showing top 5 rows

