# Uploading files to Cloud Storage with `gsutil`

In [3]:
!gsutil -m cp -r data/pq/ gs://dtc-zoomcamp-spark/pq

If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.

Copying file://data/pq/green/2021/03/part-00002-aff415c5-c661-4cce-b493-1c4345bf7216-c000.snappy.parquet [Content-Type=application/octet-stream]...
Copying file://data/pq/green/2021/03/._SUCCESS.crc [Content-Type=application/octet-stream]...
Copying file://data/pq/green/2021/03/.part-00000-aff415c5-c661-4cce-b493-1c4345bf7216-c000.snappy.parquet.crc [Content-Type=application/octet-stream]...
Copying file://data/pq/green/2021/03/.part-00003-aff415c5-c661-4cce-b493-1c4345bf7216-c000.snappy.parquet.crc [Content-Type=application/octet-stream]...
Copying file://data/pq/green/2021/03/part-00003-aff415c5-c661-4cce-b493-1c4345bf7216-c000.snappy.parq

# Configuring Spark with the GCS connector

## download the corresponding version of the connector

In [5]:
!gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar data/lib/gcs-connector-hadoop3-2.2.5.jar

Copying gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar...
| [1 files][ 30.1 MiB/ 30.1 MiB]                                                
Operation completed over 1 objects/30.1 MiB.                                     


In [6]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

In [7]:
credentials_location = '/Users/ola/Downloads/coherent-ascent-379901-f8984bc6c655.json'

conf = SparkConf() \
    .setMaster('local[*]') \
    .setAppName('test') \
    .set("spark.jars", "./data/lib/gcs-connector-hadoop3-2.2.5.jar") \
    .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location)

In [8]:
sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
hadoop_conf.set("fs.gs.auth.service.account.enable", "true")

24/02/29 22:16:14 WARN Utils: Your hostname, oladeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.102 instead (on interface en0)
24/02/29 22:16:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/02/29 22:16:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/29 22:16:29 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [9]:
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()

In [10]:
df_green = spark.read.parquet('gs://dtc-zoomcamp-spark/pq/green/*/*')

                                                                                

In [11]:
df_green.show()

                                                                                

+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|VendorID|lpep_pickup_datetime|lpep_dropoff_datetime|store_and_fwd_flag|RatecodeID|PULocationID|DOLocationID|passenger_count|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|ehail_fee|improvement_surcharge|total_amount|payment_type|trip_type|congestion_surcharge|
+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|       2| 2020-01-14 13:28:09|  2020-01-14 13:35:19|                 N|         1|          74|          75|              1|         1.35|        7.0|  0.0|    0.

In [12]:
df_green.count()

                                                                                

2304517