<a href="https://colab.research.google.com/github/lucprosa/dataeng-basic-course/blob/main/spark_streaming/dataproc/producer_collab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Reading/Writing data to Google Storage

- Authenticate to Google
- Install gcsfs and mount
- Install gcsfuse and mount the bucket as a file system
- Read data from bucket
- Write data to bucket


In [1]:
from google.colab import auth
auth.authenticate_user()

project_id = 'data-eng-dev-437916'
!gcloud config set project {project_id}

Updated property [core/project].


In [2]:
!pip install gcsfs



In [3]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1022  100  1022    0     0   8306      0 --:--:-- --:--:-- --:--:--  8377
OK
40 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mhttp://packages.cloud.google.com/apt/dists/gcsfuse-bionic/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.[0m
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)[0m
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 40 not upgraded.
Need to get 14.8 MB of archives.
After this operation, 0 B of additional dis

In [5]:
!mkdir edit-data-eng-dev
!gcsfuse edit-data-eng-dev edit-data-eng-dev

{"timestamp":{"seconds":1750434234,"nanos":191703237},"severity":"INFO","message":"Running gcsfuse/3.0.0 (Go version go1.24.0)"}
{"timestamp":{"seconds":1750434234,"nanos":197459282},"severity":"INFO","message":"Start gcsfuse/3.0.0 (Go version go1.24.0) for app \"\" using mount point: /content/edit-data-eng-dev\n"}
{"timestamp":{"seconds":1750434234,"nanos":197506393},"severity":"INFO","message":"GCSFuse config","config":{"AppName":"","CacheDir":"","Debug":{"ExitOnInvariantViolation":false,"Fuse":false,"Gcs":false,"LogMutex":false},"DisableAutoconfig":false,"EnableAtomicRenameObject":false,"EnableHns":true,"EnableNewReader":false,"FileCache":{"CacheFileForRangeRead":false,"DownloadChunkSizeMb":200,"EnableCrc":false,"EnableODirect":false,"EnableParallelDownloads":false,"ExperimentalParallelDownloadsDefaultOn":true,"MaxParallelDownloads":16,"MaxSizeMb":-1,"ParallelDownloadsPerFile":16,"WriteBufferSize":4194304},"FileSystem":{"DirMode":"755","DisableParallelDirops":false,"FileMode":"644",

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("ColabGCS") \
    .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") \
    .getOrCreate()

In [8]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

In [16]:

# define paths
bucket_name="edit-data-eng-dev"
lake_path="datalake/bronze"
table_path="basic_pays"
final_path=f"gs://{bucket_name}/{lake_path}/{table_path}"
final_path

'gs://edit-data-eng-dev/datalake/bronze/basic_pays'

In [9]:
df = spark.read.parquet("/content/edit-data-eng-dev/datalake/bronze/basic_pays")
df.show()

+-----------------+----------+------+
|    employee_name|department|salary|
+-----------------+----------+------+
|   Julie Firrelli|     Sales|  9181|
|  Steve Patterson|     Sales|  9441|
|   Foon Yue Tseng|     Sales|  6660|
|    George Vanauf|     Sales| 10563|
|      Loui Bondur|       SCM| 10449|
| Gerard Hernandez|       SCM|  6949|
|  Pamela Castillo|       SCM| 11303|
|       Larry Bott|       SCM| 11798|
|      Barry Jones|       SCM| 10586|
|     Diane Murphy|Accounting|  8435|
|   Mary Patterson|Accounting|  9998|
|    Jeff Firrelli|Accounting|  8992|
|William Patterson|Accounting|  8870|
|    Gerard Bondur|Accounting| 11472|
|      Anthony Bow|Accounting|  6627|
|  Leslie Jennings|        IT|  8113|
|  Leslie Thompson|        IT|  5186|
+-----------------+----------+------+



In [10]:
df.write.format("parquet").save("/content/edit-data-eng-dev/datalake/bronze3")

In [2]:
!mkdir -p /content/temp/

In [None]:
import requests
from pyspark.sql.types import *
import json
import datetime
import time

def ingest_from_api(url: str, landing_path: str):
    print(f"fetching data from {url}...")
    response = requests.get(url)
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    temp_file = f"/content/temp/vehicles_{int(timestamp)}.json"

    if response.status_code == 200:
        data = response.json()
        with open(temp_file, "w") as f:
            json.dump(data, f)
            print("Writing into temp location...")
            !gsutil cp {temp_file} {landing_path}

def producer_vehicles(loop: int, interval_time: int, landing_path: str):
    for i in range(loop):
        print(f"Producer running...{i}")
        ingest_from_api(f"https://api.carrismetropolitana.pt/vehicles", landing_path)
        time.sleep(interval_time)

if __name__ == '__main__':
  print("# Starting process #")
  bucket_name="edit-data-eng-dev"
  table="vehicles"
  landing_path=f"gs://{bucket_name}/datalake/landing/{table}"
  print(f"Landing path: {landing_path}")
  producer_vehicles(10, 30, landing_path)
  print("# Process done #")

# Starting process #
Landing path: gs://edit-data-eng-dev/datalake/landing/vehicles
Producer running...0
fetching data from https://api.carrismetropolitana.pt/vehicles...
Writing into temp location...
Copying file:///content/temp/vehicles_20241129141154.json [Content-Type=application/json]...
-
Operation completed over 1 objects/287.8 KiB.                                    
Producer running...1
fetching data from https://api.carrismetropolitana.pt/vehicles...
Writing into temp location...
Copying file:///content/temp/vehicles_20241129141229.json [Content-Type=application/json]...
-
Operation completed over 1 objects/287.8 KiB.                                    
Producer running...2
fetching data from https://api.carrismetropolitana.pt/vehicles...
Writing into temp location...
Copying file:///content/temp/vehicles_20241129141304.json [Content-Type=application/json]...
-
Operation completed over 1 objects/279.9 KiB.                                    
Producer running...3
fetching data

In [21]:
from google.colab import auth
auth.authenticate_user()

project_id = 'data-eng-dev-437916'
!gcloud config set project {project_id}

!pip install gcsfs

import os
# Set JAVA_HOME before creating the SparkSession
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

from pyspark.sql import SparkSession


Updated property [core/project].


In [40]:

# Define the GCS connector package
# Use a compatible version, 2.2.0 is often used with Spark 3.x and Hadoop 3.x
#gcs_connector_package = "com.google.cloud.hadoop:hadoop-gcs:2.2.0"
GCS_SHADDED = "https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar"

spark = (SparkSession.builder
    .appName("ColabGCS")
    .config("spark.jars", GCS_SHADDED)
    .config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
    .config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
    .getOrCreate())

# define paths
bucket_name="edit-data-eng-dev"
lake_path="datalake/bronze"
table_path="basic_pays"
final_path=f"gs://{bucket_name}/{lake_path}/{table_path}"
print(f"Final path: {final_path}")
df = spark.read.parquet(final_path)


Final path: gs://edit-data-eng-dev/datalake/bronze/basic_pays


Py4JJavaError: An error occurred while calling o127.parquet.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:724)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:722)
	at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:551)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:404)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:563)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)


In [25]:
final_path

'gs://edit-data-eng-dev/datalake/bronze/basic_pays'

In [28]:
import gcsfs

path = f"{final_path}/vehicles_20241130122355.json"
fs = gcsfs.GCSFileSystem()

with fs.open("gs://edit-data-eng-dev/datalake/bronze/basic_pays/vehicles_20241130122355.json") as f:
    print(f.read().decode())

FileNotFoundError: b/edit-data-eng-dev/o/datalake%2Fbronze%2Fbasic_pays%2Fvehicles_20241130122355.json

In [30]:
!gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
core:
  account: lucas.rosa@weareedit.io
  project: data-eng-dev-437916

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

Choose the account you want to use for this configuration.
To use a federated user account, exit this command and sign in to the gcloud CLI
 with your login configuration file, then run this command again.

Select an account:
 [1] lucas.rosa@weareedit.io
 [2] Sign in with a new Google

In [33]:
!gsutil cp gs://edit-data-eng-dev/datalake/bronze/basic_pays/* gs://edit-data-eng-dev/datalake/bronze2/*


Copying gs://edit-data-eng-dev/datalake/bronze/basic_pays/_SUCCESS [Content-Type=application/octet-stream]...
Copying gs://edit-data-eng-dev/datalake/bronze/basic_pays/part-00000-7167837e-0da5-43d6-81e3-8ee960243b86-c000.snappy.parquet [Content-Type=application/octet-stream]...
Copying gs://edit-data-eng-dev/datalake/bronze/basic_pays/part-00001-7167837e-0da5-43d6-81e3-8ee960243b86-c000.snappy.parquet [Content-Type=application/octet-stream]...
- [3 files][  2.4 KiB/  2.4 KiB]                                                
Operation completed over 3 objects/2.4 KiB.                                      
