<a href="https://colab.research.google.com/github/lucprosa/dataeng-basic-course/blob/main/spark_streaming/dataproc/producer_collab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Reading/Writing data to Google Storage

- Authenticate to Google
- Install gcsfs and mount
- Install gcsfuse and mount the bucket as a file system
- Read data from bucket
- Write data to bucket


# Authenticate to Google

In [1]:
from google.colab import auth
auth.authenticate_user()

project_id = 'data-eng-dev-437916'
!gcloud config set project {project_id}

Updated property [core/project].


# Install gcsfs and pyspark

In [15]:
!pip install gcsfs
!pip install pyspark



# Install gcsfuse

In [3]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1022  100  1022    0     0   8306      0 --:--:-- --:--:-- --:--:--  8377
OK
40 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mhttp://packages.cloud.google.com/apt/dists/gcsfuse-bionic/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.[0m
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)[0m
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 40 not upgraded.
Need to get 14.8 MB of archives.
After this operation, 0 B of additional dis

 # Create local folder and mount the bucket as a file system

In [5]:
!mkdir edit-data-eng-dev
!gcsfuse edit-data-eng-dev edit-data-eng-dev

{"timestamp":{"seconds":1750434234,"nanos":191703237},"severity":"INFO","message":"Running gcsfuse/3.0.0 (Go version go1.24.0)"}
{"timestamp":{"seconds":1750434234,"nanos":197459282},"severity":"INFO","message":"Start gcsfuse/3.0.0 (Go version go1.24.0) for app \"\" using mount point: /content/edit-data-eng-dev\n"}
{"timestamp":{"seconds":1750434234,"nanos":197506393},"severity":"INFO","message":"GCSFuse config","config":{"AppName":"","CacheDir":"","Debug":{"ExitOnInvariantViolation":false,"Fuse":false,"Gcs":false,"LogMutex":false},"DisableAutoconfig":false,"EnableAtomicRenameObject":false,"EnableHns":true,"EnableNewReader":false,"FileCache":{"CacheFileForRangeRead":false,"DownloadChunkSizeMb":200,"EnableCrc":false,"EnableODirect":false,"EnableParallelDownloads":false,"ExperimentalParallelDownloadsDefaultOn":true,"MaxParallelDownloads":16,"MaxSizeMb":-1,"ParallelDownloadsPerFile":16,"WriteBufferSize":4194304},"FileSystem":{"DirMode":"755","DisableParallelDirops":false,"FileMode":"644",

# Create Spark Session

In [21]:
from pyspark.sql import SparkSession

# .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
# .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
# .config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") \

spark = (SparkSession.builder \
    .appName("ColabGCS") \
    .getOrCreate())

In [23]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

In [25]:

# define paths
bucket_name="edit-data-eng-dev"
lake_path="datalake/bronze"
table_path="basic_pays"
final_path=f"gs://{bucket_name}/{lake_path}/{table_path}"

# since we're mounting the bucket as filesystem , the new path will be:
# "/content/edit-data-eng-dev/datalake/bronze/basic_pays"
# instead of
# "gs://edit-data-eng-dev/datalake/bronze/basic_pays"


# Read data from the bucket

In [30]:
df = spark.read.parquet("/content/edit-data-eng-dev/datalake/bronze/basic_pays")
df.show()

34

# Write data to the bucket

In [10]:
df.write.format("parquet").save("/content/edit-data-eng-dev/datalake/bronze/basic_pays_new")

In [2]:
!mkdir -p /content/temp/

# Copying data through gsutils

In [28]:
!gsutil cp gs://edit-data-eng-dev/datalake/bronze/basic_pays/* gs://edit-data-eng-dev/datalake/bronze3/

Copying gs://edit-data-eng-dev/datalake/bronze/basic_pays/_SUCCESS [Content-Type=application/octet-stream]...
Copying gs://edit-data-eng-dev/datalake/bronze/basic_pays/part-00000-7167837e-0da5-43d6-81e3-8ee960243b86-c000.snappy.parquet [Content-Type=application/octet-stream]...
Copying gs://edit-data-eng-dev/datalake/bronze/basic_pays/part-00001-7167837e-0da5-43d6-81e3-8ee960243b86-c000.snappy.parquet [Content-Type=application/octet-stream]...
\ [3 files][  2.4 KiB/  2.4 KiB]                                                
Operation completed over 3 objects/2.4 KiB.                                      
