# AWS Glue Code Explanation


In this notebook, we will explore a provided piece of AWS Glue code that performs an ETL operation, joining sales data with customer data, and then writing the resulting dataset to an S3 bucket. We will go through the code step by step to understand its components and execution.




```python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

salesDF = glueContext.create_dynamic_frame.from_catalog(
             database="dojodatabase",
             table_name="sales")
customerDF = glueContext.create_dynamic_frame.from_catalog(
             database="dojodatabase",
             table_name="customers")

customersalesDF = Join.apply(salesDF, customerDF, 'customerid', 'customerid')
customersalesDF = customersalesDF.drop_fields(['customerid'])

glueContext.write_dynamic_frame.from_options(customersalesDF, connection_type = "s3", connection_options = {"path": "s3://dojo-data-lake/data/customer-sales"}, format = "json")
```


## Code Explanation

### Importing Libraries and Creating Glue Context

```python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
```

These lines import the necessary libraries and modules. The `awsglue.transforms` module provides classes for various transformations that can be performed on the data. The `awsglue.context` module is used to create a Glue context, which is needed to create DynamicFrames.

```python
glueContext = GlueContext(SparkContext.getOrCreate())
```

Here, a Glue context is created from a Spark context. The Spark context is the entry point to any Spark functionality, and the Glue context allows AWS Glue to use Spark.

### Reading Data from AWS Glue Catalog

```python
salesDF = glueContext.create_dynamic_frame.from_catalog(
             database="dojodatabase",
             table_name="sales")
customerDF = glueContext.create_dynamic_frame.from_catalog(
             database="dojodatabase",
             table_name="customers")
```

These lines read data from tables in the AWS Glue Catalog into DynamicFrames. The Glue Catalog contains metadata about data stored in various locations, and it's a centralized repository for storing this metadata.

### Joining the Sales and Customer Data

```python
customersalesDF = Join.apply(salesDF, customerDF, 'customerid', 'customerid')
```

This line performs a join operation between the sales data and customer data on the `customerid` field. The `Join.apply` method is used to join two DynamicFrames based on the specified keys.

### Dropping Unnecessary Fields

```python
customersalesDF = customersalesDF.drop_fields(['customerid'])
```

This line drops the duplicate `customerid` field from the joined DynamicFrame. This is a common operation after a join to remove redundant columns.

### Writing the Resulting Data to S3

```python
glueContext.write_dynamic_frame.from_options(customersalesDF, connection_type = "s3", connection_options = {"path": "s3://dojo-data-lake/data/customer-sales"}, format = "json")
```

This line writes the resulting joined and cleaned data to an S3 bucket in JSON format. The `write_dynamic_frame.from_options` method is used to write the DynamicFrame to a specified location, format, and connection type.

### Summary

This AWS Glue code snippet is a straightforward example of reading, transforming, and loading data using AWS Glue, PySpark, and boto3. It reads data from the AWS Glue Catalog, performs a join operation, cleans up the resulting data, and then writes it to an S3 bucket.
