## Lab Overview
- Participants will learn how to fetch JSON data from a RESTful API (https://jsonplaceholder.typicode.com/posts/), convert it into a PySpark DataFrame, analyze its schema and content, and then efficiently save it into a MySQL database table. Through this lab, participants will gain practical skills in data ingestion, transformation, and persistence using Python, Spark, and MySQL, essential for building scalable data pipelines in real-world scenarios.
### Learning Objective
- Describe  how to retrieve JSON data from an API endpoint using Python and integrate it into a PySpark DataFrame for further processing.
- Familiarize themselves with essential concepts of PySpark, including SparkSession initialization, DataFrame creation from JSON data, and schema inspection.
- Explore Data Persistence Options: Explore methods for persisting PySpark DataFrames into a MySQL database, including defining connection properties, specifying JDBC URLs, and selecting appropriate write modes.
- Perform hands-on experience in building a simple, effective data pipeline, from data retrieval to persistence, using popular Python libraries like requests, PySpark, and MySQL connector.
- Build Foundation for Scalable Data Pipelines: Lay the foundation for understanding and building scalable data pipelines by integrating PySpark with external data sources and databases, essential for real-world data engineering tasks.
### Dataset:
- The “classicmodels” database.
- API Endpoint URL = "https://jsonplaceholder.typicode.com/posts/"

In [2]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import requests

In [4]:
# Initialize SparkSession
spark = SparkSession.builder.appName("Read and Save JSON Data with PySpark SQL").getOrCreate()

# API Endpoint URL
api_url = "https://jsonplaceholder.typicode.com/posts/"

# Fetch data from API
response = requests.get(api_url)
if response.status_code == 200:

    # Convert JSON response to DataFrame 
    json_df = spark.read.json(spark.sparkContext.parallelize([response.json()]))

    # Show DataFrame schema and content
    json_df.printSchema()
    json_df.show()

    # Define MySQL connection properties
    mysql_props = {
        "user": "root",
        "password": "password",
        "driver": "com.mysql.cj.jdbc.Driver"
    }
    # JDBC URL for MySQL
    mysql_url = "jdbc:mysql://localhost:3306/usersdb"
    # Save DataFrame to MySQL table
    json_df.write.jdbc(url=mysql_url, table="json_data_table", mode="overwrite", properties=mysql_props)
else:
    print(f"Failed to fetch data from API. Status code: {response.status_code}")
# Stop SparkSession
spark.stop()

root
 |-- body: string (nullable = true)
 |-- id: long (nullable = true)
 |-- title: string (nullable = true)
 |-- userId: long (nullable = true)

+--------------------+---+--------------------+------+
|                body| id|               title|userId|
+--------------------+---+--------------------+------+
|quia et suscipit\...|  1|sunt aut facere r...|     1|
|est rerum tempore...|  2|        qui est esse|     1|
|et iusto sed quo ...|  3|ea molestias quas...|     1|
|ullam et saepe re...|  4|eum et est occaecati|     1|
|repudiandae venia...|  5|  nesciunt quas odio|     1|
|ut aspernatur cor...|  6|dolorem eum magni...|     1|
|dolore placeat qu...|  7|magnam facilis autem|     1|
|dignissimos aperi...|  8|dolorem dolore es...|     1|
|consectetur animi...|  9|nesciunt iure omn...|     1|
|quo et expedita m...| 10|optio molestias i...|     1|
|delectus reiciend...| 11|et ea vero quia l...|     2|
|itaque id aut mag...| 12|in quibusdam temp...|     2|
|aut dicta possimu...| 13|do

### Explanation of Process

- Setting up PySpark:
    - It uses findspark to initialize the PySpark environment.
    - Imports pyspark and SparkSession from pyspark.sql.
- Initializing SparkSession:
    - Initializes a SparkSession named "Read and Save JSON Data with PySpark SQL."
- Fetching Data from API:
    - Makes an HTTP GET request to the specified API endpoint (https://jsonplaceholder.typicode.com/posts/) using the requests.get() method.
    - If the request is successful (status code 200), it converts the JSON response into a PySpark DataFrame using spark.read.json().
- Data Processing:
    - The DataFrame schema and content are printed using `json_df.printSchema()` and `json_df.show()` respectively.
- Saving Data to MySQL:
    - It defines MySQL connection properties (mysql_props) including username, password, and JDBC driver.
    - Specifies the JDBC URL for connecting to the MySQL database.
    - Saves the DataFrame into a MySQL table named "json_data_table" using the `write.jdbc()` method, with the mode set to "overwrite" to replace existing data if any.
- Error Handling:
    - Checks the HTTP response status code and prints an error message if the request fails.
- Environment Cleanup:
    - Stop the SparkSession to release resources.
