In [None]:
#Coforge DE Interview skills(Spark and SQL) 11 jan 24

#1. What is spark architecture?
"""Apache Spark Architecture:
1. Driver: The driver is the main control process that creates the SparkContext, connects to the cluster, and coordinates the execution of tasks. It translates the user program into tasks and schedules them to run on executors.
2. Executors: Executors are worker nodes that run individual tasks in a distributed manner. Each executor has its own JVM and runs tasks assigned by the driver. Executors also provide in-memory storage for RDDs that are cached by user programs through SparkContext.
3. Cluster Manager: The cluster manager is responsible for managing the resources across the cluster. Spark supports various cluster managers like Standalone, Apache Mesos, kubernates and Hadoop YARN. The cluster manager allocates resources to the driver and executors.
4. RDD (Resilient Distributed Dataset): RDD is the fundamental data structure of Spark. It is an immutable distributed collection of objects that can be processed in parallel. RDDs support two types of operations: transformations (e.g., map, filter) and actions (e.g., count, collect).
5. DAG (Directed Acyclic Graph): Spark creates a logical execution plan in the form of a DAG. The DAG represents the sequence of transformations applied to the data. The DAG scheduler divides the operators into stages of tasks and submits them to the task scheduler.
6. Task Scheduler: The task scheduler is responsible for scheduling tasks on the executors. It takes the stages from the DAG scheduler and assigns tasks to executors based on data locality and resource availability.
In summary, Spark architecture consists of a driver, executors, cluster manager, RDDs, DAG, and task scheduler, working together to process large-scale data in a distributed and fault-tolerant manner.
"""

#How does Spark distribute tasks for processing a 1 GB file, and how are tasks and executors allocated during execution?
'''Spark divides the 1 GB file into partitions (default is 128 MB per partition for HDFS), creating about 8 partitions for a 1 GB file. Each partition becomes a task.

Executors: Each executor processes multiple tasks based on available cores.
Tasks per Executor: If an executor has 4 cores, it can run 4 tasks in parallel.'''

# What is spark-submit explain in details?
'''
spark-submit is a command-line tool used to launch Spark applications on a cluster or locally. It handles resource allocation, configuration, and submission of jobs to Spark clusters like YARN, Kubernetes, Mesos, or standalone mode.

Key Responsibilities:
Distribute Application Code: Sends your Spark application JAR or Python script to the cluster.
Resource Management: Allocates memory and CPU resources for the driver and executors.
Configuration Management: Sets properties like application name, Spark master, deploy mode, and more.
Job Execution: Manages the execution of tasks and monitors job progress.
Command Syntax:

spark-submit \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --class <main-class> \
  --executor-memory <memory> \
  --executor-cores <cores> \
  application.jar <application-arguments>
Common Parameters:
Parameter	Description
--master	Specifies the cluster manager (local, yarn, etc.)
--deploy-mode	Deployment mode (client or cluster)
--class	Main class of the application (for Java/Scala)
--executor-memory	Memory per executor (e.g., 2G)
--executor-cores	Number of cores per executor
--num-executors	Number of executors to launch
--conf	Set configuration properties (e.g., spark.executor.memory)
Example Usage:
Running a Spark Application Locally:

spark-submit 
  --master local[2] 
  my_spark_app.py

Running on a YARN Cluster:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 4G \
  --executor-cores 4 \
  my_spark_app.py
Execution Modes:
Client Mode: Driver runs on the machine where spark-submit is executed. Suitable for local testing.
Cluster Mode: Driver runs on a worker node within the cluster. Ideal for large-scale production jobs.

Important Points:
Driver Program: Manages job execution and task scheduling.
Executors: Run the actual tasks on worker nodes and store intermediate data.
Partitions: Spark splits input data into smaller partitions for distributed processing.
By leveraging spark-submit, Spark achieves scalability and parallel processing across large datasets efficiently.
'''

In [31]:
# Synechron Interview Questions on Database for Data Engineer

# Make Oracle DB connection in Python and write a query?

#%pip install cx_Oracle
import cx_Oracle

dsn_tns=cx_Oracle.makedsn('localhost','1521',service_name='orcl')
conn=cx_Oracle.connect(user='SivaAcademy',password='pass',dsn=dsn_tns) #while connecting only give req pass
c=conn.cursor()
c.execute('select * from emp')
print(c.fetchone()) # we can use fetchall() for all or  for row in c:
    #print(row)



(7839, 'KING', 'PRESIDENT', None, datetime.datetime(1981, 11, 17, 0, 0), 5100.0, None, 20)


In [None]:
# 1. Find the sales amount each year, product-wise  [Sales_data table]
# Solution 1
query='''WITH cte AS 
          (
    SELECT product_id, 
           EXTRACT(YEAR FROM sales_date) AS sale_year, 
           SUM(amount) OVER (PARTITION BY product_id, EXTRACT(YEAR FROM sales_date)) AS cumulative_sum,
           DENSE_RANK() OVER (PARTITION BY product_id ORDER BY EXTRACT(YEAR FROM sales_date)) AS rn
    FROM Sales_data
)
SELECT product_id, sale_year, cumulative_sum
FROM cte
GROUP BY product_id, sale_year, cumulative_sum
ORDER BY product_id, sale_year'''
c.execute(query)
for i in c:
    print(i)


In [33]:
# solution 2
query='''SELECT product_id AS prod_id, 
    EXTRACT(YEAR FROM sales_date) AS sale_year, 
    SUM(amount) AS total_sales
FROM Sales_data
GROUP BY product_id, EXTRACT(YEAR FROM sales_date)
ORDER BY product_id, sale_year'''
c.execute(query)
for i in c:
    print(i)
'''
EXTRACT(YEAR FROM sale_dt): 2025
TO_CHAR(sale_dt, 'YYYY'): '2025' (as a string)
TRUNC(sale_dt, 'YYYY'): 01-JAN-2025 (the first day of the year)
'''

('P1', 2023, 1301.15)
('P10', 2023, 1085.16)
('P2', 2023, 1310.25)
('P3', 2023, 1235.66)
('P4', 2023, 1064.48)
('P5', 2023, 1259.51)
('P6', 2023, 1310.55)
('P7', 2023, 1154.54)
('P8', 2023, 1220.01)
('P9', 2023, 1383.75)


In [45]:
#2. Find products that were not sold in the last 7 months.
query='''SELECT product_id
FROM Sales_data
WHERE product_id NOT IN (
    SELECT DISTINCT product_id 
    FROM Sales_data 
    WHERE sales_date >= ADD_MONTHS(SYSDATE, -7)) group by product_id'''
c.execute(query)
for i in c:
    print(i)

'''
Analysis:
Suppose I have written this query
SELECT DISTINCT product_id 
FROM Sales_data 
WHERE sales_date <= ADD_MONTHS(SYSDATE, -7)

this will show sales before 7 months ago but it can't exclude the products that were not at all sold in the last 7 months. It may includes 1 sale in last 7 months.
'''

('P6',)
('P8',)
('P5',)
('P1',)
('P2',)
('P3',)
('P9',)
('P10',)
('P4',)
('P7',)


"\nAnalysis:\nSuppose I have written this query\nSELECT DISTINCT product_id \nFROM Sales_data \nWHERE sales_date <= ADD_MONTHS(SYSDATE, -7)\n\nthis will show sales before 7 months ago but it can't exclude the products that were not at all sold in the last 7 months. It may includes 1 sale in last 7 months.\n"

In [None]:
#Berribot Interview Questions on ETL and pyspark for Wipro
# Schedule AWS databrics spark job daily in midnight script
'''
To schedule an AWS Databricks Spark job to run daily at midnight, you can use Databricks Workflows (preferred) or an external scheduling tool like AWS Lambda or CloudWatch. Below is an example script for each approach:

1. Using Databricks Workflows (Recommended)
Create a Databricks Job:

Go to Databricks Workspace.
Navigate to Workflows > Jobs.
Click Create Job.
Add your notebook or Spark job details.
Configure job clusters and libraries if required.
Schedule the Job:

In the Schedule tab, set the schedule to:
Daily
Start time: 00:00
Save the workflow.
2. Using AWS CloudWatch with Databricks REST API
Prerequisites:
A Databricks personal access token.
Databricks job ID or notebook path.
Python Script:

import requests

# Databricks Workspace URL
DATABRICKS_HOST = "https://<databricks-instance>.cloud.databricks.com"

# Databricks Personal Access Token
DATABRICKS_TOKEN = "<your-databricks-token>"

# Job ID
JOB_ID = "<your-job-id>"

# Trigger Job Function
def trigger_job():
    url = f"{DATABRICKS_HOST}/api/2.1/jobs/run-now"
    headers = {
        "Authorization": f"Bearer {DATABRICKS_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {"job_id": JOB_ID}

    response = requests.post(url, json=payload, headers=headers)
    if response.status_code == 200:
        print("Job triggered successfully.")
    else:
        print(f"Failed to trigger job: {response.content}")

# Run the function
trigger_job()
Schedule via AWS CloudWatch:
Open AWS Management Console.
Go to CloudWatch > Rules.
Create a new rule:
Event Source: Schedule.
Schedule Expression: cron(0 0 * * ? *) (midnight daily).
Target:
Choose AWS Lambda.
Add the Python script to your Lambda function.
3. Using Crontab on EC2
If you prefer running the script from an EC2 instance:

SSH into the EC2 instance.
Edit the crontab:

crontab -e
Add the following entry to trigger the script at midnight:

0 0 * * * python3 /path/to/trigger_databricks_job.py
Which approach do you prefer? I can guide you further.'''


#Write AWS databrics notebook to calculate rolling average over time periods for q time series data set.

#Input : time period '7 days' dataset: 'stock_prices''

#Output: dataframe with 7-day rolling averages

'''Here is a Databricks notebook code that calculates the 7-day rolling average for a time series dataset (stock_prices) using PySpark.

Code for the Databricks Notebook

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, window

# Initialize SparkSession (this is usually already set up in Databricks)
spark = SparkSession.builder.appName("RollingAverage").getOrCreate()

# Sample dataset: Replace this with your actual dataset
data = [
    ("2025-01-01", 100),
    ("2025-01-02", 110),
    ("2025-01-03", 105),
    ("2025-01-04", 120),
    ("2025-01-05", 130),
    ("2025-01-06", 125),
    ("2025-01-07", 140),
    ("2025-01-08", 145),
    ("2025-01-09", 150),
    ("2025-01-10", 155)
]

# Define schema and create a DataFrame
columns = ["date", "price"]
stock_prices_df = spark.createDataFrame(data, schema=columns)

# Convert 'date' column to TimestampType
from pyspark.sql.functions import to_date
stock_prices_df = stock_prices_df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))

# Calculate 7-day rolling average
from pyspark.sql.window import Window
from pyspark.sql.functions import avg

# Define window specification
rolling_window = Window.orderBy("date").rowsBetween(-6, 0)

# Add rolling average column
result_df = stock_prices_df.withColumn("7_day_avg", avg("price").over(rolling_window))

# Show the results
result_df.show()
Explanation of the Code
Dataset Initialization:

Replace the data variable with your actual dataset from stock_prices.
Date Conversion:

Convert the date column to DateType for proper time series handling.
Define Window Specification:

Use Window.orderBy("date").rowsBetween(-6, 0) to create a rolling window of 7 days (current row and 6 previous rows).
Calculate Rolling Average:

Use the avg function to compute the 7-day rolling average.
Result:

A new column 7_day_avg is added to the DataFrame showing the rolling average.
Sample Output
date	price	7_day_avg
2025-01-01	100	100.0
2025-01-02	110	105.0
2025-01-03	105	105.0
2025-01-04	120	108.75
2025-01-05	130	113.0
2025-01-06	125	115.0
2025-01-07	140	118.57
2025-01-08	145	125.0
2025-01-09	150	130.0
2025-01-10	155	135.71
Let me know if you'd like help loading the actual dataset from S3 or a database!'''


#Partition a large dataframe in databrics by a given column and write to different parquet files.

#Input: column: 'region'
#Output: Partitioned parquet files by 'region' 

#Eg 2 Input: column: 'category'
#Output: Partioned Parquet files by 'category'

'''
Here’s how to partition a large DataFrame in Databricks by a given column (e.g., region or category) and write it to different Parquet files:

Code for Partitioning and Writing to Parquet

from pyspark.sql import SparkSession

# Initialize SparkSession (this is usually pre-configured in Databricks)
spark = SparkSession.builder.appName("PartitionedParquet").getOrCreate()

# Sample dataset: Replace this with your actual dataset
data = [
    ("East", "Electronics", 100),
    ("West", "Furniture", 200),
    ("East", "Clothing", 150),
    ("North", "Electronics", 300),
    ("South", "Clothing", 250),
    ("West", "Electronics", 400),
    ("East", "Furniture", 350)
]

# Define schema and create a DataFrame
columns = ["region", "category", "sales"]
df = spark.createDataFrame(data, schema=columns)

(((((((
We can create separate explicit schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema explicitly
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

data = [("Alice", 25), ("Bob", 30)]

# Apply the schema
df = spark.createDataFrame(data, schema)
df.printSchema()
)))))))

# Partition column: Replace 'region' with 'category' as needed
partition_column = "region"  # Change to "category" for the second example

# Output path for Parquet files (e.g., S3, DBFS)
output_path = "dbfs:/mnt/output/partitioned_data"  # Change as required

# Write DataFrame to partitioned Parquet files
df.write.partitionBy(partition_column).parquet(output_path)

print(f"Data written to Parquet files partitioned by {partition_column}")
Explanation
Input Dataset:

Replace the sample data with your actual DataFrame from a source like S3, Delta Lake, or a SQL database.
Partitioning Column:

Set the partition_column to the column by which you want to partition the Parquet files (e.g., region or category).
Output Path:

Specify the output_path where the partitioned Parquet files should be written. This could be:
dbfs:/mnt/output/... for Databricks File System.
s3://bucket-name/... for AWS S3.
Partition and Write:

Use the partitionBy() method to group data by the specified column, creating a folder for each unique value in the column.
Parquet files for each partition are written inside the corresponding folder.
Output Directory Structure
If partition_column = 'region', the Parquet files are saved as:


/mnt/output/partitioned_data/
├── region=East/
│   ├── part-00000.parquet
│   └── part-00001.parquet
├── region=West/
│   ├── part-00000.parquet
│   └── part-00001.parquet
├── region=North/
│   ├── part-00000.parquet
├── region=South/
│   ├── part-00000.parquet
For partition_column = 'category', the folder names will follow category=Electronics, category=Furniture, etc.

Verify Written Data
To verify the Parquet files:


# Load the partitioned Parquet files
partitioned_df = spark.read.parquet(output_path)

# Show data to verify
partitioned_df.show()
'''


In [None]:
#You are a data engineer explain this project in detail use Pyspark and python pandas and no SQL whereever required. And also focused on horizon insight database oracle
'''
I'll help explain this Horizon Insights data engineering project and provide implementation suggestions using PySpark and Pandas. Let me break down the architecture and provide some code examples.

The project appears to be a comprehensive data platform serving since 2017 with impressive metrics: 800+ regular users, 5 queries/second average throughput, and 18TB+ of accumulated data. Let me break this down into key components:

1. Data Sources & Ingestion:
Let's handle the various data sources using PySpark:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("HorizonInsights") \
    .config("spark.jars", "oracle-jdbc.jar") \
    .getOrCreate()

# Function to read from various sources
def ingest_service_data(source_type, path):
    if source_type == "jenkins":
        # Read Jenkins JSON data
        df = spark.read.json(path)
    elif source_type == "jira":
        # Read Jira REST API data
        df = spark.read.json(path)
    elif source_type == "qtest":
        # Read qTest data
        df = spark.read.format("csv").option("header", "true").load(path)
    return df
```

2. Data Transformation & Loading:
Given the metrics (140K+ stories/month, 18K+ bugs/month), we need robust transformation:

```python
import pandas as pd

def transform_metrics_data(df, metric_type):
    if metric_type == "stories":
        # Transform stories data
        df = df.withColumn("month", month(col("created_date"))) \
               .withColumn("year", year(col("created_date"))) \
               .groupBy("month", "year") \
               .agg(count("story_id").alias("story_count"))
    
    elif metric_type == "jenkins_jobs":
        # Transform Jenkins jobs data
        df = df.withColumn("status", when(col("result") == "SUCCESS", 1).otherwise(0)) \
               .groupBy("job_name") \
               .agg(
                   sum("status").alias("successful_builds"),
                   count("*").alias("total_builds")
               )
    return df
```

3. Oracle Database Integration:
For the Horizon Insights DB, which appears to be Oracle-based:

```python
def write_to_oracle(df, table_name):
    # Oracle connection properties
    oracle_properties = {
        "driver": "oracle.jdbc.driver.OracleDriver",
        "url": "jdbc:oracle:thin:@//hostname:port/service",
        "user": "username",
        "password": "password"
    }
    
    # Write to Oracle
    df.write \
        .format("jdbc") \
        .option("url", oracle_properties["url"]) \
        .option("dbtable", table_name) \
        .option("user", oracle_properties["user"]) \
        .option("password", oracle_properties["password"]) \
        .mode("append") \
        .save()
```

4. Reporting Pipeline:
For the reporting requirements (MetricsHQ, SDLC, Compliance):

```python
def generate_metrics_report():
    # Create a reporting dataframe
    metrics_df = pd.DataFrame()
    
    # Get Jenkins metrics
    jenkins_metrics = spark.sql("""
        SELECT 
            DATE_TRUNC('month', timestamp) as month,
            COUNT(*) as total_jobs,
            SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) as successful_jobs
        FROM jenkins_jobs
        GROUP BY DATE_TRUNC('month', timestamp)
    """)
    
    # Convert to pandas for reporting
    metrics_df['jenkins_success_rate'] = jenkins_metrics.toPandas()['successful_jobs'] / \
                                       jenkins_metrics.toPandas()['total_jobs']
    
    return metrics_df
```

Key Features to Note:
1. Volume Handling:
- Processing 470K+ Jenkins jobs/month
- Managing 350K+ Jobs/Month from Tower
- Handling 300K+ qTest runs/month

2. Performance Optimization:
```python
# Configure Spark for optimal performance
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.driver.memory", "4g")
```

3. Data Quality Checks:
```python
def data_quality_check(df, table_name):
    # Check for nulls
    null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])
    
    # Check for duplicates
    duplicate_count = df.count() - df.dropDuplicates().count()
    
    # Log quality metrics
    print(f"Quality checks for {table_name}")
    null_counts.show()
    print(f"Duplicate records: {duplicate_count}")
```

The system seems to be processing a significant amount of DevOps and testing data, with various downstream consumers like Jira, Release Manager, and TechABS. The architecture uses SSIS for data ingestion, suggesting this is likely an enterprise-level solution.

'''