### Start with pySpark

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
print("Spark session created: ", spark.version)

### Understanding what a dataframe is

In [0]:
# create spark dataframe
data = [('Anurag', 200000), ('Bhavesh', 30000), ('Chetan', 40000), ('Dhruv', 50000)]
column = ['Name', 'Salary']
spark_df = spark.createDataFrame(data, column)
display(spark_df)

- spark_df:pyspark.sql.connect.dataframe.DataFrame

  Name: string
  
  Salary: long
- df:pandas.core.frame.DataFrame

  Name: object

  Salary: int64

In [0]:
import pandas as pd
data = [('Anurag', 200000), ('Bhavesh', 30000), ('Chetan', 40000), ('Dhruv', 50000)]
column = ['Name', 'Salary']
df = pd.DataFrame(data, columns=column)
display(df)

In [0]:
spark_df.printSchema()


In [0]:
print('Count of rows in spark_df: ', spark_df.count())


### Real life usecase
These DataFrames are used for:

- Ingesting data from files
- Transforming at scale
- Writing to Delta Tables or Parquet files

In [0]:
# add datatypes

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([StructField('Name', StringType(), True),
                     StructField('Salary', IntegerType(), True)])
spark_df = spark.createDataFrame(data, schema)
display(spark_df)
spark_df.printSchema()

Q: What's the difference between SparkSession and SparkContext?

✅ Answer:

- SparkSession is the unified entry point from Spark 2.0 onwards.
- It internally manages SparkContext, SQLContext, and HiveContext.

**Q: What's the difference between SparkSession and SparkContext?**

- **SparkSession** is the unified entry point for working with structured data in Spark (introduced in Spark 2.0). It combines the functionality of SparkContext, SQLContext, and HiveContext.
- **SparkContext** is the core entry point for Spark functionality prior to Spark 2.0, mainly used for low-level RDD operations.
- In modern Spark applications, you typically use `SparkSession`, which internally manages a `SparkContext`.

**Q: What is HiveContext?**

- **HiveContext** is a Spark SQL component that enables Spark to run SQL queries using Hive's query language (HQL), access Hive UDFs, and read Hive tables.
- It provides support for Hive features like Hive SerDes, UDFs, and the Hive metastore.
- As of Spark 2.0, `HiveContext` functionality is available through `SparkSession` with Hive support enabled.

In [0]:
json_data = [
    {"Name": "David", "Salary": 70000},
    {"Name": "Eva", "Salary": 65000}
]

df_json=spark.createDataFrame(json_data)
display(df_json)