# PySpark SQL
---

## Spark SQL
- Module in Apache spark for structured data processing
- Allows us to run SQL queries alongside data processing tasks
- Seamless combination of python and SQL in one application
- DataFrame interfacing

### Staring with Spark SQL
- Initialize a session
- create a dataframe

```python
spark = SparkSession.builder.appName("Spark SQL Example").getOrCreate()

# data
# ... data about people (a dataframe)
df = spark.createDataFrame(data, schema=columns)
```
- Create a temp table
    - Temporary views exists only for the current session, making them idea for quick, session-based exploration
```python
df.createOrReplaceTempView("people")
```
- Query using SQL
```python
result.spark.sql("SELECT Name, Age FROM people WHERE Age > 30")
result.show()
```
See the "people"

### Temp Views

- Temp views protect the underlying data while doing analytics
- Loading from a csv methods

```python
df = spark.read.csv("path")
df.createOrReplaceTempView("employees")  # This allows sql based interaction
```

## Casting

Let's say you have string for integers, you can do this:
```python
data = [('HR', '3000'), ('IT', '40000'), ('Finance', '350000')]
columns = ["Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

df = df.withColumn("Salary", df['salary'].cast("int"))

df.groupBy("Department").sum("Salary").show()
```