## Intro PySpark Example

In [1]:
# Import PySpark and initialize Spark session
import pyspark
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("PySparkExample").getOrCreate()

# Create a DataFrame with sample data
data = [("Big Data", 10), ("Machine Learning", 10), ("Deep Learning", 10)]
df = spark.createDataFrame(data, ["Course", "Score"])

# Show the DataFrame
df.show()

# Stop the Spark session
spark.stop()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/08 15:06:15 WARN Utils: Your hostname, Xuan-Nguyet-Nguyen.local, resolves to a loopback address: 127.0.0.1; using 172.21.56.185 instead (on interface en0)
25/12/08 15:06:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/08 15:06:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+----------------+-----+
|          Course|Score|
+----------------+-----+
|        Big Data|   10|
|Machine Learning|   10|
|   Deep Learning|   10|
+----------------+-----+



## PySpark CSV Experiment

This section demonstrates how to import multiple CSV files into PySpark, perform basic transformations, and write the results back to local storage.


### 1. Spark Session

In [16]:
# Import PySpark and initialize Spark session
import pyspark
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("PySparkCSVExperiment").getOrCreate()

### 2. Load CSV files into PySpark Dataframes

In [19]:
# Load CSV files into PySpark Dataframes
customers_100 = spark.read.option("header", True).csv("customers-100.csv")
customers_1000 = spark.read.option("header", True).csv("customers-1000.csv")

In [20]:
# Show the DataFrame 1
print("DataFrame 1:")
customers_100.show(5)

DataFrame 1:
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|Last Name|             Company|             City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|DD37Cf93aecA6Dc|    Sheryl|   Baxter|     Rasmussen Group|     East Leonard|               Chile|        229.077.5154|    397.884.0519x718|zunigavanessa@smi...|       2020-08-24|http://www.stephe...|
|    2|1Ef7b82A4CAAD10|   Preston|   Lozano|         Vega-Gentry|East Jimmychester|            Djibouti|          5153435776|    686-620-1820x944|     vmata@co

In [21]:
# Show the DataFrame 2
print("DataFrame 2:")
customers_1000.show(5)

DataFrame 2:
+-----+---------------+----------+---------+--------------------+----------------+----------------+--------------------+-------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|Last Name|             Company|            City|         Country|             Phone 1|            Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+---------+--------------------+----------------+----------------+--------------------+-------------------+--------------------+-----------------+--------------------+
|    1|dE014d010c7ab0c|    Andrew|  Goodman|       Stewart-Flynn|     Rowlandberg|           Macao|   846-790-4623x4715|(422)787-2331x71127|marieyates@gomez-...|       2021-07-26|http://www.shea.biz/|
|    2|2B54172c8b65eC3|     Alvin|     Lane|Terry, Proctor an...|        Bethside|Papua New Guinea|  124-597-8652x05682|  321.441.0588x6218|alexandra86@mccoy...|       2021-06-24|http

### 3. Inspect Schema

Check the schema to understand column names and types.

In [23]:
print("Schema of DataFrame 1:")
customers_100.printSchema()

print("Schema of DataFrame 2:")
customers_1000.printSchema()

Schema of DataFrame 1:
root
 |-- Index: string (nullable = true)
 |-- Customer Id: string (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Phone 1: string (nullable = true)
 |-- Phone 2: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- Subscription Date: string (nullable = true)
 |-- Website: string (nullable = true)

Schema of DataFrame 2:
root
 |-- Index: string (nullable = true)
 |-- Customer Id: string (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Phone 1: string (nullable = true)
 |-- Phone 2: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- Subscription Date: string (nullable = true)
 |-- Website: string (

### 4. Combine DataFrames

We can combine them using `union`.

In [24]:
df_combined = customers_100.union(customers_1000)
print("Combined DataFrame:")
df_combined.show(5)

Combined DataFrame:
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|Last Name|             Company|             City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|DD37Cf93aecA6Dc|    Sheryl|   Baxter|     Rasmussen Group|     East Leonard|               Chile|        229.077.5154|    397.884.0519x718|zunigavanessa@smi...|       2020-08-24|http://www.stephe...|
|    2|1Ef7b82A4CAAD10|   Preston|   Lozano|         Vega-Gentry|East Jimmychester|            Djibouti|          5153435776|    686-620-1820x944|     v

### 5. Basic Transformation

#### 5.1. Filtered Data

Filter customers by Country.

In [27]:
df_combined.filter(df_combined.Country == "Chile").show()

+-----+---------------+----------+----------+------------------+------------+-------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name| Last Name|           Company|        City|Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+----------+------------------+------------+-------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|DD37Cf93aecA6Dc|    Sheryl|    Baxter|   Rasmussen Group|East Leonard|  Chile|        229.077.5154|    397.884.0519x718|zunigavanessa@smi...|       2020-08-24|http://www.stephe...|
|  199|7fB6124FC680839|     Cindy|Valenzuela|         Rojas LLC|  Maychester|  Chile|+1-860-035-9154x2...|001-489-685-6257x790|maryforbes@oliver...|       2020-04-13|http://www.holmes...|
|  579|895c3d6c2B7f017|    Sergio|   Marquez|      Arroyo-Br

#### 5.2 Count Data

Count Customers per City

In [30]:
df_combined.groupBy("City").count().show(10)

+-----------------+-----+
|             City|count|
+-----------------+-----+
|        Burchbury|    1|
|        Huangfort|    1|
|    Zimmermanland|    1|
|       Selenabury|    1|
|        Coreybury|    1|
|        Judymouth|    1|
|   North Kerriton|    1|
|       Thomasfurt|    1|
|North Jillianview|    1|
|       North Drew|    1|
+-----------------+-----+
only showing top 10 rows


#### 5.3. Sort Data

Sort by Subscription Date Descending

In [31]:
df_combined.orderBy("Subscription Date", ascending=False).show(5)

+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|Last Name|             Company|             City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|  198|a8FfE4fbd7910b9|   Bethany|  Barrera|Swanson, Figueroa...|       Vickietown|South Georgia and...|    001-411-057-3486|          6232251109|rhonda48@castro.info|       2022-05-29|http://www.cortez...|
|  498|E040edB499A6132|    Amanda|   Santos|        Camacho-Lamb|      Freemanberg| Antigua and Barbuda|   092.983.8391x0219|  626-158-4763x92618|slivingston@cherr...|     

### 6. Save Data Locally

In [32]:
# Save Combined Data
df_combined.write.mode("overwrite").csv("output/customers_combined")

### 7. Stop Spark

In [33]:
spark.stop()