# Setting Up and Exploring a Spark DataFrame in PySpark (on Mac M1)

This code snippet demonstrates how to download a dataset, create a Spark DataFrame, and explore its contents using PySpark. We'll utilize pandas for initial data loading and then convert it to a Spark DataFrame for distributed processing.

Important Note:
- Java and Environment Variables: Since we're working on Apple Silicon, it's crucial to set the JAVA_HOME environment variable to point to a compatible OpenJDK installation. This ensures Spark can interact with the Java runtime environment effectively. If not done already, you can use 'brew install openjdk' to install java.
- Virtual Environments: For managing dependencies and avoiding conflicts, it's highly recommended to use a virtual environment when working with PySpark projects. This helps isolate project-specific packages and keeps your development environment clean.

In [12]:
# Set the `JAVA_HOME` environment variable to point to the OpenJDK installation directory
# This is necessary for Spark to interact with the Java runtime on macOS with Apple Silicon
# Replace '/opt/homebrew/opt/openjdk' with your actual installation path if different
import os
os.environ['JAVA_HOME'] = '/opt/homebrew/opt/openjdk'

import pandas as pd

# Create a SparkSession named 'SparkBasics'
# This is the entry point for working with Spark SQL
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkBasics').getOrCreate()

In [13]:
# Download and Load the HR Employee Dataset

# Use pandas to read the CSV data from a URL
df_pandas = pd.read_csv('https://github.com/YBIFoundation/BigData/raw/main/HR50k.csv')

# Convert the pandas DataFrame to a Spark DataFrame
df_spark = spark.createDataFrame(df_pandas)

## Explore spark dataframe

The show() method is a built-in function of Spark DataFrames. By default, it displays the first 20 rows of the DataFrame in a tabular format. This helps you get a quick glimpse into the structure and contents of the data.

In [7]:
df.show()

24/06/09 13:32:45 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+---+---------+-----------------+---------+--------------------+----------------+---------+----------------+-------------+--------------+-----------------------+------+----------+--------------+--------+--------------------+---------------+-------------+-------------+-----------+------------------+------+--------+-----------------+-----------------+------------------------+-------------+----------------+-----------------+---------------------+---------------+--------------+------------------+-----------------------+--------------------+
|Age|Attrition|   BusinessTravel|DailyRate|          Department|DistanceFromHome|Education|  EducationField|EmployeeCount|EmployeeNumber|EnvironmentSatisfaction|Gender|HourlyRate|JobInvolvement|JobLevel|             JobRole|JobSatisfaction|MaritalStatus|MonthlyIncome|MonthlyRate|NumCompaniesWorked|Over18|OverTime|PercentSalaryHike|PerformanceRating|RelationshipSatisfaction|StandardHours|StockOptionLevel|TotalWorkingYears|TrainingTimesLastYear|WorkLifeBa

The printSchema() method is a built-in function of Spark DataFrames. It displays the schema information, which includes:
- Column names
- Data types of each column (e.g., integer, string, etc.)

In [14]:
df_spark.printSchema()

root
 |-- Age: long (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: long (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: long (nullable = true)
 |-- Education: long (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: long (nullable = true)
 |-- EmployeeNumber: long (nullable = true)
 |-- EnvironmentSatisfaction: long (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: long (nullable = true)
 |-- JobInvolvement: long (nullable = true)
 |-- JobLevel: long (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: long (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: long (nullable = true)
 |-- MonthlyRate: long (nullable = true)
 |-- NumCompaniesWorked: long (nullable = true)
 |-- Over18: string (nullable = true)
 |-- OverTime: string (nullable = true)
 |-- PercentSalaryHike: 

The describe() method is a built-in function of Spark DataFrames. It calculates summary statistics for numerical columns in the DataFrame. These statistics include:

- Count: The number of non-null values in the column
- Mean: The average value
- Stddev: The standard deviation
- Min: The minimum value
- Max: The maximum value

The show() method displays the calculated summary statistics in a tabular format.

In [15]:
df_spark.describe().show()

[Stage 6:=====>                                                    (1 + 9) / 10]

+-------+------------------+---------+--------------+-----------------+----------+------------------+------------------+----------------+-------------+-----------------+-----------------------+------+------------------+------------------+------------------+--------------------+------------------+-------------+------------------+-----------------+------------------+------+--------+------------------+-----------------+------------------------+-------------+-----------------+------------------+---------------------+------------------+----------------+------------------+-----------------------+--------------------+
|summary|               Age|Attrition|BusinessTravel|        DailyRate|Department|  DistanceFromHome|         Education|  EducationField|EmployeeCount|   EmployeeNumber|EnvironmentSatisfaction|Gender|        HourlyRate|    JobInvolvement|          JobLevel|             JobRole|   JobSatisfaction|MaritalStatus|     MonthlyIncome|      MonthlyRate|NumCompaniesWorked|Over18|OverTim

                                                                                

Setting spark configuration
- `spark.conf.set`: This method allows you to set configuration properties for Spark.
- `spark.sql.repl.eagerEval.enabled`: This specific property controls the evaluation behavior of Spark DataFrames in certain environments like Jupyter notebooks.
- `True`: Setting this to `True` enables eager evaluation, meaning the Spark DataFrame results are displayed immediately when you call an action like `show()`.

In [17]:
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
df_spark

Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
31,No,Non-Travel,158,Software,7,3,Medical,1,1,3,Male,42,2,3,Developer,1,Married,42682,298774,2,Y,No,20,4,1,80,2,15,1,2,12,4,10,11
38,No,Travel_Rarely,985,Human Resources,33,5,Life Sciences,1,2,1,Female,66,2,4,Healthcare Repres...,3,Single,45252,45252,8,Y,No,2,1,3,80,4,5,4,3,1,1,1,1
59,Yes,Non-Travel,1273,Sales,5,2,Technical Degree,1,3,4,Female,96,1,3,Manufacturing Dir...,2,Married,46149,507639,7,Y,Yes,39,3,2,80,2,9,5,1,6,6,4,3
52,Yes,Travel_Rarely,480,Support,2,5,Marketing,1,4,4,Female,71,2,4,Human Resources,1,Married,27150,27150,4,Y,No,16,3,2,80,2,22,4,4,10,9,5,6
32,No,Non-Travel,543,Human Resources,7,5,Human Resources,1,5,2,Male,122,3,3,Manager,2,Divorced,15894,47682,6,Y,Yes,42,3,4,80,2,30,3,4,29,27,9,7
19,Yes,Non-Travel,779,Hardware,43,1,Medical,1,6,2,Female,195,4,3,Research Director,3,Married,41552,1246560,3,Y,Yes,15,4,3,80,1,33,4,2,16,4,14,3
42,Yes,Non-Travel,934,Support,26,4,Human Resources,1,7,2,Female,80,3,5,Sales Executive,4,Divorced,5303,148484,3,Y,No,45,4,1,80,1,4,3,4,2,1,1,2
30,No,Travel_Rarely,380,Support,19,3,Marketing,1,8,4,Male,165,1,4,Human Resources,4,Single,28555,571100,2,Y,Yes,35,3,2,80,1,2,2,2,2,2,2,2
41,No,Travel_Frequently,1464,Software,16,1,Life Sciences,1,9,3,Male,134,1,2,Manager,4,Divorced,3241,87507,7,Y,No,1,1,3,80,2,8,1,2,2,1,2,2
45,No,Travel_Frequently,1020,Human Resources,17,5,Life Sciences,1,10,4,Female,137,2,4,Manager,2,Married,4323,116721,4,Y,Yes,32,1,3,80,4,6,4,4,5,3,4,1
