## 🔧 1. Installation

You can install PySpark via pip:
```
pip install pyspark
```
To verify the installation, run the following command in Python:

In [2]:
import pyspark
print(pyspark.__version__)

3.5.4


Before intializing PySpark, please make sure your computer installed Java 8 or later, but Java 9+ can sometimes cause issues. To check your Java version, run:
```{sh}
java -version
```
You should see something like:
```
java version "1.8.0_281"
```
If Java is not installed, install it trough the link
-  MacBook: [Install Java on MacBook](https://www.java.com/en/download/) (Check if your computer has Intel or Apple M CPU first)
-  Windows: [Install Java on PC](https://www.java.com/download/ie_manual.jsp)

## 🚀 2. Start a Spark Session

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark_Tutorial") \
    .master("local[*]") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/18 23:15:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 📂 3. Load and Explore Data
The data required for this example is saved at s3://de300spring2025/dinglin_xia/data/adult.csv.

Assuming your file is named "adult.csv" and contains a column "filtered" (JSON array of words):
We’ll walk through several important steps:

1. Feature Engineering

2. Bias Analysis on Marital Status

3. Joining with Supplemental Gender Data

4. Exporting Final Data

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import col, when, isnan, isnull, count, avg, trim
import os

In [None]:
DATA_FOLDER = "data"
# source https://www.statista.com/statistics/242030/marital-status-of-the-us-population-by-sex/
# the first value is male and the second is for female
MARITAL_STATUS_BY_GENDER = [
    ["Never-married", 47.35, 41.81],
    ["Married-AF-spouse", 67.54, 68.33],
    ["Widowed", 3.58, 11.61],
    ["Divorced", 10.82, 15.09]
]
MARITAL_STATUS_BY_GENDER_COLUMNS = ["marital_status_statistics", "male", "female"]

In [None]:
def read_data(spark: SparkSession) -> DataFrame:
    """
    read data based on the given schema; this is much faster than spark determining the schema
    """
    
    # Define the schema for the dataset
    schema = StructType([
        StructField("age", IntegerType(), True),
        StructField("workclass", StringType(), True),
        StructField("fnlwgt", FloatType(), True),
        StructField("education", StringType(), True),
        StructField("education_num", FloatType(), True),
        StructField("marital_status", StringType(), True),
        StructField("occupation", StringType(), True),
        StructField("relationship", StringType(), True),
        StructField("race", StringType(), True),
        StructField("sex", StringType(), True),
        StructField("capital_gain", FloatType(), True),
        StructField("capital_loss", FloatType(), True),
        StructField("hours_per_week", FloatType(), True),
        StructField("native_country", StringType(), True),
        StructField("income", StringType(), True)
    ])

    # Read the dataset
    data = spark.read \
        .schema(schema) \
        .option("header", "false") \
        .option("inferSchema", "false") \
        .csv(os.path.join(DATA_FOLDER,"*.csv")) 

    data = data.repartition(8)

    float_columns = [f.name for f in data.schema.fields if isinstance(f.dataType, FloatType)]
    for v in float_columns:
        data = data.withColumn(v, data[v].cast(IntegerType()))

    # Get the names of all StringType columns
    string_columns = [f.name for f in data.schema.fields if isinstance(f.dataType, StringType)]

    # Remove leading and trailing spaces in all string columns
    for column in string_columns:
        data = data.withColumn(column, trim(data[column]))

    # Show the first 5 rows of the dataset
    data.show(5)

    return data

In [None]:
def missing_values(data: DataFrame) -> DataFrame:
    """
    count the number of samples with missing values for each row
    remove such samples
    """

    missing_values = data.select([count(when(isnan(c) | isnull(c), c)).alias(c) for c in data.columns])

    # Show the missing values count per column
    missing_values.show()

    # Get the number of samples in the DataFrame
    num_samples = data.count()

    # Print the number of samples
    print("Number of samples:", num_samples)  

    data = data.dropna()      
    
    return data

### 📌 Feature Engineering Function
📖 Explanation:
This function programmatically finds all integer features and creates new features by multiplying each pair. It helps capture interaction effects for models or exploratory analysis.

In [None]:
def feature_engineering(data: DataFrame) -> DataFrame:
    """
    Calculate the product of each pair of integer features
    """
    # Identify all integer-type columns in the dataset
    integer_columns = [f.name for f in data.schema.fields if isinstance(f.dataType, IntegerType)]
    
    # For each pair of integer columns, compute a new column that is their product
    for i, col1 in enumerate(integer_columns):
        for col2 in integer_columns[i:]:  # Avoid duplicate pairs
            product_col_name = f"{col1}_x_{col2}"
            data = data.withColumn(product_col_name, col(col1) * col(col2))
    
    # Preview first 5 rows to check new columns
    data.show(5)
    
    return data

### 📌 Analyze Bias by Marital Status
📖 Explanation:
This function investigates if capital_gain varies significantly with marital_status. It also filters and inspects a specific subgroup — divorced individuals — to facilitate further analysis or visualization.

In [None]:
def bias_marital_status(data: DataFrame):
    """
    Analyze if there's a bias in capital gain by marital status
    """
    # Group by marital status and compute average capital gain
    average_capital_gain = data.groupBy("marital_status").agg(avg("capital_gain").alias("average_capital_gain"))
    average_capital_gain.show()

    # Filter only rows with marital_status == "Divorced"
    divorced_data = data.filter(data.marital_status == "Divorced")
    divorced_data.show(5)

### 📌 Join with External Gender Statistics
📖 Explanation:
This function enriches the dataset by joining it with an external dataset containing U.S. gender distribution by marital status. The `outer` join ensures we keep unmatched records for completeness.

In [None]:
def join_with_US_gender(spark: SparkSession, data: DataFrame):
    """
    Join with external data about marital status statistics by gender
    """
    # Example data (assumed predefined in MARITAL_STATUS_BY_GENDER & *_COLUMNS)
    us_df = spark.createDataFrame(MARITAL_STATUS_BY_GENDER, MARITAL_STATUS_BY_GENDER_COLUMNS)

    # Outer join on marital_status
    return data.join(us_df, data.marital_status == us_df.marital_status_statistics, 'outer')

### ✅ Main Pipeline
📖 Explanation:
This function ties all preprocessing steps together and writes the result to a CSV file. You can run it with main() in a PySpark environment.

In [None]:
def main():
    spark = SparkSession.builder.appName("Read Adult Dataset").getOrCreate()
    
    data = read_data(spark)               # Load dataset
    data = missing_values(data)           # Handle missing values (assumed implemented)
    data = feature_engineering(data)      # Add interaction features
    bias_marital_status(data)             # Analyze capital gain by marital status
    data = join_with_US_gender(spark, data)  # Enrich with gender statistics
    
    data.show(5)                          # Preview final data
    data.write.format('csv').option('header', 'true').mode('overwrite').save('saved.csv')  # Save

In [None]:
main()

## Lab Assignment
### Word Count
1 Save only the words that have count greater or equal to 3.