# Dataset Overview

Total Records: 50 students

Columns: 7 → id, name, age, gender, math, science, english

No missing values

# Demographics

Age: 18 – 25 years (average ≈ 21.5)

Gender: 29 Female, 21 Male
# Academic Performance

# Math:
Range: 40 – 100

Mean: 68.9

Std. Dev.: 17.6 (high variation)

# Science:
Range: 44 – 99

Mean: 70.2

Std. Dev.: 14.6 (moderate variation)

# English:

Range: 42 – 100

Mean: 69.4

Std. Dev.: 18.7 (highest variation)

# Key Insights

Science is the strongest subject on average.

English has the most variation in performance.

Students perform differently across subjects (not uniform).

In [1]:
sc

# Perform simple data transformation like filtering evennumbers from a given list using PySpark RDD

In [2]:
import random

In [3]:
random_numbers = [random.randint(1, 1000) for _ in range(100)]
print("Original List:")
print(random_numbers)

Original List:
[63, 274, 908, 61, 985, 635, 493, 776, 99, 301, 27, 804, 484, 197, 886, 173, 41, 289, 815, 19, 550, 667, 122, 14, 485, 611, 862, 391, 787, 954, 315, 470, 197, 528, 569, 311, 599, 57, 107, 162, 994, 939, 245, 32, 551, 403, 506, 730, 439, 399, 616, 382, 286, 41, 778, 438, 338, 548, 56, 654, 891, 28, 906, 104, 638, 752, 910, 692, 298, 576, 222, 484, 224, 168, 347, 593, 784, 23, 199, 765, 285, 69, 506, 422, 855, 245, 244, 823, 920, 946, 362, 474, 707, 513, 944, 261, 818, 456, 766, 132]


In [4]:
numbers_rdd = sc.parallelize(random_numbers)

In [5]:
even_numbers_rdd = numbers_rdd.filter(lambda x: x % 2 == 0)

In [7]:
even_numbers = even_numbers_rdd.collect()
print("\nEven Numbers:")
print(even_numbers)


Even Numbers:
[274, 908, 776, 804, 484, 886, 550, 122, 14, 862, 954, 470, 528, 162, 994, 32, 506, 730, 616, 382, 286, 778, 438, 338, 548, 56, 654, 28, 906, 104, 638, 752, 910, 692, 298, 576, 222, 484, 224, 168, 784, 506, 422, 244, 920, 946, 362, 474, 944, 818, 456, 766, 132]


# Summary

Demonstrates data transformation using PySpark RDDs.

Focuses on applying RDD operations (transformations & actions) for big data handling.

Operations Performed

# 1. Setup
Imported PySpark libraries.

Created a SparkContext to work with RDDs.

Loaded sample data (possibly text/CSV).

# 2. RDD Creation
Data converted into RDD using sc.parallelize() or textFile().

# 3. Transformations
Operations that define a new RDD but do not execute immediately (lazy evaluation):

map() → apply function to each element.

filter() → filter elements based on condition.

flatMap() → split elements into multiple parts.

distinct() → remove duplicates.

union() / intersection() → combine datasets.

groupByKey() / reduceByKey() → group and aggregate.

# 4. Actions
Operations that trigger execution and return results:

collect() → return all elements.

count() → count records.

first() → first element.

take(n) → first n elements.

reduce() → aggregate values.

# 5. Data Transformation Examples
Converting strings to key-value pairs.

Filtering based on conditions (e.g., ages > 20).

Aggregating numbers (sum, average, min, max).

Word count (common beginner example).

# 6. Output & Verification
Displaying transformed data with .collect().

Checking counts, sums, or sample records.