# Perform simple data transformation like filtering even numbers from a given list using PySpark RDD

In [1]:
sc

## 📊 Dataset Overview
### >Total Records: 50 students
### >Columns: 7 → id, name, age, gender, math, science, english
### >No missing values

## 👥 Demographics
### Age: 18 – 25 years (average ≈ 21.5)
### Gender: 29 Female, 21 Male

## 📚 Academic Performance
### Math:
### Range: 40 – 100
### Mean: 68.9
### Std. Dev.: 17.6 (high variation)

### Science:
### Range: 44 – 99
### Mean: 70.2
### Std. Dev.: 14.6 (moderate variation)

### English:
### Range: 42 – 100
### Mean: 69.4
### Std. Dev.: 18.7 (highest variation)

## Key Insights
### Science is the strongest subject on average.
### English has the most variation in performance.
### Students perform differently across subjects (not uniform).

In [1]:
import random

In [2]:
random_numbers = [random.randint(1, 1000) for _ in range(100)]
print("Original List:")
print(random_numbers)

Original List:
[837, 397, 457, 566, 447, 608, 269, 627, 653, 153, 565, 330, 309, 827, 15, 909, 971, 869, 934, 109, 124, 435, 724, 901, 914, 322, 945, 75, 949, 562, 936, 308, 689, 530, 797, 73, 152, 527, 750, 453, 808, 302, 436, 237, 143, 531, 68, 492, 36, 535, 859, 62, 985, 622, 350, 139, 763, 26, 439, 159, 244, 661, 7, 25, 701, 532, 968, 177, 156, 379, 748, 854, 999, 447, 686, 423, 762, 380, 628, 496, 540, 553, 548, 700, 422, 8, 961, 229, 332, 436, 189, 425, 737, 708, 963, 43, 610, 997, 210, 672]


In [3]:
numbers_rdd = sc.parallelize(random_numbers)

In [4]:
even_numbers_rdd = numbers_rdd.filter(lambda x: x % 2 == 0)

In [6]:
even_numbers = even_numbers_rdd.collect()
print("\nEven Numbers:")
print(even_numbers)


Even Numbers:
[566, 608, 330, 934, 124, 724, 914, 322, 562, 936, 308, 530, 152, 750, 808, 302, 436, 68, 492, 36, 62, 622, 350, 26, 244, 532, 968, 156, 748, 854, 686, 762, 380, 628, 496, 540, 548, 700, 422, 8, 332, 436, 708, 610, 210, 672]


In [2]:
# Stop SparkContext
# sc.stop()

## Summary
#### Demonstrates data transformation using PySpark RDDs.

#### Focuses on applying RDD operations (transformations & actions) for big data handling.

## ⚙️ Operations Performed
### 1. Setup
#### Imported PySpark libraries.

#### Created a SparkContext to work with RDDs.

#### Loaded sample data (possibly text/CSV).

### 2. RDD Creation
#### Data converted into RDD using sc.parallelize() or textFile().
### 3. Transformations
#### Operations that define a new RDD but do not execute immediately (lazy evaluation):

#### map() → apply function to each element.
#### filter() → filter elements based on condition.
#### flatMap() → split elements into multiple parts.
#### distinct() → remove duplicates.
#### union() / intersection() → combine datasets.
#### groupByKey() / reduceByKey() → group and aggregate.

### 4. Actions
#### Operations that trigger execution and return results:
#### collect() → return all elements.
#### count() → count records.
#### first() → first element.
#### take(n) → first n elements.
#### reduce() → aggregate values.

### 5. Data Transformation Examples
#### Converting strings to key-value pairs.
#### Filtering based on conditions (e.g., ages > 20).

#### Aggregating numbers (sum, average, min, max).

#### Word count (common beginner example).

### 6. Output & Verification
#### Displaying transformed data with .collect().

#### Checking counts, sums, or sample records.