## Create a DataFrame in PySpark and apply basic operations such as viewing data and selecting columns.

In [1]:
sc

## 📊 Dataset Overview
### >Total Records: 50 students
### >Columns: 7 → id, name, age, gender, math, science, english
### >No missing values

## 👥 Demographics
### Age: 18 – 25 years (average ≈ 21.5)
### Gender: 29 Female, 21 Male

## 📚 Academic Performance
### Math:
### Range: 40 – 100
### Mean: 68.9
### Std. Dev.: 17.6 (high variation)

### Science:
### Range: 44 – 99
### Mean: 70.2
### Std. Dev.: 14.6 (moderate variation)

### English:
### Range: 42 – 100
### Mean: 69.4
### Std. Dev.: 18.7 (highest variation)

## Key Insights
### Science is the strongest subject on average.
### English has the most variation in performance.
### Students perform differently across subjects (not uniform).

In [2]:
from pyspark.sql import SparkSession
# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("BasicDataFrameOps").getOrCreate()

In [3]:
df = spark.read.csv("C:/Users/Poojitha/Desktop/students.csv", header=True, inferSchema=True)

In [4]:
print("=== First 5 rows ===")
df.show(5)

=== First 5 rows ===
+---+-------+---+------+----+-------+-------+
| id|   name|age|gender|math|science|english|
+---+-------+---+------+----+-------+-------+
|  1|  Alice| 20|     F|  66|     92|     44|
|  2|    Bob| 20|     M|  82|     52|     77|
|  3|Charlie| 22|     F|  43|     57|     76|
|  4|  David| 19|     M|  95|     69|     46|
|  5|    Eva| 19|     F|  62|     44|     96|
+---+-------+---+------+----+-------+-------+
only showing top 5 rows


In [5]:
print("=== Schema ===")
df.printSchema()

=== Schema ===
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- math: integer (nullable = true)
 |-- science: integer (nullable = true)
 |-- english: integer (nullable = true)



In [7]:
print("=== Select name and math columns ===")
df.select("name", "math").show(5)

=== Select name and math columns ===
+-------+----+
|   name|math|
+-------+----+
|  Alice|  66|
|    Bob|  82|
|Charlie|  43|
|  David|  95|
|    Eva|  62|
+-------+----+
only showing top 5 rows


In [8]:
print("=== Students with math >= 80 ===")
df.filter(df.math >= 80).show(5)

=== Students with math >= 80 ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
|  2|   Bob| 20|     M|  82|     52|     77|
|  4| David| 19|     M|  95|     69|     46|
| 11| Kathy| 25|     M|  85|     71|     89|
| 12|   Leo| 24|     M|  97|     84|     83|
| 15|Olivia| 18|     M|  87|     90|     87|
+---+------+---+------+----+-------+-------+
only showing top 5 rows


In [9]:
print("=== Sorted by science (desc) ===")
df.orderBy(df.science.desc()).show(5)

=== Sorted by science (desc) ===
+---+------+---+------+----+-------+-------+
| id|  name|age|gender|math|science|english|
+---+------+---+------+----+-------+-------+
| 27| Aaron| 25|     F|  81|     99|     44|
| 32| Fiona| 22|     F|  48|     96|     48|
| 33|George| 22|     M|  66|     95|     84|
| 29|  Carl| 22|     F|  53|     92|     52|
|  1| Alice| 20|     F|  66|     92|     44|
+---+------+---+------+----+-------+-------+
only showing top 5 rows


In [10]:
print("Total rows in dataset:", df.count())

Total rows in dataset: 50


In [11]:
print("Columns:", df.columns)

Columns: ['id', 'name', 'age', 'gender', 'math', 'science', 'english']


### Summary
#### Initialization: The notebook initializes a Spark session and reads a CSV file named students.csv into a DataFrame.

#### Data Viewing: It shows the first 5 rows of the DataFrame.

#### Schema and Columns: It prints the schema of the DataFrame, showing the column names and data types, and also lists all column names.

#### Column Selection: It selects and displays a subset of the data, specifically the name and math columns.

#### Filtering: It filters the DataFrame to show only students with a math score greater than or equal to 80.

#### Sorting: It sorts the students in descending order based on their science marks and displays the top 5 results.

#### Counting: It counts and prints the total number of rows in the dataset, which is 50.