Create a DataFrame in PySpark by reading data from a CSV file (students.csv) and explore its structure and contents, including performing data manipulation operations like selecting columns, filtering data, adding a new calculated column, sorting, and aggregation (grouping by gender).

Dataset Overview

The dataset is a CSV file named students.csv

It contains the following fields:

    ID – Unique identifier of the student
    
    Name – Student’s name
    
    Age – Age of the student
    
    Gender – Gender (M/F)
    
    Math – Marks scored in Mathematics
    
    Science – Marks scored in Science
    
    English – Marks scored in English

In [1]:
sc

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, round, max
# Initialize Spark Session
spark = SparkSession.builder.appName("StudentsDataFrameExample").getOrCreate()


In [4]:
# Step 1: Read CSV file into DataFrame
df = spark.read.csv("C:\\Users\\user\\Downloads\\students.csv", header=True, inferSchema=True)

In [5]:
# Step 2: Explore dataset
print("=== First 10 rows ===")
df.show(10)


=== First 10 rows ===
+---+-------+---+------+----+-------+-------+
| id|   name|age|gender|math|science|english|
+---+-------+---+------+----+-------+-------+
|  1|  Alice| 20|     F|  66|     92|     44|
|  2|    Bob| 20|     M|  82|     52|     77|
|  3|Charlie| 22|     F|  43|     57|     76|
|  4|  David| 19|     M|  95|     69|     46|
|  5|    Eva| 19|     F|  62|     44|     96|
|  6|  Frank| 22|     F|  70|     78|     94|
|  7|  Grace| 24|     F|  67|     66|     93|
|  8|  Henry| 21|     F|  53|     82|     60|
|  9|    Ivy| 19|     M|  64|     52|     46|
| 10|   Jack| 19|     F|  44|     59|     60|
+---+-------+---+------+----+-------+-------+
only showing top 10 rows


In [6]:
print("=== Schema ===")
df.printSchema()

=== Schema ===
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- math: integer (nullable = true)
 |-- science: integer (nullable = true)
 |-- english: integer (nullable = true)



In [7]:
print("=== Datatypes ===")
print(df.dtypes)


=== Datatypes ===
[('id', 'int'), ('name', 'string'), ('age', 'int'), ('gender', 'string'), ('math', 'int'), ('science', 'int'), ('english', 'int')]


Conclusion:
The program successfully demonstrated fundamental PySpark DataFrame operations on the student dataset. This included reading the CSV file, viewing the schema and summary statistics, selecting specific columns, filtering students based on age (≥21) and math score (≥70), calculating a new 'average' marks column, filtering and sorting students by their average marks (≥75), and finally, calculating and presenting the average marks (math, science, English, and overall) grouped by gender.