<a href="https://colab.research.google.com/github/nikitaj832/Training/blob/main/DataFrame_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Create a DataFrame and filter rows based on conditions.**

In [2]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()

# Sample data
data = [("Niku", 25), ("Anchal", 30), ("ritu", 22), ("geeta", 35)]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

# Show original DataFrame
print("Original DataFrame:")
df.show()

# Filter rows where Age > 25
filtered_df = df.filter(df.Age > 25)

# Show filtered DataFrame
print("Filtered DataFrame (Age > 25):")
filtered_df.show()


Original DataFrame:
+------+---+
|  Name|Age|
+------+---+
|  Niku| 25|
|Anchal| 30|
|  ritu| 22|
| geeta| 35|
+------+---+

Filtered DataFrame (Age > 25):
+------+---+
|  Name|Age|
+------+---+
|Anchal| 30|
| geeta| 35|
+------+---+



## **2. Show the DataFrame content, collect data into a list, count rows.**

In [8]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("StudentData").getOrCreate()

# Sample student data
students = [
    ("Nikita", "IT", 8.2),
    ("Rahul", "CSE", 7.5),
    ("Anjali", "ECE", 8.9),
    ("Ravi", "ME", 6.8)
]

columns = ["Name", "Department", "CGPA"]

# Create DataFrame
df = spark.createDataFrame(students, columns)

# 1. Show the DataFrame content
print("DataFrame content:")
df.show()

# 2. Collect data into a Python list
data_list = df.collect()
print("\nCollected Data:")
for row in data_list:
    print(row.asDict())

# 3. Count number of rows
total_rows = df.count()
print(f"\nTotal number of rows: {total_rows}")


DataFrame content:
+------+----------+----+
|  Name|Department|CGPA|
+------+----------+----+
|Nikita|        IT| 8.2|
| Rahul|       CSE| 7.5|
|Anjali|       ECE| 8.9|
|  Ravi|        ME| 6.8|
+------+----------+----+


Collected Data:
{'Name': 'Nikita', 'Department': 'IT', 'CGPA': 8.2}
{'Name': 'Rahul', 'Department': 'CSE', 'CGPA': 7.5}
{'Name': 'Anjali', 'Department': 'ECE', 'CGPA': 8.9}
{'Name': 'Ravi', 'Department': 'ME', 'CGPA': 6.8}

Total number of rows: 4


## **3. Add a new column with withColumn and drop an existing column.**

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create Spark session
spark = SparkSession.builder.appName("ModifyColumns").getOrCreate()

# Sample data
data = [
    ("Nikita", 80),
    ("Rahul", 60),
    ("Anjali", 90),
    ("Ravi", 45)
]
columns = ["Name", "Marks"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# 1. Add a new column 'Result' based on condition
df_with_result = df.withColumn("Result", col("Marks") >= 60)

# 2. Drop the 'Marks' column
df_final = df_with_result.drop("Marks")

# Show final DataFrame
df_final.show()


+------+------+
|  Name|Result|
+------+------+
|Nikita|  true|
| Rahul|  true|
|Anjali|  true|
|  Ravi| false|
+------+------+

