1. Install PySpark package
2. Read CSV using Pandas and display basic info:
   - Load and display dataframe
   - Check datatype
   - Show dataframe info/structure

3. PySpark Operations:
   - Create Spark session named "Practise"
   - Read CSV first time (without headers/schema)
   - Read CSV second time with:
     - Headers enabled
     - Schema auto-inference
   - Display data and schema
   - Check datatype
   - Show first 3 rows

Note: The code shows two equivalent ways to read CSV with PySpark:
- Using direct parameters: `header=True, inferSchema=True`
- Using chained options: `.option("header", "true").option("inferSchema", "true")`

In [None]:
! pip install pyspark --upgrade

Defaulting to user installation because normal site-packages is not writeable
Collecting pyspark
  Downloading pyspark-3.5.3.tar.gz (317.3 MB)
     ---------------------------------------- 0.0/317.3 MB ? eta -:--:--
     ---------------------------------------- 0.5/317.3 MB 8.2 MB/s eta 0:00:39
     ---------------------------------------- 2.4/317.3 MB 8.4 MB/s eta 0:00:38
     ---------------------------------------- 3.7/317.3 MB 7.5 MB/s eta 0:00:42
     ---------------------------------------- 3.7/317.3 MB 7.5 MB/s eta 0:00:42
      --------------------------------------- 6.3/317.3 MB 7.1 MB/s eta 0:00:44
     - -------------------------------------- 8.7/317.3 MB 7.8 MB/s eta 0:00:40
     - -------------------------------------- 9.4/317.3 MB 8.0 MB/s eta 0:00:39
     - ------------------------------------- 13.1/317.3 MB 8.6 MB/s eta 0:00:36
     - ------------------------------------- 14.9/317.3 MB 8.9 MB/s eta 0:00:34
     -- ------------------------------------ 16.8/317.3 MB 8.9 M

In [2]:
import pyspark

In [None]:
import pandas as pd

df: pd.DataFrame = pd.read_csv(filepath_or_buffer="test1.csv")

display(df)

print(type(df))

df.info()

Unnamed: 0,Name,age,Experience,Salary
0,Krish,31,10,30000
1,Sudhanshu,30,8,25000
2,Sunny,29,4,20000
3,Paul,24,3,20000
4,Harsha,21,1,15000
5,Shubham,23,2,18000


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        6 non-null      object
 1   age         6 non-null      int64 
 2   Experience  6 non-null      int64 
 3   Salary      6 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 324.0+ bytes


In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Practise").getOrCreate()

In [5]:
spark

In [9]:
df_pyspark = spark.read.csv("test1.csv")

df_pyspark.show()

df_pyspark.printSchema()

+---------+---+----------+------+
|      _c0|_c1|       _c2|   _c3|
+---------+---+----------+------+
|     Name|age|Experience|Salary|
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



In [18]:
df_pyspark = spark.read.csv("test1.csv", header=True, inferSchema=True)

# OR

df_pyspark = (
    spark.read.option("header", "true").option("inferSchema", "true").csv("test1.csv")
)

df_pyspark.show()

df_pyspark.printSchema()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [12]:
print(type(df_pyspark))

<class 'pyspark.sql.dataframe.DataFrame'>


In [17]:
print(df_pyspark.head(3))

[Row(Name='Krish', age=31, Experience=10, Salary=30000), Row(Name='Sudhanshu', age=30, Experience=8, Salary=25000), Row(Name='Sunny', age=29, Experience=4, Salary=20000)]
