## Load CSV files to PySpark's DataFrame

When working with large data files, PySpark is a crucial tool for handling the data efficiently. However, the package is not the focus of the course. 

In this course, we won't be diving deep into the intricate components of PySpark. Instead, our focus will be on understanding how to create a PySpark session, which allows us to work with dataframes. Nothing more than that.

The operations performed on PySpark's dataframe almost exact the same to those in pandas' dataframe, making it easier for users familiar with pandas to transition to PySpark.

![image.png](attachment:image.png)


In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
     ---------------------------------------- 0.0/316.9 MB ? eta -:--:--
     --------------------------------------- 0.5/316.9 MB 16.2 MB/s eta 0:00:20
     --------------------------------------- 1.6/316.9 MB 21.0 MB/s eta 0:00:16
     --------------------------------------- 2.5/316.9 MB 19.8 MB/s eta 0:00:16
     --------------------------------------- 3.3/316.9 MB 19.1 MB/s eta 0:00:17
      -------------------------------------- 4.5/316.9 MB 20.3 MB/s eta 0:00:16
      -------------------------------------- 6.1/316.9 MB 22.8 MB/s eta 0:00:14
      -------------------------------------- 7.6/316.9 MB 24.3 MB/s eta 0:00:13
     - ------------------------------------- 8.5/316.9 MB 23.7 MB/s eta 0:00:14
     - ------------------------------------ 10.3/316.9 MB 25.2 MB/s eta 0:00:13
     - ------------------------------------ 11.2/316.9 MB 26.2 MB/s eta 0:00:12
     - ------------------------------------ 12.3/316.9 MB 25.2


[notice] A new release of pip is available: 23.2.1 -> 23.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("CSV to DataFrame") \
    .getOrCreate()

# Load CSV file into DataFrame
csv_path = "path/to/your/csvfile.csv"
df_csv = spark.read.csv(csv_path, header=True, inferSchema=True)

# Show DataFrame schema
df_csv.printSchema()

# Perform transformations or actions on the DataFrame as needed
# For example: df_csv = df_csv.select("column_name").filter(df_csv["another_column"] > 10)

# Stop the SparkSession
spark.stop()


## Load PyArrow / Parquet to PySpark's DataFrame

In [None]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("PyArrow/Parquet to DataFrame") \
    .getOrCreate()

# Load PyArrow/Parquet file into DataFrame
parquet_path = "path/to/your/parquetfile"
df_parquet = spark.read.format("parquet").load(parquet_path)

# Show DataFrame schema
df_parquet.printSchema()

# Perform transformations or actions on the DataFrame as needed
# For example: df_parquet = df_parquet.select("column_name").filter(df_parquet["another_column"] > 10)

# Stop the SparkSession
spark.stop()
