Based on [WafaStudies](https://www.youtube.com/@WafaStudies) PySpark [tutorial](https://www.youtube.com/playlist?list=PLMWaZteqtEaJFiJ2FyIKK0YEuXwQ9YIS_).

## Imports

In [1]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz
!pip -q install findspark

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

In [3]:
import findspark
findspark.init()

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder\
                    .appName('Spark')\
                    .master("local[*]")\
                    .getOrCreate()

## Generate data

In [5]:
!pip install faker

Collecting faker
  Downloading Faker-19.6.2-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-19.6.2


In [6]:
!mkdir data

In [7]:
import csv
import random
from faker import Faker

faker = Faker()

with open('data/employees1.csv', 'w', newline='') as csvfile:
    fieldnames = ['id', 'name', 'gender', 'salary']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    for id in range(1, 6):
        name = faker.name()
        gender = random.choice(['Male', 'Female'])
        salary = random.randint(1000, 10000)

        writer.writerow({'id': id, 'name': name, 'gender': gender, 'salary': salary})

with open('data/employees2.csv', 'w', newline='') as csvfile:
    fieldnames = ['id', 'name', 'gender', 'salary']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()

    for id in range(1, 6):
        name = faker.name()
        gender = random.choice(['Male', 'Female'])
        salary = random.randint(1000, 10000)

        writer.writerow({'id': id, 'name': name, 'gender': gender, 'salary': salary})

## Reading CSV files

In [8]:
help(spark.read.csv)

Help on method csv in module pyspark.sql.readwriter:

csv(path: Union[str, List[str]], schema: Union[pyspark.sql.types.StructType, str, NoneType] = None, sep: Optional[str] = None, encoding: Optional[str] = None, quote: Optional[str] = None, escape: Optional[str] = None, comment: Optional[str] = None, header: Union[bool, str, NoneType] = None, inferSchema: Union[bool, str, NoneType] = None, ignoreLeadingWhiteSpace: Union[bool, str, NoneType] = None, ignoreTrailingWhiteSpace: Union[bool, str, NoneType] = None, nullValue: Optional[str] = None, nanValue: Optional[str] = None, positiveInf: Optional[str] = None, negativeInf: Optional[str] = None, dateFormat: Optional[str] = None, timestampFormat: Optional[str] = None, maxColumns: Union[str, int, NoneType] = None, maxCharsPerColumn: Union[str, int, NoneType] = None, maxMalformedLogPerPartition: Union[str, int, NoneType] = None, mode: Optional[str] = None, columnNameOfCorruptRecord: Optional[str] = None, multiLine: Union[bool, str, NoneType] 

In [9]:
df = spark.read.csv(path='data/employees1.csv')
df.show()
df.printSchema()

+---+-------------+------+------+
|_c0|          _c1|   _c2|   _c3|
+---+-------------+------+------+
| id|         name|gender|salary|
|  1|  Holly Brown|  Male|  5513|
|  2|Charles Baker|  Male|  8067|
|  3|Regina Crosby|  Male|  4562|
|  4|  Mark Flores|  Male|  6676|
|  5|Daniel Snyder|  Male|  3155|
+---+-------------+------+------+

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



By default, spark read csv without header and all datatypes as string.

To avoid it, we use:

```header=True```: first line will be taken as header

```inferSchema=True```: spark will infer the datatypes of each column



In [10]:
df = spark.read.csv(path='data/employees1.csv', header=True, inferSchema=True)
df.show()
df.printSchema()

+---+-------------+------+------+
| id|         name|gender|salary|
+---+-------------+------+------+
|  1|  Holly Brown|  Male|  5513|
|  2|Charles Baker|  Male|  8067|
|  3|Regina Crosby|  Male|  4562|
|  4|  Mark Flores|  Male|  6676|
|  5|Daniel Snyder|  Male|  3155|
+---+-------------+------+------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



```inferSchema``` takes some time and processing power, so we can tell spark the schema:

In [11]:
schema = 'id integer, name string, gender string, salary double'

In [12]:
df = spark.read.csv(path='data/employees1.csv', header=True, schema=schema)

df.show()
df.printSchema()

+---+-------------+------+------+
| id|         name|gender|salary|
+---+-------------+------+------+
|  1|  Holly Brown|  Male|5513.0|
|  2|Charles Baker|  Male|8067.0|
|  3|Regina Crosby|  Male|4562.0|
|  4|  Mark Flores|  Male|6676.0|
|  5|Daniel Snyder|  Male|3155.0|
+---+-------------+------+------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)



We can also read multiple files in one dataframe:

In [13]:
df = spark.read.csv(path=['data/employees1.csv', 'data/employees2.csv'], header=True, schema=schema)

df.show()
df.printSchema()

+---+--------------+------+------+
| id|          name|gender|salary|
+---+--------------+------+------+
|  1|Veronica Davis|  Male|3838.0|
|  2|   Misty Young|Female|8519.0|
|  3| David Sanchez|  Male|5335.0|
|  4|Patricia Huber|Female|3183.0|
|  5|    Ann Jensen|Female|8023.0|
|  1|   Holly Brown|  Male|5513.0|
|  2| Charles Baker|  Male|8067.0|
|  3| Regina Crosby|  Male|4562.0|
|  4|   Mark Flores|  Male|6676.0|
|  5| Daniel Snyder|  Male|3155.0|
+---+--------------+------+------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)



If all the files are in the same folder, it's possible to use the folder path:

In [14]:
df = spark.read.csv(path=['data/'], header=True, schema=schema)

df.show()
df.printSchema()

+---+--------------+------+------+
| id|          name|gender|salary|
+---+--------------+------+------+
|  1|Veronica Davis|  Male|3838.0|
|  2|   Misty Young|Female|8519.0|
|  3| David Sanchez|  Male|5335.0|
|  4|Patricia Huber|Female|3183.0|
|  5|    Ann Jensen|Female|8023.0|
|  1|   Holly Brown|  Male|5513.0|
|  2| Charles Baker|  Male|8067.0|
|  3| Regina Crosby|  Male|4562.0|
|  4|   Mark Flores|  Male|6676.0|
|  5| Daniel Snyder|  Male|3155.0|
+---+--------------+------+------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: double (nullable = true)

