## Import libraries and Create a SparkSession Object

In [1]:
from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("read-csv-data")
         .master("spark://spark-master:7077")
         .config("spark.executor.memory", "512m")
         .getOrCreate())

spark.sparkContext.setLogLevel("Error")
                             

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/07 15:26:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Reading the csv data

In [4]:
df = (spark.read.format("csv")
      .option("header","true")
      .load("../../data/netflix_titles.csv"))

## Display sample data

In [5]:
df.show(5)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           null|Ama Qamata, Khosi...| South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|Septem

## Check the data types of the columns (Schema)

In [7]:
df.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



## Read the csv data with an explicit schema

In [10]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,DateType




##### I want to avoid this manual typing of each column.So, I create a dictionary and edit it

     
     

### First I create a dictionary from the column names and datatypes

In [20]:
schema_dict = {field.name : field.dataType for field in df.schema.fields}
schema_dict

{'show_id': StringType(),
 'type': StringType(),
 'title': StringType(),
 'director': StringType(),
 'cast': StringType(),
 'country': StringType(),
 'date_added': StringType(),
 'release_year': StringType(),
 'rating': StringType(),
 'duration': StringType(),
 'listed_in': StringType(),
 'description': StringType()}

### I modify the dictionary with my required datatypes

In [21]:
modifications = {
    'date_added': DateType(),
    'release_year': DateType()
}


In [23]:
schema_dict.update(modifications)

schema_dict

{'show_id': StringType(),
 'type': StringType(),
 'title': StringType(),
 'director': StringType(),
 'cast': StringType(),
 'country': StringType(),
 'date_added': DateType(),
 'release_year': DateType(),
 'rating': StringType(),
 'duration': StringType(),
 'listed_in': StringType(),
 'description': StringType()}

### Then I create the schema 

In [29]:
from pprint import pprint

In [30]:
schema = StructType([StructField(name, dtype, True)  for name, dtype in schema_dict.items()])

pprint(schema)

StructType([StructField('show_id', StringType(), True), StructField('type', StringType(), True), StructField('title', StringType(), True), StructField('director', StringType(), True), StructField('cast', StringType(), True), StructField('country', StringType(), True), StructField('date_added', DateType(), True), StructField('release_year', DateType(), True), StructField('rating', StringType(), True), StructField('duration', StringType(), True), StructField('listed_in', StringType(), True), StructField('description', StringType(), True)])


## Read csv data with explicitly defined schema

In [39]:
df = spark.read.format('csv')\
                .option("header", "true")\
                .schema(schema)\
                .load("../../data/netflix_titles.csv")

In [36]:
df.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: date (nullable = true)
 |-- release_year: date (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



In [37]:
df.show(5)

+-------+-------+--------------------+---------------+--------------------+-------------+----------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+----------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|      null|  2020-01-01| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           null|Ama Qamata, Khosi...| South Africa|      null|  2021-01-01| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|      null|  2021-01-01| TV-MA| 1 Season|Crime

## Reading csv when delimiter value is present within the data

In [40]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("nullValue", "null")\
                .option("escapeQuotes", "true")\
                .schema(schema)\
                .load("../../data/netflix_titles.csv")

In [41]:
df.show(5)

+-------+-------+--------------------+---------------+--------------------+-------------+----------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+----------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|      null|  2020-01-01| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           null|Ama Qamata, Khosi...| South Africa|      null|  2021-01-01| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|      null|  2021-01-01| TV-MA| 1 Season|Crime

## Different Dateformat

In [46]:
df = spark.read.format("csv")\
                .option("header", "true")\
                .option("nullValue", "null")\
                .option("dateFormat", "LLLL d,y")\
                .option("escapeQuotes", "true")\
                .schema(schema)\
                .load("../../data/netflix_titles.csv")

In [47]:
df.show(5)

+-------+-------+--------------------+---------------+--------------------+-------------+----------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+----------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|      null|        null| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           null|Ama Qamata, Khosi...| South Africa|      null|        null| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|      null|        null| TV-MA| 1 Season|Crime

PySpark follows Java’s SimpleDateFormat pattern, so you can use various date format patterns:

| Format                             | Example Input                   | Pattern                     |
|------------------------------------|--------------------------------|-----------------------------|
| Full Month Name, Day, Year        | December 25, 2020             | `"LLLL d,y"`                |
| Short Month Name, Day, Year       | Dec 25, 2020                  | `"MMM d,y"`                 |
| Day-Month-Year                    | 25-12-2020                    | `"dd-MM-yyyy"`              |
| Year-Month-Day (ISO Format)       | 2020-12-25                    | `"yyyy-MM-dd"`              |
| Month/Day/Year                    | 12/25/2020                    | `"MM/dd/yyyy"`              |
| Day/Month/Year                    | 25/12/2020                    | `"dd/MM/yyyy"`              |
| Year-Month-Day Hour:Minute        | 2020-12-25 14:30              | `"yyyy-MM-dd HH:mm"`        |
| Full Date & Time                  | Fri, 25 Dec 2020 14:30:00 GMT | `"EEE, d MMM yyyy HH:mm:ss z"` |

## Infer Schema by Pyspark

## Handling missing data

## Handling malformed data

## Working with large CSV files

## Read large CSV files as a stream