### Spark DataFrame

The DataFrame API provides a sql-like interface to interact and process your (semi-)structured datasets.

DataFrames are used when:
- high level tabular operations are needed

#### Example: Game of Thrones Transcripts Analysis

In this example, we will explore Game of Thrones transcripts.

- **Input**: text file
- **Output**: perform an EDA to answer the following questions
    - Which character has the most dialogue in Season 4?
    - Which episode has the most dialogue in Season 4?
    - Which episode in Season 4 has the highest word count in dialogue?
    - How does Season 8 (The boring one) compare to other seasons regarding:
        - Number of characters present.
        - **TODO**: Dialogue volume (measured by sentence count).

In [3]:
%spark

val inputFile = "file:///tmp/Game_of_Thrones_Script.csv"

In [4]:
%spark

val transcriptDF = spark.read
    .format("csv")
    .options(Map("header" -> "true", "inferSchema" -> "true", "delimiter" -> ","))
    .load(inputFile)

In [5]:
transcriptDF.printSchema() // df.dtypes

In [6]:
%spark

transcriptDF.show(5, false)

In [7]:
%spark

transcriptDF.createOrReplaceTempView("TRANSCRIPT_TBL")

### Which character has the most dialogue in Season 4?

#### Spark (Scala API) 

#### Spark SQL 

In [11]:
%spark

transcriptDF
    .where("Season = 'Season 4'")
    .groupBy("Name")
    .agg(count("Sentence").as("count"))
    .orderBy(desc("count"))
    .show(15, false)

In [12]:
%sql

SELECT Name, COUNT(*) AS C
FROM TRANSCRIPT_TBL
WHERE Season = 'Season 4'
GROUP BY Name
ORDER BY C desc
LIMIT 5


## The whole golden-haired squad dominated the dialogue —Tyrion alone couldn’t stop talking!

A wise man once said a true history of the world is a history of great **conversations** in elegant rooms.

### Which episode has the most dialogue in Season 4?

In [15]:
%spark

transcriptDF.where("Season = 'Season 4'")
    .groupBy("Episode")
    .agg(count("*").as("count"))
    .orderBy(desc("count"))
    .show(10, false)

In [16]:
%sql

SELECT Episode, COUNT(*) as C
FROM TRANSCRIPT_TBL
WHERE Season = 'Season 4'
GROUP BY Episode
ORDER BY C DESC

## Episode 9: Least chatter, most drama!

###  Which episode in Season 4 has the highest word count in dialogue?

In [19]:
val regex = """\w+""".r

// udf: user defined function
val extractWordsUdf = udf((sentence: String) => regex.findAllIn(sentence).toList) 

In [20]:
val season4TranscriptDF = transcriptDF
    // .where("Season = 'Season 4'")
    .filter("Season = 'Season 4'")
    .select("Episode", "Sentence")

In [21]:
season4TranscriptDF
    .withColumn("Sentence", lower(col("Sentence")))
    .withColumn("Words", extractWordsUdf(col("Sentence")))
    .show(5, false)

In [22]:
val season4WordcountDF = season4TranscriptDF
    .withColumn("Sentence", lower(col("Sentence")))
    .withColumn("Words", extractWordsUdf(col("Sentence")))
    .withColumn("Word", explode(col("Words")))
    .select("Episode", "Word")

In [23]:
season4WordcountDF.createOrReplaceTempView("SEASON4_TBL")

In [24]:
season4WordcountDF
    .groupBy("Episode")
    .agg(
        count("*").as("count")
        )
    .orderBy(desc("count"))
    .show(25, false)

In [25]:
%sql

SELECT Episode, count(*) c
FROM SEASON4_TBL
GROUP BY Episode
ORDER BY C

### Let's explore words distribution

In [27]:
%sql

SELECT Word, count(*) c
FROM SEASON4_TBL
GROUP BY Word
ORDER BY C DESC
LIMIT 1000

In [28]:
%sql

SELECT Word, count(*) c
FROM SEASON4_TBL
GROUP BY Word
ORDER BY C DESC
LIMIT 10

#### Number of characters present in Season 8 compared to other seasons

In [30]:
transcriptDF.printSchema()

#### Number of characters present in season 8

#### Number of characters present in other seasons

In [33]:
%sql
SELECT count(Name) as num_characters
FROM (
    SELECT Name, COUNT(*) c
    FROM TRANSCRIPT_TBL
    WHERE Season = 'Season 8'
    GROUP BY Name having c > 1 -- filter characters with less than one sentence 
)




In [34]:
%sql

SELECT SEASON, COUNT(1) as num_characters
FROM (
    SELECT  Season, Name, COUNT(*) c
    FROM TRANSCRIPT_TBL
    WHERE Season != 'Season 8'
    GROUP BY Name, Season having c > 1 -- filter characters with less than one sentence 
)
GROUP BY SEASON
ORDER BY num_characters DESC

## Fewer faces, fewer names