
## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/pixar_films-1.csv"
file_type = "csv"

# CSV options
infer_schema = "True"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

number,film,release_date,run_time,film_rating,plot
1,Toy Story,1995-11-22,81,G,A cowboy doll is profoundly threatened and jealous when a new spaceman action figure supplants him as top toy in a boy's bedroom.
2,A Bug's Life,1998-11-25,95,G,"""A misfit ant, looking for """"warriors"""" to save his colony from greedy grasshoppers"
3,Toy Story 2,1999-11-24,92,G,"When Woody is stolen by a toy collector, Buzz and his friends set out on a rescue mission to save Woody before he becomes a museum toy property with his roundup gang Jessie, Prospector, and Bullseye."
4,"Monsters, Inc.",2001-11-02,92,G,"In order to power the city, monsters have to scare children so that they scream. However, the children are toxic to the monsters, and after a child gets through, two monsters realize things may not be what they think."
5,Finding Nemo,2003-05-30,100,G,"After his son is captured in the Great Barrier Reef and taken to Sydney, a timid clownfish sets out on a journey to bring him home."
6,The Incredibles,2004-11-05,115,PG,"While trying to lead a quiet suburban life, a family of undercover superheroes are forced into action to save the world."
7,Cars,2006-06-09,116,G,"On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town and learns that winning isn't everything in life."
8,Ratatouille,2007-06-29,111,G,A rat who can cook makes an unusual alliance with a young kitchen worker at a famous Paris restaurant.
9,WALL-E,2008-06-27,98,G,"A robot who is responsible for cleaning a waste-covered Earth meets another robot and falls in love with her. Together, they set out on a journey that will alter the fate of mankind."
10,Up,2009-05-29,96,PG,"78-year-old Carl Fredricksen travels to South America in his house equipped with balloons, inadvertently taking a young stowaway."


In [0]:
rdd=df.rdd.distinct()
print(row)


Row(number=5, film='Finding Nemo', release_date=datetime.date(2003, 5, 30), run_time=100, film_rating='G', plot='After his son is captured in the Great Barrier Reef and taken to Sydney, a timid clownfish sets out on a journey to bring him home.')


In [0]:
rdd=df.rdd
#for row in rdd.collect():
    #print(row)

for row in rdd.take(5):
    print(row)    

Row(number=1, film='Toy Story', release_date=datetime.date(1995, 11, 22), run_time=81, film_rating='G', plot="A cowboy doll is profoundly threatened and jealous when a new spaceman action figure supplants him as top toy in a boy's bedroom.")
Row(number=2, film="A Bug's Life", release_date=datetime.date(1998, 11, 25), run_time=95, film_rating='G', plot='"A misfit ant, looking for ""warriors"" to save his colony from greedy grasshoppers')
Row(number=3, film='Toy Story 2', release_date=datetime.date(1999, 11, 24), run_time=92, film_rating='G', plot='When Woody is stolen by a toy collector, Buzz and his friends set out on a rescue mission to save Woody before he becomes a museum toy property with his roundup gang Jessie, Prospector, and Bullseye.')
Row(number=4, film='Monsters, Inc.', release_date=datetime.date(2001, 11, 2), run_time=92, film_rating='G', plot='In order to power the city, monsters have to scare children so that they scream. However, the children are toxic to the monsters, a

In [0]:
rdd=df.rdd
rdd_map=rdd.map(lambda row:row[3])
for row in rdd_map.take(10):
    print(row)

81
95
92
92
100
115
116
111
98
96


In [0]:
rdd=df.rdd
rdd_filter=rdd.filter(lambda row:row[2].year==2020)
for row in rdd_filter.collect():
    print(row)



Row(number=22, film='Onward', release_date=datetime.date(2020, 3, 6), run_time=102, film_rating='PG', plot='Teenage elf brothers Ian and Barley embark on a magical quest to spend one more day with their late father. Like any good adventure, their journey is filled with cryptic maps, impossible obstacles and unimaginable discoveries.')
Row(number=23, film='Soul', release_date=datetime.date(2020, 12, 25), run_time=100, film_rating='PG', plot="Joe is a middle-school band teacher whose life hasn't quite gone the way he expected. His true passion is jazz. But when he travels to another realm to help someone find their passion, he soon discovers what it means to have soul.")


In [0]:
rdd=df.rdd
rdd_filter=rdd.filter(lambda row:row[2].month==6)
for row in rdd_filter.collect():
    print(row)

Row(number=7, film='Cars', release_date=datetime.date(2006, 6, 9), run_time=116, film_rating='G', plot="On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town and learns that winning isn't everything in life.")
Row(number=8, film='Ratatouille', release_date=datetime.date(2007, 6, 29), run_time=111, film_rating='G', plot='A rat who can cook makes an unusual alliance with a young kitchen worker at a famous Paris restaurant.')
Row(number=9, film='WALL-E', release_date=datetime.date(2008, 6, 27), run_time=98, film_rating='G', plot='A robot who is responsible for cleaning a waste-covered Earth meets another robot and falls in love with her. Together, they set out on a journey that will alter the fate of mankind.')
Row(number=11, film='Toy Story 3', release_date=datetime.date(2010, 6, 18), run_time=103, film_rating='G', plot="The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it'

In [0]:
rdd=df.rdd
flat_rdd=rdd.flatMap(lambda row:row[1])
flat_rdd.take(10)
#print(flat_rdd)




Out[47]: ['T', 'o', 'y', ' ', 'S', 't', 'o', 'r', 'y', 'A']

In [0]:
rdd=df.rdd
reduce_rdd=rdd.map(lambda row:row[3])
result=reduce_rdd.reduce(lambda x,y:x+y)
print(result)

max_val=reduce_rdd.reduce(lambda x,y:max(x,y))
print(max_val)
min_val=reduce_rdd.reduce(lambda x,y:min(x,y))
print(min_val)

2811
118
81


In [0]:
rdd=df.rdd
rdd1=rdd.map(lambda row:(row[1],row[2]))
reduce_rdd=rdd1.reduceByKey(lambda x,y:max(x,y))
for film,date in reduce_rdd.collect():
      print(film,date)

Toy Story 1995-11-22
A Bug's Life 1998-11-25
Toy Story 2 1999-11-24
Monsters, Inc. 2001-11-02
Finding Nemo 2003-05-30
The Incredibles 2004-11-05
Cars 2006-06-09
Ratatouille 2007-06-29
WALL-E 2008-06-27
Up 2009-05-29
Toy Story 3 2010-06-18
Cars 2 2011-06-24
Brave 2012-06-22
Monsters University 2013-06-21
Inside Out 2015-06-19
The Good Dinosaur 2015-11-25
Finding Dory 2016-06-17
Cars 3 2017-06-16
Coco 2017-11-22
Incredibles 2 2018-06-15
Toy Story 4 2019-06-21
Onward 2020-03-06
Soul 2020-12-25
Luca 2021-06-18
Turning Red 2022-03-11
Lightyear 2022-06-17
Elemental 2023-06-16
Inside Out 2 2024-06-14


In [0]:
rdd=df.rdd
map_rdd=rdd.map(lambda row:(row[1],row[4]))
                
groupby_rdd=map_rdd.groupByKey()
for key,value in groupby_rdd.collect():
    print(key,list(value))               

Toy Story ['G']
A Bug's Life ['G']
Toy Story 2 ['G']
Monsters, Inc. ['G']
Finding Nemo ['G']
The Incredibles ['PG']
Cars ['G']
Ratatouille ['G']
WALL-E ['G']
Up ['PG']
Toy Story 3 ['G']
Cars 2 ['G']
Brave ['PG']
Monsters University ['G']
Inside Out ['PG']
The Good Dinosaur ['PG']
Finding Dory ['PG']
Cars 3 ['G']
Coco ['PG']
Incredibles 2 ['PG']
Toy Story 4 ['G']
Onward ['PG']
Soul ['PG']
Luca ['PG']
Turning Red ['PG']
Lightyear ['PG']
Elemental ['PG']
Inside Out 2 ['PG']


In [0]:
rdd=df.rdd
rdd_map=rdd.map(lambda row:(row[1],(row[3],row[4])))
rdd_reduce=rdd_map.reduceByKey(lambda x,y : (max(x[0],y[0]),x[1] if x[0] >= y[0] else y[1]))
for film, (max_run_time, rating) in rdd_reduce.take(10):
    print(film, min_run_time, rating)
    

Toy Story 100 G
A Bug's Life 100 G
Toy Story 2 100 G
Monsters, Inc. 100 G
Finding Nemo 100 G
The Incredibles 100 PG
Cars 100 G
Ratatouille 100 G
WALL-E 100 G
Up 100 PG


In [0]:
rdd_map = rdd.map(lambda row: (row[1], (row[3], row[4])))  


rdd_reduce = rdd_map.reduceByKey(lambda x, y: (min(x[0], y[0]), x[1] if x[0] <= y[0] else y[1]))


for film, (min_run_time, rating) in rdd_reduce.take(5):  
    print(film, min_run_time, rating)

Toy Story 81 G
A Bug's Life 95 G
Toy Story 2 92 G
Monsters, Inc. 92 G
Finding Nemo 100 G


In [0]:
from pyspark.sql import Row
data = [
    Row(number=1, film="Toy Story", release_date="1995-11-22", run_time=81, film_rating="G"),
    Row(number=2, film="A Bug's Life", release_date="1998-11-25", run_time=95, film_rating="G"),
    
]
rdd=sc.parallelize(data)
rdd1=rdd.mapPartitions(lambda partition:[(row.film,row.run_time*2,row.film_rating) for row in partition])
print(rdd1.collect())


[('Toy Story', 162, 'G'), ("A Bug's Life", 190, 'G')]


In [0]:
from pyspark.sql.functions import year

df_with_year = df.withColumn("year", year("release_date"))

pivot_df = df_with_year.groupBy("year") \
    .pivot("film_rating") \
    .count()

pivot_df.show()

unpivot_df = pivot_df.selectExpr("year",
    "stack(2, 'G', G, 'PG', PG) as (film_rating, count)"
).where("count is not null")

unpivot_df.show()

+----+----+----+
|year|   G|  PG|
+----+----+----+
|2003|   1|null|
|2007|   1|null|
|2018|null|   1|
|2015|null|   2|
|2023|null|   1|
|2006|   1|null|
|2022|null|   2|
|2013|   1|null|
|2019|   1|null|
|2004|null|   1|
|1998|   1|null|
|2020|null|   2|
|2012|null|   1|
|2009|null|   1|
|2016|null|   1|
|1995|   1|null|
|2001|   1|null|
|2024|null|   1|
|2010|   1|null|
|2011|   1|null|
+----+----+----+
only showing top 20 rows

+----+-----------+-----+
|year|film_rating|count|
+----+-----------+-----+
|2003|          G|    1|
|2007|          G|    1|
|2018|         PG|    1|
|2015|         PG|    2|
|2023|         PG|    1|
|2006|          G|    1|
|2022|         PG|    2|
|2013|          G|    1|
|2019|          G|    1|
|2004|         PG|    1|
|1998|          G|    1|
|2020|         PG|    2|
|2012|         PG|    1|
|2009|         PG|    1|
|2016|         PG|    1|
|1995|          G|    1|
|2001|          G|    1|
|2024|         PG|    1|
|2010|          G|    1|
|2011|          