## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
# File location and type
file_location = "/FileStore/tables/corona_dataset_latest.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
corona_df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(corona_df)

_c0,State,Country,Lat,Long,Date,Confirmed,Death,Recovered,state_cleaned,City
0,,Thailand,15.0,101.0,2020-01-22T00:00:00.000+0000,2,0,0,Bangkok,
1,,Japan,36.0,138.0,2020-01-22T00:00:00.000+0000,2,0,0,Hiraide,
2,,Singapore,1.2833,103.8333,2020-01-22T00:00:00.000+0000,0,0,0,Singapore,
3,,Nepal,28.1667,84.25,2020-01-22T00:00:00.000+0000,0,0,0,Kathmandu,
4,,Malaysia,2.5,112.5,2020-01-22T00:00:00.000+0000,0,0,0,Sarawak,
5,British Columbia,Canada,49.2827,-123.1207,2020-01-22T00:00:00.000+0000,0,0,0,British Columbia,
6,New South Wales,Australia,-33.8688,151.2093,2020-01-22T00:00:00.000+0000,0,0,0,New South Wales,
7,Victoria,Australia,-37.8136,144.9631,2020-01-22T00:00:00.000+0000,0,0,0,Victoria,
8,Queensland,Australia,-28.0167,153.4,2020-01-22T00:00:00.000+0000,0,0,0,Queensland,
9,,Cambodia,11.55,104.9167,2020-01-22T00:00:00.000+0000,0,0,0,Phnom Penh,


In [3]:
# Create a view or table

temp_table_name = "corona"

corona_df.createOrReplaceTempView(temp_table_name)

In [4]:
%sql

/* Query the created temp table in a SQL cell */

select * from `corona`

_c0,State,Country,Lat,Long,Date,Confirmed,Death,Recovered,state_cleaned,City
0,,Thailand,15.0,101.0,2020-01-22T00:00:00.000+0000,2,0,0,Bangkok,
1,,Japan,36.0,138.0,2020-01-22T00:00:00.000+0000,2,0,0,Hiraide,
2,,Singapore,1.2833,103.8333,2020-01-22T00:00:00.000+0000,0,0,0,Singapore,
3,,Nepal,28.1667,84.25,2020-01-22T00:00:00.000+0000,0,0,0,Kathmandu,
4,,Malaysia,2.5,112.5,2020-01-22T00:00:00.000+0000,0,0,0,Sarawak,
5,British Columbia,Canada,49.2827,-123.1207,2020-01-22T00:00:00.000+0000,0,0,0,British Columbia,
6,New South Wales,Australia,-33.8688,151.2093,2020-01-22T00:00:00.000+0000,0,0,0,New South Wales,
7,Victoria,Australia,-37.8136,144.9631,2020-01-22T00:00:00.000+0000,0,0,0,Victoria,
8,Queensland,Australia,-28.0167,153.4,2020-01-22T00:00:00.000+0000,0,0,0,Queensland,
9,,Cambodia,11.55,104.9167,2020-01-22T00:00:00.000+0000,0,0,0,Phnom Penh,


In [5]:
# File location and type
file_location = "/FileStore/tables/tweets.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
twitter_df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(twitter_df)

_c0,geo,text,user,location,entities,sentiment,country
0,,What is God saying to us about #coronavirus ?,petodinice,Lagos,"[('about #', 'CARDINAL')]","{'neg': 0.0, 'neu': 0.769, 'pos': 0.231, 'compound': 0.2732}",Nigeria
1,,"""BREAKING: """"this is disappointing","but i took the test"""". Arsenal's Mikel Arteta tests positive for #coronavirus","-… """,JerryfranksonJF,"Abuja, Nigeria","""[(""""Arsenal's Mikel Arteta"""""
2,,#Coronavirus testing must be made free to the public if we are going to understand the scope of this crisis. Anything le…,cek422,"Pennsylvania, USA",[],"{'neg': 0.173, 'neu': 0.71, 'pos': 0.117, 'compound': -0.3767}",USA
3,,Get ready for mass event crowd cancellations across the World starting this weekend: cricket in #Australia in empty st…,InfectiousDz,NYC,"[('World', 'ORG'), ('this weekend', 'DATE'), '']","{'neg': 0.085, 'neu': 0.798, 'pos': 0.117, 'compound': 0.1779}",USA
4,,The #coronavirus pandemic is revealing just how closely we are all bound together...[A thread],vic_gibson,,[''],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",
5,,"COVID-19 update as of this morning:1. Death toll in Italy passes 1,0002. Arsenal's head coach Arteta tests positive3. US…",StewartNgilana,Durban | Port Elizabeth,"[('Italy', 'GPE'), ('1,0002', 'CARDINAL'), ('Arsenal', 'ORG'), ('Arteta', 'ORG'), ('US', 'GPE')]","{'neg': 0.178, 'neu': 0.822, 'pos': 0.0, 'compound': -0.5994}",USA
6,,"It’s painful to say, but as an ER doc who talked to about healthcare, I believe he's uniquely unqualified to…",BWheatnyc,Florida,[],"{'neg': 0.098, 'neu': 0.902, 'pos': 0.0, 'compound': -0.2382}",USA
7,,"📽️Friends, I will drop another video on Twitter tonight around 7:30pm to answer your questions about #coronavirus.Plea…",LorseaR,"New South Wales, Australia","[('Twitter', 'GPE'), ('tonight', 'TIME')]","{'neg': 0.123, 'neu': 0.877, 'pos': 0.0, 'compound': -0.2732}",Australia
8,,"Questions about COVID-19? See this resource and info page from NAfME for music educators, including music instrument hygiene: ht…",straightj23,"Columbus, OH","[('NAfME', 'CARDINAL')]","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",USA
9,,"How they’re dealing with #coronavirus in a public facility in Portsmouth, Uk. Yes, that’s a coronavirus warning sig…",_______coolio,"South East, England","['', ('Portsmouth', 'GPE'), ('Uk', 'GPE')]","{'neg': 0.126, 'neu': 0.733, 'pos': 0.141, 'compound': 0.0772}",UK


In [6]:
# Create a view or table

temp_table_name = "twitter"

twitter_df.createOrReplaceTempView(temp_table_name)

In [7]:
%sql

/* Query the created temp table in a SQL cell */

select * from `twitter`

_c0,geo,text,user,location,entities,sentiment,country
0,,What is God saying to us about #coronavirus ?,petodinice,Lagos,"[('about #', 'CARDINAL')]","{'neg': 0.0, 'neu': 0.769, 'pos': 0.231, 'compound': 0.2732}",Nigeria
1,,"""BREAKING: """"this is disappointing","but i took the test"""". Arsenal's Mikel Arteta tests positive for #coronavirus","-… """,JerryfranksonJF,"Abuja, Nigeria","""[(""""Arsenal's Mikel Arteta"""""
2,,#Coronavirus testing must be made free to the public if we are going to understand the scope of this crisis. Anything le…,cek422,"Pennsylvania, USA",[],"{'neg': 0.173, 'neu': 0.71, 'pos': 0.117, 'compound': -0.3767}",USA
3,,Get ready for mass event crowd cancellations across the World starting this weekend: cricket in #Australia in empty st…,InfectiousDz,NYC,"[('World', 'ORG'), ('this weekend', 'DATE'), '']","{'neg': 0.085, 'neu': 0.798, 'pos': 0.117, 'compound': 0.1779}",USA
4,,The #coronavirus pandemic is revealing just how closely we are all bound together...[A thread],vic_gibson,,[''],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",
5,,"COVID-19 update as of this morning:1. Death toll in Italy passes 1,0002. Arsenal's head coach Arteta tests positive3. US…",StewartNgilana,Durban | Port Elizabeth,"[('Italy', 'GPE'), ('1,0002', 'CARDINAL'), ('Arsenal', 'ORG'), ('Arteta', 'ORG'), ('US', 'GPE')]","{'neg': 0.178, 'neu': 0.822, 'pos': 0.0, 'compound': -0.5994}",USA
6,,"It’s painful to say, but as an ER doc who talked to about healthcare, I believe he's uniquely unqualified to…",BWheatnyc,Florida,[],"{'neg': 0.098, 'neu': 0.902, 'pos': 0.0, 'compound': -0.2382}",USA
7,,"📽️Friends, I will drop another video on Twitter tonight around 7:30pm to answer your questions about #coronavirus.Plea…",LorseaR,"New South Wales, Australia","[('Twitter', 'GPE'), ('tonight', 'TIME')]","{'neg': 0.123, 'neu': 0.877, 'pos': 0.0, 'compound': -0.2732}",Australia
8,,"Questions about COVID-19? See this resource and info page from NAfME for music educators, including music instrument hygiene: ht…",straightj23,"Columbus, OH","[('NAfME', 'CARDINAL')]","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",USA
9,,"How they’re dealing with #coronavirus in a public facility in Portsmouth, Uk. Yes, that’s a coronavirus warning sig…",_______coolio,"South East, England","['', ('Portsmouth', 'GPE'), ('Uk', 'GPE')]","{'neg': 0.126, 'neu': 0.733, 'pos': 0.141, 'compound': 0.0772}",UK


In [8]:
twitter_df.filter("country='USA'").show()

In [9]:
twitter_df.filter("country='USA'" and "location like 'New%'").show()

In [10]:
tw_filter_df=twitter_df.filter("country='USA'")

In [11]:
tw_filter_df.explain()

In [12]:
tw_filter_df=twitter_df.filter("country='USA'" and "location like 'New%'")

In [13]:
tw_filter_df.explain()

In [14]:
twitter_df

In [15]:
twitter_df.first()

In [16]:
twitter_df.take(5)

In [17]:
twitter_df.select('text').show()

In [18]:
twitter_df.select('text','user').show()

In [19]:
corona_df.show()

In [20]:
corona_df.filter("Country = 'US'").show()

In [21]:
from pyspark.sql.functions import col
corona_df.filter("Country = 'US'").sort(col('Date'),ascending=False).show()

In [22]:
corona_df.filter("Country = 'US'").sortWithinPartitions([col('Date'),col('Confirmed')],ascending=False).show()

In [23]:
corona_df.describe().show()

In [24]:
display(corona_df.describe())

summary,_c0,State,Country,Lat,Long,Confirmed,Death,Recovered,state_cleaned,City
count,28143.0,19116,28143,28143.0,28143.0,28143.0,28143.0,28143.0,28143,14573
mean,14071.0,,,30.965553459118837,-34.57031257861667,161.88245744945456,5.494368048893153,60.17290267562094,,
stddev,8124.328649186959,,,19.36547282659765,80.78375872452575,2519.847217725942,109.29475709869877,1346.6597829124426,,
min,0.0,"Adams, IN",Afghanistan,-41.4545,-157.8584,0.0,0.0,0.0,Abu Dhabi,Adams
max,28142.0,Zhejiang,Zimbabwe,71.7069,178.065,67800.0,4032.0,58382.0,erdészet,Yolo County


In [25]:
corona_df.printSchema()

In [26]:
corona_df.filter('Confirmed >1000').sort(col('Confirmed')).show()

In [27]:
corona_df.filter('Confirmed >10000').approxQuantile('Confirmed',[0.25,0.50,0.75,0.95],0.95)

In [28]:
corona_df.agg({'Date': 'max','confirmed':'max'}).collect()

In [29]:
max_date=corona_df.agg({'Date': 'max'})

In [30]:
max_date.show()

In [31]:
import pyspark.sql.functions as f
corona_df.groupBy('country','state_cleaned').agg(f.max('Date')).show()

In [32]:
corona_df.join(corona_df.groupBy('country','state_cleaned').agg(f.max('Date').alias('Date')),on=['country','state_cleaned','Date'],how='inner').show()

In [33]:
corona_df.join(corona_df.groupBy('country','state_cleaned').agg(f.max('Date').alias('Date')),on=['country','state_cleaned','Date'],how='inner').sort('Confirmed',ascending=False).show()

In [34]:
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

ws=Window().partitionBy('Country','state_cleaned').orderBy(col('Date').desc())

corona_df.withColumn('row number',row_number().over(ws)).show()

In [35]:
corona_df.withColumn('row number',row_number().over(ws)).where(col('row number')==1).show()

In [36]:
display(corona_df.groupBy('Country').pivot('Date').agg(f.sum('Confirmed')))

Country,2020-01-22 00:00:00,2020-01-23 00:00:00,2020-01-24 00:00:00,2020-01-25 00:00:00,2020-01-26 00:00:00,2020-01-27 00:00:00,2020-01-28 00:00:00,2020-01-29 00:00:00,2020-01-30 00:00:00,2020-01-31 00:00:00,2020-02-01 00:00:00,2020-02-02 00:00:00,2020-02-03 00:00:00,2020-02-04 00:00:00,2020-02-05 00:00:00,2020-02-06 00:00:00,2020-02-07 00:00:00,2020-02-08 00:00:00,2020-02-09 00:00:00,2020-02-10 00:00:00,2020-02-11 00:00:00,2020-02-12 00:00:00,2020-02-13 00:00:00,2020-02-14 00:00:00,2020-02-15 00:00:00,2020-02-16 00:00:00,2020-02-17 00:00:00,2020-02-18 00:00:00,2020-02-19 00:00:00,2020-02-20 00:00:00,2020-02-21 00:00:00,2020-02-22 00:00:00,2020-02-23 00:00:00,2020-02-24 00:00:00,2020-02-25 00:00:00,2020-02-26 00:00:00,2020-02-27 00:00:00,2020-02-28 00:00:00,2020-02-29 00:00:00,2020-03-01 00:00:00,2020-03-02 00:00:00,2020-03-03 00:00:00,2020-03-04 00:00:00,2020-03-05 00:00:00,2020-03-06 00:00:00,2020-03-07 00:00:00,2020-03-08 00:00:00,2020-03-09 00:00:00,2020-03-10 00:00:00,2020-03-11 00:00:00,2020-03-12 00:00:00,2020-03-13 00:00:00,2020-03-14 00:00:00,2020-03-15 00:00:00,2020-03-16 00:00:00,2020-03-17 00:00:00,2020-03-18 00:00:00,2020-03-19 00:00:00,2020-03-20 00:00:00
Chad,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
Russia,0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,4,13,13,17,17,20,20,28,45,59,63,90,114,147,199,253
Paraguay,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,5,5,6,6,6,8,9,11,11,13
Senegal,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,4,4,4,4,4,4,4,4,4,10,10,24,24,26,31,31,38
Sweden,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,7,7,12,14,15,21,35,94,101,161,203,248,355,500,599,814,961,1022,1103,1190,1279,1439,1639
Cabo Verde,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
Guyana,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,4,4,7,7,7,7
Philippines,0,0,0,0,0,0,0,0,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,5,6,10,20,33,49,52,64,111,140,142,187,202,217,230
Djibouti,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1
Malaysia,0,0,0,3,4,4,4,7,8,8,8,8,8,10,12,12,12,16,16,18,18,18,19,19,22,22,22,22,22,22,22,22,22,22,22,22,23,23,25,29,29,36,50,50,83,93,99,117,129,149,149,197,238,428,566,673,790,900,1030


In [37]:
corona_df.filter("Country = 'Italy'").sort('Date',ascending=False).show()

In [38]:
corona_max_df=corona_df.join(corona_df.groupBy('country','state_cleaned').agg(f.max('Date').alias('Date')),on=['country','state_cleaned','Date'],how='inner').sort('Confirmed',ascending=False)

In [39]:
corona_max_df.show()

In [40]:
corona_max_df.withColumn('Active',corona_max_df.Confirmed - corona_max_df.Recovered - corona_max_df.Death).sort('Active', ascending=False).show()

In [41]:
corona_max_df=corona_max_df.withColumn('Active',corona_max_df.Confirmed - corona_max_df.Recovered - corona_max_df.Death)

In [42]:
corona_max_df.show()

In [43]:
corona_max_df.groupBy('Country').agg(f.sum('Active').alias('Total')).orderBy('Total',ascending=False).show()

In [44]:
corona_max_df.show()

In [45]:
corona_max_df.select('Country','state_cleaned','Confirmed','Death','Recovered').filter(col('Country').isin ('Australia','China')).groupBy('Country').sum().show(50)

In [46]:
corona_max_df.select('Country','state_cleaned','Confirmed','Death','Recovered').filter(col('Country').isin ('Australia','China')).cube('Country').sum().show()

In [47]:
corona_max_df.select('Country','state_cleaned','Confirmed','Death','Recovered').filter(col('Country').isin ('Australia','Canada')).cube('Country','state_cleaned').sum().sort('Country').show(100)

In [48]:
corona_max_df.select('Country','state_cleaned','Confirmed','Death','Recovered').filter(col('Country').isin ('Australia','Canada')).rollup('Country','state_cleaned').sum().sort('Country').show(100)

In [49]:
corona_max_df.cache()

In [50]:
%timeit corona_max_df.count()

In [51]:
from pyspark import StorageLevel
corona_max_df.persist(StorageLevel.MEMORY_AND_DISK)

In [52]:
%timeit corona_max_df.count()

In [53]:
pd=corona_df.toPandas()

In [54]:
pd.corr()

Unnamed: 0,_c0,Lat,Long,Confirmed,Death,Recovered
_c0,1.0,-0.001594,-0.005303,0.046789,0.043155,0.042857
Lat,-0.001594,1.0,-0.383389,0.008531,0.007041,0.000778
Long,-0.005303,-0.383389,1.0,0.098893,0.074918,0.079047
Confirmed,0.046789,0.008531,0.098893,1.0,0.963376,0.853924
Death,0.043155,0.007041,0.074918,0.963376,1.0,0.857292
Recovered,0.042857,0.000778,0.079047,0.853924,0.857292,1.0


In [55]:
corona_max_df.createOrReplaceTempView('corona')

In [56]:
spark.sql('select * from corona').show()

In [57]:
%sql
select Country,state_cleaned,sum(Confirmed),sum(Recovered) from corona where Country in ('China','Italy') GROUP BY ROLLUP (Country,state_cleaned) order by country 

Country,state_cleaned,sum(Confirmed),sum(Recovered)
,,128271,75706
China,Chongqing,576,570
China,,81250,71266
China,Shanghai,371,326
China,Ningxia,75,75
China,Shandong,762,748
China,Hong Kong,256,98
China,Shanxi,133,133
China,Hainan,168,161
China,Fujian,299,295
