**Comment about functions that we have to import**
* Have you find a rule about what we import : functions (col, expr, lit...) and objects (Row, StructField, StructType, StringType, LongType). We don't import methods belong to DF
* count() is a excellent example to distinct transformation (need import) and action (don't need import)

#### Schemas *(Addtionnal)*

In [0]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
    StructField("some", StringType(), True),
    StructField("col", StringType(), True),
    StructField("names", LongType(), False)
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|null|    1|
+-----+----+-----+



### 1. Data Sources, import and export
#### CSV

In [0]:
# Let a data source define the schema
df = (spark.read
      .format("csv")
      .option("header", "true") # ajouter en seconde temps false default
      .load("/databricks-datasets/definitive-guide/data/flight-data/csv/2015-summary.csv")
     )

In [0]:
df.show(2)
df.schema # poser la question d'attribut et method
df.printSchema() # ne pas oublier () de .printSchema()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

Out[23]: StructType([StructField('DEST_COUNTRY_NAME', StringType(), True), StructField('ORIGIN_COUNTRY_NAME', StringType(), True), StructField('count', StringType(), True)])

In [0]:
# we can define it explicitly ourselves
myManualSchema = StructType([
    StructField("DEST_COUNTRY_NAME", StringType(), True),
    StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
    StructField("count", LongType(), False)
])

df = (spark.read
      .format("csv")
      .option("header", "true")
      .schema(myManualSchema)
      .load("/databricks-datasets/definitive-guide/data/flight-data/csv/2015-summary.csv")
     )

In [0]:
df.schema

Out[15]: StructType([StructField('DEST_COUNTRY_NAME', StringType(), True), StructField('ORIGIN_COUNTRY_NAME', StringType(), True), StructField('count', LongType(), True)])

**We can use the schemas of a existing DF**

In [0]:
df_2014 = (spark.read
      .format("csv")
      .option("header", "true")
#       .schema(myManualSchema)
      .load("/databricks-datasets/definitive-guide/data/flight-data/csv/2014-summary.csv")
     )

In [0]:
# We can used the schema of a existing df
df_2014 = (spark.read
      .format("csv")
      .option("header", "true")
      .schema(df.schema)
      .load("/databricks-datasets/definitive-guide/data/flight-data/csv/2014-summary.csv")
     )

In [0]:
df_2014.schema

Out[22]: StructType([StructField('DEST_COUNTRY_NAME', StringType(), True), StructField('ORIGIN_COUNTRY_NAME', StringType(), True), StructField('count', LongType(), True)])

#### Json
Json have less options than csv does not have "header" option

In [0]:
df_Json = (
    spark.read.format("json")
#     .schema(df.schema) # there is no need to specify schema, json do better than csv
    .load("/databricks-datasets/definitive-guide/data/flight-data/json/2015-summary.json")
)

df1 = spark.read.format("json").load("/databricks-datasets/definitive-guide/data/flight-data/json/") # with or without last /
df.count() # 256
df1.count() # 1502

In [0]:
df_Json.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [0]:
%who
del df1

df	 df_Json	 os	 


#### Export and delete

In [0]:
df.coalesce(1).write.format('csv').mode('overwrite').option('header', 'true').save('dbfs:/FileStore/df/export_fg.csv')
# df.coalesce(1).write.format('csv').option('header', 'true').save('/dbfs/FileStore/NJ/wrtdftodbfs.txt')

# list contents
dbutils.fs.ls("FileStore/tables")

# delete element
dbutils.fs.rm('FileStore/df/Sample.csv',True)

Out[65]: [FileInfo(path='dbfs:/FileStore/tables/2015_summary-1.json', name='2015_summary-1.json', size=21368, modificationTime=1679832308000),
 FileInfo(path='dbfs:/FileStore/tables/2015_summary-2.json', name='2015_summary-2.json', size=21368, modificationTime=1679832404000),
 FileInfo(path='dbfs:/FileStore/tables/2015_summary-3.json', name='2015_summary-3.json', size=21368, modificationTime=1679832512000),
 FileInfo(path='dbfs:/FileStore/tables/2015_summary.json', name='2015_summary.json', size=21368, modificationTime=1679831665000),
 FileInfo(path='dbfs:/FileStore/tables/AABA_2006_01_01_to_2018_01_01.csv', name='AABA_2006_01_01_to_2018_01_01.csv', size=145792, modificationTime=1675007933000),
 FileInfo(path='dbfs:/FileStore/tables/AMZN_2006_01_01_to_2018_01_01.csv', name='AMZN_2006_01_01_to_2018_01_01.csv', size=151374, modificationTime=1675008319000),
 FileInfo(path='dbfs:/FileStore/tables/GOOGL_2006_01_01_to_2018_01_01.csv', name='GOOGL_2006_01_01_to_2018_01_01.csv', size=158672, m

#### Parquet

Parquet est un format de stockage de données open source orienté colonnes qui offre une variété d'optimisations de stockage, en particulier pour analytics workloads. Il permet la compression en colonnes, ce qui économise de l'espace de stockage et permet de lire des colonnes individuelles plutôt que des fichiers entiers. Nous recommandons d'écrire les données en format Parquet pour un stockage à long terme, car la lecture d'un fichier Parquet sera toujours plus efficace que celle d'un fichier JSON ou CSV.

Comme nous venons de le mentionner, il y a très peu d'options Parquet - précisément deux, en fait - parce qu'il a une spécification bien définie qui s'aligne étroitement sur les concepts de Spark.

Parquet is an open source column-oriented data store that provides a variety of storage optimizations, especially for analytics workloads. It provides columnar compression, which saves storage space and allows for reading individual columns instead of entire files. We recommend writing data out to Parquet for long-term storage because reading from a Parquet file will always be more efficient than JSON or CSV.

As we just mentioned, there are very few Parquet options—precisely two, in fact—because it has a well-defined specification that aligns closely with the concepts in Spark.

In [0]:
df_parquet = spark.read.format("parquet").load("/databricks-datasets/definitive-guide/data/flight-data/parquet/2010-summary.parquet")

df_parquet.show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
+-----------------+-------------------+-----+
only showing top 2 rows



### 2. Rows, Columns and Expressions

#### Rows

In [0]:
from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)
type(myRow)
myRow[0]
myRow[2]

Out[168]: 1

#### Columns
.col(), .column() or .expr() functions to construct and refer to columns

In [0]:
from pyspark.sql.functions import col, column, expr

#### select and selectExpr

In [0]:
df.select('ORIGIN_COUNTRY_NAME').show(2)

+-------------------+
|ORIGIN_COUNTRY_NAME|
+-------------------+
|            Romania|
|            Croatia|
+-------------------+
only showing top 2 rows



In [0]:
df.select('ORIGIN_COUNTRY_NAME', 'DEST_COUNTRY_NAME').show(2)

+-------------------+-----------------+
|ORIGIN_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-------------------+-----------------+
|            Romania|    United States|
|            Croatia|    United States|
+-------------------+-----------------+
only showing top 2 rows



In [0]:
from pyspark.sql.functions import expr
df.select(col('DEST_COUNTRY_NAME').alias('destination'), 
          column('DEST_COUNTRY_NAME'), 
          expr('DEST_COUNTRY_NAME as destination'), 
          'DEST_COUNTRY_NAME').show(2)
df.selectExpr("DEST_COUNTRY_NAME as destination", "count").show(2)
# we recommend selectExpr for a more flexible syntax

# col(), column() and expr() are generally used within other methodes

+-------------+-----------------+-------------+-----------------+
|  destination|DEST_COUNTRY_NAME|  destination|DEST_COUNTRY_NAME|
+-------------+-----------------+-------------+-----------------+
|United States|    United States|United States|    United States|
|United States|    United States|United States|    United States|
+-------------+-----------------+-------------+-----------------+
only showing top 2 rows



### 3. DataFrame Transformations
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas).

#### 3.1 Creating SQL table (view)

In [0]:
df.createGlobalTempView("dfTable") # "Global" when you want to share data among different sessions and keep alive until your application ends

df.createOrReplaceTempView("dfTable") # only within your spark session

In [0]:
# Once we register this as a temporary view so that we can query it with SQL
spark.sql('SELECT * FROM dfTable').show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [0]:
%sql select * from dfTable

#### 3.2 Add Constant/Literals Column
Sometimes, we need to pass explicit values into Spark that are just a value (rather than a new
column)
In sql, we can write "select *, 5, "five", 5.0 from table"

But in spark, we have to convert native types to Spark types with lit() function

PySpark lit() function is used to 
* convert native types to Spark types : This function converts a type in another language to its correspnding Spark representation
* therefore, it add constant or literal value as a new column to the DataFrame

In [0]:
from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0)) 

# in SQL
# SELECT 5, "five", 5.0

In [0]:
# EXO 1
df.select(expr("*"), lit(10).alias("REF")).show(2)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-852666029468171>[0m in [0;36m<cell line: 1>[0;34m()[0m
[0;32m----> 1[0;31m [0mdf[0m[0;34m.[0m[0mselect[0m[0;34m([0m[0mexpr[0m[0;34m([0m[0;34m"*"[0m[0;34m)[0m[0;34m,[0m [0mlit[0m[0;34m([0m[0;36m10[0m[0;34m)[0m[0;34m.[0m[0malias[0m[0;34m([0m[0;34m"REF"[0m[0;34m)[0m[0;34m)[0m[0;34m.[0m[0mshow[0m[0;34m([0m[0;36m2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;31mNameError[0m: name 'df' is not defined

#### 3.3 Adding Columns :  .withColumns('col_name', expression) method

In [0]:
df.withColumn('REF', lit(5.0)).show(2)

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|REF|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|5.0|
|    United States|            Croatia|    1|5.0|
+-----------------+-------------------+-----+---+
only showing top 2 rows



In [0]:
# EXO 2
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME")).show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



In [0]:
# EXO 3 : Rename (col(), column() or expr())
df.withColumn("Destination", col("DEST_COUNTRY_NAME")).columns

Out[9]: ['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count', 'Destination']

#### 3.4 Renaming Columns : .withColumnRenamed()

*Poser question aux stagiaires : pourquoi pas besoin d'appeler col()*

No need to call col() because no transformation on the column content

In [0]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "Destination").columns

Out[12]: ['Destination', 'ORIGIN_COUNTRY_NAME', 'count']

#### 3.5 Removing Columns : .drop()
Removing missing value columns in the next notebook

In [0]:
df.drop("ORIGIN_COUNTRY_NAME", "count").columns

Out[98]: ['DEST_COUNTRY_NAME']

#### 3.6 Changing a column’s type: cast()

In [0]:
df.withColumn("count", col("count").cast("float")).show() # we can cast directly a existing column

Out[101]: ['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count2']

#### 3.7 Filtering Rows
* .filter() or .where()
* pyspark or SQL expression

In [0]:
# The following filters are equivalent
df.filter(col("count")<2).show(10)
df.where("count < 2").show(10)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Croatia|    1|
|       United States|          Singapore|    1|
|             Moldova|      United States|    1|
|               Malta|      United States|    1|
|       United States|          Gibraltar|    1|
|Saint Vincent and...|      United States|    1|
|            Suriname|      United States|    1|
|       United States|             Cyprus|    1|
|        Burkina Faso|      United States|    1|
|            Djibouti|      United States|    1|
+--------------------+-------------------+-----+
only showing top 10 rows



**NOTE** : you might want to put multiple filters into the same expression. Although this is
possible, it is not always useful, because Spark automatically performs all filtering operations at
the same time regardless of the filter ordering (we already mentioned this in **Catalyst Optimiseur** part)

In [0]:
# df.filter((col("count")<2) & (col("ORIGIN_COUNTRY_NAME") != "Croatia")).show(10)

df.filter(col("count")<5).where(col("ORIGIN_COUNTRY_NAME") == "Croatia").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+



#### 3.8 Getting Unique Rows:.distinct()

In [0]:
df.count() # 256
# df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count() # 256
df.select("ORIGIN_COUNTRY_NAME").distinct().count() # 125

Out[113]: 125

#### 3.9 Random Samples: .sample()

In [0]:
df.sample(withReplacement=False, fraction=0.5, seed=42).count()

# prepare the union operation
df_test = df.sample(withReplacement=False, fraction=0.5, seed=42)

df_test = df_test.repartition(5)
df_test.rdd.getNumPartitions()

Out[114]: 132

#### 3.10 Random Splits : .randomSplit()

In [0]:
type(df.randomSplit([0.25, 0.75], seed))
Test_df, train_df = df.randomSplit([0.25, 0.75], seed)

Out[117]: list

#### 3.11 Concatenating and Appending Rows (Union in sql): .union()

In [0]:
# case 1: union with df
df.union(df_test).count()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
[0;32m<command-2657625770389973>[0m in [0;36m<cell line: 1>[0;34m()[0m
[0;32m----> 1[0;31m [0mdf_test[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;31mNameError[0m: name 'df_test' is not defined

In [0]:
# case 2: union with native rows
schema = df.schema

newRows = [
    Row("new country 1", "other country 1", 5),
    Row("new country 2", "other country 2", 2)
]

newDF = spark.createDataFrame(newRows, schema)
newDF.rdd.getNumPartitions()

# no longer need to use the parallelize method on a SparkContext 
# parallelizedRows = spark.sparkContext.parallelize(newRows)

# # To create an RDD from a collection, you will need to use the parallelize method on a SparkContext (within a SparkSession)
# # type(parallelizedRows) # parallelizedRows is a RDD
# newDF = spark.createDataFrame(parallelizedRows, schema)


(df.union(newDF)
 .filter("ORIGIN_COUNTRY_NAME like 'other country%'")
 .show(5))

#### 3.12 Sorting Rows: .sort() or .orderBy()

In [0]:
from pyspark.sql.functions import asc, desc
df.sort(desc("count"), "DEST_COUNTRY_NAME").show(5)  # no need to use col() function 

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|        The Bahamas|  986|
|      The Bahamas|      United States|  955|
|    United States|             France|  952|
|           France|      United States|  935|
|    United States|              China|  920|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
df.orderBy(desc("count"), asc("DEST_COUNTRY_NAME")).show()


+-----------------+--------------------+-----+
|DEST_COUNTRY_NAME| ORIGIN_COUNTRY_NAME|count|
+-----------------+--------------------+-----+
|    United States|         The Bahamas|  986|
|      The Bahamas|       United States|  955|
|    United States|              France|  952|
|           France|       United States|  935|
|    United States|               China|  920|
|          Curacao|       United States|   90|
|         Colombia|       United States|  873|
|    United States|            Colombia|  867|
|           Brazil|       United States|  853|
|    United States|              Canada| 8483|
|           Canada|       United States| 8399|
|     Saudi Arabia|       United States|   83|
|    United States|             Curacao|   83|
|    United States|         South Korea|  827|
|    United States|British Virgin Is...|   80|
|      Netherlands|       United States|  776|
|            China|       United States|  772|
|    United States|         New Zealand|   74|
|    United S

For optimization purposes, it’s sometimes advisable to sort within each partition before another
set of transformations.

In [0]:
df = df.repartition(5)
df.rdd.getNumPartitions()

df.sortWithinPartitions(desc("count")) # 0.12 seconds

# df.sort(desc("count")) 

Out[19]: DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: string]

#### 3.13 .repartition() and .coalesce()
* .repartition() : If you know that you’re going to be filtering by a certain column often, it can be worth repartitioning based on that column

* .Coalesce(): will not incur a full shuffle and will try to combine partitions.

In [0]:
df.rdd.getNumPartitions()
df= df.repartition(5) # nativement spark will do a fair partitions 

In [0]:
from pyspark.sql.functions import spark_partition_id, asc, desc
df\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .show()

+-----------+-----+
|partitionId|count|
+-----------+-----+
|          1|   33|
|          0|   43|
|          2|   45|
|          3|   52|
|          4|   83|
+-----------+-----+



In [0]:
df= df.repartition(5, col("count")) # This operation will shuffle your data into five partitions based on the destination country name
# If you know that you’re going to be filtering by a certain column often, it can be worth repartitioning based on that column
# we can also use col("count"), but col() is not necessary

# [re-exécuter la commande au-dessus]

In [0]:
df = df.coalesce(2) # will not incur a full shuffle and will try to combine partitions. 
df.rdd.getNumPartitions() # 2

# the re-execute the previous count command

Out[36]: 2