## **Create simple data frame**
let us understand how Spark DataFrame in-memory table with named colums works
let us load Custom dataset from olympics 2024 medals table.

In [2]:
# Custom data from olympics

data = [("United States","US",40,44,42,126),
("China","CHN",40,27,24,91),
("Japan","JPN",20,12,13,45),
("Australia","AUS",18,19,16,53),
("France","FRA",16,26,22,64),
("Netherlands","NED",15,7,12,34),
("Great Britain","GBG",14,22,29,65),
("South Korea","KOR",13,9,10,32),
("Italy","ITA",12,13,15,40),
("Germany","GER",12,13,8,33),
("New Zealand","NZ",10,7,3,20)
]

columns = ["Country","Code","Gold","Sliver","Bronze","Total"]
df = spark.createDataFrame(data=data, schema = columns)

df.printSchema()
df.show(truncate=False)



StatementMeta(, 5c3505ae-32e1-4f38-9f9c-c811a889e9e5, 4, Finished, Available, Finished)

root
 |-- Country: string (nullable = true)
 |-- Code: string (nullable = true)
 |-- Gold: long (nullable = true)
 |-- Sliver: long (nullable = true)
 |-- Bronze: long (nullable = true)
 |-- Total: long (nullable = true)

+-------------+----+----+------+------+-----+
|Country      |Code|Gold|Sliver|Bronze|Total|
+-------------+----+----+------+------+-----+
|United States|US  |40  |44    |42    |126  |
|China        |CHN |40  |27    |24    |91   |
|Japan        |JPN |20  |12    |13    |45   |
|Australia    |AUS |18  |19    |16    |53   |
|France       |FRA |16  |26    |22    |64   |
|Netherlands  |NED |15  |7     |12    |34   |
|Great Britain|GBG |14  |22    |29    |65   |
|South Korea  |KOR |13  |9     |10    |32   |
|Italy        |ITA |12  |13    |15    |40   |
|Germany      |GER |12  |13    |8     |33   |
|New Zealand  |NZ  |10  |7     |3     |20   |
+-------------+----+----+------+------+-----+



## **Understand the data from type**

In [3]:
type (df)

StatementMeta(, 4051111b-a614-48c7-98ba-06d48915bf0f, 5, Finished, Available, Finished)

pyspark.sql.dataframe.DataFrame

## **DataFrame using StructType**
**Defining DataFrame Schemas:** _StructType_ is commonly used to define the schema when creating a DataFrame, particularly for structured data with fields of different data types.

Defining DataFrame schemas in Spark is essential for ensuring data consistency and optimizing performance

**Explicit Schema Definition:**

You can define the schema of a DataFrame explicitly using StructType and StructField.

Let us use Custom dataset from olympics 2024 medals table.



In [10]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data = [("United States","US",40,44,42,126),
("China","CHN",40,27,24,91),
("Japan","JPN",20,12,13,45),
("Australia","AUS",18,19,16,53),
("France","FRA",16,26,22,64),
("Netherlands","NED",15,7,12,34),
("Great Britain","GBG",14,22,29,65),
("South Korea","KOR",13,9,10,32),
("Italy","ITA",12,13,15,40),
("Germany","GER",12,13,8,33),
("New Zealand","NZ",10,7,3,20)
]

columns = StructType([ 
    StructField("Country", StringType(),True),
    StructField("Code", StringType(),True),
    StructField("Gold", IntegerType(), True),
    StructField("Sliver", IntegerType(), True),
    StructField("Bronze", IntegerType(), True),
    StructField("Total", IntegerType(), True)
    ])

df = spark.createDataFrame(data=data, schema = columns)

df.printSchema()
df.show(truncate=False)

StatementMeta(, 4051111b-a614-48c7-98ba-06d48915bf0f, 12, Finished, Available, Finished)

root
 |-- Country: string (nullable = true)
 |-- Code: string (nullable = true)
 |-- Gold: integer (nullable = true)
 |-- Sliver: integer (nullable = true)
 |-- Bronze: integer (nullable = true)
 |-- Total: integer (nullable = true)

+-------------+----+----+------+------+-----+
|Country      |Code|Gold|Sliver|Bronze|Total|
+-------------+----+----+------+------+-----+
|United States|US  |40  |44    |42    |126  |
|China        |CHN |40  |27    |24    |91   |
|Japan        |JPN |20  |12    |13    |45   |
|Australia    |AUS |18  |19    |16    |53   |
|France       |FRA |16  |26    |22    |64   |
|Netherlands  |NED |15  |7     |12    |34   |
|Great Britain|GBG |14  |22    |29    |65   |
|South Korea  |KOR |13  |9     |10    |32   |
|Italy        |ITA |12  |13    |15    |40   |
|Germany      |GER |12  |13    |8     |33   |
|New Zealand  |NZ  |10  |7     |3     |20   |
+-------------+----+----+------+------+-----+



## **Data Definition Language**
When you define schemas in this way using a string format like "Country STRING, Code STRING, Gold INT, Silver INT, Bronze INT, Total INT", it's called a DDL (Data Definition Language) string or DDL schema string.

This method allows you to specify the schema in a concise, SQL-like format, making it easy to define the data types for each column.

In [3]:
columns_ddl = "Country STRING, Code STRING, Gold INT,Sliver INT,Bronze INT, Total INT"
df_with_ddl_schema = spark.createDataFrame(data=data,schema=columns_ddl)
df.printSchema()
df.show(truncate=False)


StatementMeta(, 5c3505ae-32e1-4f38-9f9c-c811a889e9e5, 5, Finished, Available, Finished)

root
 |-- Country: string (nullable = true)
 |-- Code: string (nullable = true)
 |-- Gold: long (nullable = true)
 |-- Sliver: long (nullable = true)
 |-- Bronze: long (nullable = true)
 |-- Total: long (nullable = true)

+-------------+----+----+------+------+-----+
|Country      |Code|Gold|Sliver|Bronze|Total|
+-------------+----+----+------+------+-----+
|United States|US  |40  |44    |42    |126  |
|China        |CHN |40  |27    |24    |91   |
|Japan        |JPN |20  |12    |13    |45   |
|Australia    |AUS |18  |19    |16    |53   |
|France       |FRA |16  |26    |22    |64   |
|Netherlands  |NED |15  |7     |12    |34   |
|Great Britain|GBG |14  |22    |29    |65   |
|South Korea  |KOR |13  |9     |10    |32   |
|Italy        |ITA |12  |13    |15    |40   |
|Germany      |GER |12  |13    |8     |33   |
|New Zealand  |NZ  |10  |7     |3     |20   |
+-------------+----+----+------+------+-----+



## **Understanding and Playing with DataFram**

In [4]:
# Show the entire DataFrame (default is 20 rows and only 20 characters wide per column)
df.show()

StatementMeta(, 5c3505ae-32e1-4f38-9f9c-c811a889e9e5, 6, Finished, Available, Finished)

+-------------+----+----+------+------+-----+
|      Country|Code|Gold|Sliver|Bronze|Total|
+-------------+----+----+------+------+-----+
|United States|  US|  40|    44|    42|  126|
|        China| CHN|  40|    27|    24|   91|
|        Japan| JPN|  20|    12|    13|   45|
|    Australia| AUS|  18|    19|    16|   53|
|       France| FRA|  16|    26|    22|   64|
|  Netherlands| NED|  15|     7|    12|   34|
|Great Britain| GBG|  14|    22|    29|   65|
|  South Korea| KOR|  13|     9|    10|   32|
|        Italy| ITA|  12|    13|    15|   40|
|      Germany| GER|  12|    13|     8|   33|
|  New Zealand|  NZ|  10|     7|     3|   20|
+-------------+----+----+------+------+-----+



_**display(df)**_ shows the DataFrame in an interactive table. It provides features like sorting and filtering directly in the UI

<mark><u>Reminder to show case : Columns, Charts and downlaod </u></mark>

In [None]:
display(df)

In [None]:
display(df.head(2))

In [None]:
display(df.tail(2))

In [None]:
# 
df.schema