# Olympics & Country GDP Bronze Layer

## Sources
We've picked two dataset from kaggle: the olympic data and the world-gdp dataset

### Olympic Data
  [Source: olympic-data](https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data)

### GDP
  [Source: country-regional-and-world-gdp](https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data)

## Olympic

The Olympic Games are an international multi-sport event held every four years in which thousands of athletes from around the world participate in various sports competitions. The Olympics are one of the most significant and prestigious sporting events globally, promoting unity, friendship, and fair play among nations.

### Olympic Medals
Gold, silver, and bronze medals are awarded to the top three athletes or teams in each event.

In [1]:
# Initialize spark by loading utils package
from utils import spark, save_parquet, load_parquet, path

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/29 16:56:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Read csv files from storage

In [2]:
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, DoubleType

olympics_schema = StructType([
    StructField("ID", IntegerType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),
    StructField("Age", IntegerType()),
    StructField("Height", DoubleType()),
    StructField("Weight", DoubleType()),
    StructField("Team", StringType()),
    StructField("NOC", StringType()),
    StructField("Games", StringType()),
    StructField("Year", IntegerType()),
    StructField("Season", StringType()),
    StructField("City", StringType()),
    StructField("Sport", StringType()),
    StructField("Event", StringType()),
    StructField("Medal", StringType())
])

df_olympics = (
    spark.read
    .format("csv")
    .option("Header", True)
    .option("inferSchema", False)
    .schema(olympics_schema)
    .load(path["csv"] + "dataset_olympics.csv")
)

df_olympics.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Height: double (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)



In [3]:
df_olympics.show(5)

+---+--------------------+---+----+------+------+--------------+---+-----------+----+------+---------+-------------+--------------------+-----+
| ID|                Name|Sex| Age|Height|Weight|          Team|NOC|      Games|Year|Season|     City|        Sport|               Event|Medal|
+---+--------------------+---+----+------+------+--------------+---+-----------+----+------+---------+-------------+--------------------+-----+
|  1|           A Dijiang|  M|NULL| 180.0|  80.0|         China|CHN|1992 Summer|1992|Summer|Barcelona|   Basketball|Basketball Men's ...| NULL|
|  2|            A Lamusi|  M|NULL| 170.0|  60.0|         China|CHN|2012 Summer|2012|Summer|   London|         Judo|Judo Men's Extra-...| NULL|
|  3| Gunnar Nielsen Aaby|  M|NULL|  NULL|  NULL|       Denmark|DEN|1920 Summer|1920|Summer|Antwerpen|     Football|Football Men's Fo...| NULL|
|  4|Edgar Lindenau Aabye|  M|NULL|  NULL|  NULL|Denmark/Sweden|DEN|1900 Summer|1900|Summer|    Paris|   Tug-Of-War|Tug-Of-War Men's ...

### Write dataframe to parquet

In [4]:
save_parquet(df_olympics, filename="olympics")

                                                                                

'./data/parquet/olympics.parquet'

### Read dataframe from parquet

In [5]:
load_parquet("olympics").show(5)

+---+--------------------+---+----+------+------+--------------+---+-----------+----+------+---------+-------------+--------------------+-----+
| ID|                Name|Sex| Age|Height|Weight|          Team|NOC|      Games|Year|Season|     City|        Sport|               Event|Medal|
+---+--------------------+---+----+------+------+--------------+---+-----------+----+------+---------+-------------+--------------------+-----+
|  1|           A Dijiang|  M|NULL| 180.0|  80.0|         China|CHN|1992 Summer|1992|Summer|Barcelona|   Basketball|Basketball Men's ...| NULL|
|  2|            A Lamusi|  M|NULL| 170.0|  60.0|         China|CHN|2012 Summer|2012|Summer|   London|         Judo|Judo Men's Extra-...| NULL|
|  3| Gunnar Nielsen Aaby|  M|NULL|  NULL|  NULL|       Denmark|DEN|1920 Summer|1920|Summer|Antwerpen|     Football|Football Men's Fo...| NULL|
|  4|Edgar Lindenau Aabye|  M|NULL|  NULL|  NULL|Denmark/Sweden|DEN|1900 Summer|1900|Summer|    Paris|   Tug-Of-War|Tug-Of-War Men's ...

## GDP

### Read table from catalog
We've used the data ingestion feature in databricks to import the csv into a delta table

In [6]:
df_gdp = spark.read.csv(path["csv"] + '/dataset_gdp.csv', header=True, inferSchema=True)
df_gdp.printSchema()

root
 |-- Country Name: string (nullable = true)
 |-- Country Code: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Value: double (nullable = true)



### Write dataframe to parquet

In [7]:
save_parquet(df_gdp, filename="gdp")

'./data/parquet/gdp.parquet'

### Read dataframe from parquet

In [8]:
load_parquet("gdp").show(5)

+------------+------------+----+-------------------+
|Country Name|Country Code|Year|              Value|
+------------+------------+----+-------------------+
|  Arab World|         ARB|1968|2.57606830410857E10|
|  Arab World|         ARB|1969|2.84342036154829E10|
|  Arab World|         ARB|1970|3.13854996640672E10|
|  Arab World|         ARB|1971|3.64269098883928E10|
|  Arab World|         ARB|1972|4.33160566154562E10|
+------------+------------+----+-------------------+
only showing top 5 rows

