# PySpark YouTube Course - Lesson 01

This notebook contains all the materials and notes from the PySpark Course: 'PySpark - Aula 01 - Fundamentos - Tutorial em Português na Prática' by DataDev Academy, available on YouTube. You can find the video [here](https://www.youtube.com/watch?v=ycZZs4371us).

## Summary

- [PySpark YouTube Course - Lesson 01](#pyspark-youtube-course)
  - [Summary](#summary)
  - [Importing PySpark](#importing-pyspark)
  - [Starting PySpark Session](#starting-pyspark-session)
  - [Reading a File](#reading-a-file)
  - [Displaying the DataFrame](#displaying-the-dataframe)
  - [Checking Column Types](#checking-column-types)
  - [Renaming Columns](#renaming-columns)
  - [Checking for Null Values](#checking-for-null-values)
  - [Selecting Columns](#selecting-columns)
    - [Alias](#alias)
  - [Organizing Select Statement](#organizing-select-statement)
  - [Filtering DataFrame](#filtering-dataframe)
    - [AND Condition (&)](#and-condition-)
    - [OR Condition (|)](#or-condition-)
    - [Combined AND and OR Conditions](#combined-and-and-or-conditions)
  - [Creating New Columns](#creating-new-columns)
    - [Lit function](#lit-function)
    - [Substring function](#substring-function)
    - [Concat function](#concat-function)
  - [Changing Column Type](#changing-column-type)
  - [Challenge: Birth Date Column Transformation](#challenge-birth-date-column-transformation)

## Importing PySpark

In [1]:
# Importing necessary modules from PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

## Starting PySpark Session

In [2]:
# Creating a Spark session
# - master: defines the cluster manager, 'local' means running on a local machine
# - appName: sets the name of your application
spark = SparkSession.builder \
    .master("local") \
    .appName("pyspark-01") \
    .getOrCreate()

## Reading a File

In [3]:
# Reading a CSV file into a DataFrame
# - path: specifies the location of the file
# - header: indicates if the file contains a header row
# - inferSchema: automatically infers the data types of the columns
df = spark.read.csv("./data/wc2018-players.csv", header=True, inferSchema=True)

## Displaying the DataFrame

In [4]:
# - show(5): shows the first 5 rows of the DataFrame
df.show(5)

+---------+---+----+------------------+----------+----------+--------------------+------+------+
|     Team|  #|Pos.| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
|Argentina|  3|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|
|Argentina| 22|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|
|Argentina| 15|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|
|Argentina| 18|  DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|
|Argentina| 10|  FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
only showing top 5 rows



## Checking Column Types

In [5]:
# Checking the column types
# - printSchema(): prints the schema of the DataFrame, including column names and data types
df.printSchema()

root
 |-- Team: string (nullable = true)
 |-- #: integer (nullable = true)
 |-- Pos.: string (nullable = true)
 |-- FIFA Popular Name: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Shirt Name: string (nullable = true)
 |-- Club: string (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)



## Renaming Columns

In [6]:
# Renaming columns in the DataFrame
# - withColumnRenamed("oldName", "newName"): renames a column from oldName to newName
df = df.withColumnRenamed("Pos.", "Position") \
       .withColumnRenamed("#", "Number")

# Displaying the updated DataFrame to verify changes
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+
only showing top 5 rows



## Checking for Null Values

In [7]:
# Small dataset
# - toPandas(): converts the DataFrame to a Pandas DataFrame
# - isna().sum(): counts the number of null values in each column
df.toPandas().isna().sum()

Team                 0
Number               0
Position             0
FIFA Popular Name    0
Birth Date           0
Shirt Name           0
Club                 0
Height               0
Weight               0
dtype: int64

In [8]:
# Large dataset
# Iterating through each column to count null values in the DataFrame
for column in df.columns:
    # Filter rows where the column value is null, then count them
    null_count = df.filter(col(column).isNull()).count()
    print(column, null_count)

Team 0
Number 0
Position 0
FIFA Popular Name 0
Birth Date 0
Shirt Name 0
Club 0
Height 0
Weight 0


## Selecting Columns

In [9]:
# Selecting specific columns from the DataFrame
# - select("col1", "col2"): selects columns 'Team' and 'FIFA Popular Name' from the DataFrame
df.select("Team", "FIFA Popular Name").show(5)

+---------+------------------+
|     Team| FIFA Popular Name|
+---------+------------------+
|Argentina|TAGLIAFICO Nicolas|
|Argentina|    PAVON Cristian|
|Argentina|    LANZINI Manuel|
|Argentina|    SALVIO Eduardo|
|Argentina|      MESSI Lionel|
+---------+------------------+
only showing top 5 rows



In [10]:
# Selecting specific columns using col() function
# - col("Team"): selects the 'Team' column
# - col("FIFA Popular Name"): selects the 'FIFA Popular Name' column
df.select(col("Team"), col("FIFA Popular Name")).show(5)

+---------+------------------+
|     Team| FIFA Popular Name|
+---------+------------------+
|Argentina|TAGLIAFICO Nicolas|
|Argentina|    PAVON Cristian|
|Argentina|    LANZINI Manuel|
|Argentina|    SALVIO Eduardo|
|Argentina|      MESSI Lionel|
+---------+------------------+
only showing top 5 rows



In [11]:
# Selecting specific columns using DataFrame indexing
# - df["Team"]: selects the 'Team' column
# - df["FIFA Popular Name"]: selects the 'FIFA Popular Name' column
df.select(df["Team"], df["FIFA Popular Name"]).show(5)

+---------+------------------+
|     Team| FIFA Popular Name|
+---------+------------------+
|Argentina|TAGLIAFICO Nicolas|
|Argentina|    PAVON Cristian|
|Argentina|    LANZINI Manuel|
|Argentina|    SALVIO Eduardo|
|Argentina|      MESSI Lionel|
+---------+------------------+
only showing top 5 rows



### Alias

In [12]:
# Selecting a column with an alias
# - col("Team").alias("National Team"): selects the 'Team' column and renames it to 'National Team'
df.select(col("Team").alias("National Team")).show(5)

+-------------+
|National Team|
+-------------+
|    Argentina|
|    Argentina|
|    Argentina|
|    Argentina|
|    Argentina|
+-------------+
only showing top 5 rows



In [13]:
# Selecting a column with an alias using DataFrame indexing
# - df["Team"].alias("National Team"): selects the 'Team' column and renames it to 'National Team'
df.select(df["Team"].alias("National Team")).show(5)

+-------------+
|National Team|
+-------------+
|    Argentina|
|    Argentina|
|    Argentina|
|    Argentina|
|    Argentina|
+-------------+
only showing top 5 rows



## Organizing Select Statement

In [14]:
# Selecting and organizing specific columns in the DataFrame
df.select("FIFA Popular Name", "Weight", "Height").show(5)

+------------------+------+------+
| FIFA Popular Name|Weight|Height|
+------------------+------+------+
|TAGLIAFICO Nicolas|    65|   169|
|    PAVON Cristian|    65|   169|
|    LANZINI Manuel|    66|   167|
|    SALVIO Eduardo|    69|   167|
|      MESSI Lionel|    72|   170|
+------------------+------+------+
only showing top 5 rows



## Filtering DataFrame

In [15]:
# Filtering the DataFrame based on a condition
df.filter("Team = 'Brazil'").show(10)

+------+------+--------+-----------------+----------+-----------+--------------------+------+------+
|  Team|Number|Position|FIFA Popular Name|Birth Date| Shirt Name|                Club|Height|Weight|
+------+------+--------+-----------------+----------+-----------+--------------------+------+------+
|Brazil|    18|      MF|             FRED|05.03.1993|       FRED|FC Shakhtar Donet...|   169|    64|
|Brazil|    21|      FW|           TAISON|13.01.1988|     TAISON|FC Shakhtar Donet...|   172|    64|
|Brazil|    17|      MF|      FERNANDINHO|04.05.1985|FERNANDINHO|Manchester City F...|   179|    67|
|Brazil|    22|      DF|           FAGNER|11.06.1989|     FAGNER|SC Corinthians (BRA)|   168|    67|
|Brazil|    10|      FW|           NEYMAR|05.02.1992|  NEYMAR JR|Paris Saint-Germa...|   175|    68|
|Brazil|    11|      MF|PHILIPPE COUTINHO|12.06.1992|P. COUTINHO|  FC Barcelona (ESP)|   172|    68|
|Brazil|     7|      FW|    DOUGLAS COSTA|14.09.1990|   D. COSTA|   Juventus FC (ITA)|   18

In [16]:
# Filtering the DataFrame based on a column condition using col() function
df.filter(col("FIFA Popular Name") == "FRED").show()

+------+------+--------+-----------------+----------+----------+--------------------+------+------+
|  Team|Number|Position|FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+------+------+--------+-----------------+----------+----------+--------------------+------+------+
|Brazil|    18|      MF|             FRED|05.03.1993|      FRED|FC Shakhtar Donet...|   169|    64|
+------+------+--------+-----------------+----------+----------+--------------------+------+------+



In [17]:
# Displaying the DataFrame
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+
only showing top 5 rows



### AND Condition (&)

In [18]:
# Filtering the DataFrame based on multiple conditions using logical AND (&) operator
df.filter((col("Team") == "Argentina") & (col("Height") > 180) & (col("Weight") >= 85)).show(5)

+---------+------+--------+-----------------+----------+----------+--------------------+------+------+
|     Team|Number|Position|FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+------+--------+-----------------+----------+----------+--------------------+------+------+
|Argentina|     6|      DF|   FAZIO Federico|17.03.1987|     FAZIO|       AS Roma (ITA)|   199|    85|
|Argentina|    12|      GK|    ARMANI Franco|16.10.1986|    ARMANI|CA River Plate (ARG)|   189|    85|
|Argentina|     1|      GK|    GUZMAN Nahuel|10.02.1986|    GUZMÁN|   Tigres UANL (MEX)|   192|    90|
+---------+------+--------+-----------------+----------+----------+--------------------+------+------+



In [19]:
# Filtering the DataFrame with chained filter operations
df.filter(col("Team") == "Brazil").filter(col("Number") > 20).show(5)

+------+------+--------+-----------------+----------+----------+--------------------+------+------+
|  Team|Number|Position|FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+------+------+--------+-----------------+----------+----------+--------------------+------+------+
|Brazil|    21|      FW|           TAISON|13.01.1988|    TAISON|FC Shakhtar Donet...|   172|    64|
|Brazil|    22|      DF|           FAGNER|11.06.1989|    FAGNER|SC Corinthians (BRA)|   168|    67|
|Brazil|    23|      GK|          EDERSON|17.08.1993|   EDERSON|Manchester City F...|   188|    86|
+------+------+--------+-----------------+----------+----------+--------------------+------+------+



### OR Condition (|)

In [20]:
# Filtering the DataFrame based on multiple OR conditions using logical OR (|) operator
df.filter((col("FIFA Popular Name") == "MESSI Lionel") | 
          (col("FIFA Popular Name") == "SALVIO Eduardo") | 
          (col("Height") == 199)).show(5)

+---------+------+--------+-----------------+----------+----------+------------------+------+------+
|     Team|Number|Position|FIFA Popular Name|Birth Date|Shirt Name|              Club|Height|Weight|
+---------+------+--------+-----------------+----------+----------+------------------+------+------+
|Argentina|    18|      DF|   SALVIO Eduardo|13.07.1990|    SALVIO|  SL Benfica (POR)|   167|    69|
|Argentina|    10|      FW|     MESSI Lionel|24.06.1987|     MESSI|FC Barcelona (ESP)|   170|    72|
|Argentina|     6|      DF|   FAZIO Federico|17.03.1987|     FAZIO|     AS Roma (ITA)|   199|    85|
|  Belgium|     1|      GK| COURTOIS Thibaut|11.05.1992|  COURTOIS|  Chelsea FC (ENG)|   199|    91|
+---------+------+--------+-----------------+----------+----------+------------------+------+------+



### Combined AND and OR Conditions

In [21]:
# Filtering the DataFrame with combined AND (&) and OR (|) conditions
df.filter(((col("Team") == "Brazil") & (col("Position") == "DF")) |
          ((col("Height") == 199) & (col("Team") == "Belgium"))).show()

+-------+------+--------+-----------------+----------+-----------+--------------------+------+------+
|   Team|Number|Position|FIFA Popular Name|Birth Date| Shirt Name|                Club|Height|Weight|
+-------+------+--------+-----------------+----------+-----------+--------------------+------+------+
|Belgium|     1|      GK| COURTOIS Thibaut|11.05.1992|   COURTOIS|    Chelsea FC (ENG)|   199|    91|
| Brazil|    22|      DF|           FAGNER|11.06.1989|     FAGNER|SC Corinthians (BRA)|   168|    67|
| Brazil|     6|      DF|      FILIPE LUIS|09.08.1985|FILIPE LUIS|Atletico Madrid (...|   182|    73|
| Brazil|    13|      DF|       MARQUINHOS|14.05.1994| MARQUINHOS|Paris Saint-Germa...|   183|    75|
| Brazil|     3|      DF|          MIRANDA|07.09.1984|    MIRANDA|FC Internazionale...|   186|    78|
| Brazil|    14|      DF|           DANILO|15.07.1991|     DANILO|Manchester City F...|   184|    78|
| Brazil|     2|      DF|     THIAGO SILVA|22.09.1984|   T. SILVA|Paris Saint-Germ

## Creating New Columns

### Lit function

In [22]:
# Creating new columns in the DataFrame using lit()
df.withColumn("World Cup", lit(2018)).show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+---------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|World Cup|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+---------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|     2018|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|     2018|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|     2018|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|     2018|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|     2018|
+---------+------+--------+------------------+----------+----------+--------------------

In [23]:
# Creating a new column in the DataFrame with a computed value using lit() and column operations
df.withColumn("New Column", lit(col("Height") - col("Weight"))).show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|New Column|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|       104|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|       104|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|       101|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|        98|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|        98|
+---------+------+--------+------------------+----------+----------+------------

### Substring function

In [24]:
# Creating a conditional column in the DataFrame using substring()
df.withColumn("Sub", substring(col("Team"), 1, 3)).show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+---+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Sub|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+---+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|Arg|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|Arg|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|Arg|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|Arg|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|Arg|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+---+
only showing top 5 rows



In [25]:
# Creating a conditional column in the DataFrame using substring()
df.withColumn("Sub", substring(col("Team"), -3, 3)).show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+---+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Sub|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+---+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|ina|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|ina|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|ina|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|ina|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|ina|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+---+
only showing top 5 rows



In [26]:
# Creating a new column 'Year' in the DataFrame using the substring function
# - substring(col("Birth Date"), -4, 4): extracts the last 4 characters (year) from the 'Birth Date' column
df = df.withColumn("Year", substring(col("Birth Date"), -4, 4))

# Displaying the DataFrame
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|1987|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
only showing top 5 

### Concat function

In [27]:
# Creating a conditional column in the DataFrame using concat()
# - concat(col("Team"), col("FIFA Popular Name")): concatenates 'Team' and 'FIFA Popular Name' columns
df.withColumn("Concat", concat(col("Team"), col("FIFA Popular Name"))).show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+--------------------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|              Concat|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+--------------------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992|ArgentinaTAGLIAFI...|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996|ArgentinaPAVON Cr...|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993|ArgentinaLANZINI ...|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990|ArgentinaSALVIO E...|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)| 

In [28]:
# Creating a conditional column in the DataFrame using concat_ws()
# - concat_ws(" - ", col("Team"), col("Number"), col("Position")): concatenates 'Team', 'Number', and 'Position' columns with a separator "-"
df.withColumn("Concat", concat_ws(" - ", col("Team"), col("Number"), col("Position"))).show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+-------------------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|             Concat|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+-------------------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992| Argentina - 3 - DF|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996|Argentina - 22 - MF|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993|Argentina - 15 - MF|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990|Argentina - 18 - DF|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170| 

## Changing Column Type

In [29]:
# Displaying the DataFrame
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|1987|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
only showing top 5 

In [30]:
# Checking the column types
df.printSchema()

root
 |-- Team: string (nullable = true)
 |-- Number: integer (nullable = true)
 |-- Position: string (nullable = true)
 |-- FIFA Popular Name: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Shirt Name: string (nullable = true)
 |-- Club: string (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Year: string (nullable = true)



In [31]:
# Changing the column type in the DataFrame
# - col("Year").cast(IntegerType()): casts the 'Year' column to IntegerType
df = df.withColumn("Year", col("Year").cast(IntegerType()))

# Checking the column types
df.printSchema()

root
 |-- Team: string (nullable = true)
 |-- Number: integer (nullable = true)
 |-- Position: string (nullable = true)
 |-- FIFA Popular Name: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Shirt Name: string (nullable = true)
 |-- Club: string (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Year: integer (nullable = true)



## Challenge: Birth Date Column Transformation

In [32]:
# Displaying the DataFrame
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|1987|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+
only showing top 5 

In [33]:
# Transforming the 'Birth Date' column into DateType (YYYY-MM-DD format)

# Splitting 'Birth Date' column into Day, Month, and Year columns
df = df.withColumn("Day", split(col("Birth Date"), "\\.").getItem(0).cast(IntegerType()))
df = df.withColumn("Month", split(col("Birth Date"), "\\.").getItem(1).cast(IntegerType()))
df = df.withColumn("Year", split(col("Birth Date"), "\\.").getItem(2).cast(IntegerType()))

# Creating a new column 'BirthDate' with DateType format
df = df.withColumn("BirthDate", concat_ws("-", col("Year"), col("Month"), col("Day")).cast("date"))

# Displaying the updated DataFrame to verify changes
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+---+-----+----------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|Day|Month| BirthDate|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+---+-----+----------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992| 31|    8|1992-08-31|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996| 21|    1|1996-01-21|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993| 15|    2|1993-02-15|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990| 13|    7|1990-07-13|
|Argentina|    10|      FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)| 

In [34]:
# Creating a new column 'Birth' with DateType format
df = df.withColumn("Birth", concat_ws("-", col("Year"), col("Month"), col("Day")).cast(DateType()))

# Displaying the updated DataFrame to verify changes
df.show(5)

+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+---+-----+----------+----------+
|     Team|Number|Position| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|Year|Day|Month| BirthDate|     Birth|
+---------+------+--------+------------------+----------+----------+--------------------+------+------+----+---+-----+----------+----------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|1992| 31|    8|1992-08-31|1992-08-31|
|Argentina|    22|      MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|1996| 21|    1|1996-01-21|1996-01-21|
|Argentina|    15|      MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|1993| 15|    2|1993-02-15|1993-02-15|
|Argentina|    18|      DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|1990| 13|    7|1990-07-13|1990-07-13|
|Argentina|  

In [35]:
# Checking the column types
df.printSchema()

root
 |-- Team: string (nullable = true)
 |-- Number: integer (nullable = true)
 |-- Position: string (nullable = true)
 |-- FIFA Popular Name: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Shirt Name: string (nullable = true)
 |-- Club: string (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Day: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- BirthDate: date (nullable = true)
 |-- Birth: date (nullable = true)

