# Initial Handling and Shaping of Data via _PySpark_ ✨

## The PySpark Session

Start the PySpark session:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName( 'BabyNamesZurich' ).getOrCreate()

## The PySpark DataFrame

Read from the CSV file and put the data into a PySpark DataFrame:

_(from now on the DataFrame will be called as "df")_

In [2]:
data_file = '../datasets/baby_vornamen.csv'
df = spark.read.csv( data_file, header=True, inferSchema=True )
# Show first 5 lines of data
df.show(5)

                                                                                

+---------------+--------+--------+----------+
|StichtagDatJahr| Vorname| SexLang|AnzGebuWir|
+---------------+--------+--------+----------+
|           1993|  Abarna|weiblich|         1|
|           1993| Abetare|weiblich|         1|
|           1993|    Abir|weiblich|         1|
|           1993| Abirami|weiblich|         1|
|           1993|Adelaide|weiblich|         1|
+---------------+--------+--------+----------+
only showing top 5 rows



### Change the scene of the data a little bit 🔮🪄

Changing the column names:

In [3]:
df = df.withColumnRenamed( 'StichtagDatJahr', 'year' ) \
        .withColumnRenamed( 'Vorname', 'name' ) \
        .withColumnRenamed( 'SexLang', 'sex' ) \
        .withColumnRenamed( 'AnzGebuWir', 'freq' )

Change the column value of 'sex':

In [4]:
from pyspark.sql.functions import when

df = df.withColumn( 'sex', 
         when( df.sex.ilike( 'weiblich' ), 'F' ) \
        .when( df.sex.ilike( 'männlich' ), 'M' )
)

#### A column value can also be changed via function _regexp_replace_ in PySpark.

Below is an example use-case.

```python

from pyspark.sql.functions import when, regexp_replace

df = df.withColumn( 'sex', 
         when( df.sex.ilike( 'weiblich' ), regexp_replace( df.sex, 'weiblich', 'F' ) ) \
        .when( df.sex.ilike( 'männlich' ), regexp_replace( df.sex, 'männlich', 'M' ) )
)

# == OR ==

df = df.withColumn( 'sex', 
         regexp_replace( df.sex, 'weiblich', 'F' )
)
```

However, _regexp_replace_ is rather a better fit for partial word replacements.

Because the whole value of the data is replaced, it is not necessary to use it this time.

#### Let's see the updated data  👀

In [5]:
df.show(5)
df.printSchema()

+----+--------+---+----+
|year|    name|sex|freq|
+----+--------+---+----+
|1993|  Abarna|  F|   1|
|1993| Abetare|  F|   1|
|1993|    Abir|  F|   1|
|1993| Abirami|  F|   1|
|1993|Adelaide|  F|   1|
+----+--------+---+----+
only showing top 5 rows

root
 |-- year: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- freq: integer (nullable = true)



## Hello SQL 👋

How to turn a PySpark df into a table:

In [6]:
df.createOrReplaceTempView( 'NEWBORNS' )

Viewing the newly created table:

In [7]:
spark.sql( "SHOW TABLES" ).show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|         | newborns|       true|
+---------+---------+-----------+

