# Special characters in the Parquet files

Parquet file does not store plain text data, it only stores binary data, and the encoding of textual data is not explicitly managed by the Parquet file. The interpretation of those bytes as characters is left to the reading application (like Pandas, spark, etc.). 
based on the encoding used by that application.

If you're experiencing issues with textual data in Parquet files, **it must be related to the way the data was initially written to the Parquet files**, rather than how Pandas or Spark is reading it. 

> Ensure that the data is written correctly with appropriate encodings to the Parquet files in the first place. If the data was written correctly, we should be able to read it correctly as well.
>

Below are two examples,

- In example 1, we read a parquet file that is written with bad encoding. 
- In example 2, we create a parquet file with special characters with good encoding. 

In [32]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import os
import pandas as pd

In [None]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("ReadWriteParquet").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("ReadWriteParquet") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

In [33]:
! wget https://s3.amazonaws.com/duckdb-md-dataset-121/netflix_daily_top_10.parquet

--2023-08-29 13:16:58--  https://s3.amazonaws.com/duckdb-md-dataset-121/netflix_daily_top_10.parquet
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.38.8, 52.217.86.198, 52.217.164.176, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.38.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 92595 (90K) [binary/octet-stream]
Saving to: ‘netflix_daily_top_10.parquet.1’


2023-08-29 13:17:00 (150 KB/s) - ‘netflix_daily_top_10.parquet.1’ saved [92595/92595]



## Example 1

Read the downloaded parquet file with pandas

In [34]:
filePath = "/home/onyxia/work/netflix_daily_top_10.parquet"

pdf = pd.read_parquet(filePath)

In [35]:
filteredPdf = pdf[pdf['Title'].str.contains('Queen', case=False)]
filteredPdf.head(50)

Unnamed: 0,As of,Rank,Year to Date Rank,Last Week Rank,Title,Type,Netflix Exclusive,Netflix Release Date,Days In Top 10,Viewership Score
699,2020-06-09,10,-,-,Queen of the South,TV Show,,"May 9, 2017",1,1
709,2020-06-10,10,10,-,Queen of the South,TV Show,,"May 9, 2017",2,2
718,2020-06-11,9,10,-,Queen of the South,TV Show,,"May 9, 2017",3,4
728,2020-06-12,9,9,-,Queen of the South,TV Show,,"May 9, 2017",4,6
1631,2020-09-11,2,-,-,The Babysitter: Killer Queen,Movie,Yes,"Sep 10, 2020",1,9
1641,2020-09-12,2,2,-,The Babysitter: Killer Queen,Movie,Yes,"Sep 10, 2020",2,18
1652,2020-09-13,3,2,-,The Babysitter: Killer Queen,Movie,Yes,"Sep 10, 2020",3,26
1663,2020-09-14,4,3,-,The Babysitter: Killer Queen,Movie,Yes,"Sep 10, 2020",4,33
1675,2020-09-15,6,4,-,The Babysitter: Killer Queen,Movie,Yes,"Sep 10, 2020",5,38
1685,2020-09-16,6,6,-,The Babysitter: Killer Queen,Movie,Yes,"Sep 10, 2020",6,43


### Read the parquet file with spark

In [36]:
df = spark.read.parquet(filePath)
df.show(5)

+----------+----+-----------------+--------------+--------------------+-------+-----------------+--------------------+--------------+----------------+
|     As of|Rank|Year to Date Rank|Last Week Rank|               Title|   Type|Netflix Exclusive|Netflix Release Date|Days In Top 10|Viewership Score|
+----------+----+-----------------+--------------+--------------------+-------+-----------------+--------------------+--------------+----------------+
|2020-04-01|   1|                1|             1|Tiger King: Murde...|TV Show|              Yes|        Mar 20, 2020|             9|              90|
|2020-04-01|   2|                2|             -|               Ozark|TV Show|              Yes|        Jul 21, 2017|             5|              45|
|2020-04-01|   3|                3|             2|        All American|TV Show|             null|        Mar 28, 2019|             9|              76|
|2020-04-01|   4|                4|             -|        Blood Father|  Movie|             nu

In [37]:
filterDf = df.filter(col("Title").like("%Queen%"))
filterDf.show(50,truncate=False)

+----------+----+-----------------+--------------+----------------------------+-------+-----------------+--------------------+--------------+----------------+
|As of     |Rank|Year to Date Rank|Last Week Rank|Title                       |Type   |Netflix Exclusive|Netflix Release Date|Days In Top 10|Viewership Score|
+----------+----+-----------------+--------------+----------------------------+-------+-----------------+--------------------+--------------+----------------+
|2020-06-09|10  |-                |-             |Queen of the South          |TV Show|null             |May 9, 2017         |1             |1               |
|2020-06-10|10  |10               |-             |Queen of the South          |TV Show|null             |May 9, 2017         |2             |2               |
|2020-06-11|9   |10               |-             |Queen of the South          |TV Show|null             |May 9, 2017         |3             |4               |
|2020-06-12|9   |9                |-          

> You can notice the result is wrong too

## Example 2

Now lets create a parquet file with special characters

In [38]:
data_list = [('Johnçë', 28),
             ('Alice', 24),
             ('âêîôû', 32)]
rdd = spark.sparkContext.parallelize(data_list)
specDf = spark.createDataFrame(rdd, ["Name", "Age"])

In [39]:
specDf.show()

                                                                                

+------+---+
|  Name|Age|
+------+---+
|Johnçë| 28|
| Alice| 24|
| âêîôû| 32|
+------+---+



In [26]:
filePath2 = "/home/onyxia/work/out_parquet"

In [23]:
specDf.coalesce(1).write.mode("overwrite").parquet(filePath2)

+------+---+
|  Name|Age|
+------+---+
|Johnçë| 28|
| Alice| 24|
| âêîôû| 32|
+------+---+



> We write the dataframe into a parquet file

In [40]:
! ls /home/onyxia/work/out_parquet

part-00000-4ce45397-dc47-4713-8866-d94b62896e88-c000.snappy.parquet  _SUCCESS


### Read the parquet file with spark

In [28]:
newDf = spark.read.parquet(filePath2)
newDf.show(5)

                                                                                

+------+---+
|  Name|Age|
+------+---+
|Johnçë| 28|
| Alice| 24|
| âêîôû| 32|
+------+---+



### Read the parquet file with pandas

In [30]:
pdf1 = pd.read_parquet(filePath2)

In [31]:
pdf1.head()

Unnamed: 0,Name,Age
0,Johnçë,28
1,Alice,24
2,âêîôû,32
