# 02. Data transformation

In previous notebook, we have seen how to download tweets. Note we only store the origin response, we have not do any cleaning and transformation.

If the data ingestion procesus is done by different people by using different tools and or api version. The tweets may have different schema.

If you want to use these tweets, we need to clean them and transform them into a unique schema that we can use them after.

In this notebook, we will show you how to do that.



Raw data sources:
- s3a://pengfei/diffusion/demo_prod/current/2021
- s3a://pengfei/diffusion/demo_prod/current_short/2021
- s3a://pengfei/diffusion/demo_prod/old/2010_2021

In [1]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType
from pyspark.sql.functions import lit, col, when, concat, udf
import os

In [2]:
local=False
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .config('spark.jars.packages', 'org.postgresql:postgresql:42.2.24') \
                  .appName("Twitter_Data_Transformation").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("Twitter_Data_Transformation") \
                      .config("spark.kubernetes.container.image",os.environ["IMAGE_NAME"]) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.driver.pod.name", os.environ['KUBERNETES_POD_NAME'])\
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()
# spark.kubernetes.driver.pod.name config will enable executor Pod Garbage Collection



:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-978bc624-3cca-4c47-a53b-48e5fafb3e8c;1.0
	confs: [default]
	found org.postgresql#postgresql;42.2.24 in central
	found org.checkerframework#checker-qual;3.5.0 in central
:: resolution report :: resolve 349ms :: artifacts dl 4ms
	:: modules in use:
	org.checkerframework#checker-qual;3.5.0 from central in [default]
	org.postgresql#postgresql;42.2.24 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	-------------------------------------------

## 2.1 Explorer different data set

In [3]:
path1 = "s3a://pengfei/diffusion/demo_prod/current/2021"
path2 = "s3a://pengfei/diffusion/demo_prod/current_short/2021"
path3 = "s3a://pengfei/diffusion/demo_prod/old/2010_2021"

### 2.1.1 Check data of path1

In [4]:
df1=spark.read.parquet(path1)

2021-12-08 12:49:18,508 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [5]:
df1.count()

                                                                                

13158

In [17]:
df1.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- additional_media_info: struct (nullable = true)
 |    |    |    |    |-- monetizable: boolean (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- i

### 2.1.2 Check data of path2

In [6]:
df2=spark.read.parquet(path2)

In [7]:
df2.count()

700

In [8]:
df2.printSchema()

root
 |-- name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- text: string (nullable = true)
 |-- __index_level_0__: long (nullable = true)



### 2.1.3 Check data of path3

In [31]:
df3=spark.read.parquet(path3)

In [32]:
df3.count()

                                                                                

2062

In [11]:
df3.printSchema()

root
 |-- created_at: string (nullable = true)
 |-- id: long (nullable = true)
 |-- id_str: string (nullable = true)
 |-- text: string (nullable = true)
 |-- source: string (nullable = true)
 |-- truncated: boolean (nullable = true)
 |-- in_reply_to_status_id: double (nullable = true)
 |-- in_reply_to_status_id_str: string (nullable = true)
 |-- in_reply_to_user_id: double (nullable = true)
 |-- in_reply_to_user_id_str: string (nullable = true)
 |-- in_reply_to_screen_name: string (nullable = true)
 |-- geo: integer (nullable = true)
 |-- coordinates: integer (nullable = true)
 |-- place: integer (nullable = true)
 |-- contributors: integer (nullable = true)
 |-- is_quote_status: boolean (nullable = true)
 |-- quote_count: long (nullable = true)
 |-- reply_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- favorite_count: long (nullable = true)
 |-- favorited: boolean (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- filter_level: string (nulla

**You can notice the three data set all have different schema.**

## 2.2 Merge the three data set

In [20]:
df1_tr=df1.select(col("user.name"),df1.text,df1.created_at)
df1_tr.show(2,truncate=False)

[Stage 15:>                                                         (0 + 1) / 1]

+--------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|name          |text                                                                                                                                        |created_at                    |
+--------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|Marc Botte    |RT @remigodeau: «Les disparités salariales ont légèrement diminué en 2 décennies, du fait de leur réduction parmi la moitié des salariés le…|Fri Apr 09 07:11:09 +0000 2021|
|Hugues Bismuth|RT @InseeFr: En mai 2020, un quart (23 %) des personnes âgées de 15 ans ou plus déclarent que la situation financière de leur foyer s’est d…|Fri Apr 09 07:11:09 +0000 2021|
+--------------+---------------------------------------

                                                                                

In [25]:
print(df1_tr.count())



13158


                                                                                

In [22]:
df2_tr=df2.select(col("name"),col("text"),col("date"))
df2_tr.show(2,truncate=False)

+------------------+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|name              |text                                                                                                                                          |date                          |
+------------------+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|MeidasTouch.com   |Facebook just suspended our account for 24 hours because of this video. Apparently on Facebook, you can spread Covi… https://t.co/skL8phRzYM  |Sun Nov 28 20:20:40 +0000 2021|
|lenika ✨ at PTD LA|idk the question but here's his answer:\n\nSG: you asked about whether the fears and hesitations we once expressed ha… https://t.co/lgY4MuBs52|Sun Nov 28 22:35:50 +0000 2021|
+------------------+-----

In [43]:
from pyspark.sql.types import StringType
df3_tr=df3.select(col("id").cast(StringType()),col("text"),col("created_at"))
df3_tr.show(2, truncate=False)
df3_tr.printSchema()

[Stage 30:>                                                         (0 + 1) / 1]

+-----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|id         |text                                                                                                                                        |created_at                    |
+-----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|26489279972|RT @krstv :Notez bien les noms des gens qui daubent sur "Qui veut épouser mon fils?".75%d'entre eux livetweeteront l'émission(INSEE/Mixbeat)|Tue Oct 05 20:50:42 +0000 2010|
|26488007455|Insee and Pelosi are two peas in a pod.  If you like Pelosi, then vote for Inslee!! #tcot #teaparty #gop #sgp #wcot #wasen                  |Tue Oct 05 20:31:29 +0000 2010|
+-----------+---------------------------------------------------------

                                                                                

In [46]:
union_1=df2_tr.union(df1_tr).union(df3_tr)

In [48]:
union_1.show(5,truncate=False)

+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|name                            |text                                                                                                                                          |date                          |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|MeidasTouch.com                 |Facebook just suspended our account for 24 hours because of this video. Apparently on Facebook, you can spread Covi… https://t.co/skL8phRzYM  |Sun Nov 28 20:20:40 +0000 2021|
|lenika ✨ at PTD LA              |idk the question but here's his answer:\n\nSG: you asked about whether the fears and hesitations we once expressed ha… https://t.c

In [49]:
union_1.count()

                                                                                

15920

In [50]:
output_path="s3a://formation/mise-en-prod/data"
union_1.coalesce(1).write.mode('overwrite').parquet(output_path)

                                                                                