# Data Wrangling and Transformation

This wrangling is based on exploratory data analysis presented in the EDA-initial Notebook that is in the same folder as this one.   Some additional ad hoc exploration has guided some of the decisions.    Both of these notebooks will be updated from time to time as more data is collected.

In a production system, it would be better to specify all the processing steps before requesting any output, because the optimizer can take the multiple steps into account.

In the interest of clairity and observability, I am breaking up the steps here.



## Initialization

In [1]:
spark

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
21,application_1575479622115_0022,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<pyspark.sql.session.SparkSession object at 0x7febf1f19908>

In [2]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
21,application_1575479622115_0022,pyspark,idle,Link,Link,✔


In [3]:
from pyspark.sql.functions import col, input_file_name, udf, expr
from pyspark.sql.functions import sum as spark_sum

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Read in all data from the bucket

In [4]:
raw_df = spark.read.json("s3://topic-sentiment-1/combined/*.json")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
raw_df.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1429172

First, we get rid of columns we don't need.

In [6]:
columns_to_drop = ['filename', 'image_url', 'localpath', 'title_page', 'title_rss']
clean_df = raw_df.drop(*columns_to_drop)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next, we get rid of duplicates.  The crawling process will often get the same URL multiple times because we run it under different conditions, and there is overlap when we do an incremental crawl.  We consider it to be a duplication if the URL and date_publish are the same.

In [7]:
clean_df = clean_df.dropDuplicates(['url', 'date_publish'])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
clean_df.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1153452

Next we create a `published` column that contains the conversion of `date_publish` string into a timestamp.

In [9]:
clean_df = clean_df.withColumn("published", (col("date_publish").cast("timestamp")))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We create a new column `text_or_desc` that will be used in analysis.  This uses the `text` column data, if present, and falls back to `description` if the `text` column was empty.

In [10]:
clean_df = clean_df.withColumn("text_or_desc",
                           expr("case when text IS NULL THEN description ELSE text END"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

As noted in the EDA-inital notebook, we are going to assume that the language is English (en) if the language was not specified.  We are excluding bbc.com where this was not a good assumption.

In [11]:
clean_df = clean_df.withColumn("language_guess",
                          expr("case when (language IS NULL AND source_domain NOT IN ('bbc.com')) THEN 'en' ELSE language END"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now we get rid of any rows where the date of publish, the title or the text/description is empty.

In [12]:
clean_df = clean_df.na.drop(subset=['date_publish', 'published', 'text_or_desc', 'title'])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
clean_df.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1095746

Finally, we are only going to use English articles in our analysis

In [14]:
clean_df = clean_df.where("language_guess = 'en'")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
clean_df.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1036546

# Output

Write the results to a bucket.

In [16]:
clean_df.write.parquet("s3://topic-sentiment-1/clean-data", mode="overwrite")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…