## 10. Challenges with working with date and timestamps

Let's read in the supermarket sales dataframe attached to the lecture now and see some of the issues that can come up when working with date and timestamps values.

In [1]:
import findspark
import pyspark
findspark.init()
from pyspark.sql import SparkSession

def create_spark(appname):
    """Create a SparkSession as 'spark'
    Returns:
        SparkSession : A Spark Session that can
        be used to perform spark operations.
        """
    spark = SparkSession.builder.appName(appname).config("spark.sql.legacy.timeParserPolicy", "LEGACY").getOrCreate()
    return spark

dataset_path = 'dataset\supermarket_sales - Sheet1.csv'
spark = create_spark('supermarketSale')

sales = spark.read.csv(dataset_path, header=True)

## About this dataset

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. 

 - Attribute information
 - Invoice id: Computer generated sales slip invoice identification number
 - Branch: Branch of supercenter (3 branches are available identified by A, B and C).
 - City: Location of supercenters
 - Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
 - Gender: Gender type of customer
 - Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
 - Unit price: Price of each product in USD
 - Quantity: Number of products purchased by customer
 - Tax: 5% tax fee for customer buying
 - Total: Total price including tax
 - Date: Date of purchase (Record available from January 2019 to March 2019)
 - Time: Purchase time (10am to 9pm)
 - Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
 - COGS: Cost of goods sold
 - Gross margin percentage: Gross margin percentage
 - Gross income: Gross income
 - Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

**Source:** https://www.kaggle.com/aungpyaeap/supermarket-sales

### View dataframe and schema as usual

In [2]:
# view data
sales.show()

+-----------+------+---------+-------------+------+--------------------+----------+--------+-------+--------+---------+-----+-----------+------+-----------------------+------------+------+
| Invoice ID|Branch|     City|Customer type|Gender|        Product line|Unit price|Quantity| Tax 5%|   Total|     Date| Time|    Payment|  cogs|gross margin percentage|gross income|Rating|
+-----------+------+---------+-------------+------+--------------------+----------+--------+-------+--------+---------+-----+-----------+------+-----------------------+------------+------+
|750-67-8428|     A|   Yangon|       Member|Female|   Health and beauty|     74.69|       7|26.1415|548.9715| 1/5/2019|13:08|    Ewallet|522.83|            4.761904762|     26.1415|   9.1|
|226-31-3081|     C|Naypyitaw|       Normal|Female|Electronic access...|     15.28|       5|   3.82|   80.22| 3/8/2019|10:29|       Cash|  76.4|            4.761904762|        3.82|   9.6|
|631-41-3108|     A|   Yangon|       Normal|  Male|  Ho

In [3]:
# view schema
sales.printSchema()

root
 |-- Invoice ID: string (nullable = true)
 |-- Branch: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Customer type: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Product line: string (nullable = true)
 |-- Unit price: string (nullable = true)
 |-- Quantity: string (nullable = true)
 |-- Tax 5%: string (nullable = true)
 |-- Total: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Payment: string (nullable = true)
 |-- cogs: string (nullable = true)
 |-- gross margin percentage: string (nullable = true)
 |-- gross income: string (nullable = true)
 |-- Rating: string (nullable = true)



### Convert date field to date type

Looks like we need to convert the date field into a date type. Let's go ahead and do that..

In [4]:
from pyspark.sql.functions import col, to_date, expr
sales = sales.withColumn('Date', to_date(expr("lpad(Date, 10, '0')"), 'MM/dd/yyyy'))
#sales = sales.withColumn("Date", to_date(col("Date"), format="MM/dd/yyyy"))

### How can we extract the month value from the date field?

If you had trouble converting the date field in the previous question think about a more creative solution to extract the month from that field.

In [5]:
from pyspark.sql.functions import month,col
sales = sales.withColumn(
    "Month", month(col("Date")).alias("Month")
)

# Display the DataFrame with the new column
sales.show()

+-----------+------+---------+-------------+------+--------------------+----------+--------+-------+--------+----------+-----+-----------+------+-----------------------+------------+------+-----+
| Invoice ID|Branch|     City|Customer type|Gender|        Product line|Unit price|Quantity| Tax 5%|   Total|      Date| Time|    Payment|  cogs|gross margin percentage|gross income|Rating|Month|
+-----------+------+---------+-------------+------+--------------------+----------+--------+-------+--------+----------+-----+-----------+------+-----------------------+------------+------+-----+
|750-67-8428|     A|   Yangon|       Member|Female|   Health and beauty|     74.69|       7|26.1415|548.9715|2019-01-05|13:08|    Ewallet|522.83|            4.761904762|     26.1415|   9.1|    1|
|226-31-3081|     C|Naypyitaw|       Normal|Female|Electronic access...|     15.28|       5|   3.82|   80.22|2019-03-08|10:29|       Cash|  76.4|            4.761904762|        3.82|   9.6|    3|
|631-41-3108|     A|

## 11.0 Working with Arrays

Here is a dataframe of reviews from the movie the Dark Night.

In [6]:
from pyspark.sql.functions import *

values = [(5,'Epic. This is the best movie I have EVER seen'), \
          (4,'Pretty good, but I would have liked to seen better special effects'), \
          (3,'So so. Casting could have been improved'), \
          (5,'The most EPIC movie of the year! Casting was awesome. Special effects were so intense.'), \
          (4,'Solid but I would have liked to see more of the love story'), \
          (5,'THE BOMB!!!!!!!')]
reviews = spark.createDataFrame(values,['rating', 'review_txt'])

reviews.show(6,False)

+------+--------------------------------------------------------------------------------------+
|rating|review_txt                                                                            |
+------+--------------------------------------------------------------------------------------+
|5     |Epic. This is the best movie I have EVER seen                                         |
|4     |Pretty good, but I would have liked to seen better special effects                    |
|3     |So so. Casting could have been improved                                               |
|5     |The most EPIC movie of the year! Casting was awesome. Special effects were so intense.|
|4     |Solid but I would have liked to see more of the love story                            |
|5     |THE BOMB!!!!!!!                                                                       |
+------+--------------------------------------------------------------------------------------+



## 11.1 Let's see if we can create an array off of the review text column and then derive some meaningful results from it.

**But first** we need to clean the rview_txt column to make sure we can get what we need from our analysis later on. So let's do the following:

1. Remove all punctuation
2. lower case everything
3. Remove white space (trim)
3. Then finally, split the string

In [7]:
from pyspark.sql.functions import regexp_replace, lower, trim, split

def clean_review_text(df):
  """
  Cleans the review_txt column by removing punctuation, converting to lowercase,
  trimming whitespace, and splitting the string into words.

  Args:
    df: A Spark DataFrame with a column named 'review_txt'.

  Returns:
    A new Spark DataFrame with the cleaned 'review_txt' column named 'cleaned_review'.
  """

  # Remove punctuation
  df = df.withColumn("review_txt", regexp_replace(col("review_txt"), r"[^\w\s]", ""))

  # Lowercase everything
  df = df.withColumn("review_txt", lower(col("review_txt")))

  # Trim whitespace
  df = df.withColumn("review_txt", trim(col("review_txt")))

  # Split the string into words
  df = df.withColumn("cleaned_review", split(col("review_txt"), " "))

  # Rename the cleaned column
  df = df.withColumnRenamed("review_txt", "original_review")
  df = df.withColumnRenamed("cleaned_review", "review_words")

  return df

# Apply the cleaning function
cleaned_df = clean_review_text(reviews)

# Show the cleaned DataFrame
cleaned_df.show(truncate=False)


+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|rating|original_review                                                                    |review_words                                                                                       |
+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|5     |epic this is the best movie i have ever seen                                       |[epic, this, is, the, best, movie, i, have, ever, seen]                                            |
|4     |pretty good but i would have liked to seen better special effects                  |[pretty, good, but, i, would, have, liked, to, seen, better, special, effects]                     |
|3     |so so casting could have be

## 11.2 Alright now let's see if we can find which reviews contain the word 'Epic'

In [8]:
from pyspark.sql.functions import col, array_contains

# Filter reviews containing the word "Epic" (case-insensitive)
epic_reviews = cleaned_df.filter(array_contains(col("review_words"), "epic"))

# Show the reviews containing "Epic"
epic_reviews.show(truncate=False)


+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|rating|original_review                                                                    |review_words                                                                                       |
+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|5     |epic this is the best movie i have ever seen                                       |[epic, this, is, the, best, movie, i, have, ever, seen]                                            |
|5     |the most epic movie of the year casting was awesome special effects were so intense|[the, most, epic, movie, of, the, year, casting, was, awesome, special, effects, were, so, intense]|
+------+---------------------------