<a href="https://colab.research.google.com/github/khaledn66/pyspark2/blob/main/25manipulating_data_in_dataframes_hw.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Manipulating Data in DataFrames HW


#### Let's get started applying what we learned in the lecure!

I've provided several questions below to help test and expand you knowledge from the code along lecture. So let's see what you've got!

First create your spark instance as we need to do at the start of every project.

In [None]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FirstSpark").getOrCreate()
spark

## Read in our Republican vs. Democrats Tweet DataFrame

Attached to the lecture

## About this dataframe

Extracted tweets from all of the representatives (latest 200 as of May 17th 2018)

**Source:** https://www.kaggle.com/kapastor/democratvsrepublicantweets#ExtractedTweets.csv

Use either .show() or .toPandas() check out the first view rows of the dataframe to get an idea of what we are working with.

In [None]:
!rm -rf pyspark2

# Repository erneut klonen
!git clone https://github.com/khaledn66/pyspark2.git

Cloning into 'pyspark2'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 53 (delta 22), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (53/53), 7.41 MiB | 5.47 MiB/s, done.
Resolving deltas: 100% (22/22), done.


**Prevent Truncation of view**

If the view you produced above truncated some of the longer tweets, see if you can prevent that so you can read the whole tweet.

In [None]:
file_path = './pyspark2/Rep_vs_Dem_tweets.csv'
tweets = spark.read.csv(file_path, inferSchema=True, header=True)
tweets.show(5)

+--------------------+-------------+--------------------+
|               Party|       Handle|               Tweet|
+--------------------+-------------+--------------------+
|            Democrat|RepDarrenSoto|Today, Senate Dem...|
|            Democrat|RepDarrenSoto|RT @WinterHavenSu...|
|            Democrat|RepDarrenSoto|RT @NBCLatino: .@...|
|Congress has allo...|         NULL|                NULL|
|            Democrat|RepDarrenSoto|RT @NALCABPolicy:...|
+--------------------+-------------+--------------------+
only showing top 5 rows



**Print Schema**

First, check the schema to make sure the datatypes are accurate.

In [None]:
print(tweets.printSchema())

root
 |-- Party: string (nullable = true)
 |-- Handle: string (nullable = true)
 |-- Tweet: string (nullable = true)

None


## 1. Can you identify any tweet that mentions the handle @LatinoLeader using regexp_extract?

It doesn't matter how you identify the row, any identifier will do. You can test your script on row 5 from this dataset. That row contains @LatinoLeader.

In [None]:
tweets.select("Handle","Tweet").where(tweets.Tweet.like("%@LatinoLeader %")).show(5, False)

+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Handle       |Tweet                                                                                                                                       |
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
|RepDarrenSoto|RT @NALCABPolicy: Meeting with @RepDarrenSoto . Thanks for taking the time to meet with @LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.‚Ä¶|
+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
from pyspark.sql.functions import regexp_replace, regexp_extract

In [None]:
#pattern = r"(@LatinoLeader\b)"
pattern = r"\b@LatinoLeader\b"

# Verwende regexp_extract, um den exakten Wert zu extrahieren
tweets = tweets.withColumn("Handle", regexp_extract(tweets.Tweet, pattern, 0))

# Zeige die Ergebnisse
tweets.show(truncate=False)

# Benutzernamen extrahieren
tweets = tweets.withColumn("Handle", regexp_extract(tweets.Handle, pattern, 1))

tweets.show(truncate=False)

+----------------------------------------------------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Party                                               |Handle|Tweet                                                                                                                                       |
+----------------------------------------------------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Democrat                                            |      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶ https://t.co/n3tggDLU1L |
|Democrat                                            |      |RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Te

In [None]:
Tweets = tweets.withColumn('Handle',regexp_extract(tweets.Handle,  r"\b@LatinoLeader\b",1)).show()

+--------------------+------+--------------------+
|               Party|Handle|               Tweet|
+--------------------+------+--------------------+
|            Democrat|      |Today, Senate Dem...|
|            Democrat|      |RT @WinterHavenSu...|
|            Democrat|      |RT @NBCLatino: .@...|
|Congress has allo...|  NULL|                NULL|
|            Democrat|      |RT @NALCABPolicy:...|
|            Democrat|      |RT @Vegalteno: Hu...|
|            Democrat|      |RT @EmgageActionF...|
|            Democrat|      |Hurricane Maria l...|
|            Democrat|      |RT @Tharryry: I a...|
|            Democrat|      |RT @HispanicCaucu...|
|            Democrat|      |RT @RepStephMurph...|
|            Democrat|      |RT @AllSaints_FL:...|
|            Democrat|      |.@realDonaldTrump...|
|            Democrat|      |Thank you to my m...|
|            Democrat|      |We paid our respe...|
|Sgt Sam Howard - ...|  NULL|                NULL|
|            Democrat|      |RT

In [None]:
from pyspark.sql.functions import col

# Filter auf exakten Benutzernamen @LatinoLeader in der Tweet-Spalte
tweets_filtered = tweets.filter(col("Tweet").contains("@LatinoLeader"))

tweets_filtered.show(5, truncate=False)


+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Party   |Handle|Tweet                                                                                                                                       |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Democrat|      |RT @NALCABPolicy: Meeting with @RepDarrenSoto . Thanks for taking the time to meet with @LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.‚Ä¶|
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+



## 2. Replace any value other than 'Democrate' or 'Republican' with 'Other' in the Party column.

We can see from the output below, that there are several other values other than 'Democrate' or 'Republican' in the Part column. We are assuming that this is dirty data that needs to be cleaned up.

In [None]:
from pyspark.sql.functions import when, col

# Ersetzen von Werten in der 'Party' Spalte
tweets = tweets.withColumn("Party",
                           when(col("Party") == "Democrat", "Democrat")
                           .when(col("Party") == "Republican", "Republican")
                           .otherwise("Other"))

tweets.show(truncate=False)


+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Party   |Handle|Tweet                                                                                                                                       |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Democrat|      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶ https://t.co/n3tggDLU1L |
|Democrat|      |RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia‚Ä¶|
|Democrat|      |RT @NBCLatino: .@RepDarrenSoto noted that Hurricane Maria has left approximately $90 billion in damages.                                    |
|Other   |NULL  |NULL                     

## 3. Delete all embedded links (ie. "https:....)

For example see the first row in the tweets dataframe.

*Note: this may require an google search :)*

In [None]:
from pyspark.sql.functions import regexp_replace

# Replace URLs starting with 'http://' or 'https://' with an empty string
tweets = tweets.withColumn("Tweet",
                           regexp_replace(col("Tweet"), r"https?://\S+", ""))

tweets.show(truncate=False)


+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Party   |Handle|Tweet                                                                                                                                       |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Democrat|      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶                         |
|Democrat|      |RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia‚Ä¶|
|Democrat|      |RT @NBCLatino: .@RepDarrenSoto noted that Hurricane Maria has left approximately $90 billion in damages.                                    |
|Other   |NULL  |NULL                     

## 4. Remove any leading or trailing white space in the tweet column

In [None]:
from pyspark.sql.functions import trim, col

# Remove leading and trailing white spaces in the 'Tweet' column
tweets = tweets.withColumn("Tweet", trim(col("Tweet")))

tweets.show(truncate=False)


+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Party   |Handle|Tweet                                                                                                                                       |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Democrat|      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶                         |
|Democrat|      |RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia‚Ä¶|
|Democrat|      |RT @NBCLatino: .@RepDarrenSoto noted that Hurricane Maria has left approximately $90 billion in damages.                                    |
|Other   |NULL  |NULL                     

## 5. Rename the 'Party' column to 'Dem_Rep'

No real reason here :) just wanted you to get practice doing this.

In [None]:
# Rename the 'Party' column to 'Dem_Rep'
tweets = tweets.withColumnRenamed("Party", "Dem_Rep")

tweets.show(truncate=False)


+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Dem_Rep |Handle|Tweet                                                                                                                                       |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Democrat|      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶                         |
|Democrat|      |RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia‚Ä¶|
|Democrat|      |RT @NBCLatino: .@RepDarrenSoto noted that Hurricane Maria has left approximately $90 billion in damages.                                    |
|Other   |NULL  |NULL                     

## 6. Concatenate the Party and Handle columns

Silly yes... but good practice.

pyspark.sql.functions.concat_ws(sep, *cols)[source] <br>
Concatenates multiple input string columns together into a single string column, using the given separator.

In [None]:
from pyspark.sql.functions import concat, col

# Concatenate 'Party' and 'Handle' columns
tweets = tweets.withColumn("Party_Handle", concat(col("Dem_Rep"), col("Handle")))

tweets.show(truncate=False)

+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+------------+
|Dem_Rep |Handle|Tweet                                                                                                                                       |Party_Handle|
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+------------+
|Democrat|      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶                         |Democrat    |
|Democrat|      |RT @WinterHavenSun: Winter Haven resident / Alta Vista teacher is one of several recognized by @RepDarrenSoto for National Teacher Apprecia‚Ä¶|Democrat    |
|Democrat|      |RT @NBCLatino: .@RepDarrenSoto noted that Hurricane Maria has left approximately $90 billion in damages.               

## Challenge Question

Let's image that we want to analyze the hashtags that are used in these tweets. Can you extract all the hashtags you see?

In [None]:
from pyspark.sql.functions import regexp_extract_all, col

# Regular expression to match hashtags (starting with # followed by alphanumeric characters)
pattern = r"#\w+"

# Extract all hashtags in each tweet
tweets = tweets.withColumn("Hashtags", regexp_extract_all(col("Tweet"), pattern))

tweets.show(truncate=False)


AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `#\w+` cannot be resolved. Did you mean one of the following? [`Tweet`, `Handle`, `Dem_Rep`, `Party_Handle`].;
'Project [Dem_Rep#499, Handle#404, Tweet#480, Party_Handle#518, regexp_extract_all(Tweet#480, '#\w+, 1) AS Hashtags#544]
+- Project [Dem_Rep#499, Handle#404, Tweet#480, concat(Dem_Rep#499, Handle#404) AS Party_Handle#518]
   +- Project [Party#442 AS Dem_Rep#499, Handle#404, Tweet#480]
      +- Project [Party#442, Handle#404, trim(Tweet#461, None) AS Tweet#480]
         +- Project [Party#442, Handle#404, regexp_replace(Tweet#19, https?://\S+, , 1) AS Tweet#461]
            +- Project [CASE WHEN (Party#17 = Democrat) THEN Democrat WHEN (Party#17 = Republican) THEN Republican ELSE Other END AS Party#442, Handle#404, Tweet#19]
               +- Project [Party#17, regexp_extract(Handle#385, \b@LatinoLeader\b, 1) AS Handle#404, Tweet#19]
                  +- Project [Party#17, regexp_extract(Tweet#19, \b@LatinoLeader\b, 0) AS Handle#385, Tweet#19]
                     +- Project [Party#17, regexp_extract(Handle#347, \b@LatinoLeader\b, 1) AS Handle#366, Tweet#19]
                        +- Project [Party#17, regexp_extract(Tweet#19, \b@LatinoLeader\b, 0) AS Handle#347, Tweet#19]
                           +- Project [Party#17, regexp_extract(Handle#309, \b@LatinoLeader\b, 1) AS Handle#328, Tweet#19]
                              +- Project [Party#17, regexp_extract(Tweet#19, \b@LatinoLeader\b, 0) AS Handle#309, Tweet#19]
                                 +- Project [Party#17, regexp_extract(Handle#271, (^|\s)@LatinoLeader(\s|$), 1) AS Handle#290, Tweet#19]
                                    +- Project [Party#17, regexp_extract(Tweet#19, (^|\s)@LatinoLeader(\s|$), 0) AS Handle#271, Tweet#19]
                                       +- Project [Party#17, regexp_extract(Handle#215, (@LatinoLeader\b), 1) AS Handle#235, Tweet#19]
                                          +- Project [Party#17, regexp_extract(Handle#171, (@LatinoLeader/b), 1) AS Handle#215, Tweet#19]
                                             +- Project [Party#17, regexp_extract(Handle#151, (@LatinoLeader), 1) AS Handle#171, Tweet#19]
                                                +- Project [Party#17, regexp_extract(Handle#131, (@LatinoLeader), 1) AS Handle#151, Tweet#19]
                                                   +- Project [Party#17, regexp_extract(Handle#111, (@\w+), 1) AS Handle#131, Tweet#19]
                                                      +- Project [Party#17, regexp_extract(Handle#107, (@\w+), 1) AS Handle#111, Tweet#19]
                                                         +- Project [Party#17, regexp_extract(Handle#18, (@\w+), 1) AS Handle#107, Tweet#19]
                                                            +- Relation [Party#17,Handle#18,Tweet#19] csv


In [None]:
from pyspark.sql.functions import regexp_extract, col, explode, split

# Regular expression to match hashtags
pattern = r"#\w+"

# Extract hashtags and split them into an array of hashtags
tweets = tweets.withColumn("Hashtags", regexp_extract(col("Tweet"), pattern, 0))

# Alternatively, split by space to extract multiple hashtags (if more than one per tweet)
tweets = tweets.withColumn("Hashtags", split(col("Tweet"), " "))

# Filter to keep only the hashtags (those that start with #)
tweets = tweets.withColumn("Hashtags",
                           explode(
                               filter(lambda x: x.startswith("#"), col("Hashtags"))
                           ))

tweets.show(truncate=False)


PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

In [None]:
from pyspark.sql.functions import regexp_extract, col, split, array

# Regular expression to match hashtags (words starting with #)
pattern = r"#\w+"

# Extract hashtags by splitting the text into words
tweets = tweets.withColumn("Hashtags", split(col("Tweet"), " "))

# Filter out non-hashtag words
tweets = tweets.withColumn("Hashtags",
                           array(*[col("Hashtags")[i] for i in range(0, 10)]))  # We select only first 10 for example

tweets.show(truncate=False)


+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------------------------------------------------------------+
|Dem_Rep |Handle|Tweet                                                                                                                                       |Party_Handle|Hashtags                                                                                          |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------------------------------------------------------------------------------+
|Democrat|      |Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House‚Ä¶                         |Democrat    |[Today,, Senat

# Let's create our own dataset to work with real dates

This is a dataset of patient visits from a medical office. It contains the patients first and last names, date of birth, and the dates of their first 3 visits.

In [None]:
from pyspark.sql.types import *

md_office = [('Mohammed','Alfasy','1987-4-8','2016-1-7','2017-2-3','2018-3-2') \
            ,('Marcy','Wellmaker','1986-4-8','2015-1-7','2017-1-3','2018-1-2') \
            ,('Ginny','Ginger','1986-7-10','2014-8-7','2015-2-3','2016-3-2') \
            ,('Vijay','Doberson','1988-5-2','2016-1-7','2018-2-3','2018-3-2') \
            ,('Orhan','Gelicek','1987-5-11','2016-5-7','2017-1-3','2018-9-2') \
            ,('Sarah','Jones','1956-7-6','2016-4-7','2017-8-3','2018-10-2') \
            ,('John','Johnson','2017-10-12','2018-1-2','2018-10-3','2018-3-2') ]

df = spark.createDataFrame(md_office,['first_name','last_name','dob','visit1','visit2','visit3']) # schema=final_struc

# Check to make sure it worked
df.show()
print(df.printSchema())

+----------+---------+----------+--------+---------+---------+
|first_name|last_name|       dob|  visit1|   visit2|   visit3|
+----------+---------+----------+--------+---------+---------+
|  Mohammed|   Alfasy|  1987-4-8|2016-1-7| 2017-2-3| 2018-3-2|
|     Marcy|Wellmaker|  1986-4-8|2015-1-7| 2017-1-3| 2018-1-2|
|     Ginny|   Ginger| 1986-7-10|2014-8-7| 2015-2-3| 2016-3-2|
|     Vijay| Doberson|  1988-5-2|2016-1-7| 2018-2-3| 2018-3-2|
|     Orhan|  Gelicek| 1987-5-11|2016-5-7| 2017-1-3| 2018-9-2|
|     Sarah|    Jones|  1956-7-6|2016-4-7| 2017-8-3|2018-10-2|
|      John|  Johnson|2017-10-12|2018-1-2|2018-10-3| 2018-3-2|
+----------+---------+----------+--------+---------+---------+

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- visit1: string (nullable = true)
 |-- visit2: string (nullable = true)
 |-- visit3: string (nullable = true)

None


Oh no! The dates are still stored as text... let's try converting them again and see if we have any issues this time.

## 7. Can you calculate a variable showing the length of time between patient visits?

Compare visit1 to visit2 and visit2 to visit3 for all patients and see what the average length of time is between visits. Create an alias for it as well.

In [None]:
from pyspark.sql import functions as F

# Calculate the time difference between visit1 and visit2
md_office = md_office.withColumn("time_visit1_to_visit2",
                           F.datediff(col("visit2"), col("visit1")))

# Calculate the time difference between visit2 and visit3
md_office = md_office.withColumn("time_visit2_to_visit3",
                           F.datediff(col("visit3"), col("visit2")))

# Calculate the average time difference for all patients
average_time_visit1_to_visit2 = md_office.agg(F.avg("time_visit1_to_visit2")).collect()[0][0]
average_time_visit2_to_visit3 = md_office.agg(F.avg("time_visit2_to_visit3")).collect()[0][0]

# Show the DataFrame with the calculated time differences
md_office.show(truncate=False)

# Output the average times between visits
print(f"Average time between visit1 and visit2: {average_time_visit1_to_visit2} days")
print(f"Average time between visit2 and visit3: {average_time_visit2_to_visit3} days")


AttributeError: 'list' object has no attribute 'withColumn'

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType

# Your list of data
md_office = [
    ('Mohammed', 'Alfasy', '1987-4-8', '2016-1-7', '2017-2-3', '2018-3-2'),
    ('Marcy', 'Wellmaker', '1986-4-8', '2015-1-7', '2017-1-3', '2018-1-2'),
    ('Ginny', 'Ginger', '1986-7-10', '2014-8-7', '2015-2-3', '2016-3-2'),
    ('Vijay', 'Doberson', '1988-5-2', '2016-1-7', '2018-2-3', '2018-3-2'),
    ('Orhan', 'Gelicek', '1987-5-11', '2016-5-7', '2017-1-3', '2018-9-2'),
    ('Sarah', 'Jones', '1956-7-6', '2016-4-7', '2017-8-3', '2018-10-2'),
    ('John', 'Johnson', '2017-10-12', '2018-1-2', '2018-10-3', '2018-3-2')
]

# Define the schema for the DataFrame
schema = StructType([
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("dob", StringType(), True),
    StructField("visit1", StringType(), True),
    StructField("visit2", StringType(), True),
    StructField("visit3", StringType(), True)
])

# Create a DataFrame from the list and schema
df = spark.createDataFrame(md_office, schema)

# Convert the date columns to DateType
df = df.withColumn("dob", F.to_date("dob", "yyyy-M-d"))
df = df.withColumn("visit1", F.to_date("visit1", "yyyy-M-d"))
df = df.withColumn("visit2", F.to_date("visit2", "yyyy-M-d"))
df = df.withColumn("visit3", F.to_date("visit3", "yyyy-M-d"))

# Calculate the time difference between visit1 and visit2
df = df.withColumn("time_visit1_to_visit2", F.datediff("visit2", "visit1"))

# Calculate the time difference between visit2 and visit3
df = df.withColumn("time_visit2_to_visit3", F.datediff("visit3", "visit2"))

# Show the updated DataFrame
df.show(truncate=False)

# Check the schema to verify the column types
df.printSchema()


+----------+---------+----------+----------+----------+----------+---------------------+---------------------+
|first_name|last_name|dob       |visit1    |visit2    |visit3    |time_visit1_to_visit2|time_visit2_to_visit3|
+----------+---------+----------+----------+----------+----------+---------------------+---------------------+
|Mohammed  |Alfasy   |1987-04-08|2016-01-07|2017-02-03|2018-03-02|393                  |392                  |
|Marcy     |Wellmaker|1986-04-08|2015-01-07|2017-01-03|2018-01-02|727                  |364                  |
|Ginny     |Ginger   |1986-07-10|2014-08-07|2015-02-03|2016-03-02|180                  |393                  |
|Vijay     |Doberson |1988-05-02|2016-01-07|2018-02-03|2018-03-02|758                  |27                   |
|Orhan     |Gelicek  |1987-05-11|2016-05-07|2017-01-03|2018-09-02|241                  |607                  |
|Sarah     |Jones    |1956-07-06|2016-04-07|2017-08-03|2018-10-02|483                  |425                  |
|

In [None]:
from pyspark.sql import functions as F

# Calculate the time difference between visit1 and visit2
df = df.withColumn("time_visit1_to_visit2",
                           F.datediff(col("visit2"), col("visit1")))

# Calculate the time difference between visit2 and visit3
df = df.withColumn("time_visit2_to_visit3",
                           F.datediff(col("visit3"), col("visit2")))

# Calculate the average time difference for all patients
average_time_visit1_to_visit2 = df.agg(F.avg("time_visit1_to_visit2")).collect()[0][0]
average_time_visit2_to_visit3 = df.agg(F.avg("time_visit2_to_visit3")).collect()[0][0]

# Show the DataFrame with the calculated time differences
df.show(truncate=False)

# Output the average times between visits
print(f"Average time between visit1 and visit2: {average_time_visit1_to_visit2} days")
print(f"Average time between visit2 and visit3: {average_time_visit2_to_visit3} days")


+----------+---------+----------+----------+----------+----------+---------------------+---------------------+
|first_name|last_name|dob       |visit1    |visit2    |visit3    |time_visit1_to_visit2|time_visit2_to_visit3|
+----------+---------+----------+----------+----------+----------+---------------------+---------------------+
|Mohammed  |Alfasy   |1987-04-08|2016-01-07|2017-02-03|2018-03-02|393                  |392                  |
|Marcy     |Wellmaker|1986-04-08|2015-01-07|2017-01-03|2018-01-02|727                  |364                  |
|Ginny     |Ginger   |1986-07-10|2014-08-07|2015-02-03|2016-03-02|180                  |393                  |
|Vijay     |Doberson |1988-05-02|2016-01-07|2018-02-03|2018-03-02|758                  |27                   |
|Orhan     |Gelicek  |1987-05-11|2016-05-07|2017-01-03|2018-09-02|241                  |607                  |
|Sarah     |Jones    |1956-07-06|2016-04-07|2017-08-03|2018-10-02|483                  |425                  |
|

## 8. Can you calculate the age of each patient?

In [None]:
from pyspark.sql import functions as F

# Calculate age by subtracting dob from the current date and dividing by 365 (approximately)
df = df.withColumn("age", F.floor(F.datediff(F.current_date(), df.dob) / 365))

# Show the updated DataFrame
df.show(truncate=False)

# Check the schema to verify the column types
df.printSchema()


+----------+---------+----------+----------+----------+----------+---------------------+---------------------+---+
|first_name|last_name|dob       |visit1    |visit2    |visit3    |time_visit1_to_visit2|time_visit2_to_visit3|age|
+----------+---------+----------+----------+----------+----------+---------------------+---------------------+---+
|Mohammed  |Alfasy   |1987-04-08|2016-01-07|2017-02-03|2018-03-02|393                  |392                  |37 |
|Marcy     |Wellmaker|1986-04-08|2015-01-07|2017-01-03|2018-01-02|727                  |364                  |38 |
|Ginny     |Ginger   |1986-07-10|2014-08-07|2015-02-03|2016-03-02|180                  |393                  |38 |
|Vijay     |Doberson |1988-05-02|2016-01-07|2018-02-03|2018-03-02|758                  |27                   |36 |
|Orhan     |Gelicek  |1987-05-11|2016-05-07|2017-01-03|2018-09-02|241                  |607                  |37 |
|Sarah     |Jones    |1956-07-06|2016-04-07|2017-08-03|2018-10-02|483           

## 9. Can you extract the month from the first visit column and call it "Month"?

In [None]:
from pyspark.sql import functions as F

# Extract the month from visit1 and create a new column "Month"
df = df.withColumn("Month", F.month("visit1"))

# Show the updated DataFrame
df.show(truncate=False)

# Check the schema to verify the column types
df.printSchema()


+----------+---------+----------+----------+----------+----------+---------------------+---------------------+---+-----+
|first_name|last_name|dob       |visit1    |visit2    |visit3    |time_visit1_to_visit2|time_visit2_to_visit3|age|Month|
+----------+---------+----------+----------+----------+----------+---------------------+---------------------+---+-----+
|Mohammed  |Alfasy   |1987-04-08|2016-01-07|2017-02-03|2018-03-02|393                  |392                  |37 |1    |
|Marcy     |Wellmaker|1986-04-08|2015-01-07|2017-01-03|2018-01-02|727                  |364                  |38 |1    |
|Ginny     |Ginger   |1986-07-10|2014-08-07|2015-02-03|2016-03-02|180                  |393                  |38 |8    |
|Vijay     |Doberson |1988-05-02|2016-01-07|2018-02-03|2018-03-02|758                  |27                   |36 |1    |
|Orhan     |Gelicek  |1987-05-11|2016-05-07|2017-01-03|2018-09-02|241                  |607                  |37 |5    |
|Sarah     |Jones    |1956-07-06

## 10. Challenges with working with date and timestamps

Let's read in the supermarket sales dataframe attached to the lecture now and see some of the issues that can come up when working with date and timestamps values.

In [None]:
file_path = './pyspark2/supermarket_sales.csv'
market = spark.read.csv(file_path, inferSchema=True, header=True)
market.show(5)

+-----------+------+---------+-------------+------+--------------------+----------+--------+-------+--------+---------+-------------------+-----------+------+-----------------------+------------+------+
| Invoice ID|Branch|     City|Customer type|Gender|        Product line|Unit price|Quantity| Tax 5%|   Total|     Date|               Time|    Payment|  cogs|gross margin percentage|gross income|Rating|
+-----------+------+---------+-------------+------+--------------------+----------+--------+-------+--------+---------+-------------------+-----------+------+-----------------------+------------+------+
|750-67-8428|     A|   Yangon|       Member|Female|   Health and beauty|     74.69|       7|26.1415|548.9715| 1/5/2019|2024-11-07 13:08:00|    Ewallet|522.83|            4.761904762|     26.1415|   9.1|
|226-31-3081|     C|Naypyitaw|       Normal|Female|Electronic access...|     15.28|       5|   3.82|   80.22| 3/8/2019|2024-11-07 10:29:00|       Cash|  76.4|            4.761904762|      

## About this dataset

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

 - Attribute information
 - Invoice id: Computer generated sales slip invoice identification number
 - Branch: Branch of supercenter (3 branches are available identified by A, B and C).
 - City: Location of supercenters
 - Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
 - Gender: Gender type of customer
 - Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
 - Unit price: Price of each product in USD
 - Quantity: Number of products purchased by customer
 - Tax: 5% tax fee for customer buying
 - Total: Total price including tax
 - Date: Date of purchase (Record available from January 2019 to March 2019)
 - Time: Purchase time (10am to 9pm)
 - Payment: Payment used by customer for purchase (3 methods are available ‚Äì Cash, Credit card and Ewallet)
 - COGS: Cost of goods sold
 - Gross margin percentage: Gross margin percentage
 - Gross income: Gross income
 - Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

**Source:** https://www.kaggle.com/aungpyaeap/supermarket-sales

### View dataframe and schema as usual

In [None]:
print(market.printSchema())

root
 |-- Invoice ID: string (nullable = true)
 |-- Branch: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Customer type: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Product line: string (nullable = true)
 |-- Unit price: double (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Tax 5%: double (nullable = true)
 |-- Total: double (nullable = true)
 |-- Date: string (nullable = true)
 |-- Time: timestamp (nullable = true)
 |-- Payment: string (nullable = true)
 |-- cogs: double (nullable = true)
 |-- gross margin percentage: double (nullable = true)
 |-- gross income: double (nullable = true)
 |-- Rating: double (nullable = true)

None


### Convert date field to date type

Looks like we need to convert the date field into a date type. Let's go ahead and do that..

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

market = market.withColumn("Date", F.to_date("Date", "yyyy-M-d"))
market.show(truncate=False)

# Check the schema to verify the column types
market.printSchema()

+-----------+------+---------+-------------+------+----------------------+----------+--------+-------+--------+----+-------------------+-----------+------+-----------------------+------------+------+
|Invoice ID |Branch|City     |Customer type|Gender|Product line          |Unit price|Quantity|Tax 5% |Total   |Date|Time               |Payment    |cogs  |gross margin percentage|gross income|Rating|
+-----------+------+---------+-------------+------+----------------------+----------+--------+-------+--------+----+-------------------+-----------+------+-----------------------+------------+------+
|750-67-8428|A     |Yangon   |Member       |Female|Health and beauty     |74.69     |7       |26.1415|548.9715|NULL|2024-11-07 13:08:00|Ewallet    |522.83|4.761904762            |26.1415     |9.1   |
|226-31-3081|C     |Naypyitaw|Normal       |Female|Electronic accessories|15.28     |5       |3.82   |80.22   |NULL|2024-11-07 10:29:00|Cash       |76.4  |4.761904762            |3.82        |9.6   |


### How can we extract the month value from the date field?

If you had trouble converting the date field in the previous question think about a more creative solution to extract the month from that field.

## 11.0 Working with Arrays

Here is a dataframe of reviews from the movie the Dark Night.

In [None]:
from pyspark.sql.functions import *

values = [(5,'Epic. This is the best movie I have EVER seen'), \
          (4,'Pretty good, but I would have liked to seen better special effects'), \
          (3,'So so. Casting could have been improved'), \
          (5,'The most EPIC movie of the year! Casting was awesome. Special effects were so intense.'), \
          (4,'Solid but I would have liked to see more of the love story'), \
          (5,'THE BOMB!!!!!!!')]
reviews = spark.createDataFrame(values,['rating', 'review_txt'])

reviews.show(6,False)

+------+--------------------------------------------------------------------------------------+
|rating|review_txt                                                                            |
+------+--------------------------------------------------------------------------------------+
|5     |Epic. This is the best movie I have EVER seen                                         |
|4     |Pretty good, but I would have liked to seen better special effects                    |
|3     |So so. Casting could have been improved                                               |
|5     |The most EPIC movie of the year! Casting was awesome. Special effects were so intense.|
|4     |Solid but I would have liked to see more of the love story                            |
|5     |THE BOMB!!!!!!!                                                                       |
+------+--------------------------------------------------------------------------------------+



## 11.1 Let's see if we can create an array off of the review text column and then derive some meaningful results from it.

**But first** we need to clean the rview_txt column to make sure we can get what we need from our analysis later on. So let's do the following:

1. Remove all punctuation
2. lower case everything
3. Remove white space (trim)
3. Then finally, split the string

In [None]:
from pyspark.sql import functions as F

# Assuming 'tweets' is the DataFrame and 'text_column' is the column from which you want to remove punctuation
df =reviews.withColumn("review_txt", F.regexp_replace("review_txt", r"[^\w\s]", ""))

# Show the DataFrame to check the results
df.show(truncate=False)


+------+-----------------------------------------------------------------------------------+
|rating|review_txt                                                                         |
+------+-----------------------------------------------------------------------------------+
|5     |Epic This is the best movie I have EVER seen                                       |
|4     |Pretty good but I would have liked to seen better special effects                  |
|3     |So so Casting could have been improved                                             |
|5     |The most EPIC movie of the year Casting was awesome Special effects were so intense|
|4     |Solid but I would have liked to see more of the love story                         |
|5     |THE BOMB                                                                           |
+------+-----------------------------------------------------------------------------------+



In [None]:
from pyspark.sql import functions as F

# Assuming 'text_column' is the column you want to trim
df = df.withColumn("review_txt", F.trim("review_txt"))

# Show the DataFrame to check the results
df.show(truncate=False)


+------+-----------------------------------------------------------------------------------+
|rating|review_txt                                                                         |
+------+-----------------------------------------------------------------------------------+
|5     |epic this is the best movie i have ever seen                                       |
|4     |pretty good but i would have liked to seen better special effects                  |
|3     |so so casting could have been improved                                             |
|5     |the most epic movie of the year casting was awesome special effects were so intense|
|4     |solid but i would have liked to see more of the love story                         |
|5     |the bomb                                                                           |
+------+-----------------------------------------------------------------------------------+



In [None]:
from pyspark.sql import functions as F

# Assuming 'text_column' is the column you want to convert to lowercase
df = df.withColumn("review_txt", F.lower("review_txt"))

# Show the DataFrame to check the results
df.show(truncate=False)


+------+-----------------------------------------------------------------------------------+
|rating|review_txt                                                                         |
+------+-----------------------------------------------------------------------------------+
|5     |epic this is the best movie i have ever seen                                       |
|4     |pretty good but i would have liked to seen better special effects                  |
|3     |so so casting could have been improved                                             |
|5     |the most epic movie of the year casting was awesome special effects were so intense|
|4     |solid but i would have liked to see more of the love story                         |
|5     |the bomb                                                                           |
+------+-----------------------------------------------------------------------------------+



In [None]:
from pyspark.sql import functions as F

# Assuming 'text_column' is the column you want to split by space (or any other delimiter)
df = df.withColumn("review_txt ", F.split("review_txt", " "))

# Show the DataFrame to check the results
df.show(truncate=False)


+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|rating|review_txt                                                                         |review_txt                                                                                         |
+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|5     |epic this is the best movie i have ever seen                                       |[epic, this, is, the, best, movie, i, have, ever, seen]                                            |
|4     |pretty good but i would have liked to seen better special effects                  |[pretty, good, but, i, would, have, liked, to, seen, better, special, effects]                     |
|3     |so so casting could have be

## 11.2 Alright now let's see if we can find which reviews contain the word 'Epic'

In [None]:
from pyspark.sql import functions as F

# Assuming 'review_column' is the column containing the reviews
df_with_epic = df.filter(F.lower(F.col("review_txt")).contains("epic"))

# Show the DataFrame with reviews that contain the word 'epic'
df_with_epic.show(truncate=False)


+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|rating|review_txt                                                                         |review_txt                                                                                         |
+------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|5     |epic this is the best movie i have ever seen                                       |[epic, this, is, the, best, movie, i, have, ever, seen]                                            |
|5     |the most epic movie of the year casting was awesome special effects were so intense|[the, most, epic, movie, of, the, year, casting, was, awesome, special, effects, were, so, intense]|
+------+---------------------------

### That's it! Great Job!