# The form of datetime to tranform as sql.types.DateType

Data which has the form of "Thu Oct 21 07:02:44 +0000 2021" comes through API.<br>
In this case, you cannot tranform data as sql.types.DateType right away.<br> So, we have to transform the data form after get each data.<br> To do that, we have to undferstand what is right form for sql.types.DateType

## Craete SparkSession 

In [14]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession \
    .builder \
    .appName("Usage of to_date") \
    .getOrCreate()

# Explain about to_date
## to_date(col[, format])    [<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#data-types">link<a>]
Converts a Column into pyspark.sql.types.DateType using the optionally specified format.<br>
    
## pyspark.sql.types.DateType <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DateType.html#pyspark.sql.types.DateType">[link]</a>
pyspark.sql.types.DateType means "datetime.date" in python and it is called "date".

## Load data typed json

In [40]:
df = spark.read.json("./data/sample_data")
print(df)
print(df.take(1))

DataFrame[created_at: string]
[Row(created_at='Thu Oct 21 07:02:44 +0000 2021')]


## Convert datatype from string to date

Transformed data from string to date. <Br>
However, the data in the column of to_date(created_at) is empty(None).<br>

It is probably the reason why the form is not correct.

In [45]:
from pyspark.sql.functions import to_date

dateDF = df.select(col("created_at"), to_date(col("created_at")))
print(dateDF)
print(dateDF.take(1))

DataFrame[created_at: string, to_date(created_at): date]
[Row(created_at='Thu Oct 21 07:02:44 +0000 2021', to_date(created_at)=None)]


## Let's find a right form
I said pyspark.sql.types.DataType follows datetime.date in python.<br>
So, what we have to look up is datetime.date

###  datetime.data<a href="https://docs.python.org/3/library/datetime.html#datetime.date"> [link]</a> 
class datetime.date(year, month, day)<br>
All arguments are required. Arguments must be integers, in the following ranges:<br>
<br>
MINYEAR <= year <= MAXYEAR<br>
1 <= month <= 12<br>
1 <= day <= number of days in the given month and year<br>
### conclusion
(1) For transforming data from String to Date it has to be only String in the year, month, and day.<br>
(2) The year,month, day should be number not abbreviation of Oct or something else.<br>
(3) date_str format is allowed to form of "year-month-day" in date.fromisoformat. 
 - Not sure. only this way is allowed to a type of Date 

# Trasnform data form and use to_date() again!

In [57]:
import datetime
x = 'Thu Oct 21 07:02:44 +0000 2021' 

In [61]:
datetime.datetime.strptime(x, "%a %b %d %H:%M:%S %z %Y") 

datetime.datetime(2021, 10, 21, 7, 2, 44, tzinfo=datetime.timezone.utc)

In [89]:
import datetime
from pyspark.sql.functions import udf

def from_created_at(x):
    """
    parsing format : "https://docs.python.org/3/library/datetime.html#datetime.date"
    
    The valuable of 'x' has a form of 'Thu Oct 21 07:02:44 +0000 2021' 
    """
    dt = datetime.datetime.strptime(x, "%a %b %d %H:%M:%S %z %Y")
    return dt.date().isoformat()

from_created_at_udf = udf(lambda x: from_created_at(x))

In [93]:
from pyspark.sql.functions import to_date

dateDF = df.select(col("created_at"), to_date(from_created_at_udf(col("created_at"))))
print(dateDF)
dateDF.show()

DataFrame[created_at: string, to_date(<lambda>(created_at)): date]
+--------------------+-----------------------------+
|          created_at|to_date(<lambda>(created_at))|
+--------------------+-----------------------------+
|Thu Oct 21 07:02:...|                   2021-10-21|
+--------------------+-----------------------------+



# And. . .

Of course, there is the TimestampType which is from datetime.datetime as well. <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.TimestampType.html#pyspark.sql.types.TimestampType"> [link] </a> and the usage would be similar.
