# <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Pandas (6)</p>

<div class="alert alert-block alert-info alert">  
    
# <span style=" color:red">  Text Methods & Time Methods

## Table of Contents
**1. Text Methods for String Data**
* Split, grab and expand 
* Clean and edit strings
* apply() function
  
**2. Time Methods for Date and Time**
* Converting to datetime: pd.to_datetime()
* European vs American date format
* Custom Time String Formatting
* parse_dates
* Resample: Date Column as an Index
* .dt method

## 1. Text Methods for String Data

* Often text data needs to be cleaned or manipulated for processing.
* While we can always use a custom apply() function for these tasks, Pandas provides many built-in string methods.
* For detailed informations, visit https://pandas.pydata.org/docs/user_guide/text.html

In [1]:
import numpy as np
import pandas as pd

In [2]:
email = "karlos@email.com"

In [3]:
# after "email." use tab to see string methods (capitalize, find, lower, upper, split, strip, etc.)

email.split("@")   # split from "@"

['karlos', 'email.com']

In [4]:
# Let's create pandas series
names = pd.Series(["andrew", "bobo", "claire", "david", "5"])   # here, 5 is a string value

In [5]:
names
# See the elements and the data type of this series

0    andrew
1      bobo
2    claire
3     david
4         5
dtype: object

In [6]:
# We can make all values upper case using "upper" method
names.str.upper() 
# But this is just a tempoarary change. To make it permanent, assign it to a variable.

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         5
dtype: object

In [7]:
# use isdigit method
email.isdigit()  
# It is not digit but a variable

False

In [8]:
"5".isdigit()  

True

In [9]:
names.str.isdigit()
# only the last element, which is "5", is digit in this series.

0    False
1    False
2    False
3    False
4     True
dtype: bool

### Split, grab and expand

In [10]:
tech_finance = ["GOOG,APPL,AMZN", "JPM,BAC,GS" ]

In [11]:
# There are two items in this object
len(tech_finance)

2

In [12]:
# Convert it to a pandas series
tickers = pd.Series(tech_finance)
tickers

0    GOOG,APPL,AMZN
1        JPM,BAC,GS
dtype: object

In [13]:
# Let's split them from commas (,)
tickers.str.split(',')

0    [GOOG, APPL, AMZN]
1        [JPM, BAC, GS]
dtype: object

In [14]:
# When we split it, we can call one item from this list
tech = "GOOG,APPL,AMZN"
tech.split(',')[0]

'GOOG'

In [15]:
tickers.str.split(',').str[0]

0    GOOG
1     JPM
dtype: object

In [16]:
# Find "JPM"
tickers.str.split(',').str[0][1]

'JPM'

#### Make them columns, after splitting them from commas (use "expand=True" argument).

In [17]:
# Make them columns, after splitting them from commas (use "expand=True" argument)
tickers.str.split(',',expand=True)

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


### Clean and edit strings

In [18]:
# Let's correct this series
messy_names = pd.Series(["andrew   ", "bo;bo", "   claire   "])
messy_names

0       andrew   
1           bo;bo
2       claire   
dtype: object

In [19]:
# Although it is not recognizable, there are spaces in the first and the last lines
# We can check it by calling these items
messy_names[0]  # call the first itme, "andrew"

'andrew   '

In [20]:
# call the last item, "claaire"
messy_names[2] 

'   claire   '

In [21]:
# Let's fix the mistakes, first "bo;bo"
messy_names.str.replace(";","") # replace ";" with nothing, that is, remove it

0       andrew   
1            bobo
2       claire   
dtype: object

In [22]:
# After replacing it we can also strip white spaces with this code
cleaned_names = messy_names.str.replace(";","").str.strip()

In [23]:
# Let's check the itmes if white spaces are removed
cleaned_names[0]

'andrew'

In [24]:
cleaned_names[2]

'claire'

In [25]:
# We could add capitalize method to the code above. So after cleaning it, it will capitalize items.
messy_names.str.replace(";","").str.strip().str.capitalize()
# This is not a permanent change unless we assign it to a variable

0    Andrew
1      Bobo
2    Claire
dtype: object

### An alternative: apply() function

In [26]:
# We could write this cleaning codes as a function, so that we can apply it whenever we need

def cleanup(name):
    name = name.replace(";", "") # remove semicolon
    name = name.strip()          # remove spaces
    name = name.capitalize()     # capitalize
    return name

In [27]:
# Apply the function
messy_names.apply(cleanup)

0    Andrew
1      Bobo
2    Claire
dtype: object

## 2. Time Methods for Date and Time

* Basic Python has a **datetime object** containing date and time information.
* Pandas allows us to easily extract information from a datetime object to use feature engineering.
* For example, we may have recent timestamped sales data.
* Pandas will allow us to extract information from the timestamp, such as **day of the week**, **weekend vs weekday**, **AM vs PM**
* For details, see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#converting-to-timestamps

In [28]:
import numpy as np
import pandas as pd
from datetime import datetime

In [29]:
# The order of the datetime object

myyear = 2024
mymonth = 1
myday = 2
myhour = 5
mymin = 30
mysec = 15

In [30]:
# To see how datetime works use "shift + tab" keys inside the datetime paranthesis()
# datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])

mydate = datetime(myyear, mymonth, myday)
mydate      # it is now a datetime object.

datetime.datetime(2024, 1, 2, 0, 0)

In [31]:
# Let's add hour, minute and second data 
mydatetime = datetime(myyear, mymonth, myday, myhour, mymin, mysec)
mydatetime

datetime.datetime(2024, 1, 2, 5, 30, 15)

In [32]:
# To see the attributes, use "tab" after the comma "mydatetime."
# For example, use "year" attirbute to grab the year in this data
mydatetime.year

2024

In [33]:
# Grab the day
mydatetime.day

2

### Converting to datetime

If you use dates which start with the day first (i.e. European style), you can pass the dayfirst flag. Syntax: **pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)**

In [34]:
# Create a Pandas Series

myser = pd.Series(["Jul 31, 2009", "Jan 10, 2010", None])
myser

0    Jul 31, 2009
1    Jan 10, 2010
2            None
dtype: object

In [35]:
# It is not a datetime object yet
# For example, grab the first item and check its year
myser[0].year

AttributeError: 'str' object has no attribute 'year'

#### pd.to_datetime()

In [36]:
# Convert it to datetime... (see how it works using "shift+tab")
pd.to_datetime(myser)


0   2009-07-31
1   2010-01-10
2          NaT
dtype: datetime64[ns]

In [37]:
# Let's grab an item
timeser = pd.to_datetime(myser)

In [38]:
# Grab year from the first item
timeser[0].year

2009

In [39]:
# Grab the day from the second item
timeser[1].day

10

### European vs American date format

In [40]:
# December 31, 2000
obvi_euro_date = "31-12-2000"

In [41]:
pd.to_datetime(obvi_euro_date)
# The output display it from the largest (year) to smallest

  pd.to_datetime(obvi_euro_date)


Timestamp('2000-12-31 00:00:00')

In [42]:
# However, this date might be confusing
# Which one comes first, Day (European) or Month (American)?
euro_date = "10-12-2000"

In [43]:
# How does Pandas read it?
pd.to_datetime(euro_date)
# It is October 12, 2000.

Timestamp('2000-10-12 00:00:00')

In [44]:
# To convert it to European format use "dayfirst=True"
pd.to_datetime(euro_date, dayfirst=True)

Timestamp('2000-12-10 00:00:00')

### Custom Time String Formatting
* Sometimes dates can have a non standard format, luckily we can always specify to pandas the format.
* See the alternative datetime formats: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

In [45]:
style_date = "12--Dec--2024"

In [46]:
pd.to_datetime(style_date, format="%d--%b--%Y") # it will fix this format

Timestamp('2024-12-12 00:00:00')

In [47]:
# Can Pandas understand this date?
custom_date = "12th of Dec 2020"

In [48]:
pd.to_datetime(custom_date)

Timestamp('2020-12-12 00:00:00')

### Import Data and explore Datetime

In [49]:
sales = pd.read_csv("RetailSales_BeerWineLiquor.csv")

In [50]:
sales

Unnamed: 0,DATE,MRTSSM4453USN
0,1992-01-01,1509
1,1992-02-01,1541
2,1992-03-01,1597
3,1992-04-01,1675
4,1992-05-01,1822
...,...,...
335,2019-12-01,6630
336,2020-01-01,4388
337,2020-02-01,4533
338,2020-03-01,5562


In [51]:
# See the DATE column
sales["DATE"] # its data type is object (strng), not datetime

0      1992-01-01
1      1992-02-01
2      1992-03-01
3      1992-04-01
4      1992-05-01
          ...    
335    2019-12-01
336    2020-01-01
337    2020-02-01
338    2020-03-01
339    2020-04-01
Name: DATE, Length: 340, dtype: object

In [52]:
# We need to convert DATE column from string to datetime
sales["DATE"] = pd.to_datetime(sales["DATE"])

In [53]:
# Let's check it again
sales["DATE"] # now, it is datetime

0     1992-01-01
1     1992-02-01
2     1992-03-01
3     1992-04-01
4     1992-05-01
         ...    
335   2019-12-01
336   2020-01-01
337   2020-02-01
338   2020-03-01
339   2020-04-01
Name: DATE, Length: 340, dtype: datetime64[ns]

In [54]:
# Let's grab the year of first date object
sales["DATE"][0].year

1992

#### parse_dates

We can assign a column as a date column when we read a file.

In [55]:
# Write the index of the column... DATE is located in the first index
sales = pd.read_csv("RetailSales_BeerWineLiquor.csv", parse_dates =[0]) 

In [56]:
sales["DATE"] # it is directly converted to datetime

0     1992-01-01
1     1992-02-01
2     1992-03-01
3     1992-04-01
4     1992-05-01
         ...    
335   2019-12-01
336   2020-01-01
337   2020-02-01
338   2020-03-01
339   2020-04-01
Name: DATE, Length: 340, dtype: datetime64[ns]

In [57]:
# Grab an item from DATE column according to index location
sales.iloc[3]['DATE']

Timestamp('1992-04-01 00:00:00')

In [58]:
# We can also check its data type
type(sales.iloc[3]['DATE'])

pandas._libs.tslibs.timestamps.Timestamp

### Resample: Date Column as an Index

In [59]:
# Let's check "sales" index information
sales.index

RangeIndex(start=0, stop=340, step=1)

In [60]:
# Set DATE column as an index
sales = sales.set_index("DATE")
sales

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
1992-01-01,1509
1992-02-01,1541
1992-03-01,1597
1992-04-01,1675
1992-05-01,1822
...,...
2019-12-01,6630
2020-01-01,4388
2020-02-01,4533
2020-03-01,5562


* To resample it we need a **rule** parameter and then an aggregation function.
* The rule parameter describes the frequency ("A" for year, "B" for business day, etc.) with which to apply the aggregation function (daily, monthly, yearly, etc.)

<table style="display: inline-block">
    <caption style="text-align: center"><strong>TIME SERIES OFFSET ALIASES</strong></caption>
<tr><th>ALIAS</th><th>DESCRIPTION</th></tr>
<tr><td>B</td><td>business day frequency</td></tr>
<tr><td>C</td><td>custom business day frequency (experimental)</td></tr>
<tr><td>D</td><td>calendar day frequency</td></tr>
<tr><td>W</td><td>weekly frequency</td></tr>
<tr><td>M</td><td>month end frequency</td></tr>
<tr><td>SM</td><td>semi-month end frequency (15th and end of month)</td></tr>
<tr><td>BM</td><td>business month end frequency</td></tr>
<tr><td>CBM</td><td>custom business month end frequency</td></tr>
<tr><td>MS</td><td>month start frequency</td></tr>
<tr><td>SMS</td><td>semi-month start frequency (1st and 15th)</td></tr>
<tr><td>BMS</td><td>business month start frequency</td></tr>
<tr><td>CBMS</td><td>custom business month start frequency</td></tr>
<tr><td>Q</td><td>quarter end frequency</td></tr>
<tr><td></td><td><font color=white>intentionally left blank</font></td></tr></table>

<table style="display: inline-block; margin-left: 40px">
<caption style="text-align: center"></caption>
<tr><th>ALIAS</th><th>DESCRIPTION</th></tr>
<tr><td>BQ</td><td>business quarter endfrequency</td></tr>
<tr><td>QS</td><td>quarter start frequency</td></tr>
<tr><td>BQS</td><td>business quarter start frequency</td></tr>
<tr><td>A</td><td>year end frequency</td></tr>
<tr><td>BA</td><td>business year end frequency</td></tr>
<tr><td>AS</td><td>year start frequency</td></tr>
<tr><td>BAS</td><td>business year start frequency</td></tr>
<tr><td>BH</td><td>business hour frequency</td></tr>
<tr><td>H</td><td>hourly frequency</td></tr>
<tr><td>T, min</td><td>minutely frequency</td></tr>
<tr><td>S</td><td>secondly frequency</td></tr>
<tr><td>L, ms</td><td>milliseconds</td></tr>
<tr><td>U, us</td><td>microseconds</td></tr>
<tr><td>N</td><td>nanoseconds</td></tr></table>

In [61]:
# Resample it according to year (A) and an aggregation (mean)
# So it shows average sales in each year
sales.resample(rule = "A").mean()

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
1992-12-31,1807.25
1993-12-31,1794.833333
1994-12-31,1841.75
1995-12-31,1833.916667
1996-12-31,1929.75
1997-12-31,2006.75
1998-12-31,2115.166667
1999-12-31,2206.333333
2000-12-31,2375.583333
2001-12-31,2468.416667


### .dt method
* If the column is in a datetime format, we can use **.dt** method.
* See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html

In [62]:
# See how it is used
# help(sales['DATE'].dt)

In [63]:
# Let's import our data again with parse dates
sales = pd.read_csv("RetailSales_BeerWineLiquor.csv", parse_dates =[0]) 
sales.head()

Unnamed: 0,DATE,MRTSSM4453USN
0,1992-01-01,1509
1,1992-02-01,1541
2,1992-03-01,1597
3,1992-04-01,1675
4,1992-05-01,1822


In [64]:
sales.info() # DATE column is in datetime format

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   DATE           340 non-null    datetime64[ns]
 1   MRTSSM4453USN  340 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 5.4 KB


In [65]:
# Similar to "str" method in text data, datetime has "dt" method
sales["DATE"].dt.year

0      1992
1      1992
2      1992
3      1992
4      1992
       ... 
335    2019
336    2020
337    2020
338    2020
339    2020
Name: DATE, Length: 340, dtype: int32

In [66]:
# Grab a month value according to intex
sales["DATE"].dt.month[320]

9

In [67]:
sales['DATE'].dt.is_leap_year

0       True
1       True
2       True
3       True
4       True
       ...  
335    False
336     True
337     True
338     True
339     True
Name: DATE, Length: 340, dtype: bool

A **leap year** is a year, which has **366 days** (instead of 365) including 29th of February as an intercalary day. Leap years are years which are multiples of four with the exception of years divisible by 100 but not by 400.