## Challenge 4 - Parse Dates

A dataset contains a text field that has a date embedded within the text. The problem is that the date is represented a few different ways. For example:

* 16-APR-2005
* 16,•1900
* 4-SEP-00
* Jan•5•2000

The goal is to create a new Date/Time field populated with the dates contained within the text field. You will also need to standardize the dates so that they are all formatted the same.

In [170]:
import pandas as pd
import re
from datetime import date
from dateutil.relativedelta import relativedelta
pd.options.display.max_colwidth = 200

In [171]:
df = pd.read_csv(r'.\004_input.csv')
df

Unnamed: 0,Field_1
0,He who sleeps on the floor will not fall off the bed.Robert Gronock16-APR-2005This is a valid record.
1,"After all is said and done, more is said than doneAnonymous09-JAN-1856Oops! The RandomInt field is missing. This record should be rejected into a bad file."
2,"I want to see you shoot the way you shoutTeddy Roosevelt Nov 16, 1900This record is valid but the date is in MMMbDD,bYYYY format."
3,get someone else to do it.15-APR-1944This record is valid but has a carriage return in the middle of the quote.
4,Why do they call it rush hour when nothing moves?Mork27-JUN-70This record is valid but has only a two-year date.
5,I'm taking the Ryanair approach to it: subcontracting everythingMichael O'Leary23-MAY-2011This record is valid and has a single quote in the two text fields.
6,"I Xeroxed a mirror. Now I have an extra Xerox machine.""Steven Wright30-JUN-06This record is valid but has a trailing double quote that should be part fo the text field."
7,"Freidrich Engels01-AUG-08This record is missing the first field, the quote. This record should be rejected in a bad file."
8,"'He's so old his social security number is two digits.'Brian Morgan Jan 5 2000This record is valid but the quote is surrounded by single quotes, the date is an unconventional format and the intege..."
9,"""I was the best man at the wedding.So why is she marrying him?""Jerry Seinfeld09-July-2001This record is valid but has a carriage return in the quote and the month is fully specified in the date."


In [172]:
string = '.Darrin Weinberg 21-MAR-2005'

In [173]:
re.search(r'\d+[-\s]\w+[-\s]\d{2,4}', string).group(0)

'21-MAR-2005'

In [174]:
def get_dates(string):
    
    patterns = [r'\d+[-\s]\w+[-\s]\d{2,4}', r'\w*[\s\d,]*\d{4}']
    
    for pat in patterns:
        if re.search(pat,string):
            return pd.to_datetime(re.search(pat,string).group(0))

In [175]:
df['Date'] = df['Field_1'].apply(get_dates)
df

Unnamed: 0,Field_1,Date
0,He who sleeps on the floor will not fall off the bed.Robert Gronock16-APR-2005This is a valid record.,2005-04-16
1,"After all is said and done, more is said than doneAnonymous09-JAN-1856Oops! The RandomInt field is missing. This record should be rejected into a bad file.",1856-01-09
2,"I want to see you shoot the way you shoutTeddy Roosevelt Nov 16, 1900This record is valid but the date is in MMMbDD,bYYYY format.",1900-11-16
3,get someone else to do it.15-APR-1944This record is valid but has a carriage return in the middle of the quote.,1944-04-15
4,Why do they call it rush hour when nothing moves?Mork27-JUN-70This record is valid but has only a two-year date.,1970-06-27
5,I'm taking the Ryanair approach to it: subcontracting everythingMichael O'Leary23-MAY-2011This record is valid and has a single quote in the two text fields.,2011-05-23
6,"I Xeroxed a mirror. Now I have an extra Xerox machine.""Steven Wright30-JUN-06This record is valid but has a trailing double quote that should be part fo the text field.",2006-06-30
7,"Freidrich Engels01-AUG-08This record is missing the first field, the quote. This record should be rejected in a bad file.",2008-08-01
8,"'He's so old his social security number is two digits.'Brian Morgan Jan 5 2000This record is valid but the quote is surrounded by single quotes, the date is an unconventional format and the intege...",2000-01-05
9,"""I was the best man at the wedding.So why is she marrying him?""Jerry Seinfeld09-July-2001This record is valid but has a carriage return in the quote and the month is fully specified in the date.",2001-07-09


In [176]:
df['Date'].iloc[11]

Timestamp('2069-09-16 00:00:00')

In [177]:
def correct_year(date_col):
    """If the year is greater than the current year, the date should be in the previous century"""
    
    year_today = date.today().year
    if date_col.year > year_today:
        return date_col - relativedelta(years=100)
    else:
        return date_col

In [178]:
df['Date'] = df['Date'].apply(correct_year)
df

Unnamed: 0,Field_1,Date
0,He who sleeps on the floor will not fall off the bed.Robert Gronock16-APR-2005This is a valid record.,2005-04-16
1,"After all is said and done, more is said than doneAnonymous09-JAN-1856Oops! The RandomInt field is missing. This record should be rejected into a bad file.",1856-01-09
2,"I want to see you shoot the way you shoutTeddy Roosevelt Nov 16, 1900This record is valid but the date is in MMMbDD,bYYYY format.",1900-11-16
3,get someone else to do it.15-APR-1944This record is valid but has a carriage return in the middle of the quote.,1944-04-15
4,Why do they call it rush hour when nothing moves?Mork27-JUN-70This record is valid but has only a two-year date.,1970-06-27
5,I'm taking the Ryanair approach to it: subcontracting everythingMichael O'Leary23-MAY-2011This record is valid and has a single quote in the two text fields.,2011-05-23
6,"I Xeroxed a mirror. Now I have an extra Xerox machine.""Steven Wright30-JUN-06This record is valid but has a trailing double quote that should be part fo the text field.",2006-06-30
7,"Freidrich Engels01-AUG-08This record is missing the first field, the quote. This record should be rejected in a bad file.",2008-08-01
8,"'He's so old his social security number is two digits.'Brian Morgan Jan 5 2000This record is valid but the quote is surrounded by single quotes, the date is an unconventional format and the intege...",2000-01-05
9,"""I was the best man at the wedding.So why is she marrying him?""Jerry Seinfeld09-July-2001This record is valid but has a carriage return in the quote and the month is fully specified in the date.",2001-07-09


In [179]:
df['Date'].iloc[11]

Timestamp('1969-09-16 00:00:00')