---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [1]:
import pandas as pd
import re

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

In [18]:
def date_sorter():
    
    import numpy as np
    list1 = []
    list2 = []
    list3 = []
    list4 = []
    cond1 = r'(?:\d{1,2}[-/])?\d{1,2}[-/]\d{2,4}'
    cond2 = r'(?:\d{1,2})[-., ]+(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?:\d{2,4})'
    cond3 = r'(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?:[-., ]+(?:\d{1,2})(?:[a-z]{2})?)?[-., ]+(?:\d{2,4})'
    cond4 = r'(19\d\d|20\d\d)'
    for line in df:

        search1 = []
        search2 = []
        search3 = []
        search4 = []

        search1 = re.findall(cond1, line)
        search2 = re.findall(cond2, line)
        search3 = re.findall(cond3, line)
        search4 = re.findall(cond4, line)

        list1.append(search1)
        list2.append(search2)
        list3.append(search3)
        list4.append(search4)
    dff = pd.DataFrame(df)
    dff['cond1']= list1
    dff['cond2'] = list2
    dff['cond3'] = list3
    dff['cond4'] = list4
    dff['cond1'] = dff['cond1'].apply(lambda x: np.nan if x==[] else x)
    dff['cond2'] = dff['cond2'].apply(lambda x: np.nan if x==[] else x)
    dff['cond3'] = dff['cond3'].apply(lambda x: np.nan if x==[] else x)
    dff['cond4'] = dff['cond4'].apply(lambda x: np.nan if x==[] else x)

    dff['cond1'] = dff['cond1'].astype(str)
    dff['cond2'] = dff['cond2'].astype(str)
    dff['cond3'] = dff['cond3'].astype(str)
    dff['cond4'] = dff['cond4'].astype(str)
    dff['cond1']= dff['cond1'].apply(lambda x: np.nan if x=='nan' else x)
    dff['cond2']= dff['cond2'].apply(lambda x: np.nan if x=='nan' else x)
    dff['cond3']= dff['cond3'].apply(lambda x: np.nan if x=='nan' else x)
    dff['cond4']= dff['cond4'].apply(lambda x: np.nan if x=='nan' else x)

    dff['cond1'] = dff['cond1'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)
    dff['cond2'] = dff['cond2'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)
    dff['cond3'] = dff['cond3'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)
    dff['cond4'] = dff['cond4'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)

    dff['cond1']= dff['cond1'].str.strip()
    dff['cond2']= dff['cond2'].str.strip()
    dff['cond3']= dff['cond3'].str.strip()
    dff['cond4']= dff['cond4'].str.strip()

    #condition 1

    c1dff = dff['cond1'].str.extractall(r'((\d{1,2}[-/])?(\d{1,2})[-/](\d{2,4}))')
    c1dff = c1dff[0].unstack()
    c1dff.loc[80][0] = c1dff.loc[80][1]
    c1dff.loc[272][0] = np.nan
    c1dff.drop(columns=[1], inplace=True)
    c1dff = c1dff[0].str.extractall(r'(?P<date_raw>(?P<month>\d{1,2})[-/](?P<day>\d{1,2}[-/])?(?P<year>\d{2,4}))')
    #c1dff.rename(columns={0:'date_raw', 1:'month', 2:'day', 3: 'year'}, inplace=True)
    c1dff['day']= c1dff['day'].replace(r'[-/. ]', '', regex=True)
    #c1dff['day'] = c1dff.day.astype(str)
    c1dff['day'].fillna('01',inplace=True)
    c1dff['month'] = c1dff['month'].apply(lambda x: '0'+x if len(x)==1 else x)
    c1dff['day'] = c1dff['day'].apply(lambda x: '0'+x if len(x)==1 else x)
    c1dff['year'] = c1dff['year'].apply(lambda x: '19'+x if len(x)==2 else x)

    c1dff['date'] = c1dff['year']+c1dff['month']+c1dff['day'] #reconstruct in yyyymmdd
    c1dff.drop(columns = ['month','day', 'year'], inplace=True)
    c1dff = c1dff.droplevel(level=1, axis=0)

    #Condition 2

    c2dff = dff['cond2'].str.extractall(r'((\d{1,2})[-., ](Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(\d{2,4}))')
    c2dff.index.unique(level='match')# only match level 0 hence match index can be discarded
    c2dff.rename(columns={0:'raw_date', 1: 'day', 2: 'month', 3: 'year'}, inplace=True)

    # Day field cleaning
    c2dff.fillna('01', inplace=True) # file NaN with 01
    c2dff['day'].apply(lambda x: x if len(x)<2 else 0).unique() # no single digit date

    # month field cleaning
    c2dff[c2dff['month'].isnull()] # no null fields
    c2dff['month'] = c2dff['month'].apply(lambda x: x[:3] if len(x)>3 else x) # change to 3 character month format
    monthDic = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', \
                'May':'05', 'Jun':'06', 'Jul':'07' ,'Aug':'08', 'Sep':'09',\
               'Oct':'10', 'Nov':'11', 'Dec':'12'}
    c2dff['month'] = c2dff['month'].apply(lambda x: monthDic[x]) # change to numeric month representation

    c2dff['year'].isnull().unique() # no null fields
    c2dff['year'].unique() # some fields are incorrectly picked in cond2 and correct in cond3. there are where
    c2dff['date'] = c2dff['year']+c2dff['month']+c2dff['day']

    #Condition 3

    c3dff = dff['cond3'].str.extractall(r'(?P<date_raw>(?P<month>Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?P<day>[-., ]+\d{1,2}(?P<th>[a-z]{2})?)?[-., ]+(?P<year>\d{2,4}))')
    c3dff.index.unique(level='match') # only match level 0 hence match index can be discarded

    # Cleaning for Day field. remove nonalphanumeric characters and strip
    c3dff['day']= c3dff['day'].replace(r'[\W]*[\s]*','',regex=True) # clear non-alphanumeric characters(\W) and white spaces(\s) 
    # since each entry could have more than one white space or illegal characters * is specified to match 0 or more times
    c3dff['day'].fillna('01', inplace = True) # fill NaN with 01
    #c3dff['day'].apply(lambda x: x if len(x)!=2 else 0).unique() # verify all enteries are legit. 

    # Month Field Cleaning, make sure all month are 3 characters long, then use dictionary to change to numeric value
    c3dff['month'].unique() # some values needs converting to 3 digits, also misspelled months
    c3dff['month'] = c3dff['month'].apply(lambda x: x[:3] if len(x)>3 else x)
    c3dff['month'] = c3dff['month'].apply(lambda x: monthDic[x])

    # Year Field Cleaning
    c3dff['year'].apply(lambda x: x if len(x)!=4 else 0).unique() # all fields are of 4 length 
    c3dff['year'].unique() # values seem fine as well, hence no cleaning required
    c3dff['date'] = c3dff['year']+c3dff['month']+c3dff['day']

    #Condtion 4

    dff['cond4'].astype(str).apply(lambda x: x if len(x)!=4 else 0).unique()
    c4dff = dff['cond4'].str.extractall(r'(19\d\d|20\d\d)')
    c4dff.index.unique(level='match') # only index level 0, match can be discarded
    c4dff.rename(columns={0:'raw_date'}, inplace=True)
    c4dff['date'] = c4dff['raw_date']+'01'+'01'

    ## Mergin back into main DataFrame dff
    dff = dff.merge(c1dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'date_raw':'raw_cond1', 'date': 'date_cond1'}, inplace=True)
    c2dff = c2dff.droplevel(level=1, axis=0)
    c3dff = c3dff.droplevel(level=1, axis=0)
    c4dff = c4dff.droplevel(level=1, axis=0)
    c3dff.drop(columns=['day', 'month', 'year', 'th'], inplace=True)
    c2dff.drop(columns=['day', 'month', 'year'], inplace=True)
    dff=dff.merge(c2dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'raw_date':'raw_cond2', 'date': 'date_cond2'}, inplace=True)
    dff = dff.merge(c3dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'date_raw':'raw_cond3', 'date': 'date_cond3'}, inplace=True)
    dff = dff.merge(c4dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'raw_date':'raw_cond4', 'date': 'date_cond4'}, inplace=True)

    # Filtering DataFrame

    allNull = dff['date_cond1'].notnull() & dff['date_cond2'].isnull() & dff['date_cond3'].isnull() & dff['date_cond4'].isnull()
    dff['date_final'] = dff[allNull]['date_cond1']
    #dff['date_cond1'].astype(str).apply(lambda x: x if len(x)!=8 else 0).unique()

    f3 = dff['date_cond1'].notnull() & dff['date_cond3'].notnull()
    dff.loc[f3,'date_final'] = dff['date_cond3']

    #dff[dff['date_cond1'].notnull() & dff['date_cond3'].notnull()]
    f4 = dff['date_cond1'].notnull() & dff['date_cond4'].notnull()
    dff.loc[f4,'date_final'] = dff['date_cond1']

    #Condition two populate
    f5 = dff['date_cond2'].notnull()
    dff.loc[f5,'date_final'] = dff['date_cond2']

    # Condition 3 populate
    f6 = dff['cond2'].isnull() & dff['cond3'].notnull()
    dff.loc[f6, 'date_final'] = dff['date_cond3']

    # Condition 4 populate
    f7 = dff['date_final'].isnull()
    dff.loc[f7, 'date_final'] = dff['date_cond4']

    #return final sorted series in the format specified 
    final = dff['date_final'].sort_values()
    l1 = list(final.index)
    fSer = pd.Series(l1) 
      
    
    return fSer

In [19]:
date_sorter()

0        9
1       84
2        2
3       53
4       28
5      474
6      153
7       13
8      129
9       98
10     111
11     225
12      31
13     171
14     191
15     486
16     335
17     415
18      36
19     405
20     323
21     422
22     375
23     380
24     345
25      57
26     481
27     436
28     104
29     299
      ... 
470    220
471    208
472    243
473    139
474    320
475    383
476    244
477    286
478    480
479    431
480    279
481    198
482    381
483    463
484    366
485    439
486    255
487    401
488    475
489    257
490    152
491    235
492    464
493    253
494    427
495    231
496    141
497    186
498    161
499    413
Length: 500, dtype: int64

## Python version at the portal is slightly older, hence slight changes to below to accomodate 

In [1]:
def date_sorter():
    
    
    import numpy as np
    import re

    list1 = []
    list2 = []
    list3 = []
    list4 = []
    cond1 = r'(?:\d{1,2}[-/])?\d{1,2}[-/]\d{2,4}'
    cond2 = r'(?:\d{1,2})[-., ]+(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?:\d{2,4})'
    cond3 = r'(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?:[-., ]+(?:\d{1,2})(?:[a-z]{2})?)?[-., ]+(?:\d{2,4})'
    cond4 = r'(19\d\d|20\d\d)'
    for line in df:

        search1 = []
        search2 = []
        search3 = []
        search4 = []

        search1 = re.findall(cond1, line)
        search2 = re.findall(cond2, line)
        search3 = re.findall(cond3, line)
        search4 = re.findall(cond4, line)

        list1.append(search1)
        list2.append(search2)
        list3.append(search3)
        list4.append(search4)
    dff = pd.DataFrame(df)
    dff['cond1']= list1
    dff['cond2'] = list2
    dff['cond3'] = list3
    dff['cond4'] = list4
    dff['cond1'] = dff['cond1'].apply(lambda x: np.nan if x==[] else x)
    dff['cond2'] = dff['cond2'].apply(lambda x: np.nan if x==[] else x)
    dff['cond3'] = dff['cond3'].apply(lambda x: np.nan if x==[] else x)
    dff['cond4'] = dff['cond4'].apply(lambda x: np.nan if x==[] else x)

    dff['cond1'] = dff['cond1'].astype(str)
    dff['cond2'] = dff['cond2'].astype(str)
    dff['cond3'] = dff['cond3'].astype(str)
    dff['cond4'] = dff['cond4'].astype(str)
    dff['cond1']= dff['cond1'].apply(lambda x: np.nan if x=='nan' else x)
    dff['cond2']= dff['cond2'].apply(lambda x: np.nan if x=='nan' else x)
    dff['cond3']= dff['cond3'].apply(lambda x: np.nan if x=='nan' else x)
    dff['cond4']= dff['cond4'].apply(lambda x: np.nan if x=='nan' else x)

    dff['cond1'] = dff['cond1'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)
    dff['cond2'] = dff['cond2'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)
    dff['cond3'] = dff['cond3'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)
    dff['cond4'] = dff['cond4'].replace(r'(\])*(\[)*(\')*(\')*', '', regex=True)

    dff['cond1']= dff['cond1'].str.strip()
    dff['cond2']= dff['cond2'].str.strip()
    dff['cond3']= dff['cond3'].str.strip()
    dff['cond4']= dff['cond4'].str.strip()

    #condition 1

    c1dff = dff['cond1'].str.extractall(r'((\d{1,2}[-/])?(\d{1,2})[-/](\d{2,4}))')
    c1dff = c1dff[0].unstack()
    c1dff.loc[80][0] = c1dff.loc[80][1]
    c1dff.loc[272][0] = np.nan
    c1dff.drop(1, axis=1, inplace=True)
    c1dff = c1dff[0].str.extractall(r'(?P<date_raw>(?P<month>\d{1,2})[-/](?P<day>\d{1,2}[-/])?(?P<year>\d{2,4}))')
    #c1dff.rename(columns={0:'date_raw', 1:'month', 2:'day', 3: 'year'}, inplace=True)
    c1dff['day']= c1dff['day'].replace(r'[-/. ]', '', regex=True)
    #c1dff['day'] = c1dff.day.astype(str)
    c1dff['day'].fillna('01',inplace=True)
    c1dff['month'] = c1dff['month'].apply(lambda x: '0'+x if len(x)==1 else x)
    c1dff['day'] = c1dff['day'].apply(lambda x: '0'+x if len(x)==1 else x)
    c1dff['year'] = c1dff['year'].apply(lambda x: '19'+x if len(x)==2 else x)

    c1dff['date'] = c1dff['year']+c1dff['month']+c1dff['day'] #reconstruct in yyyymmdd
    c1dff.drop(['month','day', 'year'],axis=1, inplace=True)
    c1dff.index = c1dff.index.droplevel(level=1)
    #Condition 2

    c2dff = dff['cond2'].str.extractall(r'((\d{1,2})[-., ](Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(\d{2,4}))')
    #c2dff.index.unique(level='match')# only match level 0 hence match index can be discarded
    c2dff.rename(columns={0:'raw_date', 1: 'day', 2: 'month', 3: 'year'}, inplace=True)

    # Day field cleaning
    c2dff.fillna('01', inplace=True) # file NaN with 01
    c2dff['day'].apply(lambda x: x if len(x)<2 else 0).unique() # no single digit date

    # month field cleaning
    c2dff[c2dff['month'].isnull()] # no null fields
    c2dff['month'] = c2dff['month'].apply(lambda x: x[:3] if len(x)>3 else x) # change to 3 character month format
    monthDic = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', \
                'May':'05', 'Jun':'06', 'Jul':'07' ,'Aug':'08', 'Sep':'09',\
               'Oct':'10', 'Nov':'11', 'Dec':'12'}
    c2dff['month'] = c2dff['month'].apply(lambda x: monthDic[x]) # change to numeric month representation

    c2dff['year'].isnull().unique() # no null fields
    c2dff['year'].unique() # some fields are incorrectly picked in cond2 and correct in cond3. there are where
    c2dff['date'] = c2dff['year']+c2dff['month']+c2dff['day']

    #Condition 3

    c3dff = dff['cond3'].str.extractall(r'(?P<date_raw>(?P<month>Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?P<day>[-., ]+\d{1,2}(?P<th>[a-z]{2})?)?[-., ]+(?P<year>\d{2,4}))')
    #c3dff.index.unique(level='match') # only match level 0 hence match index can be discarded

    # Cleaning for Day field. remove nonalphanumeric characters and strip
    c3dff['day']= c3dff['day'].replace(r'[\W]*[\s]*','',regex=True) # clear non-alphanumeric characters(\W) and white spaces(\s) 
    # since each entry could have more than one white space or illegal characters * is specified to match 0 or more times
    c3dff['day'].fillna('01', inplace = True) # fill NaN with 01
    #c3dff['day'].apply(lambda x: x if len(x)!=2 else 0).unique() # verify all enteries are legit. 

    # Month Field Cleaning, make sure all month are 3 characters long, then use dictionary to change to numeric value
    c3dff['month'].unique() # some values needs converting to 3 digits, also misspelled months
    c3dff['month'] = c3dff['month'].apply(lambda x: x[:3] if len(x)>3 else x)
    c3dff['month'] = c3dff['month'].apply(lambda x: monthDic[x])

    # Year Field Cleaning
    c3dff['year'].apply(lambda x: x if len(x)!=4 else 0).unique() # all fields are of 4 length 
    c3dff['year'].unique() # values seem fine as well, hence no cleaning required
    c3dff['date'] = c3dff['year']+c3dff['month']+c3dff['day']

    #Condtion 4

    dff['cond4'].astype(str).apply(lambda x: x if len(x)!=4 else 0).unique()
    c4dff = dff['cond4'].str.extractall(r'(19\d\d|20\d\d)')
    #c4dff.index.unique(level='match') # only index level 0, match can be discarded
    c4dff.rename(columns={0:'raw_date'}, inplace=True)
    c4dff['date'] = c4dff['raw_date']+'01'+'01'

    ## Mergin back into main DataFrame dff
    dff = dff.merge(c1dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'date_raw':'raw_cond1', 'date': 'date_cond1'}, inplace=True)

    c2dff.index = c2dff.index.droplevel(level=1)
    c3dff.index = c3dff.index.droplevel(level=1)
    c4dff.index = c4dff.index.droplevel(level=1)
    c3dff.drop(['day', 'month', 'year', 'th'],axis=1, inplace=True)
    c2dff.drop(['day', 'month', 'year'], axis=1, inplace=True)
    dff=dff.merge(c2dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'raw_date':'raw_cond2', 'date': 'date_cond2'}, inplace=True)
    dff = dff.merge(c3dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'date_raw':'raw_cond3', 'date': 'date_cond3'}, inplace=True)
    dff = dff.merge(c4dff, left_index=True, right_index=True, how='outer')
    dff.rename(columns={'raw_date':'raw_cond4', 'date': 'date_cond4'}, inplace=True)

    # Filtering DataFrame

    allNull = dff['date_cond1'].notnull() & dff['date_cond2'].isnull() & dff['date_cond3'].isnull() & dff['date_cond4'].isnull()
    dff['date_final'] = dff[allNull]['date_cond1']
    #dff['date_cond1'].astype(str).apply(lambda x: x if len(x)!=8 else 0).unique()

    f3 = dff['date_cond1'].notnull() & dff['date_cond3'].notnull()
    dff.loc[f3,'date_final'] = dff['date_cond3']

    #dff[dff['date_cond1'].notnull() & dff['date_cond3'].notnull()]
    f4 = dff['date_cond1'].notnull() & dff['date_cond4'].notnull()
    dff.loc[f4,'date_final'] = dff['date_cond1']

    #Condition two populate
    f5 = dff['date_cond2'].notnull()
    dff.loc[f5,'date_final'] = dff['date_cond2']

    # Condition 3 populate
    f6 = dff['cond2'].isnull() & dff['cond3'].notnull()
    dff.loc[f6, 'date_final'] = dff['date_cond3']

    # Condition 4 populate
    f7 = dff['date_final'].isnull()
    dff.loc[f7, 'date_final'] = dff['date_cond4']

    #return final sorted series in the format specified 
    final = dff['date_final'].sort_values()
    l1 = list(final.index)
    fSer = pd.Series(l1) 



    
    return fSer

## Old Workings from this point below ** please ignore **

In [630]:
import re
new = []
cond1 = r'(?:\d{1,2}[-/])?\d{1,2}[-/]\d{2,4}'
cond2 = r'(?:\d{1,2})?[-., ]+(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?:\d{2,4})'
cond3 = r'(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?:[-., ]+(?:\d{1,2})(?:[a-z]{2})?)?[-., ]+(?:\d{2,4})'
cond4 = r'(19\d\d|20\d\d)'
for line in df:
    search = []
    search = re.findall(cond1, line)
    if search == []:
        search = re.findall(cond2, line)
        if search == []:
            search = re.findall(cond3, line)
            if search == []:
                search = re.findall(cond4, line)
                if search ==[]:
                    new.append('0')
                else:
                    new.append(search)
            else:
                new.append(search)
        else:
            new.append(search)
    else:
        new.append(search)




In [648]:
list1 = []
list2 = []
list3 = []
list4 = []
cond1 = r'(?:\d{1,2}[-/])?\d{1,2}[-/]\d{2,4}'
cond2 = r'(?:\d{1,2})?[-., ]+(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?:\d{2,4})'
cond3 = r'(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?:[-., ]+(?:\d{1,2})(?:[a-z]{2})?)?[-., ]+(?:\d{2,4})'
cond4 = r'(19\d\d|20\d\d)'
for line in df:
 
    search1 = []
    search2 = []
    search3 = []
    search4 = []
    
    search1 = re.findall(cond1, line)
    search2 = re.findall(cond2, line)
    search3 = re.findall(cond3, line)
    search4 = re.findall(cond4, line)
    
    list1.append(search1)
    list2.append(search2)
    list3.append(search3)
    list4.append(search4)

In [663]:
dff = pd.DataFrame(df)
dff['cond1']= list1
dff['cond2'] = list2
dff['cond3'] = list3
dff['cond4'] = list4
dff['cond1'] = dff['cond1'].apply(lambda x: np.nan if x==[] else x)
dff['cond2'] = dff['cond2'].apply(lambda x: np.nan if x==[] else x)
dff['cond3'] = dff['cond3'].apply(lambda x: np.nan if x==[] else x)
dff['cond4'] = dff['cond4'].apply(lambda x: np.nan if x==[] else x)
dff

Unnamed: 0,0,cond1,cond2,cond3,cond4
0,03/25/93 Total time of visit (in minutes):\n,[03/25/93],,,
1,6/18/85 Primary Care Doctor:\n,[6/18/85],,,
2,sshe plans to move as of 7/8/71 In-Home Servic...,[7/8/71],,,
3,7 on 9/27/75 Audit C Score Current:\n,[9/27/75],,,
4,2/6/96 sleep studyPain Treatment Pain Level (N...,[2/6/96],,,
5,.Per 7/06/79 Movement D/O note:\n,[7/06/79],,,
6,"4, 5/18/78 Patient's thoughts about current su...",[5/18/78],,,
7,10/24/89 CPT Code: 90801 - Psychiatric Diagnos...,[10/24/89],,,
8,3/7/86 SOS-10 Total Score:\n,[3/7/86],,,
9,(4/10/71)Score-1Audit C Score Current:\n,[4/10/71],,,


In [2]:
dff[dff['cond1'].notnull() & dff['cond2'].notnull()]D

NameError: name 'dff' is not defined

In [625]:
newdff = pd.DataFrame(new)
#newdff.rename(columns={0:'RawDate', 1:'Empty'}, inplace =True)
newdff

Unnamed: 0,0,1
0,03/25/93,
1,6/18/85,
2,7/8/71,
3,9/27/75,
4,2/6/96,
5,7/06/79,
6,5/18/78,
7,10/24/89,
8,3/7/86,
9,4/10/71,


In [517]:
newdff[newdff[0]=="0"]

Unnamed: 0,0,1


In [518]:
newdff.iloc[300:350]

Unnamed: 0,0,1
300,January 1994,
301,Dec 1992,
302,November 2004,
303,January 1977,
304,Mar 2002,
305,Feb 2000,
306,"May, 2004",
307,July 2006,
308,Feb 1994,
309,April 1977,


In [519]:
new = newdff[0].str.extractall(r'(?P<datetime>(?P<day>\d{1,2})?[-/](?P<month>\d{1,2})[-/](?P<year>\d{2,4}))')
#newdff[0].str.extractall(r'(?P<day>\d{1,2})?[-., ]+((?P<month>Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?P<year>\d{2,4}))')
new

Unnamed: 0_level_0,Unnamed: 1_level_0,datetime,day,month,year
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,03/25/93,03,25,93
1,0,6/18/85,6,18,85
2,0,7/8/71,7,8,71
3,0,9/27/75,9,27,75
4,0,2/6/96,2,6,96
5,0,7/06/79,7,06,79
6,0,5/18/78,5,18,78
7,0,10/24/89,10,24,89
8,0,3/7/86,3,7,86
9,0,4/10/71,4,10,71


In [520]:
new['day'] = new['day'].apply(lambda x: '0'+x if len(x) ==1 else x)
new['month'] = new['month'].apply(lambda x: '0'+x if len(x) ==1 else x)
new['year'] = new['year'].apply(lambda x: '19'+x if len(x) ==2 else x)
#    '0' + new['day'].astype(str)
new['date'] = new['year']+new['month']+new['day']

new.drop(columns=(['day', 'month', 'year']), inplace=True)

In [521]:
newdff = newdff.merge(new, how= 'outer' , right_on='datetime', left_on=0)

In [522]:
newdff.loc[300:400]

Unnamed: 0,0,1,datetime,date
300,January 1994,,,
301,Dec 1992,,,
302,November 2004,,,
303,January 1977,,,
304,Mar 2002,,,
305,Feb 2000,,,
306,"May, 2004",,,
307,July 2006,,,
308,Feb 1994,,,
309,April 1977,,,


In [523]:
new = newdff[0].str.extractall(r'(?P<day>\d{1,2})?[-., ]+(?P<month>Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?P<year>\d{2,4})')
new

Unnamed: 0_level_0,Unnamed: 1_level_0,day,month,year
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
125,0,24,Jan,2001
126,0,10,Sep,2004
127,0,26,May,1982
128,0,28,June,2002
129,0,06,May,1972
130,0,25,Oct,1987
131,0,14,Oct,1996
132,0,30,Nov,2007
133,0,28,June,1994
134,0,14,Jan,1981


In [524]:
new[new['year'].str.len() != 4]
new['year'] = new['year'].apply(lambda x: '19'+x if len(x) !=4 else x)

In [525]:
#new[new['day'].isnull()== True]['day'] = '01'

In [526]:
new['day'].fillna('01', inplace = True)
new['month'] = new['month'].apply(lambda x: x[:3] if len(x)>3 else x)
new

Unnamed: 0_level_0,Unnamed: 1_level_0,day,month,year
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
125,0,24,Jan,2001
126,0,10,Sep,2004
127,0,26,May,1982
128,0,28,Jun,2002
129,0,06,May,1972
130,0,25,Oct,1987
131,0,14,Oct,1996
132,0,30,Nov,2007
133,0,28,Jun,1994
134,0,14,Jan,1981


In [527]:
monthDic = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', \
            'May':'05', 'Jun':'06', 'Jul':'07' ,'Aug':'08', 'Sep':'09',\
           'Oct':'10', 'Nov':'11', 'Dec':'12'}
new['month'] = new['month'].apply(lambda x: monthDic[x])

In [528]:
new['date'] = new['year']+new['month']+new['day']

In [529]:
new

Unnamed: 0_level_0,Unnamed: 1_level_0,day,month,year,date
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
125,0,24,01,2001,20010124
126,0,10,09,2004,20040910
127,0,26,05,1982,19820526
128,0,28,06,2002,20020628
129,0,06,05,1972,19720506
130,0,25,10,1987,19871025
131,0,14,10,1996,19961014
132,0,30,11,2007,20071130
133,0,28,06,1994,19940628
134,0,14,01,1981,19810114


In [530]:
new = new.droplevel(level =1, axis=0)
newdff = newdff.merge(new, how= 'outer' , left_index=True, right_index =True)
newdff['date_x'] = newdff['date_x'].fillna(newdff['date_y'])
newdff.iloc[300:342]

Unnamed: 0,0,1,datetime,date_x,day,month,year,date_y
300,January 1994,,,19940101.0,1.0,1.0,1994.0,19940101.0
301,Dec 1992,,,19921201.0,1.0,12.0,1992.0,19921201.0
302,November 2004,,,20041101.0,1.0,11.0,2004.0,20041101.0
303,January 1977,,,19770101.0,1.0,1.0,1977.0,19770101.0
304,Mar 2002,,,20020301.0,1.0,3.0,2002.0,20020301.0
305,Feb 2000,,,20000201.0,1.0,2.0,2000.0,20000201.0
306,"May, 2004",,,,,,,
307,July 2006,,,20060701.0,1.0,7.0,2006.0,20060701.0
308,Feb 1994,,,,,,,
309,April 1977,,,19770401.0,1.0,4.0,1977.0,19770401.0


In [531]:
newdff.drop(columns = ['datetime', 'day', 'month', 'year', 'date_y'], inplace=True)
newdff.rename(columns={'date_x': 'date'}, inplace=True)

In [532]:
newdff[newdff['date'].isnull()]

Unnamed: 0,0,1,date
80,10-15,6/29/81,
194,"April 11, 1990",,
200,"July 26, 1978",,
202,"May 15, 1989",,
203,"September 06, 1995",,
204,"Mar. 10, 1976",,
208,"September 01, 2012",,
209,"July 25, 1983",,
210,"August 11, 1989",,
211,"April 17, 1992",,


In [533]:
dd = newdff[0].str.extractall(r'((Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})([-., ]+(\d{1,2})([a-z]{2})?)?[-., ]+(\d{2,4}))')
#text1 = 'soemsa fians aisn March 2nd 2009 aisdnais  2'
#newdff.iloc[300:350]
dd.loc[200:]

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
200,0,"July 26, 1978",July,26,26,,1978
201,0,December 23,December,,,,23
202,0,"May 15, 1989",May,15,15,,1989
203,0,"September 06, 1995",September,06,06,,1995
204,0,"Mar. 10, 1976",Mar,. 10,10,,1976
205,0,Jan 27,Jan,,,,27
206,0,October 23,October,,,,23
207,0,August 12,August,,,,12
208,0,"September 01, 2012",September,01,01,,2012
209,0,"July 25, 1983",July,25,25,,1983


In [535]:
dd[2]=dd[2].str.replace(r'(?:[.]+)', '') # few enteries with '.' need to be taken cleaned
dd[2]= dd[2].str.strip() # strip white spaces in date field
dd[dd[2].str.len()>2] # make sure no more invalid values

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


In [536]:
dd.rename(columns= {1:"month", 2: 'day', 3:'day_dup', 4:'Null', 5:'year'}, inplace=True)

In [537]:
dd['month'] = dd['month'].apply(lambda x: x[:3] if len(x)>3 else x)
dd['month'] = dd['month'].apply(lambda x: monthDic[x])

In [538]:
dd['day'].fillna('01', inplace = True) # add 01 for all the missing days
dd['year'] = dd['year'].apply(lambda x: '19'+x if len(x)==2 else x) # add 19 to 2 digit years
dd['date'] = dd['year']+dd['month']+dd['day'] # combine date

In [539]:
dd[dd['date'].str.len()<8] #make sure no invalid enteries

Unnamed: 0_level_0,Unnamed: 1_level_0,0,month,day,day_dup,Null,year,date
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1


In [540]:
dd.drop(columns = [0, 'month', 'day', 'day_dup', 'Null', 'year'], inplace=True)
dd= dd.droplevel(level=1, axis=0)

In [544]:
dd.loc[194]

date    19900411
Name: 194, dtype: object

In [542]:
newdff = newdff.merge(dd, how= 'outer' , left_index=True, right_index =True)

In [548]:
newdff[newdff['date_x'].isnull()]

Unnamed: 0,0,1,date_x,date_y
80,10-15,6/29/81,,
248,50-100,,,
271,08-810,,,
272,11-16,14-17,,
343,6/1998,,,
344,6/2005,,,
345,10/1973,,,
346,9/2005,,,
347,03/1980,,,
348,12/2005,,,


In [545]:
newdff['date_x']=newdff['date_x'].fillna(newdff['date_y'])

In [549]:
fin = newdff[newdff['date_x'].isnull()]
fin[0].apply(lambda x: x if len(x)==4 else 0)
fin

Unnamed: 0,0,1,date_x,date_y
80,10-15,6/29/81,,
248,50-100,,,
271,08-810,,,
272,11-16,14-17,,
343,6/1998,,,
344,6/2005,,,
345,10/1973,,,
346,9/2005,,,
347,03/1980,,,
348,12/2005,,,


In [550]:
fin = fin[0].str.extractall(r'((\d{1,2}[-/])?(\d{1,2}[-/])?(\d{2,4}))') 
#fin[0].str.findall(r'(?:\d{1,2}[-/])?(?:\d{1,2}[-/])?(?:\d{2,4})') #
#newdff[0].str.findall(r'(\d{1,2}[-/]+)?(19\d\d|20\d\d)')
fin

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
80,0,10-15,10-,,15
248,0,50-100,50-,,100
271,0,08-810,08-,,810
272,0,11-16,11-,,16
343,0,6/1998,6/,,1998
344,0,6/2005,6/,,2005
345,0,10/1973,10/,,1973
346,0,9/2005,9/,,2005
347,0,03/1980,03/,,1980
348,0,12/2005,12/,,2005


In [551]:
#fin[1].apply(lambda x: if re.sub(r'[-/]', '', str(x)))
fin[1]=fin[1].replace(r'[/-]', '', regex=True)
fin

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
80,0,10-15,10,,15
248,0,50-100,50,,100
271,0,08-810,08,,810
272,0,11-16,11,,16
343,0,6/1998,6,,1998
344,0,6/2005,6,,2005
345,0,10/1973,10,,1973
346,0,9/2005,9,,2005
347,0,03/1980,03,,1980
348,0,12/2005,12,,2005


In [552]:
fin.rename(columns={0:'datetime', 1:'day', 2:'month', 3:'year'}, inplace=True)
fin['day'].fillna('01', inplace=True)
fin['month'].fillna('01', inplace=True)
fin['year'] = fin['year'].apply(lambda x: '19'+x if len(x)==2 else x)

In [555]:
fin['date'] = fin['year']+fin['month']+fin['day']
fin.drop(columns=['datetime', 'day', 'month', 'year'], inplace=True)
fin = fin.droplevel(level=1, axis=0)

Unnamed: 0,datetime,date
80,10-15,19150110
248,50-100,1000150
271,08-810,8100108
272,11-16,19160111
343,6/1998,1998016
344,6/2005,2005016
345,10/1973,19730110
346,9/2005,2005019
347,03/1980,19800103
348,12/2005,20050112


In [567]:
newdff = newdff.merge(fin, how= 'outer' , left_index=True, right_index =True)

In [576]:
newdff['date_x'] = newdff['date_x'].fillna(newdff['date'])
newdff.drop(columns=['date_y', 'date'], inplace = True)
newdff.rename(columns={'date_x': 'date'}, inplace=True)

In [585]:
newdff['date'].sort_values()

248     1000150
226    19070101
214    19100901
198    19111001
207    19120801
217    19130601
215    19140801
223    19141001
219    19141201
80     19150110
272    19160111
197    19180201
196    19180201
206    19231001
201    19231201
199    19240101
212    19240701
220    19250601
205    19270101
195    19300501
2      19710807
9      19711004
53     19711107
28     19711209
84     19711805
474    19720101
153    19720113
129    19720506
225    19720615
171    19721004
         ...   
282    20120501
243    20120901
208    20120901
139    20121021
320    20121101
34     20121205
244    20130101
480    20130101
286    20130101
431     2013014
279    20130901
463    20140101
381     2014011
439    20140110
401    20140112
366     2014017
255    20141001
475    20150101
257    20150901
152    20150928
235    20151001
464    20160101
413    20160111
427     2016015
253    20160201
231    20160501
141    20160530
186    20161013
161    20161019
271     8100108
Name: date, Length: 500,

In [628]:
df[275]

'.   Ex-BF 25 yo, plans to go to sch in URUGUAY September 1984.\n'

In [629]:
newdff.loc[275]

0     September 1984
1               None
Name: 275, dtype: object

In [612]:
import re
new = []
cond1 = r'(?:\d{1,2}[-/])?\d{1,2}[-/]\d{2,4}'
cond3 = r'(?:\d{1,2})?[-., ]+(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?:\d{2,4})'
cond2 = r'(?:Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})(?:[-., ]+(?:\d{1,2})(?:[a-z]{2})?)?[-., ]+(?:\d{2,4})'
cond4 = r'(19\d\d|20\d\d)'
for line in df:
    search = []
    search = re.findall(cond1, line)
    if search == []:
        search = re.findall(cond2, line)
        if search == []:
            search = re.findall(cond3, line)
            if search == []:
                search = re.findall(cond4, line)
                if search ==[]:
                    new.append('0')
                else:
                    new.append(search)
            else:
                new.append(search)
        else:
            new.append(search)
    else:
        new.append(search)
newdff = pd.DataFrame(new)
new = newdff[0].str.extractall(r'(?P<datetime>(?P<day>\d{1,2})?[-/](?P<month>\d{1,2})[-/](?P<year>\d{2,4}))')
new['day'] = new['day'].apply(lambda x: '0'+x if len(x) ==1 else x)
new['month'] = new['month'].apply(lambda x: '0'+x if len(x) ==1 else x)
new['year'] = new['year'].apply(lambda x: '19'+x if len(x) ==2 else x)
#    '0' + new['day'].astype(str)
new['date'] = new['year']+new['month']+new['day']

new.drop(columns=(['day', 'month', 'year']), inplace=True)
newdff = newdff.merge(new, how= 'outer' , right_on='datetime', left_on=0)
new = newdff[0].str.extractall(r'(?P<day>\d{1,2})?[-., ]+(?P<month>Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})[-., ]+(?P<year>\d{2,4})')
new[new['year'].str.len() != 4]
new['year'] = new['year'].apply(lambda x: '19'+x if len(x) !=4 else x)
new['day'].fillna('01', inplace = True)
new['month'] = new['month'].apply(lambda x: x[:3] if len(x)>3 else x)
monthDic = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', \
            'May':'05', 'Jun':'06', 'Jul':'07' ,'Aug':'08', 'Sep':'09',\
           'Oct':'10', 'Nov':'11', 'Dec':'12'}
new['month'] = new['month'].apply(lambda x: monthDic[x])
new['date'] = new['year']+new['month']+new['day']
new = new.droplevel(level =1, axis=0)
newdff = newdff.merge(new, how= 'outer' , left_index=True, right_index =True)
newdff['date_x'] = newdff['date_x'].fillna(newdff['date_y'])
newdff.drop(columns = ['datetime', 'day', 'month', 'year', 'date_y'], inplace=True)
newdff.rename(columns={'date_x': 'date'}, inplace=True)
dd = newdff[0].str.extractall(r'((Jan[a-z]{0,}|Feb[a-z]{0,}|Mar[a-z]{0,}|Apr[a-z]{0,}|May|Jun[a-z]{0,}|Jul[a-z]{0,}|Aug[a-z]{0,}|Sep[a-z]{0,}|Oct[a-z]{0,}|Nov[a-z]{0,}|Dec[a-z]{0,})([-., ]+(\d{1,2})([a-z]{2})?)?[-., ]+(\d{2,4}))')
dd[2]=dd[2].str.replace(r'(?:[.]+)', '') # few enteries with '.' need to be taken cleaned
dd[2]= dd[2].str.strip() # strip white spaces in date field
dd.rename(columns= {1:"month", 2: 'day', 3:'day_dup', 4:'Null', 5:'year'}, inplace=True)
dd['month'] = dd['month'].apply(lambda x: x[:3] if len(x)>3 else x)
dd['month'] = dd['month'].apply(lambda x: monthDic[x])
dd['day'].fillna('01', inplace = True) # add 01 for all the missing days
dd['year'] = dd['year'].apply(lambda x: '19'+x if len(x)==2 else x) # add 19 to 2 digit years
dd['date'] = dd['year']+dd['month']+dd['day'] # combine date
dd.drop(columns = [0, 'month', 'day', 'day_dup', 'Null', 'year'], inplace=True)
dd= dd.droplevel(level=1, axis=0)
newdff = newdff.merge(dd, how= 'outer' , left_index=True, right_index =True)
newdff['date_x']=newdff['date_x'].fillna(newdff['date_y'])
fin = newdff[newdff['date_x'].isnull()]
fin[0].apply(lambda x: x if len(x)==4 else 0)
fin = fin[0].str.extractall(r'((\d{1,2}[-/])?(\d{1,2}[-/])?(\d{2,4}))') 
fin[1]=fin[1].replace(r'[/-]', '', regex=True)
fin.rename(columns={0:'datetime', 1:'day', 2:'month', 3:'year'}, inplace=True)
fin['day'].fillna('01', inplace=True)
fin['month'].fillna('01', inplace=True)
fin['year'] = fin['year'].apply(lambda x: '19'+x if len(x)==2 else x)
fin['date'] = fin['year']+fin['month']+fin['day']
fin.drop(columns=['datetime', 'day', 'month', 'year'], inplace=True)
fin = fin.droplevel(level=1, axis=0)
newdff = newdff.merge(fin, how= 'outer' , left_index=True, right_index =True)
newdff['date_x'] = newdff['date_x'].fillna(newdff['date'])
newdff.drop(columns=['date_y', 'date'], inplace = True)
newdff.rename(columns={'date_x': 'date'}, inplace=True)
newdff['date'].sort_values()

253     1000150
80     19150110
275    19160111
2      19710807
9      19711004
53     19711107
28     19711209
84     19711805
474    19720101
159    19720101
129    19720501
232    19720615
179    19721001
111    19721006
198    19721101
98     19721305
31     19722007
13     19722601
486    19730101
405    19730103
375    19730106
345    19730110
57     19730112
415     1973012
422     1973014
380     1973017
336    19730201
325    19730301
36     19731402
481    19740101
         ...   
215    20120901
248    20120901
140    20121001
322    20121101
34     20121205
480    20130101
289    20130101
249    20130101
431     2013014
282    20130901
205    20131011
463    20140101
381     2014011
439    20140110
401    20140112
366     2014017
260    20141001
475    20150101
158    20150901
157    20150901
241    20151001
464    20160101
413    20160111
427     2016015
258    20160201
143    20160501
142    20160501
169    20161001
168    20161001
274     8100108
Name: date, Length: 500,