---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

## Imports and Data

In [1]:
import numpy as np
import pandas as pd
import re
pd.set_option('display.max_colwidth', 100)

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

s = pd.Series(doc)
s.head()

0                        03/25/93 Total time of visit (in minutes):\n
1                                      6/18/85 Primary Care Doctor:\n
2            sshe plans to move as of 7/8/71 In-Home Services: None\n
3                               7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7\n
dtype: object

## Workspace

In [2]:
df = pd.DataFrame(data=s, columns=['string'])

df['original_index'] = df.index
df = df.reindex_axis(['original_index', 'string'], axis=1)  # Re-order columns

df.head()

Unnamed: 0,original_index,string
0,0,03/25/93 Total time of visit (in minutes):\n
1,1,6/18/85 Primary Care Doctor:\n
2,2,sshe plans to move as of 7/8/71 In-Home Services: None\n
3,3,7 on 9/27/75 Audit C Score Current:\n
4,4,2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7\n


### Rows 0-124
mm/dd/yy OR mm/dd/yyyy

In [3]:
s[0:126].head()

0                        03/25/93 Total time of visit (in minutes):\n
1                                      6/18/85 Primary Care Doctor:\n
2            sshe plans to move as of 7/8/71 In-Home Services: None\n
3                               7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7\n
dtype: object

In [4]:
first = df['string'].str.extractall(r'.*(?P<mm>\d{1,2})[/-](?P<dd>\d{1,2})[/-](?P<yyyy>\d{2,4}).*')
first[0:125]

Unnamed: 0_level_0,Unnamed: 1_level_0,mm,dd,yyyy
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,3,25,93
1,0,6,18,85
2,0,7,8,71
3,0,9,27,75
4,0,2,6,96
5,0,7,06,79
6,0,5,18,78
7,0,0,24,89
8,0,3,7,86
9,0,4,10,71


### Rows 125-193 AND 228-342
dd mon yyyy OR mon yyyy

In [426]:
s.loc[125:194].head()
s.loc[228:343].head()

125    s The patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husban...
126                                           .10 Sep 2004 - Intake at EEC for IOP but did not follow up\n
127                          see above and APS eval of 26 May 1982 Social History Marital Status: Single\n
128                        Tbooked for intake appointment at Sierra Vista, Chongging, WY on 28 June 2002\n
129                                                                      06 May 1972 SOS-10 Total Score:\n
dtype: object

In [621]:
s.loc[[233, 240, 244, 270, 273, 306, 311, 317, 321, 328, 329, 338, 339]]

233                              Dr. Gloria English, who conducted an initial consultation in July, 1990\n
240            )- Venlafaxine 37.5mg daily: May, 2011: self-discontinued due to side effects (dizziness)\n
244    s Mr. Moss is a 27-year-old, Caucasian, engaged veteran of the Navy. He was previously scheduled...
270                                                                       May, 2006 Primary Care Doctor:\n
273            ) - Zoloft 100 mg daily: February, 2010 : self-discontinued due to side effects (unknown)\n
306                                                                     May, 2004 Hx of Brain Injury: No\n
311               - Prozac 20 mg daily:  February, 1995: self-discontinued due to side effects (unknown)\n
317    . Psychosocial: lives w/ father, looking to get own apartment. Lives in Saluda. Fa. has roommate...
321                                                                   2June, 1999 Audit C Score Current:\n
328    s Pt reports long Hx of drug a

In [680]:
s.loc[328]

's Pt reports long Hx of drug addiction. PLEASE SEE NSC SUPPLEMENTAL NOTE on this same date for details about substance use Hx. Her drugs of choice are opiates and benzodiazepines. She said that she used cocaine about once per month when she was on methadone. She got off of methadone in May, 2001 (7 months ago). She reports using it cocaine 1 x since getting off methadone. She states she has been abstinent from opiates for 7 months. Pt states: "I have no desire to use an opiate or a drug at all. I don\'t have a desire to use heroin or have any craving for it. I can\'t stand that I am all of the place, I can\'t have a conversation, I can\'t sit down and watch a movie - my ADHD is that bad. I just want my ADHD treated. The heroin takes the racing thoughts away."  She is primarily seeking treatment for ADHD. She reps her substance abuse has been to self-medicate her ADHD and believes that if she were treated for ADHD she would stop abusing substances. Currently, Pt\'s PCP, Dr. Michael Yar

In [699]:
second = df['string'].str.extractall('(?P<dd>\d{1,2} )?(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:[a-z]* )?(?P<yyyy>\d{4})')

second = df['string'].str.extractall('(?P<dd>\d{1,2} )?(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:[a-z]* )?(?:\D*)(?P<yyyy>\d{4})')

# second.loc[125:194]
second.loc[228:343]
second.loc[[233, 240, 244, 270, 273, 306, 311, 317, 321, 328, 329, 338, 339]]

Unnamed: 0_level_0,Unnamed: 1_level_0,dd,mm,yyyy
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
233,0,,Jul,1990
240,0,,May,2011
244,0,,Jan,2013
270,0,,May,2006
273,0,,Feb,2010
306,0,,May,2004
311,0,,Feb,1995
317,0,,Mar,1975
321,0,,Jun,1999
328,0,,May,2001


### Rows 194-227
mon dd yyyy

In [428]:
s.loc[194:228].head()

194                                                  April 11, 1990 CPT Code: 90791: No medical services\n
195    MRI May 30, 2001 empty sella but no problems with endocrine functionPertinent Medical Review of ...
196    .Feb 18, 1994: made a phone call to Mom and Mom commented that he was talking very fast, hard to...
197                                       Brother died February 18, 1981 Parental/Caregiver obligations:\n
198    none; but currently has appt with new HJH PCP Rachel Salas, MD on October. 11, 2013 Other Agency...
dtype: object

In [602]:
third = df['string'].str.extractall('(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:.*) (?P<dd>\d{1,2}).? (?P<yyyy>\d{4})')
third.loc[194:228]

34

### Rows 343-454
mm/yyyy

In [432]:
s[343:455].head()

343                                                                          6/1998 Primary Care Doctor:\n
344    s 52 y/o MWM h/o chronic depression, anxiety, adhd.  Here for psychopharm transfer in the contex...
345                                                                      10/1973 Hx of Brain Injury: Yes\n
346                                                                          9/2005 Primary Care Doctor:\n
347                                                s 03/1980 Positive PPD: treated with INH for 6 months\n
dtype: object

In [490]:
fifth = df['string'].str.extractall('(?P<mm>\d{1,2})/(?P<yyyy>\d{4})')
fifth.loc[343:455]

Unnamed: 0_level_0,Unnamed: 1_level_0,mm,yyyy
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
343,0,6,1998
344,0,6,2005
345,0,10,1973
346,0,9,2005
347,0,03,1980
348,0,12,2005
349,0,5,1987
350,0,5,2004
351,0,8,1974
352,0,3,1986


### Rows 455-500
yyyy

In [434]:
s.loc[455:].head()

455                                                  sHemmorage caused by probe in 1984 Medical History:\n
456    sHas been at MYH since his treaters in NE retired in 2000. Was seen in NE for 20 years. Previouy...
457    Pt joined Army reserves in 2001 and has 3 years left in this commitment.-Mental Status Exam Was ...
458    one sister from whom he is estranged due to her opiate dependence, legal conflict over mother's ...
459                     sSince 1998. Prior medication trials (including efficacy, reasons discontinued):\n
dtype: object

In [489]:
sixth = df['string'].str.extractall('[\s\b\w]?(\d{4})[\s\b\w]?')
sixth.loc[455:]

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
455,0,1984
456,0,2000
457,0,2001
458,0,1982
459,0,1998
460,0,2012
461,0,1991
462,0,1988
463,0,2014
464,0,2016


### Together

In [802]:
dic = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12}

# 0-124  (mm/dd/yyyy or mm/dd/yy)
first = df[:125]['string'].str.extractall(r'(?P<mm>\d{1,2})[/-](?P<dd>\d{1,2})[/-](?P<yyyy>\d{2,4}).*')
# Add 1900 to two digit years
for i, y in enumerate(first.yyyy):
    if len(y) == 2:
        first.yyyy[i] = str(int(y) + 1900)

# 125-193 (dd mon yyyy)
second = df[125:194]['string'].str.extractall('(?P<dd>\d{1,2} )?(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:[a-z]*) ?(?P<yyyy>\d{4})')
second.mm = second.mm.map(dic)

# 194-227 (mon dd yyyy)
third = df[194:228]['string'].str.extractall('(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:.*) (?P<dd>\d{1,2}).? (?P<yyyy>\d{4})')
third.mm = third.mm.map(dic)

# 228-342 (mon yyyy)
fourth = df[228:343]['string'].str.extractall('(?P<dd>\d{1,2} )?(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:[a-z]*)?(?:\D*)(?P<yyyy>\d{4})')
fourth.mm = fourth.mm.map(dic)
fourth['dd'] = 1

# 343-454 (mm/yyyy)
fifth = df[343:455]['string'].str.extractall('(?P<mm>\d{1,2})/(?P<yyyy>\d{4})')

# 455-500 (yyyy)
sixth = df[455:]['string'].str.extractall('[\s\b\w]?(\d{4})[\s\b\w]?')
sixth.columns = ['yyyy']  # Rename

regexs = [first, second, third, fourth, fifth, sixth]

for ex in regexs:
    # Get original index
    ex.reset_index(inplace=True)
    ex.drop('match', axis=1, inplace=True)
    ex.rename(columns={'level_0':'orig_index'}, inplace=True)

    # Fill missing columns and values
    if 'mm' not in ex.columns:
        ex['mm'] = 1
    if 'dd' not in ex.columns:
        ex['dd'] = 1    
    
# Vertical stack into one df
stack = pd.concat([first, second, third, fourth, fifth, sixth])

# Convert columns to one datetime
stack[['mm', 'dd', 'yyyy']] = stack[['mm', 'dd', 'yyyy']].astype('str')
stack['datetime'] = pd.to_datetime(stack.yyyy + ' ' + stack.mm + ' ' + stack.dd, 
                                   infer_datetime_format=True)

stack.sort_values('datetime', ascending=True)

# stack.sort_values('orig_index')

Unnamed: 0,dd,mm,orig_index,yyyy,datetime
9,10,4,9,1971,1971-04-10
84,18,5,84,1971,1971-05-18
2,8,7,2,1971,1971-07-08
53,11,7,53,1971,1971-07-11
28,12,9,28,1971,1971-09-12
19,1,1,474,1972,1972-01-01
28,13,1,153,1972,1972-01-13
13,26,1,13,1972,1972-01-26
4,06,5,129,1972,1972-05-06
98,13,5,98,1972,1972-05-13


## Final Answer

In [5]:
def date_sorter():
    
    # Skeleton
    import pandas as pd
    import re
    pd.set_option('display.max_colwidth', 100)

    doc = []
    with open('dates.txt') as file:
        for line in file:
            doc.append(line)

    s = pd.Series(doc)    
    
    
    
    # Regular Expressions
    dic = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12}

    # 0-124  (mm/dd/yyyy or mm/dd/yy)
    first = df[:125]['string'].str.extractall(r'(?P<mm>\d{1,2})[/-](?P<dd>\d{1,2})[/-](?P<yyyy>\d{2,4}).*')
    # Add 1900 to two digit years
    for i, y in enumerate(first.yyyy):
        if len(y) == 2:
            first.yyyy[i] = str(int(y) + 1900)

    # 125-193 (dd mon yyyy)
    second = df[125:194]['string'].str.extractall('(?P<dd>\d{1,2} )?(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:[a-z]*) ?(?P<yyyy>\d{4})')
    second.mm = second.mm.map(dic)

    # 194-227 (mon dd yyyy)
    third = df[194:228]['string'].str.extractall('(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:.*) (?P<dd>\d{1,2}).? (?P<yyyy>\d{4})')
    third.mm = third.mm.map(dic)

    # 228-342 (mon yyyy)
    fourth = df[228:343]['string'].str.extractall('(?P<dd>\d{1,2} )?(?P<mm>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(?:[a-z]*)?(?:\D*)(?P<yyyy>\d{4})')
    fourth.mm = fourth.mm.map(dic)
    fourth['dd'] = 1

    # 343-454 (mm/yyyy)
    fifth = df[343:455]['string'].str.extractall('(?P<mm>\d{1,2})/(?P<yyyy>\d{4})')

    # 455-500 (yyyy)
    sixth = df[455:]['string'].str.extractall('[\s\b\w]?(\d{4})[\s\b\w]?')
    sixth.columns = ['yyyy']  # Rename

    # Cleaning
    regexs = [first, second, third, fourth, fifth, sixth]

    for ex in regexs:
        # Get original index
        ex.reset_index(inplace=True)
        ex.drop('match', axis=1, inplace=True)
        ex.rename(columns={'level_0':'orig_index'}, inplace=True)

        # Fill missing columns and values
        if 'mm' not in ex.columns:
            ex['mm'] = 1
        if 'dd' not in ex.columns:
            ex['dd'] = 1    

    # Vertical stack into one df
    stack = pd.concat([first, second, third, fourth, fifth, sixth])

    # Convert columns to one datetime
    stack[['mm', 'dd', 'yyyy']] = stack[['mm', 'dd', 'yyyy']].astype('str')
    stack['datetime'] = pd.to_datetime(stack.yyyy + ' ' + stack.mm + ' ' + stack.dd, 
                                       infer_datetime_format=True)

    stack = stack.sort_values('datetime', ascending=True)
    series = pd.Series(stack['orig_index']).reset_index(drop=True)
    
    return series

In [6]:
# date_sorter()

## Sophie's Test

In [7]:
def test():
    import pandas as pd

    fun = date_sorter()
    res = 'Data Type Test (Series?): '
    res += ['Failed\n','Passed\n'][type(fun)==pd.Series]
    
    res += 'Data Shape Test ((500,)?): '
    res += ['Failed\n','Passed\n'][fun.shape==(500,)]
    
    res += 'Index Values Test (range(500)?): '
    res += ['Failed\n','Passed\n'][fun.index.tolist()==list(range(500))]
    
    res += 'Values Test (0-499): '
    res += ['Failed\n','Passed\n'][all((fun<500) & (fun>=0))]
    
    return res

# print(test())   