In this assignment, you'll be

working with messy medical data and using regex to extract relevant infromation from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:  
    
    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [1]:
import pandas as pd
import numpy as np

In [2]:
with open('dates.txt','r') as infile:
    data=infile.read()

In [12]:
df=pd.DataFrame(data.splitlines())
df.rename(columns={0:'entry'},inplace=True)

In [25]:
df.entry.loc[477]

'oEnjoys animals, had a dog x 14 yrs who died in 1994 Interpersonal Interactions/ Concerns:'

In [32]:
import calendar

In [35]:
'|'.join(list(calendar.month_abbr))

'|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec'

In [50]:
month_regex='|'.join(list(calendar.month_name)[1:])+'|'.join(list(calendar.month_abbr))

In [52]:
month_regex

'January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec'

In [247]:
p1='(?:\d{1,2}[-/]\d{1,2}[-/]\d{2,4})|'
p2='(?:(?:'+month_regex+')\.?,?\s?\d{1,2},?\s?\d{2,4})|'
p3='(?:\d{2,4}\s?(?:'+month_regex+'),?\s?\d{2,4})|'
p4='(?:(?:'+month_regex+').?,?s?\d{2,4})|'
p5='(?:\d{1,2}/\d{4})|'
p6='(?:[^-]\d{4}[\.,]?)|'
p7='(?:^\d{4})'

In [248]:
date_ser=df.entry.str.extract(r'('+p1+p2+p3+p4+p5+p6+p7+')')
date_ser

  """Entry point for launching an IPython kernel.


0        03/25/93
1         6/18/85
2          7/8/71
3         9/27/75
4          2/6/96
5         7/06/79
6         5/18/78
7        10/24/89
8          3/7/86
9         4/10/71
10        5/11/85
11        4/09/75
12        8/01/98
13        1/26/72
14      5/24/1990
15      1/25/2011
16        4/12/82
17     10/13/1976
18        4/24/98
19        5/21/77
20        7/21/98
21       10/21/79
22        3/03/90
23        2/11/76
24     07/25/1984
25        4-13-82
26        9/22/89
27        9/02/76
28        9/12/71
29       10/24/86
          ...    
470         y1983
471         1999.
472         .2010
473         (1975
474         1972.
475          2015
476          1989
477          1994
478         (1993
479          1996
480         2013,
481         y1974
482          1990
483          1995
484          2004
485         1987.
486          1973
487          1992
488          1977
489          1985
490          2007
491          2009
492         1986.
493         r1978
494       

In [250]:
exceptions=date_ser[pd.to_datetime(date_ser,errors='coerce').isnull()].str.slice(1,5)

In [252]:
exceptions.index

Int64Index([462, 466, 470, 472, 473, 478, 481, 493, 496], dtype='int64')

In [259]:
date_ser.loc[exceptions.index]=exceptions

In [262]:
date_ser.loc[450:500]

450     1/1994
451    12/2004
452     3/2003
453     7/1991
454     7/1982
455       1984
456      2000.
457       2001
458      1982,
459      1998.
460       2012
461      1991,
462       1988
463       2014
464       2016
465      1976,
466       1981
467       2011
468      1997,
469      2003.
470       1983
471      1999.
472       2010
473       1975
474      1972.
475       2015
476       1989
477       1994
478       1993
479       1996
480      2013,
481       1974
482       1990
483       1995
484       2004
485      1987.
486       1973
487       1992
488       1977
489       1985
490       2007
491       2009
492      1986.
493       1978
494       2002
495       1979
496       2006
497       2008
498       2005
499      1980,
Name: entry, dtype: object

In [263]:
df=pd.to_datetime(date_ser)

In [265]:
df.iloc[:10]

0   1993-03-25
1   1985-06-18
2   1971-07-08
3   1975-09-27
4   1996-02-06
5   1979-07-06
6   1978-05-18
7   1989-10-24
8   1986-03-07
9   1971-04-10
Name: entry, dtype: datetime64[ns]

In [267]:
tmp=df.iloc[:10]

In [272]:
tmp

0   1993-03-25
1   1985-06-18
2   1971-07-08
3   1975-09-27
4   1996-02-06
5   1979-07-06
6   1978-05-18
7   1989-10-24
8   1986-03-07
9   1971-04-10
Name: entry, dtype: datetime64[ns]

In [274]:
tmp.sort_values()

9   1971-04-10
2   1971-07-08
3   1975-09-27
6   1978-05-18
5   1979-07-06
1   1985-06-18
8   1986-03-07
7   1989-10-24
0   1993-03-25
4   1996-02-06
Name: entry, dtype: datetime64[ns]

In [276]:
ans=np.argsort(df)