<font color="green">

## Home task: Text Mining
</font>

The Task is to extract relevant infromation from the messy medical data using regex.

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. 

1) Extract the date strings. Here is a list of some of the variants you might encounter in this dataset:
* `04/20/2009`; `04/20/09`; `4/20/09`; `4/3/09`
* `Mar-20-2009`; `Mar 20, 2009`; `March 20, 2009`;  `Mar. 20, 2009`; `Mar 20 2009`;
* `20 Mar 2009`; `20 March 2009`; `20 Mar. 2009`; `20 March, 2009`
* `Mar 20th, 2009`; `Mar 21st, 2009`; `Mar 22nd, 2009`
* `Feb 2009`; `Sep 2009`; `Oct 2010`
* `6/2008`; `12/2009`
* `2009`; `2010`

2) Normalize the extracted dates considering the following rules:  
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).

3) Sort records in ascending chronological order according.




In [1]:
import os

CWD = os.getcwd()                       # Current working directory
FILE =  os.path.join(CWD, 'dates.txt')  # Path to the file containing medical data


with open(FILE, 'r') as f:
    content = f.read()
print(content[:500]) 

03/25/93 Total time of visit (in minutes):
6/18/85 Primary Care Doctor:
sshe plans to move as of 7/8/71 In-Home Services: None
7 on 9/27/75 Audit C Score Current:
2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7
.Per 7/06/79 Movement D/O note:
4, 5/18/78 Patient's thoughts about current substance abuse:
10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview
3/7/86 SOS-10 Total Score:
(4/10/71)Score-1Audit C Score Current:
(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activ


Note: This snippet is extracted just for high level review. You may read the file with the help of pandas or whatever convenient tool

In [2]:
import pandas as pd


# Read lines from the file
with open(FILE, 'r') as f:
    lines = f.readlines()

# Construct pandas Series based on the read lines
data = pd.Series(lines)
data

0           03/25/93 Total time of visit (in minutes):\n
1                         6/18/85 Primary Care Doctor:\n
2      sshe plans to move as of 7/8/71 In-Home Servic...
3                  7 on 9/27/75 Audit C Score Current:\n
4      2/6/96 sleep studyPain Treatment Pain Level (N...
                             ...                        
495    1979 Family Psych History: Family History of S...
496    therapist and friend died in ~2006 Parental/Ca...
497                         2008 partial thyroidectomy\n
498    sPt describes a history of sexual abuse as a c...
499    . In 1980, patient was living in Naples and de...
Length: 500, dtype: object

In [3]:
# Regex patterns to extract dates from text data
EXTRACT_PATTERNS = [
    
    # Matches '04/20/2009', '4/20/89' etc
    r'\d{1,2}[/-]\d{1,2}[/-](?:19|20)?\d{2}',
    
    # Matches '22nd March 2009', '2 Mar. 2009', '20 March, 2009' etc
    r'\d{1,2}(?:st|nd|rd|th)?[ ,-]{1,2}(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*[ .-]{1,2}(?:19|20)?\d{2}',
    
    # Matches 'Mar-20-2009', 'Mar 20, 2009', 'Mar 20th, 2009' etc
    r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*[ .,-]{1,2}\d{1,2}(?:st|nd|rd|th)?[ ,-]{1,2}(?:19|20)?\d{2}',
    
    # Matches 'Mar 2009', 'Mar. 2009', 'March 2009' etc
    r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*[ .,-]{1,2}(?:19|20)?\d{2}',
    
    # Matches '6/2008', '02/1989' etc
    r'\d{1,2}/(?:19|20)\d{2}',
    
    # Matches '2008', '1989' etc
    r'(?:19|20)\d{2}'
]


def extract_dates(texts):
    '''
    Extracts dates from the text data
    :param text: text data
    :type text: pandas Series[str]
    :return: `(dates, na_masks)` where `dates` is pandas Series[str] containing extracted dates,
    and `na_masks` is list of pandas Series[bool] which are NA masks of dates for each pattern
    :rtype: tuple[Series[str], list[Series[bool]]]
    '''
    n_patterns = len(EXTRACT_PATTERNS)

    # Masks that indicate entries not matching a specific pattern (used to access dates matching a pattern)
    na_masks = [pd.Series(True, index=texts.index)]
    extracted = []   
    for i in range(n_patterns):

        # Extract the dates matching a pattern, but not matching 
        dates = texts[pd.concat(na_masks, axis=1).all(axis=1)].str.extract('(' + EXTRACT_PATTERNS[i] + ')').squeeze()
        
        # Add mask for the pattern
        na_masks.append(pd.concat([na_masks[0], dates.isna()], axis=1).all(axis=1))
        
        # Add the extracted dates
        extracted.append(dates[~dates.isna()])

    # Merge all extracted dates together
    return pd.concat(extracted), na_masks[1:]

In [4]:
# Call the function to extract dates from the text data
dates, masks = extract_dates(data)
dates

0      03/25/93
1       6/18/85
2        7/8/71
3       9/27/75
4        2/6/96
         ...   
495        1979
496        2006
497        2008
498        2005
499        1980
Name: 0, Length: 500, dtype: object

In [5]:
# Check the masks for each regex pattern
pd.concat(masks, axis=1)

Unnamed: 0,0,1,2,3,4,5
0,False,True,True,True,True,True
1,False,True,True,True,True,True
2,False,True,True,True,True,True
3,False,True,True,True,True,True
4,False,True,True,True,True,True
...,...,...,...,...,...,...
495,True,True,True,True,True,False
496,True,True,True,True,True,False
497,True,True,True,True,True,False
498,True,True,True,True,True,False


In [6]:
# Regex patterns to split dates into 'day', 'month' and 'year' data
SPLIT_PATTERNS = [
    r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>(?:19|20)?\d{2})',
    r'(?P<day>\d{1,2})(?:st|nd|rd|th)?[ ,-]{1,2}(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*)[ .-]{1,2}(?P<year>(?:19|20)?\d{2})',
    r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*)[ .,-]{1,2}(?P<day>\d{1,2})(?:st|nd|rd|th)?[ ,-]{1,2}(?P<year>(?:19|20)?\d{2})',
    r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*)[ .,-]{1,2}(?P<year>(?:19|20)?\d{2})',
    r'(?P<month>\d{1,2})/(?P<year>(?:19|20)\d{2})',
    r'(?P<year>(?:19|20)\d{2})'
]


def split_dates(dates, na_masks):
    '''
    Splits dates into three groups: `day`, `month` and `year`
    :param dates: different formated dates
    :type dates: pandas Series[str]
    :param na_masks: NA masks of dates for each pattern 
    :type na_masks: list[pandas Series[bool]]
    :return: values of day, month and year for corresponding dates
    (for dates not having day or month number NA values are set)
    :rtype: pandas DataFrame which have three columns: 'day', 'month' and 'year'
    '''
    n_patterns = len(SPLIT_PATTERNS)

    splitted = []
    for i in range(n_patterns):

        # Use pattern and corresponding mask to split date into 'day', 'month' and 'year' groups
        splitted.append(dates[~na_masks[i]].str.extract(SPLIT_PATTERNS[i]))

    # Merge all splitted date groups together
    return pd.concat(splitted)

In [7]:
# Call the function to split the dates into groups
date_groups = split_dates(dates, masks)
date_groups

Unnamed: 0,month,day,year
0,03,25,93
1,6,18,85
2,7,8,71
3,9,27,75
4,2,6,96
...,...,...,...
495,,,1979
496,,,2006
497,,,2008
498,,,2005


In [8]:
import calendar

# Dictionary mapping the abbreviations of months (first 3 letters) to their numbers
MONTH_NUMBERS = {abbr: str(number) for number, abbr in enumerate(calendar.month_abbr)}


def normalize_dates(dates):
    '''
    Normalizes dates and converts them into ISO date format: `yyyy-mm-dd`
    :param dates: dates represented as groups: `day`, `month` and `year`
    :type dates: pandas DataFrame which have columns 'day', 'month', 'year'
    :return: dates converted into ISO format
    :rtype: pandas Series[str]
    '''
    # Set the number 1 instead of missing day and month values
    normalized = dates.fillna('1')

    # Replace names and abbreviations of months with month numbers
    normalized['month'] = normalized['month'].str.replace(r'(?:\b\w{3}\b|\b\w{4,}\b)', lambda x: MONTH_NUMBERS[x.string[:3]], regex=True)

    # Add '19' before years encoded with two digits
    normalized['year'] = normalized['year'].str.replace(r'\b\d{2}\b', lambda x: f'19{x.string}', regex=True)

    # Convert date groups to ISO format yyyy-mm-dd
    return normalized.apply(lambda x: f"{x['year']}-{x['month'].zfill(2)}-{x['day'].zfill(2)}", axis=1)

In [9]:
# Call the function to convert dates into ISO format
iso_dates = normalize_dates(date_groups)
iso_dates

0      1993-03-25
1      1985-06-18
2      1971-07-08
3      1975-09-27
4      1996-02-06
          ...    
495    1979-01-01
496    2006-01-01
497    2008-01-01
498    2005-01-01
499    1980-01-01
Length: 500, dtype: object

In [10]:
# Sort the dates in chronological order
result_dates = pd.DataFrame({'res': iso_dates, 'index': iso_dates.index})
result_dates.sort_values(['res', 'index'], ignore_index=True)

Unnamed: 0,res,index
0,1971-04-10,9
1,1971-05-18,84
2,1971-07-08,2
3,1971-07-11,53
4,1971-09-12,28
...,...,...
495,2016-05-01,427
496,2016-05-30,141
497,2016-10-13,186
498,2016-10-19,161


<font color="blue">

### Expected Output
</font>

<table align= 'left'>
    <tr><td></td><td> res</td><td>index</td></tr>
    <tr><td>0</td><td> 1971-04-10</td><td>9</td></tr>
    <tr><td>1</td><td>1971-05-18</td><td>84</td></tr>
    <tr><td>2</td><td>1971-07-08</td><td>2</td></tr>
    <tr><td>3</td><td>1971-07-11</td><td>53</td></tr>
    <tr><td>4</td><td>1971-09-12</td><td>28</td></tr>
    <tr><td>...</td><td>...</td><td>...</td></tr>
    <tr><td>495 </td><td>2016-05-01</td><td>    427</td></tr>
    <tr><td>496</td><td> 2016-05-30  </td><td>  141</td></tr>
    <tr><td>497</td><td> 2016-10-13  </td><td>  186</td></tr>
    <tr><td>498</td><td> 2016-10-19   </td><td> 161</td></tr>
    <tr><td>499 </td><td>2016-11-01 </td><td>   413</td></tr>
</table>