https://www.hackerrank.com/challenges/a-text-processing-warmup/problem

In [1]:
import numpy as np
import pandas as pd

This problem will help you warm up and practice basic text and string processing techniques. This will be a first step towards more complex Text and Natural Language Processing and Analysis tasks.

You will be given a fragment of text.

In this fragment, you need to identify the articles used (i.e., 'a', 'an', 'the').

And you also need to identify dates (which might be expressed in a variety of ways such as '15/11/2012','15/11/12', '15th March 1999','15th March 99' or '20th of March, 1999').

You can make the following assumptions 1) In the date, year and day will always be in numeric form. Which means, you don't have to worry about "fifteenth" or "twentieth" etc. Month, could be either numeric form (1-12) or with its name (January-December, Jan-Dec).

2) This is a bit open ended, and somewhat intentionally so. The aim is for you to try to write something which figures out as many common patterns as possible, in which dates are present in text.

3) Most of the test cases are Wikipedia articles. Having a look at the common formats in which dates occur in those, will help.

4) Dates could either be in the form: Month followed by Day followed by Year, or Day followed by Month followed by Year.

5) The day could be in the form of either (1,2,3,...31) or (1st, 2nd, 3rd...31st).

A fragment is a valid date if it contains day, month and year information (all three of them should be present). To extract date information, you will need to try detecting different kinds of representations of dates, some of which have been shown above. The more patterns you match and identify correctly, the greater your score will be.

Input Format

First line contains the number of test cases T. This is followed by T test fragments (each fragment will be in one line and each will have a blank line after it) . Each line contains a paragraph of text in which you need to identify the articles and dates. There will be a blank line after each paragraph.

So, totally there are 2T+1 lines in the input file. The last one is a blank line after the last text fragment.

Output Format

4T lines, four lines of output for each test case. First line -> number of occurrences of 'a'. Second line -> number of occurrences of 'an'. Third Line -> number of occurrences of 'the'. Fourth Line -> number of occurrences of date information.

In [41]:
df = pd.read_fwf('input00.txt', header = None)

In [51]:
df = df[~df[0].isnull()][[0]][1:].reset_index(drop = True)

In [52]:
df

Unnamed: 0,0
0,"Delhi, is a metropolitan and the capital regio..."
1,"Mumbai, also known as Bombay, is the capital c..."
2,New York is a state in the Northeastern region...
3,The Indian Rebellion of 1857 began as a mutiny...
4,The Boston Tea Party (referred to in its time ...


In [58]:
import collections

for document in df[0]:   
    counter = collections.Counter([i.lower() for i in document.split(' ')])
    a_cnt = counter['a']
    an_cnt = counter['an']
    the_cnt = counter['the']
    print(a_cnt)
    print(an_cnt)
    print(the_cnt)

1
0
4
1
0
5
1
0
6
1
0
4
3
0
4


In [60]:
df[0][2]

'New York is a state in the Northeastern region of the United States. New York is the 27th-most extensive, the 3rd-most populous, and the 7th-most densely populated of the 50 United States.'

In [62]:
df[0][3]

"The Indian Rebellion of 1857 began as a mutiny of sepoys of the East India Company's army on 10 May 1857, in the town of Meerut, and soon escalated into other mutinies and civilian rebellions largely in the upper Gangetic plain and central India,"

In [65]:
month_list = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 
              'january', 'feburary', 'march', 'april', 'june', 'july', 'august', 'september', 'october', 'november', 'december']

for document in df[0]:
    words = [i.lower() for i in document.split(" ")]
    counter = collections.Counter(words)
    date_cnt = 0
    
    for month in month_list:
        date_cnt += counter[month]
    
    print(date_cnt)

0
0
0
1
0


In [69]:
t1 = df[0][0]
t1

'Delhi, is a metropolitan and the capital region of India which includes the national capital city, New Delhi. It is the second most populous metropolis in India after Mumbai and the largest city in terms of area.'

In [85]:
t2 = '5a5'

In [89]:
import re
re.findall("[^a-z]the[^a-z]", t1)
# [] means find individual elements, - means from a to z, ^ means other than. together it means there has to be 
#  something in front of "the" and it cannot be alpha (so in this case a space in the front)
# similarily, there has to be a space after "the"


[' the ', ' the ', ' the ', ' the ']

In [86]:
re.findall("[^a-z]a[^a-z]", t2)

['5a5']

In [90]:
a_cnt = re.findall("[^a-z]a[^a-z]", t1)
an_cnt = re.findall("[^a-z]an[^a-z]", t1)
the_cnt = re.findall("[^a-z]the[^a-z]", t1)

In [93]:
t3 = df[0][3]
t3

"The Indian Rebellion of 1857 began as a mutiny of sepoys of the East India Company's army on 10 May 1857, in the town of Meerut, and soon escalated into other mutinies and civilian rebellions largely in the upper Gangetic plain and central India,"

In [94]:
re.findall("[0-3]?[0-9][ /,\-]{1,2}[01]?[0-9][ /,\-]{1,2}\d{2,4}", t3)



[]

In [95]:
temp = re.compile('[0-3]?')
print(temp)

re.compile('[0-3]?')


In [96]:
p = re.compile('ab*')
p 

re.compile(r'ab*', re.UNICODE)

In [128]:
t5 = '4/28/1991 5-28-1991 88/28/1991 12-19-1993'

re.findall("[0-2]?[0-9][ /,\-]{1,2}[01]?[0-9][ /,\-]{1,2}[0-9]{2,4}", t5)



#? means  ?, matches either once or zero times; you can think of it as marking something as being optional.
# [0-3]?[0-9] means 00, 01, ...30, and does not have anything from [0-3], which is only 0, 1, ... 9

#[/, \-] means the number is followed by / or -. \ is an escape so we just want to get -.

#{1,2} means the character in front of it has to repeat minimal 1 times and maximual 2 times. so / or , or - has to repeat
# at least 1 and at most 2 times

#simiplied way to write it: we just want to have three numbers for day, month, and year. each number we 
# can have 1-4 digits, they can be seperated by one or two / or -. for example: mm/dd/yy, mm-dd-yy, dd/mm/yyy

re.findall("[0-9]{1,4}[ /,\-]{1,2}[0-9]{1,4}[ /,\-]{1,2}[0-9]{1,4}", t5)

['4/28/1991', '5-28-1991', '88/28/1991', '12-19-1993']

In [11]:
import re 
t6 = '16th-03st-1991'
re.findall("[0-3]?[0-9](st|nd|rd)?[ /,\-]{1,2}[01]?[0-9][ /,\-]{1,2}\d{2,4}", t6)

re.findall("(?:[0-3]?[0-9])(?:th|st|nd|rd)", t6)

# re.findall("[0-9]{1,2}[st,nd,rd]?[ /,\-]{0,2}[0-9]{1,4}[ /,\-]{1,2}[0-9]{2,4}", t6)

# re.findall("[0-9]{1,2}[st,nd,rd]?[ /,\-]{0,2}[0-9]{1,4}[ /,\-]{1,2}[0-9]{2,4}", t6)


['16th', '03st']

t7 = '17th Jan, 2019'
re.findall("[0-9]{1,2}(st|nd|rd)?(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)[ /,\-]{0,2}[0-9]{2,4}", t7)

In [187]:
t7 = '17th jan, 2019'
re.findall("[0-9]{1,2}(th|st|nd|rd)?", t7)

re.findall("[0-9][0-9](?:th|rd|nd|st)?\s+(jan)?", t7)

['jan']