https://www.hackerrank.com/challenges/punctuation-corrector-its/problem

In a world which survives on food, water, oxygen and text messages; we notice that people often don't bother much about punctuation and grammar in their text messages. This is somewhat understandable, partly because every extra keystroke on a miniature keypad takes a lot more work than it would take on a laptop or a desktop.

We'd like to build a small sub-feature of a punctuation corrector which can handle one small but specific, and frequent issue: detecting (and correcting) whether the appropriate form of the word ought to be "its" or "it's" - a common point of confusion in general.

It's is the contraction of "it is" and "it has."  
Its is a possessive form - i.e., it shows ownership.  
You are given sentences which contain one or more occurrences of either 'its' or 'it's'. These occurrences have been replaced by three question marks (???). Display with correct form of the sentence, with the question marks replaced by either 'it's' or 'its', as appropriate.

Input Format
The first line contains an integer N, 1<=N<=200.
This is followed by N lines, each containing 1 sentence, with not more than a total of 300 characters. The occurrences of 'it's' and 'its' have been replaced by '???" in these sentences.

Output Format
N lines, each containing one sentence. The ith output contains the completed version of the ith input, where the '???' have been replaced by either 'it's' or 'its', as appropriate.
Capitalization will be ignored.

Sample Input

10  
This restaurant is known for ??? emphasis on spicy cooking.  
Golfing has lost ??? appeal.  
??? become very difficult to find parking in the downtown areas.  
This shop recently moved from ??? former location near the bus terminus.  
Every dog has ??? day.  
Guess ??? shape.
The jury has reached ??? decision.  
Stop ??? momentum!  
??? time to go.
??? lying over there.

Sample Output

This restaurant is known for its emphasis on spicy cooking.  
Golfing has lost its appeal.  
It's become very difficult to find parking in the downtown areas.  
This shop recently moved from its former location near the bus terminus.  
Every dog has its day.  
Guess its shape.
The jury has reached its decision.  
Stop its momentum!  
It's time to go.
It's lying over there.

In [53]:
import pandas as pd
import numpy as np
import nltk 
from nltk import word_tokenize, sent_tokenize
from collections import defaultdict

In [18]:
with open('punctuation_corrector_corpus.txt') as f:
    corpus = f.read()

In [29]:
corpus[:200].replace('\n', '. ')

"\ufeffProject Gutenberg's The Secret Cache, by E. C. [Ethel Claire] Brill. . This eBook is for the use of anyone anywhere at no cost and with. almost no restrictions whatsoever.  You may copy it, give it away"

In [35]:
tokenized_sentence = [sent for sent in sent_tokenize(corpus.replace('\n', '. ')) if sent!= '.']

In [36]:
len(tokenized_sentence)

70728

In [51]:
# Let's define 'its' as positive and 'it's' as negative:

positive_tokenized_sentence = [sent for sent in tokenized_sentence if 'its' in sent and "it's" not in sent]

negative_tokenized_sentence = [sent for sent in tokenized_sentence if "it's" in sent]


In [65]:
its_previous_word_cnt = defaultdict(int)
its_previous_word_tag = defaultdict(int)

for sentence in positive_tokenized_sentence:
    tokenized_word_list = [word.lower() for word in word_tokenize(sentence) if word.isalpha()]
#     print(tokenized_word_list)
    for index in range(1, len(tokenized_word_list)):
        if tokenized_word_list[index] == 'its':
            previous_word = tokenized_word_list[index-1]
            its_previous_word_cnt[previous_word] += 1
            its_previous_word_tag[nlp(previous_word)[0].pos_] +=1
            
# its_previous_word_cnt        

In [60]:
its_previous_word_cnt_list = [(word, freq) for (word, freq) in its_previous_word_cnt.items()]
its_previous_word_cnt_list = sorted(its_previous_word_cnt_list, key = lambda x: -x[1])

In [68]:
its_previous_word_tag

defaultdict(int,
            {'VERB': 102,
             'NOUN': 40,
             'CCONJ': 34,
             'DET': 20,
             'ADP': 237,
             'PART': 23,
             'ADV': 15,
             'ADJ': 8,
             'INTJ': 1,
             'PRON': 1})

In [63]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.pos_)

Apple PROPN
is VERB
looking VERB
at ADP
buying VERB
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM


In [45]:
len(positive_tokenized_sentence)

818

In [50]:
len(negative_tokenized_sentence)

12

In [52]:
negative_tokenized_sentence

["behind instead of standing up straight and stiff, as a rabbit's tail.",
 "The little squirrels were so astonished at the rabbit's appearance.",
 "Doctor, we've done our work, so it's time we had some play.",
 '"It--it\'s not actionable," he.',
 '"It\'s as well," said the old man; "it\'s a question whether I. shall live to the Assizes, so it matters little to me, but I. should wish to spare Alice the shock.',
 "If you don't--it's a fine,.",
 "I was at my wit's end where to get.",
 "of the business; but it's 'Where are the geese?'",
 '"Well, then, you\'ve lost your fiver, for it\'s town bred," snapped.',
 '"Very sorry to knock you up, Watson," said he, "but it\'s the.',
 '"Why, it\'s a dummy," said he.. .',
 "it's a wicked world,."]