# Task Goal

**Input**: String

**Output**: An array of sentences



In [326]:
# Useful Libraries
import json

## 1. One Single Sentence

Each piece of the chat is composed of three components that appear always in the same order: (1) hh:mm:ss  (2) customer/agent name (3) sentence. 

The order of the chat "before Customer then Agent" is not taken account in the extraction process to make things more general. An additional step for order control could be added at the end once the extraction is done. 


In [327]:
def extractOneSentence(text):
     
    # Tokens are split by spaces 

    tokens= text.split() 

    # The date is the first token of the sentence

    date=tokens[0]

    # The type is the set of tokens that appear after the date and before the ":"

    Type= ((text.replace(date+" ", "")).split(":")[0]).rstrip()

    # The mention is the concatenation of date, type, and ":"

    mention= date + " " + Type +" : "

    # The sentence is all what comes after the ":"

    sentence= (text.split(mention))[1]
   
    # Create a dictionary data structure that contains all the extracted key-value pairs

    keyValues=[{'date': date, 'mention': mention, 'sentence': sentence, 'type': Type}]

    return keyValues

 **Testing the Output of the first implemented function**




In [328]:
input= "14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit."

output=extractOneSentence(input)

print("--------------------Input Sentence -------------------------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))
print("------------------------------------------------------------------------")


--------------------Input Sentence -------------------------------------
14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Customer : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "type": "Customer"
 }
]
------------------------------------------------------------------------


## 2. Two Sentences
The input consists of two sentences separated by a new line character. 

In [329]:
# Define a function that takes as input the entire text and split it into 2 sentences
# Then call the extractOneSentence function

def extractTwoSentences(text):
    
    sentence1=text.splitlines()[0]
    sentence2=text.splitlines()[1]
    return (extractOneSentence(sentence1)+extractOneSentence(sentence2))


**Testing the Output of the second implemented function**

In [330]:
input= "14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n14:26:15 Agent : Aliquam non cursus erat, ut blandit lectus."

output=extractTwoSentences(input)
print("--------------------Input Sentence -------------------------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))
print("------------------------------------------------------------------------")

--------------------Input Sentence -------------------------------------
14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
14:26:15 Agent : Aliquam non cursus erat, ut blandit lectus.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Customer : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "type": "Customer"
 },
 {
  "date": "14:26:15",
  "mention": "14:26:15 Agent : ",
  "sentence": "Aliquam non cursus erat, ut blandit lectus.",
  "type": "Agent"
 }
]
------------------------------------------------------------------------


## Two customer mentions as start
The way in which I have modeled the extraction process at the very begining does not have a problem with the order in which mentions are reported. 
So, the only part that needs to change is to extend the "extractTwoSentences" function to any number of input sentences.  

In [331]:
# The function will first split the sentences based on the newlines
# Then call the extractOneSentence function

def extractSentences(text):
    
    sentences=text.splitlines()
    result=[]
    for s in (sentences):
        result=result+extractOneSentence(s)
    return result

**Testing the Output of the third implemented function**

In [332]:
input= "14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n14:27:00 Customer : Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.\n14:27:47 Agent : Vestibulum tempor diam eu leo molestie eleifend.\n14:28:28 Customer : Contrary to popular belief, Lorem Ipsum is not simply random text."

output=extractSentences(input)

print("--------------------Input Sentence -------------------------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))
print("------------------------------------------------------------------------")

--------------------Input Sentence -------------------------------------
14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
14:27:00 Customer : Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.
14:27:47 Agent : Vestibulum tempor diam eu leo molestie eleifend.
14:28:28 Customer : Contrary to popular belief, Lorem Ipsum is not simply random text.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Customer : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "type": "Customer"
 },
 {
  "date": "14:27:00",
  "mention": "14:27:00 Customer : ",
  "sentence": "Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.",
  "type": "Customer"
 },
 {
  "date": "14:27:47",
  "mention": "14:27:47 Agent : ",
  "sentence": "Vestibulum tempor diam eu leo molestie eleifend.",
  "type": "Agent"
 },
 {
  "date": "14:28:28",
  "mention": "14:28

## Date Splitting 
In the case sentences are not determined by new lines, it is enough to take the full stop as indicator of the end of a sentence instead of looking at the date. 
For a more general solution, we can create a split function that first tries to split the sentences with the new lines if they exist.
Then, split each resulting sentence with the full stop mark.


In [333]:
# A general function that splits the text based on both new lines and full stops.

def splitText(text):
    newLineSentences=text.splitlines()
    finalSentences=[]
    for s in (newLineSentences):
        finalSentences=finalSentences+ s.split(".")
    return finalSentences

# A new version of the extractSentences function
# It first call the split function and then perform the extraction    

def extractAllSentences(text):
    sentences=splitText(text)
    sentences = list(filter(None, sentences))
    result=[]
    for s in (sentences):
        result=result+extractOneSentence(s+".")
    return result

**Testing the Output of the fourth implemented 
function**

Here we test the new version of the extraction function using both sentences that are marked with new lines and sentences that are marked with full stop. 

In [334]:
# Sentences with new lines

input= "14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n14:27:00 Customer : Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.\n14:27:47 Agent : Vestibulum tempor diam eu leo molestie eleifend.\n14:28:28 Customer : Contrary to popular belief, Lorem Ipsum is not simply random text."
output=extractAllSentences(input)

print("--------------------Input Sentence (with new lines)----------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))

# Sentences with no new lines

input="14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Agent : Aliquam non cursus erat, ut blandit lectus."
output=extractAllSentences(input)

print("--------------------Input Sentence (with no new lines)------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))
print("------------------------------------------------------------------------")

--------------------Input Sentence (with new lines)----------------------
14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
14:27:00 Customer : Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.
14:27:47 Agent : Vestibulum tempor diam eu leo molestie eleifend.
14:28:28 Customer : Contrary to popular belief, Lorem Ipsum is not simply random text.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Customer : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "type": "Customer"
 },
 {
  "date": "14:27:00",
  "mention": "14:27:00 Customer : ",
  "sentence": "Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.",
  "type": "Customer"
 },
 {
  "date": "14:27:47",
  "mention": "14:27:47 Agent : ",
  "sentence": "Vestibulum tempor diam eu leo molestie eleifend.",
  "type": "Agent"
 },
 {
  "date": "14:28:28",
  "mention": "14:2

## 5. Ignore Extra Dates

The implemented solution so far should handle this case because the extraction of a date is based on its position in the text. Only the first occurence in the sentence is taken as a date and all dates occuring after are considered as a part of the sentence. We can test in the following our extraction function based on an example with an extra date.  


In [335]:
input= "14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Agent : I received it at 12:24:48, ut blandit lectus."
output=extractAllSentences(input)

print("--------------------Input Sentence--------------------------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))

--------------------Input Sentence--------------------------------------
14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Agent : I received it at 12:24:48, ut blandit lectus.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Customer : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "type": "Customer"
 },
 {
  "date": "14:26:15",
  "mention": "14:26:15 Agent : ",
  "sentence": "I received it at 12:24:48, ut blandit lectus.",
  "type": "Agent"
 }
]


## 6. Full Names

The extraction of the mention was based on the position of the ":" which is not a part of the date. 
Thus, full names should be automatically extracted by the previousely implemented methods. Here is a test.

In [336]:
input= "14:24:32 Luca Galasso : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Emanuele Querzola : I received it at 12:24:48, ut blandit lectus."
output=extractAllSentences(input)

print("--------------------Input Sentence--------------------------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))

--------------------Input Sentence--------------------------------------
14:24:32 Luca Galasso : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Emanuele Querzola : I received it at 12:24:48, ut blandit lectus.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Luca Galasso : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "type": "Luca Galasso"
 },
 {
  "date": "14:26:15",
  "mention": "14:26:15 Emanuele Querzola : ",
  "sentence": "I received it at 12:24:48, ut blandit lectus.",
  "type": "Emanuele Querzola"
 }
]


## 7. Missing colon after the names

Since in general, a "type" can be a "Customer", an "Agent" or any name of varying length, it is difficult to guess when the "type" should end. So, we need to make an asumption. We assume that when the colon is missing, we simply take the first word after the date as the "type". Based on this assumption, we modify the "extractOneSentence" function as follows: 


In [337]:
'''
We modify the ExtractOneSectence function
'''

def extractOneSentenceModified(text):
    
    # Tokens are split by spaces

    tokens= text.split()

    # The date is the first token of the sentence

    date=tokens[0]

    # Remove the date from the original text

    newText= text.replace(date+" ", "")

    # Extract the content of "type"

    typeContent=newText.split(":")

    if ((len(typeContent)% 2) == 0): # there is no missing colon.
        Type= (typeContent[0]).rstrip()
        mention= date + " " + Type +" : "
        sentence= (text.split(mention))[1]
    else: # if the sentence is into an odd number of partitions then there are only extra dates
        Type=newText.split()[0]
        mention= date + " " + Type
        sentence= (text.split(mention))[1]
        mention=mention+" : "

   
    # Create a dictionary data structure that contains all the key-value pairs
    keyValues=[{'date': date, 'mention': mention, 'sentence': sentence, 'type': Type}]
    return keyValues

'''
We modify the extractAllSentences function with the new version extractOneSentenceModified 
'''

def extractAllSentencesModified(text):
    sentences=splitText(text)
    sentences = list(filter(None, sentences)) # remove empty strings due to some split operations
    result=[]
    for s in (sentences):
        result=result+extractOneSentenceModified(s)
    return result


**Testing the Output of the last implemented 
function**

We first test the implemented function with the example of the missing colon. 

In [338]:
input="14:24:32 Customer Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Agent I received it at 12:24:48, ut blandit lectus."
output=extractAllSentencesModified(input)

print("--------------------Input Sentence (missing colon)-----------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))

--------------------Input Sentence (missing colon)-----------------------
14:24:32 Customer Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Agent I received it at 12:24:48, ut blandit lectus.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Customer : ",
  "sentence": " Lorem ipsum dolor sit amet, consectetur adipiscing elit",
  "type": "Customer"
 },
 {
  "date": "14:26:15",
  "mention": "14:26:15 Agent : ",
  "sentence": " I received it at 12:24:48, ut blandit lectus",
  "type": "Agent"
 }
]



No we test the new functions using all the previous scenarios to show that the solution is general to all the described cases in this Kata task. 

In [339]:
input= "14:24:32 Luca Galasso : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Emanuele Querzola : I received it at 12:24:48, ut blandit lectus."
output=extractAllSentencesModified(input)

print("--------------------Input Sentence (full names with extra date)-----------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))


input= "14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n14:27:00 Customer : Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.\n14:27:47 Agent : Vestibulum tempor diam eu leo molestie eleifend.\n14:28:28 Customer : Contrary to popular belief, Lorem Ipsum is not simply random text."
output=extractAllSentencesModified(input)

print("--------------------Input Sentence (new lines)-----------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))


input="14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Agent : Aliquam non cursus erat, ut blandit lectus."
output=extractAllSentencesModified(input)

print("--------------------Input Sentence (no new line)-----------------------")
print(input)
print("--------------------Output Array ---------------------------------------")
print(json.dumps(output, indent = 1))



--------------------Input Sentence (full names with extra date)-----------------------
14:24:32 Luca Galasso : Lorem ipsum dolor sit amet, consectetur adipiscing elit.14:26:15 Emanuele Querzola : I received it at 12:24:48, ut blandit lectus.
--------------------Output Array ---------------------------------------
[
 {
  "date": "14:24:32",
  "mention": "14:24:32 Luca Galasso : ",
  "sentence": "Lorem ipsum dolor sit amet, consectetur adipiscing elit",
  "type": "Luca Galasso"
 },
 {
  "date": "14:26:15",
  "mention": "14:26:15 Emanuele Querzola : ",
  "sentence": "I received it at 12:24:48, ut blandit lectus",
  "type": "Emanuele Querzola"
 }
]
--------------------Input Sentence (new lines)-----------------------
14:24:32 Customer : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
14:27:00 Customer : Pellentesque cursus maximus felis, pharetra porta purus aliquet viverra.
14:27:47 Agent : Vestibulum tempor diam eu leo molestie eleifend.
14:28:28 Customer : Contrary to popular b