## Build Chatbot using Google NarrativeQA Reading Comprehension Dataset


### This notebook ingests and wrangles the raw Wikipedia summaries into a cleaned Summary dataset that is to be searched for each question asked.
INGEST: The original Wikipedia summary dataset (summaries.csv) provided by Google was of the following schema:

* Each Wikipedia article started with the document_id, which aligned with the qas.csv dataset.  And then each follow-on paragraph for that respective Wikipedia article was a separate row in the CSV. 


* For example, the first row of a new Wikipedia article would be something like, "0025577043f5090cd603c6aea60f26e236195594,test," Mark Hunter (Slater), a high school ..." And the the next row in the CSV would be the next paragraph of the Wikipedia article, such as "Nobody knows the true identity of ..." This continues for the length of the Wikipedia article. You only know that a new Wikipedia article is beginning when a row is led with a document_id.

EXPORT: 
* So I needed to wrangle the dataset so that one Wikipedia article was one row, rather than multiple rows, and exported as summary_list.csv.  

* I also reformatted the summaries so that each sentence of each Wikipedia article starts with the document_id (rather than just the first sentence of each Article). This was exported as updatedSummaryList.csv. 

In [1]:
import nltk
import re
import os
import io
import pandas as pd
import time
import csv
import pickle
import string
nltk.download('stopwords')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.floa

True

In [2]:
startTime = time.time()

In [3]:
# Clean each question and answer

def cleaning_document(document):
    regex = r"\([^)]*\)|\[[^)]*\]"   #removes anything in () or [] and trailing spaces
    
    document = document.lower()
        
    document = document.replace(",test,\"", " ")
    document = document.replace(",train,\"", " ")
    document = re.sub(' +', ' ', document)  
    document = document.replace('"',"")
    document = document.replace(" •", "")
    document = document.replace(" .", ".")
    document = document.replace("-", " ")
    document = document.replace("#","")
    document = document.replace("/"," ")
    document = document.replace("\\"," ")
    document = document.replace("TM"," ")
    document = document.replace("\n", " ")
    document = document.replace("\t", " ")
    document = document.replace("\"\"", "\"")
    document = document.replace("\"", " ")
    document = document.replace("  ", " ")
    
    return document

In [4]:
# Flatten lists of list 

def nested_flatten(inputList):
    summary = []
    
    for item in inputList:
        if isinstance(item, list):
            summary += nested_flatten(item)
        else:
            summary += [item]
    return summary

In [5]:
# Wrangle the Google summaries.csv dataset to combine paragraphs associated with 
#  the same article. The exported csv has one Wikipedia article per row. 
#  Exported as summary_list.csv

def main1():
    num_files = 0

    summary = []
    summary_list = []
    
    # Ingest summaries.csv ffile
    with open('summaries.csv', newline='') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',')
        for row in spamreader:
            num_files += 1
            
            try:
                # check if beginning of row is a document_id (new Wikipedia article)
                #  
                if len(row[0].split(',')[0]) == 40 and row[0][0].isnumeric():
                    summary = nested_flatten(summary) # flatten the lists of list
                    summary = ' '.join(summary)
                    summary = cleaning_document(str(summary)) # clean summary of article
                    
                    # add the summary of the previous article and all its paragraphs
                    #  to the list of all Wikipedia articles
                    summary_list.append(summary) 
                    
                    # now start a new list b/c this row begins a new Wikipedia article
                    #  (b/c it starts with a document_id)
                    summary = [] 
                    summary = summary + row 
                
                # if next row is not the beginning of a new Wikipedia article, add
                #  the next paragraph (row) to the summary
                else:
                    summary = summary + row 
            except:
                pass
   
    # Export wrangled summaries to CSV
    df = pd.DataFrame(data={"summaries": summary_list})
    df.to_csv("summary_list.csv", sep=',', encoding='utf-8',index=False)      


In [6]:
if __name__ == '__main__':
    main1()

In [7]:
# Restructure each Wikiepedia article so that each sentence in each article starts 
#  with the document_id (rather than just the first sentence of each Article)
#  Exported as updatedSummaryList.csv

def main2():
    num_files = 0
    
    updatedSummaryList = []
        
    with open('summary_list.csv', newline='') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',')
        for row in spamreader:
            
            # identify the document_id
            answerID = row[0][0:40]
            
            # identify the Wikipedia article
            summy = row[0][41:]
            
            # tokenize each sentence in each article
            sent_tokens = nltk.sent_tokenize(summy)
            
            # reformat each sentence in each article so that each sentence begins
            #  with the document_id
            for sent in sent_tokens:
                updatedSent = answerID + ', ' + sent + ' '
                updatedSent = cleaning_document(str(updatedSent))
                updatedSummaryList.append(updatedSent)

        # Export to CSV
        df = pd.DataFrame(data={"summaries": updatedSummaryList})
        df.to_csv("updatedSummaryList.csv", sep=',', encoding='utf-8',index=False) 

In [8]:
main2()

In [9]:
# QA results

with open('updatedSummaryList.csv', newline='') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',')
        for row in spamreader:
            print(row)

['summaries']
["0025577043f5090cd603c6aea60f26e236195594, mark hunter (slater), a high school student in a sleepy suburb of phoenix, arizona, starts an fm pirate radio station that broadcasts from the basement of his parents' house. "]
['0025577043f5090cd603c6aea60f26e236195594, mark is a loner, an outsider, whose only outlet for his teenage angst and aggression is his unauthorized radio station. ']
["0025577043f5090cd603c6aea60f26e236195594, his pirate station's theme song is everybody knows by leonard cohen and there are glimpses of cassettes by such alternative musicians as the jesus and mary chain, camper van beethoven, primal scream, soundgarden, ice t, bad brains, concrete blonde, henry rollins, and the pixies. "]
['0025577043f5090cd603c6aea60f26e236195594, by day, mark is seen as a loner, hardly talking to anyone around him; by night, he expresses his outsider views about what is wrong with american society. ']
['0025577043f5090cd603c6aea60f26e236195594, when he speaks his mind 

['0619e886fb3167a3d70f4d191754a82b3f9ecf7f, the doones, abandoning their plan to marry lorna to carver and claim her wealth, side with monmouth in the hope of reclaiming their ancestral lands. ']
['0619e886fb3167a3d70f4d191754a82b3f9ecf7f, however, monmouth is defeated at the battle of sedgemoor, and his associates are sought for treason. ']
['0619e886fb3167a3d70f4d191754a82b3f9ecf7f, john ridd is captured during the revolution. ']
['0619e886fb3167a3d70f4d191754a82b3f9ecf7f, innocent of all charges, he is taken to london by an old friend to clear his name. ']
['0619e886fb3167a3d70f4d191754a82b3f9ecf7f, there, he is reunited with lorna (now lorna dugal), whose love for him has not diminished. ']
["0619e886fb3167a3d70f4d191754a82b3f9ecf7f, when he thwarts an attack on lorna's great uncle and legal guardian earl brandir, john is granted a pardon, a title, and a coat of arms by the king and returns a free man to exmoor. "]
['0619e886fb3167a3d70f4d191754a82b3f9ecf7f, in the meantime, the su

['0b9c36c9ed7054b8879daec163f52d1491264a55, mike, convinced his job is over, resolves to play his heart out. ']
['0b9c36c9ed7054b8879daec163f52d1491264a55, psmith leaves work early, to take his father to the match. ']
['0b9c36c9ed7054b8879daec163f52d1491264a55, mr smith is shocked that the bank does not approve of people leaving to play cricket; psmith persuades him that rather than working at the bank, he should study for the bar. ']
['0b9c36c9ed7054b8879daec163f52d1491264a55, they arrive at the game just as mike, playing well, reaches his century. ']
["0b9c36c9ed7054b8879daec163f52d1491264a55, after the match, psmith tells mike of his plans to study law at cambridge, and also that his father, needing an agent for his estate, is willing to take mike on, having first paid for him to go to the 'varsity too, to study the business. "]
['0b9c36c9ed7054b8879daec163f52d1491264a55, mr bickersdyke, relaxing in his club, overjoyed at the thought of finally being able to sack psmith and mike, is

['0f849890e27fd05b9a8683d111f489515db72ea4, they dance and sing , celebrating dionysus and adding details of his birth and the dionysian rites. ']
['0f849890e27fd05b9a8683d111f489515db72ea4, then tiresias , the blind and elderly seer , appears. ']
['0f849890e27fd05b9a8683d111f489515db72ea4, he knocks on the palace doors and calls for cadmus , the founder and former king of thebes. ']
['0f849890e27fd05b9a8683d111f489515db72ea4, the two venerable old men are planning to join the revelry in the mountains when cadmus’ grandson pentheus , the current king , enters. ']
['0f849890e27fd05b9a8683d111f489515db72ea4, disgusted to find the two old men in festival dress , he scolds them and orders his soldiers to arrest anyone engaging in dionysian worship. ']
["0f849890e27fd05b9a8683d111f489515db72ea4, he wants the foreigner , whom he does n't recognize as dionysus in disguise , to be captured. "]
['0f849890e27fd05b9a8683d111f489515db72ea4, pentheus intends to have him stoned to death. ']
['0f8498

['1343fe0f3a4293a8d5a214cd30e857f9abe77ebb, woot steals a magic apron that opens doors and barriers at the wearer s request , enabling the four to escape. ']
['1343fe0f3a4293a8d5a214cd30e857f9abe77ebb, woot , as a green monkey , narrowly avoids becoming a jaguar s meal by descending further into a den of subterranean dragons. ']
['1343fe0f3a4293a8d5a214cd30e857f9abe77ebb, after escaping that ordeal , woot , the tin woodman as a tin owl , the scarecrow as a straw stuffed bear , and polychrome as a canary turn south into the munchkin country. ']
['1343fe0f3a4293a8d5a214cd30e857f9abe77ebb, they arrive at the farm of jinjur , who renews her acquaintance with them and sends to the emerald city for help. ']
['1343fe0f3a4293a8d5a214cd30e857f9abe77ebb, dorothy and ozma arrive and ozma easily restores the scarecrow and the tin woodman to their rightful forms. ']
['1343fe0f3a4293a8d5a214cd30e857f9abe77ebb, polychrome takes several steps to restore to her true form. ']
['1343fe0f3a4293a8d5a214cd3

['1915f92c4152b867fa9bee83e61b901983f8a3ea, sylvia robson lives happily with her parents on a farm, and is passionately loved by her rather dull quaker cousin philip. ']
['1915f92c4152b867fa9bee83e61b901983f8a3ea, she, however, meets and falls in love with charlie kinraid, a dashing sailor on a whaling vessel, and they become secretly engaged. ']
['1915f92c4152b867fa9bee83e61b901983f8a3ea, when kinraid goes back to his ship, he is forcibly enlisted in the royal navy by a press gang, a scene witnessed by philip. ']
["1915f92c4152b867fa9bee83e61b901983f8a3ea, philip does not tell sylvia of the incident nor relay to her charlie's parting message and, believing her lover is dead, sylvia eventually marries her cousin. "]
["1915f92c4152b867fa9bee83e61b901983f8a3ea, this act is primarily prompted out of gratefulness for philip's assistance during a difficult time following her father's imprisonment and subsequent execution for leading a revengeful raid on press gang collaborators. "]
['1915f9

['1b548ec72908f9447446bdb24e8c179df19a8999, theseus maintains that , since every man must die when his time comes , that it is best to die with a good name and reputation , on good terms with his friends , and having died with honour. ']
['1b548ec72908f9447446bdb24e8c179df19a8999, theseus s comfort to emily and palamon is that arcite died in just such a manner , having acquitted himself well in a feat of arms. ']
["1b82b15048e60b850bcaa1a8c719cf4008d0fbb8, the play opens with the recruiter, captain plume's sergeant kite, recruiting in the town of shrewsbury. "]
["1b82b15048e60b850bcaa1a8c719cf4008d0fbb8, plume arrives, in love with sylvia, closely followed by worthy, a local gentleman who is in love with sylvia's cousin melinda. "]
['1b82b15048e60b850bcaa1a8c719cf4008d0fbb8, worthy asked melinda to become his mistress a year previously, as he believed her to be of inadequate fortune to marry. ']
['1b82b15048e60b850bcaa1a8c719cf4008d0fbb8, but he changes his mind after she comes into an

['2101dfafc654880e081ab5d54326a0fc9d4809f2, lona , vane s love , turns out to be lilith s daughter , and is killed by her own mother. ']
['2101dfafc654880e081ab5d54326a0fc9d4809f2, lilith , however , is captured and brought to adam and eve at the house of death , where they struggle to make her open her hand , fused shut , in which she holds the water the little ones need to grow. ']
['2101dfafc654880e081ab5d54326a0fc9d4809f2, only when she gives it up can lilith join the sleepers in blissful dreams , free of sin. ']
['2101dfafc654880e081ab5d54326a0fc9d4809f2, after a long struggle , lilith bids adam cut her hand from her body ; it is done , lilith sleeps , and vane is sent to bury the hand ; water flows from the hole and washes the land over. ']
['2101dfafc654880e081ab5d54326a0fc9d4809f2, vane is then allowed to join the little ones , already asleep , in their dreaming. ']
['2101dfafc654880e081ab5d54326a0fc9d4809f2, he takes his bed , next to lona s , and finds true life in death. ']


['254ce2e2522625e70a10c88ef265769083049b46, there , the shaggy man s friend johnny dooit builds a sand boat by which they may cross. ']
['254ce2e2522625e70a10c88ef265769083049b46, this is necessary , because physical contact with the desert s sands , as of this book and ozma of oz ( 1907 ) , will turn the travelers to dust. ']
['254ce2e2522625e70a10c88ef265769083049b46, upon reaching oz , dorothy and her companions are warmly welcomed by the mechanical man tik tok and billina the yellow hen. ']
['254ce2e2522625e70a10c88ef265769083049b46, they proceed in company , to come in their travels to the truth pond , where button bright and the shaggy man regain their true heads by bathing in its waters. ']
['254ce2e2522625e70a10c88ef265769083049b46, they meet the tin woodman , the scarecrow , and jack pumpkinhead who journey with them to the imperial capital called emerald city for ozma s grand birthday bash. ']
['254ce2e2522625e70a10c88ef265769083049b46, dorothy meets up with ozma as her chari

['2b7e8df77a6d154c5f957ffc6c9f40bc38ca3cde, though she pushes for the vote , amidala grows frustrated with the corruption in the senate and decides to return to naboo with the jedi. ']
['2b7e8df77a6d154c5f957ffc6c9f40bc38ca3cde, on naboo , padmé reveals herself to the gungans as queen amidala and persuades them into an alliance against the trade federation. ']
['2b7e8df77a6d154c5f957ffc6c9f40bc38ca3cde, jar jar leads his people in a battle against the droid army while padmé leads the hunt for gunray in theed. ']
['2b7e8df77a6d154c5f957ffc6c9f40bc38ca3cde, in a starship hangar , anakin enters a vacant starfighter and inadvertently triggers its autopilot , joining the battle against the federation droid control ship in space. ']
['2b7e8df77a6d154c5f957ffc6c9f40bc38ca3cde, anakin ventures into the ship and destroys it from within , deactivating the droid army. ']
['2b7e8df77a6d154c5f957ffc6c9f40bc38ca3cde, meanwhile , qui gon and obi wan battle darth maul , who mortally wounds qui gon bef

["30337c485ea9d6d657be1ff17823e94b0a531550, london's oldest daughter joan commented that in spite of its tragic ending, the book is often regarded as a 'success' storyâ\xa0... which inspired not only a whole generation of young writers but other different fields who, without aid or encouragement, attained their objectives through great struggle.,living in oakland at the beginning of the 20th century , martin eden struggles to rise above his destitute , proletarian circumstances through an intense and passionate pursuit of self education , hoping to achieve a place among the literary elite. "]
['30337c485ea9d6d657be1ff17823e94b0a531550, his principal motivation is his love for ruth morse. ']
['30337c485ea9d6d657be1ff17823e94b0a531550, because eden is a rough , uneducated sailor from a working class background and the morses are a bourgeois family , a union between them would be impossible unless and until he reached their level of wealth and refinement. ']
['30337c485ea9d6d657be1ff17823

['35592c2abea624d315c5171d67ab5e14794ca071, they have already completed three , and utah predicts they ll attempt the fourth on a rare sea wave phenomenon in france. ']
['35592c2abea624d315c5171d67ab5e14794ca071, after presenting his analysis , utah is sent undercover to france under a field agent named pappas ( ray winstone ). ']
['35592c2abea624d315c5171d67ab5e14794ca071, they reach france and utah gets help from others to surf the tall tube wave. ']
['35592c2abea624d315c5171d67ab5e14794ca071, as he goes in , there is already another surfer in the wave , leaving utah with an unstable wave. ']
['35592c2abea624d315c5171d67ab5e14794ca071, utah gets sucked into the wave and faints , but the other surfer bails and rescues utah. ']
['35592c2abea624d315c5171d67ab5e14794ca071, he wakes aboard a yacht with the surfer , bodhi ( ă\x89dgar ramă\xadrez ) , and his team roach ( clemens schick ) , chowder ( tobias santelmann ) , and grommet ( matias varela ). ']
['35592c2abea624d315c5171d67ab5e1479

['38e24416d39a0a285ef1693adad25c9ed0c94487, upon his release from prison , george violates his parole conditions and heads down to cartagena , colombia to meet up with diego. ']
['38e24416d39a0a285ef1693adad25c9ed0c94487, they meet with cartel officer cesar rosa to negotiate the terms for smuggling 15 kilograms ( 33 lb ) to establish good faith. ']
['38e24416d39a0a285ef1693adad25c9ed0c94487, as the smuggling operation grows , diego gets arrested , leaving george to find a way to sell 50 kg ( 110 lb ) and get the money in time. ']
['38e24416d39a0a285ef1693adad25c9ed0c94487, george reconnects with derek in california , and the two successfully sell all 50 kg in 36 hours , amassing a $ 1.35 million profit. ']
['38e24416d39a0a285ef1693adad25c9ed0c94487, george is then whisked off to medellín , colombia , where he finally meets the group s leader , pablo escobar ( cliff curtis ) , who agrees to go into business with them. ']
['38e24416d39a0a285ef1693adad25c9ed0c94487, with the help of main 

['3d248aa8bba34b3f5199c1aed1443b9fa3395d03, to his horror , it shows him hovering over renée s dismembered body. ']
['3d248aa8bba34b3f5199c1aed1443b9fa3395d03, he is arrested for her murder , tried , found guilty and sentenced to death. ']
['3d248aa8bba34b3f5199c1aed1443b9fa3395d03, shortly after arriving at death row , fred is plagued by frequent headaches and strange visions of the mystery man , a burning cabin in the desert and a strange man driving down a dark highway. ']
['3d248aa8bba34b3f5199c1aed1443b9fa3395d03, during a routine cell check , the prison guard is shocked to find that the man in fred s cell is now pete dayton ( balthazar getty ) , a young auto mechanic. ']
['3d248aa8bba34b3f5199c1aed1443b9fa3395d03, since pete has committed no crime , he is released into the care of his parents , who take him home. ']
['3d248aa8bba34b3f5199c1aed1443b9fa3395d03, pete is then followed by two detectives who are trying to find out more about him. ']
['3d248aa8bba34b3f5199c1aed1443b9fa3

['42d253275a8807aa6ecf57c6c306cb24d76710f1, barnes cuts taylor near his eye with a push dagger before departing. ']
['42d253275a8807aa6ecf57c6c306cb24d76710f1, the platoon is sent back to the combat area to maintain defensive positions, where taylor shares a foxhole with francis. ']
['42d253275a8807aa6ecf57c6c306cb24d76710f1, that night, a major nva assault occurs, and the defensive lines are broken. ']
['42d253275a8807aa6ecf57c6c306cb24d76710f1, much of the platoon, including bunny, junior, and wolfe, are killed in the ensuing battle. ']
['42d253275a8807aa6ecf57c6c306cb24d76710f1, during the attack, an nva sapper, armed with explosives, rushes into battalion headquarters, making a suicide attack and killing everyone inside. ']
['42d253275a8807aa6ecf57c6c306cb24d76710f1, meanwhile, captain harris, the company commander, orders his air support to expend all remaining ordnance inside his perimeter. ']
['42d253275a8807aa6ecf57c6c306cb24d76710f1, during the chaos, taylor encounters barnes,

['47b4955fca17af174e1a45b7ff981ce68f3625f8, ouch ! ']
['47b4955fca17af174e1a45b7ff981ce68f3625f8, jarvis and tweel follow the cart creatures to their destination , a mound with a tunnel leading down below it. ']
['47b4955fca17af174e1a45b7ff981ce68f3625f8, jarvis soon becomes lost in the network of tunnels , and hours or days pass before he and tweel find themselves in a domed chamber near the surface. ']
['47b4955fca17af174e1a45b7ff981ce68f3625f8, there they find the cart creatures depositing their loads beneath a wheel that grinds the stones and plants into dust. ']
['47b4955fca17af174e1a45b7ff981ce68f3625f8, some of the cart creatures also step under the wheel themselves and are pulverized. ']
['47b4955fca17af174e1a45b7ff981ce68f3625f8, beyond the wheel is a shining crystal on a pedestal. ']
['47b4955fca17af174e1a45b7ff981ce68f3625f8, when jarvis approaches it he feels a tingling in his hands and face , and a wart on his left thumb dries up and falls off. ']
['47b4955fca17af174e1a45b

['4d36547aa42b054fa6e8ee99e541acb8b0070fc0, she and her brothers are reunited with their parents in a triumphal celebration, which signifies the heavenly bliss awaiting the wayfaring soul that prevails over trials and travails, whether these are the threats posed by overt evil or the blandishments of temptation.,the plot concerns two brothers and their sister , simply called the lady , lost in a journey through the woods. ']
['4d36547aa42b054fa6e8ee99e541acb8b0070fc0, the lady becomes fatigued , and the brothers wander off in search of sustenance. ']
['4d36547aa42b054fa6e8ee99e541acb8b0070fc0, while alone , she encounters the debauched comus , a character inspired by the god of revelry ( ancient greek : κῶμος ) , who is disguised as a villager and claims he will lead her to her brothers. ']
['4d36547aa42b054fa6e8ee99e541acb8b0070fc0, deceived by his amiable countenance , the lady follows him , only to be captured , brought to his pleasure palace and victimised by his necromancy. ']
['4