<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

 <div>
    <span style="float: left; width: 33%; text-align: left;"><a href="QandA_data_processing.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="Overview.ipynb">1</a>
        <a href="General_preprocessing.ipynb">2</a>
        <a href="QandA_data_processing.ipynb">3</a>
        <a >4</a>
        <a href="Summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="Summary.ipynb">Next Notebook</a></span>
</div>

# Exercise

---

## Goal
The goal of this exercise is to enable you to apply techniques learned in the previous labs in a single pipeline to generate a SQuAD JSON format question-answering dataset from text data, thus, solidifying your understanding.


## Data 

Our sample data are from two different sources and would cover two topics namely ecommerce and climate change. The ecommerce is sourced from [keggle](https://www.kaggle.com/datasets/cclark/product-item-data) and it contains 500 actual SKUs (stock-keeping units) from an outdoor apparel brand's product catalog. The climate change was sourced from [researchgate](https://www.researchgate.net/publication/311301385_Climate_Change) authored by `Chris Riedy`.


### Exercise 1

**Load CSV file and display text**

In [None]:
import numpy as np
import pandas as pd
import re
import string
#import cudf as df

path = '../source_code/data/ecommerce.csv'
cv_file = pd.read_csv( ) # add path 

cv_file
for row in cv_file['description']:
    print(row)


**Removal of HTML tags and unwanted characters** 

In [None]:
# you are free to modify the entire code in this cell

html_reg_exp = re.compile('<.*?>')

context = []
for row in cv_file['description']:
    _context = html_reg_exp.sub( , row) #add r'' to this line such as: (r'', row)
    context.append(_context)
      
cv_file['description'] = context 
cv_file['description'][0]

In [None]:
##(optional) apply other preprocessing techniques to remove unwanted characters or symbols



**Save the processed data in CSV file as `ecommerce_1.csv`**

In [None]:
## write back to ecommerce_1.csv

save_path = '../source_code/data/ '  #complete the path
cv_file.to_csv(save_path, index=False)

#cv_file.head()

### Exercise 2

You are to follow the list of steps highlighted below to complete the exercise.

### Step 1

- Download/open the [ecommerce_1.csv](../source_code/data/ecommerce.csv) at `source_code/data/ecommerce.csv` and [climate change](../source_code/data/Climatechange.docx) at `source_code/data/Climatechange.docx` document
- create two Excel files named `exercise.xlsx` and `exercise.csv`
- In the `exercise.xlsx`, create three sheets (Sheet0, Sheet1, Sheet2)
- In `Sheet0`, create two columns (tid, title) and insert two rows as (1, ecommerce), (2, climate change) for tid and title respectively  
- In `Sheet1`, create three columns (tid, cid, and context). please refer to the `On-Code SQuAD Format QA dataset Generation` section in `Lab 3`
- Manually Extract <u>10 contexts</u> from each file to complete a total of 20 contexts having `cid` numbering from 1-20 and tid as 1 or 2 according to the title `tid` mapping.
- Create four columns in `Sheet2` (tid, cid, question, answer). Extract 2 questions and answers (QA) from each context in `Sheet1` and insert them into question and answer columns in `Sheet2`. Also, copy the corresponding tid & cid of the context where the QA was extracted and paste it into `Sheet2` as shown in `Lab 3`



### Step 2

- In the `exercise.csv` create a single column named `document_text`.
- Copy all the context from `Sheet1` in `exercise.xlsx` and paste it under the `document_text` in `exercise.csv` as shown in the `No-Code SQuAD Format QA dataset Generation with Haystack` section in `Lab 3`  
- Copy both files (`exercise.xlsx` and `exercise.csv`) and paste them in the `../source_code/data/` director



### Exercise 3

- Generate `SQuAD` JSON dataset format called `exercise.json`  from `exercise.xlsx` by modifying the code cell below. You can recall `Lab 3`.  


In [None]:
# You are free to rewrite your own code
import json
import csv
import re, os
import pandas as pd

class MakeDataset():
    def __init__(self, path):
        self.Path = path
        self.final_json = {}
        self.final_json['version'] = "v2.0"
        self.final_json['data'] = []      
    
    # modify this method
    def csv_reader(self):
        xls = pd.ExcelFile(self.Path)
        self.Title = 
        self.Context = 
        self.QandA   =
        
        
    def get_loc(self, answer, content):
        loc = re.search(answer.lower(), content.lower())
        if loc is None:
            return -1
        else:
            return loc.span()[0]
        
     #modify this model   
    def make_json(self):
        qid = 1
        for i in range(len(self.Title)):
            self.brace_in_data ={}
            self.brace_in_data['title'] = self.Title['title'][i]
            self.brace_in_data['paragraphs'] = []                              
            for j in range(len(self.Context)):
                if self.Title['tid'][i] == self.Context['tid'][j]:
                    brace_in_paragaraphs = {}
                    brace_in_paragaraphs['context'] =    #add
                    brace_in_paragaraphs['qas'] =        #add    
                    for k in  range(len(self.QandA)):
                        if self.Context['cid'][j] == self.QandA['cid'][k] and self.Title['tid'][i] == self.QandA['tid'][k]:
                            brace_in_qas = {}
                            brace_in_qas['question'] = self.QandA['question'][k]
                            brace_in_qas['id'] = qid
                            loc = self.get_loc(self.QandA['answer'][k], self.Context['context'][j])
                            if loc == -1:
                                brace_in_qas['answer'] =[] 
                                brace_in_qas['is_impossible'] = True
                            else:
                                brace_in_qas['answer'] =[{'text':self.QandA['answer'][k], 'answer_start':loc}]
                                brace_in_qas['is_impossible'] = False
                            qid +=1                                   
                            brace_in_paragaraphs['qas'].append() # add
                    self.brace_in_data['paragraphs'].append(brace_in_paragaraphs)
            self.final_json['data'].append(self.brace_in_data)        
                    
    def save_json(self, filename):
        with open(f"../source_code/data/{filename}.json", "w") as write_file:
            json.dump(self.final_json, write_file, indent=4)
            print("dataset saved in SQauD json format ....")
    

path = '../source_code/data/exercise.xlsx' 
Obj = MakeDataset(path)
Obj.csv_reader()
Obj.make_json()

**Save `exercise.json`**

In [None]:
save_filename = 'exercise.json' 

## Add call to save_json function

Obj.final_json

### Exercise 4

Use of T5-Base model to generate question ansewer from `exercise.csv`.

**Step 1:** Load the CSV file and add model

In [None]:
import pandas as pd

input_file_path = '../source_code/data/exercise.csv'
cv_file =  #add code

cv_file.head()

In [None]:
import sys
sys.path.append('question_generation')

In [None]:
# add model
from pipelines import pipeline

nlp = pipeline()  #add model to this line

**Step 2:** Write functions to generate SQuAD json format

In [None]:
# You are free to modify or rewrite the code from scatch  

import json
import csv
import re, os

def get_loc(answer, content):
    # complete the code
    

def make_json(qa, context):
    qid = 1
    final_json = {}
    final_json['version'] = "v2.0"
    final_json['data'] = []
    brace_in_data ={}
    
    brace_in_data['paragraphs'] = []
    for j in range(len(context)):
        brace_in_paragaraphs = {}
        brace_in_paragaraphs['qas'] = []    
        for row in qa[j]:
            brace_in_qas = {}
            #for row in rows:                
            brace_in_qas['question'] = row['question']
            brace_in_qas['id'] = qid
            loc = get_loc(str(row['answer']), str(context[j]))
            if loc == -1:
                brace_in_qas['answer'] =[] 
                brace_in_qas['is_impossible'] = True
            else:
                brace_in_qas['answer'] =[{'text':row['answer'], 'answer_start':loc}]
                brace_in_qas['is_impossible'] = False
            qid +=1
            brace_in_paragaraphs['qas'].append()  # add missing variable
        brace_in_paragaraphs['context'] = context[j]
        brace_in_data['paragraphs'].append()  # add missing variable
    final_json['data'].append() # add missing variable
    
    return final_json

**Extract question and answer and make json file**

In [None]:
qa = []
for row in cv_file['document_text']:
    qa.append(nlp(row))

json_data = make_json(qa, cv_file['document_text'])
json_data

---
## Licensing

Copyright Â© 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="QandA_data_processing.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="Overview.ipynb">1</a>
        <a href="General_preprocessing.ipynb">2</a>
        <a href="QandA_data_processing.ipynb">3</a>
        <a >4</a>
        <a href="Summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="Summary.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>