# ***Question And Answering***
![alt text](https://miro.medium.com/max/1206/1*1LzDyOaDaDrIDu9hksV35Q.jpeg)

**Question answering** (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.

## Introduction

This notebook goes through a necessary step for NLP Question And Answering project.
Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website.
2. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms and prepare Corpus.
3. **Answer Questions from data - ** Try to answers any question's that related from the data.


> **Corpus** - a collection of text

The output of this notebook will be clean, organized data in two standard text formats:


As a reminder, our goal is to look at transcripts of various tourism in Ethiopia and we gone scraping sample data from offical website "www.ethiopiaonlinevisa.com". Prepare **Courpus** from the data and try to answer questions from Courpus.



---
# **Phase One:** Getting the data

Essential Libraries in Python

> **Requests** is a Python HTTP library to make HTTP requests.

> **lxml** is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping.

> **Beautiful Soup** is a Python library for pulling data out of HTML and XML files.

> **Pickle** in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.



In [0]:
# pip install requests
# pip install beautifulsoup4
# pip install pickle
# pip install lxml

In [0]:
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from www.ethiopiaonlinevisa.com
def url_to_transcript(url):
    '''Returns transcript data specifically from www.ethiopiaonlinevisa.com'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="col-lg-9 col-md-8").find_all('p')]
    print(url)
    return text


# URLs of transcripts in scope
urls = ['https://www.ethiopiaonlinevisa.com/simien-mountains/',
        'https://www.ethiopiaonlinevisa.com/ethiopia-travel-advice-safety-security/',
        'https://www.ethiopiaonlinevisa.com/tourism-grows-ethiopia-evisa/',
        'https://www.ethiopiaonlinevisa.com/danakil-depression/',
        'https://www.ethiopiaonlinevisa.com/churches-lalibela-ethiopia-holy-city/',
        'https://www.ethiopiaonlinevisa.com/ecotourism-in-ethiopia/',
        'https://www.ethiopiaonlinevisa.com/omo-valley-ancient-tribes/',
        'https://www.ethiopiaonlinevisa.com/gondar-tourist-attractions-history/',
        'https://www.ethiopiaonlinevisa.com/axum-ethiopia/',
        'https://www.ethiopiaonlinevisa.com/addis-ababa-capital-largest-city-ethiopia/',
        'https://www.ethiopiaonlinevisa.com/ethiopia-world-heritage-attractions/',
        'https://www.ethiopiaonlinevisa.com/best-time-to-visit-ethiopia/',
        'https://www.ethiopiaonlinevisa.com/ethiopian-calendar/',
        'https://www.ethiopiaonlinevisa.com/amharic-the-ethiopian-language/']

# title names
title = ['simien', 'security', 'visa', 'danakil', 'lalibela', 'ecotourism', 'omo', 'gondar', 'axum', 'addis', 'attractions', 'best-time', 'etcalendar', 'amharic']


In [12]:
# # Actually request transcripts (needs internet connection to run)
import lxml
transcripts = [url_to_transcript(u) for u in urls]

https://www.ethiopiaonlinevisa.com/simien-mountains/
https://www.ethiopiaonlinevisa.com/ethiopia-travel-advice-safety-security/
https://www.ethiopiaonlinevisa.com/tourism-grows-ethiopia-evisa/
https://www.ethiopiaonlinevisa.com/danakil-depression/
https://www.ethiopiaonlinevisa.com/churches-lalibela-ethiopia-holy-city/
https://www.ethiopiaonlinevisa.com/ecotourism-in-ethiopia/
https://www.ethiopiaonlinevisa.com/omo-valley-ancient-tribes/
https://www.ethiopiaonlinevisa.com/gondar-tourist-attractions-history/
https://www.ethiopiaonlinevisa.com/axum-ethiopia/
https://www.ethiopiaonlinevisa.com/addis-ababa-capital-largest-city-ethiopia/
https://www.ethiopiaonlinevisa.com/ethiopia-world-heritage-attractions/
https://www.ethiopiaonlinevisa.com/best-time-to-visit-ethiopia/
https://www.ethiopiaonlinevisa.com/ethiopian-calendar/
https://www.ethiopiaonlinevisa.com/amharic-the-ethiopian-language/


In [0]:
#just incase we use colab
#   from google.colab import drive
#   drive.mount('/content/drive')

In [13]:
# import the os module
import os
 
# detect the current working directory and print it
dirpath = os.getcwd()
print ("The current working directory is %s" % dirpath)
print("The current working directory is " + dirpath)

foldername = os.path.basename(dirpath)
print("Directory name is : " + foldername)

The current working directory is /content
The current working directory is /content
Directory name is : content


In [14]:
# !mkdir name sample_data 
import os
# define the name of the directory to be created
path = "/sample_data"
try:
    os.mkdir(path)
except OSError:
    print ("Creation of the directory %s failed" % path)
else:
    print ("Successfully created the directory %s " % path)

Successfully created the directory /sample_data 


In [0]:
# # Pickle files for later use

# Use sample_data directory to hold the text files

for i, c in enumerate(title):
    with open("sample_data/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

In [0]:
# Load pickled files
data = {}
for i, c in enumerate(title):
    with open("sample_data/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [18]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['simien', 'security', 'visa', 'danakil', 'lalibela', 'ecotourism', 'omo', 'gondar', 'axum', 'addis', 'attractions', 'best-time', 'etcalendar', 'amharic'])

In [19]:
# More checks
data['addis']

['There is an incredible amount of things do do in Addis Ababa. Ethiopia’s exciting capital city is not just one of the most vibrant and eclectic cities in the country, it is often referred to as the capital of Africa.',
 'Its name translates to “new flower” in the Amharic language which reflects its youthful, lively charm and personality. Set in the breathtaking Ethiopian highlands, Addis Ababa is 2,665 meters above sea level, which makes the weather a little cool but very pleasant. Visitors can enjoy the city at any time of the year.',
 'Planning a trip to Ethiopia’s capital can be overwhelming as there is so much to see. To help you plan your trip, we’ve compiled a list of the most interesting things to see in Addis Ababa.',
 'There is an exciting range of fascinating things to do in Addis Ababa including churches, markets, and fascinating museums. You can also find some of the world’s best coffee and \xa0mouthwatering range of local and international restaurants and food stalls.',



---
# **Phase Two:** Organizing the data


In [0]:
#Concatenate all scraping data from a website.
metadata = data['simien']+data['security']+data['visa']+data['danakil']+data['amharic']+\
           data['lalibela']+data['ecotourism']+data['best-time']+data['etcalendar']+\
           data['omo']+data['gondar']+data['axum']+data['addis']+data['attractions']



In [21]:
metadata

['Simien Mountains National Park is a dramatic wilderness in the north of Ethiopia. Its jagged landscape, complete with soaring peaks and hidden valleys, is home to alien flowers and rare creatures. It is the perfect adventure for visitors who love trekking, nature, and stunning views.',
 'Read on to discover the best things to see at this UNESCO World Heritage Site, how to get there, and where to stay.',
 'The Simien Mountains (also spelled Semien mountains) are located in the Ethiopian Highlands, often referred to as “the Roof of Africa”. The highest point in the country is the mountain Ras Dashen at 4,550 m, found within the park.',
 'Simien Mountains National Park is located close to the town of Debark, where the park headquarters and main entry point can be found.',
 'Debark is on the main road between the historic cities of Gondar and Axum, within a few hours’ drive of each. There are also buses that run between these two cities that stop at Debark.',
 'Simien Mountain National P

In [22]:
#Check Data type
type(metadata)

list

In [23]:
# Python program to convert a list to string 
# Using list comprehension  

corpus = ''.join([str(elem) for elem in metadata])  
print(corpus)  

Simien Mountains National Park is a dramatic wilderness in the north of Ethiopia. Its jagged landscape, complete with soaring peaks and hidden valleys, is home to alien flowers and rare creatures. It is the perfect adventure for visitors who love trekking, nature, and stunning views.Read on to discover the best things to see at this UNESCO World Heritage Site, how to get there, and where to stay.The Simien Mountains (also spelled Semien mountains) are located in the Ethiopian Highlands, often referred to as “the Roof of Africa”. The highest point in the country is the mountain Ras Dashen at 4,550 m, found within the park.Simien Mountains National Park is located close to the town of Debark, where the park headquarters and main entry point can be found.Debark is on the main road between the historic cities of Gondar and Axum, within a few hours’ drive of each. There are also buses that run between these two cities that stop at Debark.Simien Mountain National Park is like nowhere else on

---
# **Phase Three:** Question and Answering

**AllenNLP** is a pythone framework that makes the task of building Deep Learning models for Natural Language Processing something really enjoyable.

In [25]:
pip install allennlp

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/bb/bb/041115d8bad1447080e5d1e30097c95e4b66e36074277afce8620a61cee3/allennlp-0.9.0-py3-none-any.whl (7.6MB)
[K     |████████████████████████████████| 7.6MB 2.5MB/s 
[?25hCollecting pytorch-pretrained-bert>=0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 41.4MB/s 
Collecting jsonpickle
  Downloading https://files.pythonhosted.org/packages/07/07/c157520a3ebd166c8c24c6ae0ecae7c3968eb4653ff0e5af369bb82f004d/jsonpickle-1.2-py2.py3-none-any.whl
Collecting responses>=0.7
  Downloading https://files.pythonhosted.org/packages/3e/0c/940781dd49710f4b1f0650c450c9fd8491db0e1bffd99ebc36355607f96d/responses-0.10.9-py2.py3-none-any.whl
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bac

**Predictor** is a thin wrapper around an AllenNLP model that handles JSON -> JSON predictions that can be used for serving models through the web API or making predictions in bulk

In [26]:
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/bidaf-model-2017.09.15-charpad.tar.gz")

100%|██████████| 46175392/46175392 [00:03<00:00, 12096705.84B/s]
  "num_layers={}".format(dropout, num_layers))


In [0]:
#sample questions

"""
Interesting things to see in Addis Ababa?
where is a dramatic wilderness in the north of Ethiopia?
what is The first step to visiting Simien Mountains?
when the national park is open?
Do I Need a Tour Guide to Visit the Omo Valley?
which time the national park is open?
which one is the best museums in Ethiopia?
is there any flights from Gondar?
Where is the greatest historical site in Ethiopia?
"""

result=predictor.predict(
  passage=corpus,
# question="Where is the greatest historical site in Ethiopia?"
  question = input("\033[3;36;47m Question: ")
)
result['best_span_str']