Skip to content

harrismohammed/palpx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Palpx-Task

1) Problem statement :

To Build a question answering system for a text article. Which returns the answer to the question with a confidence score.

2) Input :

Input 1. Link to an English wikipedia article page Example: article: “https://en.wikipedia.org/wiki/R2-D2”

Input 2. Question in natural language related to that article content. Example: “When was R2-D2 inducted into the Robot Hall of Fame?”

3) Output :

Extract and show the particular sentence containing the answer to the question with a confidence score. Example: “R2-D2 was inducted into the Robot Hall of Fame in 2003.”, score: 0.8

4) Code Format :

  • I have used Jupyter Notebook to exhibit my work.

5) Requirements & Packages :

A. Python 3.7

B. NLTK = Natural Language processing Toolkit. It can be used for a wide range of Text Processing use cases. I have used it to detect Sentences from my article.

$ pip install nltk

After installing nltk, you need to download 'punkt'; which enables you to tokenize :

import nltk

nltk.download('punkt')

C. BEAUTIFUL SOUP = This library helps us to get the HTML structure of the page that we want to work with. We can then, use its functions to access specific elements and extract relevant information.

$ pip install beautifulsoup4

D. FUZZY WUZZY = It uses Levenshtein Distance to calculate the similarities between two given sentences.

$ pip install fuzzywuzzy

E. REGULAR EXPRESSIONS = I have used RE to preprocess Data by removing HTML Tags from Data.

$ pip install regex

F. TQDM = It is used to print progress in a script.

$ pip install tqdm

G. SPACY = spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

$ pip install -U spacy

-Required to download English Language model.

$ python -m spacy download en

6) Step by Step Procedure :

  • Step 1 : We need to download and import all the required libraries and packages. Pls see above section for the requirements

  • Step 2 : We request the user to provide any reference Wikepedia Article that will be used as the input text corpus. Here we validate the url and then use BeautifulSoup library to extract data(scraped in html format). The striphtml function removes all Html tags and carry out preprocessing to obtain the clean article text and store each sentence in a list. We then tokenize the article into individual sentences using sent_tokenize function.

  • Step 3 : Here we request the user to enter the desired question pertaining the Wikepedia URL provided.

  • Step 4 : In this step, we find the CONTEXT by searching through the article we can find where the Question is present. First match the words in the question from the list of sentences of the article and then Extract sentences before and after containing majority of the words from the question to obtain the CONTEXT.

  • Step 5 : Using Spacy's token dependencies find the subject, object and the root word of the sentence. Store this as the key phrase. This is done inorder to understand on what topic the question revolves around. Using the Key Phrase find the sentence from the context containing the subject, object and the root word. As this sentence is supposed to contain the answer.

7) Future Possible modifications :

Word embeddings can be applied. This helps build a very robust model.

Word Embeddings help capture the syntactical similarity or relation with other words. Mathematically, if the cosine angle between any 2 given words is close to 0, then they are similar. This enables us to refine the search for the answer, Our model would be capable to understand whether the word in the question and context are similar or dissimilar. Moreover, word embeddings are nothing but vectors. This is important to compute for the model mentioned below.

Using the context that I obtained,Google's open source pretrained model(BERT) can be attached to build a question answer model.

BERT, or Bidirectional Encoder Representations from Transformers, is a new method for language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. BERT is trained as a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then the model can be downstreamed to perform NLP tasks that we care about (like question answering).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published