# 1. Search and Information Retrieval
Apart from this major function of storing data and ranking search
results, several features in a modern search engine involve NLP
- Spelling correction
- Related queries
- Snippet extraction
- Biographical information extraction
- Search results classification

These two types of search engines are
distinguished as follows:
- Generic search engines, such as Google and Bing, that crawl
the web and aim to cover as much as possible by constantly
looking for new webpages
-  search engines, where our search space is
restricted to a smaller set of already existing documents
within an organization

## Components of a Search Engine
![](images/search-engine.png)
 <center>Early architecture of the Google search engine </center>
 - Crawler: Collects all the content for the search engine. The crawler’s job is
to traverse the web following a bunch of seed URLs and build its
collection of URLs through them in a breadth-first way. It visits
each URL, saves a copy of the document, detects the outgoing
hyperlinks, then adds them to the list of URLs to be visited next.
- Indexer: Parses and stores the content that the crawler collects and builds
an “index” so it can be searched and retrieved efficiently.
- Searcher: Searches the index and ranks the search results for the user query
based on the relevance of the results to the query.
- Feedback: A fourth component, which is now common in all search engines,
that tracks and analyzes user interactions with the search engine,
such as click-throughs, time spent on searching and on each clicked
result, etc., and uses it for continuous improvement of the search
system.

## A Typical Enterprise Search Pipeline
- Crawling/content acquisition
- Text normalization
- Indexing

The pipeline typically consists of the following steps:
1. Query processing and execution: The search query is passed
through the text normalization process as above. Once the
query is framed, it’s executed, and results are retrieved and
ranked according to some notion of relevance.
2. Feedback and ranking: To evaluate search results and make
them more relevant to the user, user behavior is recorded and
analyzed, and signals such as click action on result and time
spent on a result page are used to improve the ranking
algorithm.

## Setting Up a Search Engine: An Example
This notebook shows how to use Elastic Search to index and search through data. We will use a dataset called CMU Book summaries [dataset](http://www.cs.cmu.edu/~dbamman/booksummaries.html).

For this code to work, elastic search instance has to be running in the background. For this you need to follow these steps :

Linux :

1. Go to the elasticsearch-X.Y.Z/bin folder on your machine
2. Run ./elasticsearch.

Windows :

1. Download the latest release
2. Run .\bin\elasticsearch.bat

[ElasticSearch Documentation](https://www.elastic.co/guide/index.html)

In [None]:
from elasticsearch import Elasticsearch 
from datetime import datetime

In [None]:
#elastic search instance has to be running on the machine. Default port is 9200. 

#Call the Elastic Search instance, and delete any pre-existing index
es=Elasticsearch([{'host':'localhost','port':9200}])
if es.indices.exists(index="myindex"):
    es.indices.delete(index='myindex', ignore=[400, 404]) #Deleting existing index for now 

In [None]:
#Build an index from booksummaries dataset. I am using only 500 documents for now.
path = "booksummaries.txt" #Add your path.
count = 1
for line in open(path):
    fields = line.split("\t")
    doc = {'id' : fields[0],
            'title': fields[2],
            'author': fields[3],
            'summary': fields[6]
          }

    res = es.index(index="myindex", id=fields[0], body=doc)
    count = count+1
    if count%100 == 0:
        print("indexed 100 documents")
    if count == 501:
        break

In [None]:
#Check to see how big is the index
res = es.search(index="myindex", body={"query": {"match_all": {}}})
print("Your index has %d entries" % res['hits']['total']['value'])

In [None]:
#Try a test query. The query searches "summary" field which contains the text
#and does a full text query on that field.
res = es.search(index="myindex", body={"query": {"match": {"summary": "animal"}}})
print("Your search returned %d results." % res['hits']['total']['value'])

In [None]:
#Printing the title field and summary field's first 100 characters for 2nd result
print(res["hits"]["hits"][2]["_source"]["title"])
print(res["hits"]["hits"][2]["_source"]["summary"][:100])

In [None]:
#match query considers both exact matches, and fuzzy matches and works as a OR query. 
#match_phrase looks for exact matches.
while True:
    query = input("Enter your search query: ")
    if query == "STOP":
        break
    res = es.search(index="myindex", body={"query": {"match_phrase": {"summary": query}}})
    print("Your search returned %d results:" % res['hits']['total']['value'])
    for hit in res["hits"]["hits"]:
        print(hit["_source"]["title"])
        #to get a snippet 100 characters before and after the match
        loc = hit["_source"]["summary"].lower().index(query)
        print(hit["_source"]["summary"][:100])
        print(hit["_source"]["summary"][loc-100:loc+100])

## A Case Study: Book Store Search
Imagine a scenario where we have a new e-commerce store focused
on books and we have to build its search pipeline. We have metadata
like author, title, and summary. The search functionality we saw earlier
can serve as the baseline at the start. We can set up our own search
engine backend or use online services like Elasticsearch or
Elastic on Azure.

This default search output might have a bunch of issues. For instance, it
may show the results with exact query matches in title or summary to
be higher than more relevant results that aren’t an exact match. Some of
the exact matches might be poorly written books with bad reviews,
which we’re not accounting for in our search ranking.

We can incorporate real-world metrics that account for this into our
search engine. For instance, the number of times a book is viewed and
sold, the number of reviews, and the book’s rating can all be
incorporated into the search ranking function. 

We should start collecting user interactions with the search engine to
improve it further.

# 2. Topic Modeling
Topic models are used extensively for
document clustering and organizing large collections of text data.
They’re also useful for text classification.

Topic modeling operationalizes this intuition. It tries to identify the
“key” words (called “topics”) present in a text corpus without prior
knowledge about it, unlike the rule-based text mining approaches that
use regular expressions or dictionary-based keyword searching
techniques. 

![](images/topic-modeling.png)
<center>Illustration a of topic modeling visualization</center>

Topic modeling generally refers to a collection of
unsupervised statistical learning methods to discover latent topics in a
large collection of text documents. Some of the popular topic modeling
algorithms are latent Dirichlet allocation (LDA), latent semantic
analysis (LSA), and probabilistic latent semantic analysis (PLSA). In
practice, the technique that’s most commonly used is LDA.


## Training a Topic Model: An Example

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import os

In [4]:
#tokenize, remove stopwords, non-alphabetic words, lowercase
def preprocess(textstring):
   stops =  set(stopwords.words('english'))
   tokens = word_tokenize(textstring)
   return [token.lower() for token in tokens if token.isalpha() and token not in stops]

# This is a sample path of your downloaded data set. This is currently set to a windows based path . 
# Please update it to your actual download path regradless of your choice of operating system 

data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),data)

summaries = []
for line in open(data_path, encoding="utf-8"):
   temp = line.split("\t")
   summaries.append(preprocess(temp[6]))

# Create a dictionary representation of the documents.

dictionary = Dictionary(summaries)

# Filter infrequent or too frequent words.

dictionary.filter_extremes(no_below=10, no_above=0.5)
corpus = [dictionary.doc2bow(summary) for summary in summaries]

# Make a index to word dictionary.

temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

#Train the topic model

model = LdaModel(corpus=corpus, id2word=id2word,iterations=400, num_topics=10)
top_topics = list(model.top_topics(corpus))
pprint(top_topics)

NameError: name '__file__' is not defined

## What’s Next?
In our experience, some of the use cases for topic models are:
- Summarizing documents, tweets, etc., in the form of keywords based on learned topic distributions
- Detecting social media trends over a period of time
- Designing recommender systems for text

# 3. Text Summarization
Text summarization refers to the task of creating a summary of a
longer piece of text. The goal of this task is to create a coherent
summary that captures the key ideas in the text.

- Extractive versus abstractive summarization
- Query-focused versus query-independent summarization
- Single-document versus multi-document summarization

## Summarization Use Cases
The most common use case for text summarization is
a single-document, query-independent, extractive summarization. This
is typically used to create short summaries of longer documents for
human readers or a machine (e.g., in a search engine to index
summaries instead of full texts).

![](images/autotdr.png)
<center>Screenshot of Reddit’s autotldr bot</center>

**There are broadly two types of summarization — Extractive and Abstractive**
1. Extractive— These approaches select sentences from the corpus that best represent it and arrange them to form a summary.
2. Abstractive— These approaches use natural language techniques to summarize a text using novel sentences.

## Setting Up a Summarizer: An Example
**Summarization with Sumy**\
Sumy offers several algorithms and methods for summarization such as:
1. Luhn – Heurestic method
2. Latent Semantic Analysis
4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
5. TextRank - Graph-based summarization technique with keyword extractions in from document

In [6]:
#Code to summarize a given webpage using Sumy's TextRank implementation. 
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

num_sentences_in_summary = 2
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_list=("TextRankSummarizer:","LexRankSummarizer:","LuhnSummarizer:","LsaSummarizer") #list of summarizers
summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]

for i,summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    print("-"*30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
------------------------------
LexRankSummarizer:
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training documen

## Summarization example with Gensim

In [7]:
from gensim.summarization import summarize,summarize_corpus
from gensim.summarization.textcleaner import split_sentences
from gensim import corpora

text = open("data/nlphistory.txt").read()

#summarize method extracts the most relevant sentences in a text
print("Summarize:\n",summarize(text, word_count=200, ratio = 0.1))


#the summarize_corpus selects the most important documents in a corpus:
sentences = split_sentences(text)# Creates a corpus where each document is a sentence.
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

# Extracts the most important documents (shown here in BoW representation)
print("-"*30,"\nSummarize Corpus\n",summarize_corpus(corpus,ratio=0.1))

Summarize:
 Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966.
This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, proba



The two parameters `word_count` and `ratio` we can adjust how much text the summarizer outputs

1. `word_count`: maximum amount of words we want in the summary
2. `ratio`: fraction of sentences in the original text should be returned as output

## Practical Advice
There are a few practical issues to keep in mind when deploying a
summarizer:
- Pre-processing steps like sentence splitting (or HTML parsing in the above example) play a very important role in what comes out as output summary. 
- Most summarization algorithms are sensitive to the size of the text given as input.You need to be aware of this limitation when using a summarizer with very large texts. A workaround could be to run the summarizer on partitions of the large text and stringing the summaries together. Another alternative could be to run the summarizer on the top M% and bottom N% of the text instead of the whole text.

# 4. Recommender Systems for Textual Data
A common approach to building recommendation systems is a method
called *collaborative filtering*. It shows recommendations to users
based on their past history and on what users with similar profiles
preferred in the past.

## Creating a Book Recommender System: An Example

In [1]:
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument



In [None]:
# Read the dataset’s README to understand the data format. 

data_path = "data/booksummaries.txt"
mydata = {} #titles-summaries dictionary object
for line in open(data_path, encoding="utf-8"):
    temp = line.split("\t")
    mydata[temp[2]] = temp[6]

In [None]:
#prepare the data for doc2vec, build and save a doc2vec model
train_doc2vec = [TaggedDocument((word_tokenize(mydata[t])), tags=[t]) for t in mydata.keys()]
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm =1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
model.save("d2v.model")

In [None]:
#Use the model to look for similar texts
model= Doc2Vec.load("d2v.model")

#This is a sentence from the summary of “Animal Farm” on Wikipedia:
#https://en.wikipedia.org/wiki/Animal_Farm
sample = """
Napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs who will run the farm.
 """
new_vector = model.infer_vector(word_tokenize(sample))
sims = model.docvecs.most_similar([new_vector])
print(sims)

## Practical Advice
How do we know our recommendation system is working? In a realworld project, the impact of recommendations can be measured by
performance indicators, such as user click-through rates, conversion
into a purchase (if relevant), customer engagement on the website, etc.

Pre-processing decisions play a significant role
in the recommendations served by our system. 

# 5. Machine Translation
Machine translation (MT)—translating text from one language to
another automatically—is one of the original problems of NLP
research.

Two example scenarios where MT may be required to develop solutions:
- Our client’s products are used by people around the world
who leave reviews on social media in multiple languages. Translate all the reviews into one language, and run
sentiment analysis for that language.
- We work with a lot of social media data (e.g., tweets) on a
regular basis and notice that it’s unlike the kind of text we
encounter in typical text documents. 

## Using a Machine Translation API: An Example

In [None]:
import os, requests, uuid, json

# You will need a subscription key - you can use trial version
# This will be user based

subscription_key = "XXXX"
endpoint = "https://api-nam.cognitive.microsofttranslator.com"
path = '/translate?api-version=3.0'
params = '&to=de' #From English to German (de)
constructed_url = endpoint + path + params

headers = {
    'Ocp-Apim-Subscription-Key': subscription_key,
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())
}

body = [{'text' : 'How good is Machine Translation?'}]
request = requests.post(constructed_url, headers=headers, json=body)
response = request.json()

print(json.dumps(response, sort_keys=True, indent=4, separators=(',', ': ')))

## Practical Advice
Don’t build your own MT system if you
don’t have to. It’s more practical to make use of translation APIs.
When using such APIs, it’s important to pay close attention to pricing
policies. Considering the costs involved, it might be a good idea to
store the translations of frequently used text (called a translation
memory or a translation cache).


# 6. Question-Answering Systems
When searching online with a search engine such as Google or Bing,
for some of the queries, we see “answers” along with a bunch of
search results. These answers can be a few words or a listing or
definition.

NLP plays an important role in understanding the user query,
deciding what kind of question it is and what kind of answer is needed,
and identifying where the answers are in a given document after
retrieving documents relevant to the query.

![](images/answer-extraction.png)
<center>Answer extraction</center>

## Developing a Custom Question-Answering System
Let’s say we’re asked to develop a question-answering system that
answers all user questions about computers. We’ve identified a few
websites with question-and-answer discussions (e.g., Stack Overflow)
and have a crawler in place. The next step could be using
text embeddings and performing a similarity-based search using
Elasticsearch.

## Looking for Deeper Answers
Question
answering using deep neural networks is very much an active area of
research and is typically studied as a supervised ML problem using
specific datasets designed for this task, such as the SQuAD
dataset. DeepQA, which is a part of Allen NLP, is a popular
library for developing experimental question-answering systems using
DL architectures.