# 1. Search and Information Retrieval
Apart from this major function of storing data and ranking search
results, several features in a modern search engine involve NLP
- Spelling correction
- Related queries
- Snippet extraction
- Biographical information extraction
- Search results classification

These two types of search engines are
distinguished as follows:
- Generic search engines, such as Google and Bing, that crawl
the web and aim to cover as much as possible by constantly
looking for new webpages
-  search engines, where our search space is
restricted to a smaller set of already existing documents
within an organization

## Components of a Search Engine
![](images/search-engine.png)
 <center>Early architecture of the Google search engine </center>
 - Crawler: Collects all the content for the search engine. The crawler’s job is
to traverse the web following a bunch of seed URLs and build its
collection of URLs through them in a breadth-first way. It visits
each URL, saves a copy of the document, detects the outgoing
hyperlinks, then adds them to the list of URLs to be visited next.
- Indexer: Parses and stores the content that the crawler collects and builds
an “index” so it can be searched and retrieved efficiently.
- Searcher: Searches the index and ranks the search results for the user query
based on the relevance of the results to the query.
- Feedback: A fourth component, which is now common in all search engines,
that tracks and analyzes user interactions with the search engine,
such as click-throughs, time spent on searching and on each clicked
result, etc., and uses it for continuous improvement of the search
system.

## A Typical Enterprise Search Pipeline
- Crawling/content acquisition
- Text normalization
- Indexing

The pipeline typically consists of the following steps:
1. Query processing and execution: The search query is passed
through the text normalization process as above. Once the
query is framed, it’s executed, and results are retrieved and
ranked according to some notion of relevance.
2. Feedback and ranking: To evaluate search results and make
them more relevant to the user, user behavior is recorded and
analyzed, and signals such as click action on result and time
spent on a result page are used to improve the ranking
algorithm.

## Setting Up a Search Engine: An Example
This notebook shows how to use Elastic Search to index and search through data. We will use a dataset called CMU Book summaries [dataset](http://www.cs.cmu.edu/~dbamman/booksummaries.html).

For this code to work, elastic search instance has to be running in the background. For this you need to follow these steps :

Linux :

1. Go to the elasticsearch-X.Y.Z/bin folder on your machine
2. Run ./elasticsearch.

Windows :

1. Download the latest release
2. Run .\bin\elasticsearch.bat

[ElasticSearch Documentation](https://www.elastic.co/guide/index.html)

In [None]:
from elasticsearch import Elasticsearch 
from datetime import datetime

In [None]:
#elastic search instance has to be running on the machine. Default port is 9200. 

#Call the Elastic Search instance, and delete any pre-existing index
es=Elasticsearch([{'host':'localhost','port':9200}])
if es.indices.exists(index="myindex"):
    es.indices.delete(index='myindex', ignore=[400, 404]) #Deleting existing index for now 

In [None]:
#Build an index from booksummaries dataset. I am using only 500 documents for now.
path = "booksummaries.txt" #Add your path.
count = 1
for line in open(path):
    fields = line.split("\t")
    doc = {'id' : fields[0],
            'title': fields[2],
            'author': fields[3],
            'summary': fields[6]
          }

    res = es.index(index="myindex", id=fields[0], body=doc)
    count = count+1
    if count%100 == 0:
        print("indexed 100 documents")
    if count == 501:
        break

In [None]:
#Check to see how big is the index
res = es.search(index="myindex", body={"query": {"match_all": {}}})
print("Your index has %d entries" % res['hits']['total']['value'])

In [None]:
#Try a test query. The query searches "summary" field which contains the text
#and does a full text query on that field.
res = es.search(index="myindex", body={"query": {"match": {"summary": "animal"}}})
print("Your search returned %d results." % res['hits']['total']['value'])

In [None]:
#Printing the title field and summary field's first 100 characters for 2nd result
print(res["hits"]["hits"][2]["_source"]["title"])
print(res["hits"]["hits"][2]["_source"]["summary"][:100])

In [None]:
#match query considers both exact matches, and fuzzy matches and works as a OR query. 
#match_phrase looks for exact matches.
while True:
    query = input("Enter your search query: ")
    if query == "STOP":
        break
    res = es.search(index="myindex", body={"query": {"match_phrase": {"summary": query}}})
    print("Your search returned %d results:" % res['hits']['total']['value'])
    for hit in res["hits"]["hits"]:
        print(hit["_source"]["title"])
        #to get a snippet 100 characters before and after the match
        loc = hit["_source"]["summary"].lower().index(query)
        print(hit["_source"]["summary"][:100])
        print(hit["_source"]["summary"][loc-100:loc+100])

## A Case Study: Book Store Search
Imagine a scenario where we have a new e-commerce store focused
on books and we have to build its search pipeline. We have metadata
like author, title, and summary. The search functionality we saw earlier
can serve as the baseline at the start. We can set up our own search
engine backend or use online services like Elasticsearch or
Elastic on Azure.

This default search output might have a bunch of issues. For instance, it
may show the results with exact query matches in title or summary to
be higher than more relevant results that aren’t an exact match. Some of
the exact matches might be poorly written books with bad reviews,
which we’re not accounting for in our search ranking.

We can incorporate real-world metrics that account for this into our
search engine. For instance, the number of times a book is viewed and
sold, the number of reviews, and the book’s rating can all be
incorporated into the search ranking function. 

We should start collecting user interactions with the search engine to
improve it further.

# 2. Topic Modeling
Topic models are used extensively for
document clustering and organizing large collections of text data.
They’re also useful for text classification.

Topic modeling operationalizes this intuition. It tries to identify the
“key” words (called “topics”) present in a text corpus without prior
knowledge about it, unlike the rule-based text mining approaches that
use regular expressions or dictionary-based keyword searching
techniques. 

![](images/topic-modeling.png)
<center>Illustration a of topic modeling visualization</center>

Topic modeling generally refers to a collection of
unsupervised statistical learning methods to discover latent topics in a
large collection of text documents. Some of the popular topic modeling
algorithms are latent Dirichlet allocation (LDA), latent semantic
analysis (LSA), and probabilistic latent semantic analysis (PLSA). In
practice, the technique that’s most commonly used is LDA.
