Search Interface for MediaGoblin

Name: Praveen Kumar
Email: praveen97uma@gmail.com
Project: Search Interface for Mediagoblin

Summary

Search is an important tool for a web application which lets a user look for a piece of information in a pile of data and improves the usability of a site. MediaGoblin lacks a search interface and it being a platform for sharing of media, it is of utmost importance that we provide a search interface. There are various open source search engines such as Whoosh, ElasticSearch, Xapian that can be integrated with MediaGoblin. This project aims at researching about the best search engine that can be integrated with MediaGoblin according to our use cases and preferences and then implementing it in such a way that it can be easily extended and maintained.

Benefits

This project will add a search interface to MediaGoblin. It will help MediaGoblin users to search for a needle in a pile of media easily rather than scanning linearly through the whole media collection which is sometimes very frustrating. A search interface will enhance the usability of MediaGoblin as a platform.

Deliverables

At the end of the GSoC, I will deliver a search interface implemented using one of the search engines I discuss in my plan below. The search interface will be implemented either as a core feature or as a plugin which a user can enable whenever he/she wants. I will also deliver unit tests and documentation for the code I write. I have discussed more about the features I propose to implement in the project in further sections below.

Plan

There are three phases that form the part of this project.

Research and Evaluation of Search Engines
Implementation
Testing

Research and Evaluation of Search Engines

The searchable data that MediaGoblin stores consists of (title, description, slug and some meta data) of media types like image, video, audio etc, text in comments, tags and collections. The models that I consider for this project are MediaEntry, MediaComment, MediaTag, and Collection. If time permits, I will implement hooks so that a new model can be easily indexed.

I list below the searchable attributes of the above models:

  MediaEntry – author, title, slug, description
  MediaComment – author, content
  MediaTag – name, slug
  Collection – title, slug, description

Now, we can create indices for our data in two ways :-

1) A single index that indexes all the data of the above four models.

One way of modeling the index can be Index(title, slug, description, media_type). We will leave the value of an attribute empty if that value is not available to us. This way we can index all our data easily. The queries can be evaluated against a single index and searches will be relatively fast. But, the downside is that we we won't be able to index models that are added in future and have attributes completely different from those that form our index. Also, the data stored in this index will be sparse.

2) Create a separate index for each data model.

We will create an index for each model and each index will have the responsibility of maintaining an updated index for the model it indexes. The queries will have to be evaluated against each index that we have and the results combined to return the final results. The advantage of this approach is that we can provide the users the ability to execute complex searches such as searching of a keyword only in images, or images and videos, or images after a certain date.

Evaluation of Search Engines

Almost all the open source engines available have support for complex search queries and are fairly robust. Of all the search engines available today, we need to determine which one to use and which satisfies our requirements best. So during the initial phase of this project, I will compute performance and usability metrics and present to the community a report to help in choosing the best option for a search engine.

I will consider the following metrics for evaluating search engines that we can use:

Performance and Usability Metrics

Indexing speed
Search speed
Size(disk space) of indices generated
Memory footprint of the indexer and searcher
Scalability
Ease of setup and installation of dependencies
Ease of understanding their API.

List of search engines that I will research about.

Whoosh – pure python
ElasticSearch – written in Java, extends lucene, Python bindings are available.
Solr – written in Java, extends Lucene, Python bindings are available
Xapian – written in C/C++, Python bindings are available

Benchmarking Approach:

I have implemented some python scripts for performance benchmarking. The scripts are located here. The scripts implement methods for measuring execution time of a function, the memory it consumed, generating dummy documents, plotting graphs, etc. I implemented a basic indexer and searcher in Whoosh and evaluated its performance i.e computed the metrics 1-5. I create dummy documents by selecting words randomly from /usr/share/dict available in the UNIX systems. I had four attributes in my index viz. Id, title, slug and description. Id is an integer id associated with a document. The values for the rest of the attributes were randomly generated. I evaluated the index creation performance of Whoosh by creating indices with varying number of documents and then plotting a graph. I also computed the query time and memory used by the searcher. Following are the outputs from the benchmarking scripts.

Indexing Benchmarks Complete Output

--------------------------------------------------------------------
Platform: Linux-3.5.0-27-generic-i686-with-Ubuntu-12.10-quantal 
Python Version: 2.7.3, Compiler: GCC 4.7.2
Number of CPUs: 4, Total Physical Memory: 1944.617188 MB
--------------------------------------------------------------------
Software Tools used:
  Whoosh 2.4.1, psutil 0.7.0, matplotlib 1.1.1, python 2.7.3

===== Performace of index creation =====
Multiprocessing : Yes
No. of words in each document: 20
Length of each word: 10 chars

No. of Docs  Time(secs)    Memory(MB)      Index Size(MB)
----------------------------------------------------------
100          0.726852      36.297363       0.645767
200          0.750638      37.110352       1.287406
300          1.060829      37.377841       1.934848
400          1.258528      37.540264       2.578110

===== Performace of index creation =====
Multiprocessing : No
No. of words in each document: 20
Length of each word: 10 chars

No. of Docs  Time(secs)    Memory(MB)      Index Size(MB)
----------------------------------------------------------
100          0.696731      38.180176       0.646939
200          1.345728      38.896484       1.269023
300          1.978825      39.830859       1.883274
400          2.625160      39.845052       2.478565

Index Creation

Search benchmarks Complete Output

--------------------------------------------------------------------
Platform: Linux-3.5.0-27-generic-i686-with-Ubuntu-12.10-quantal 
Python Version: 2.7.3, Compiler: GCC 4.7.2
Number of CPUs: 4, Total Physical Memory: 1944.617188 MB
--------------------------------------------------------------------
Software Tools used:
  Whoosh 2.4.1, psutil 0.7.0, matplotlib 1.1.1, python 2.7.3

===== Performance of searching of simple queries =====
Size of the index: 3.055882 MB
No. of indexed documents: 500 
No of search queries: 10

-------------------------------------------------------------------
Search Word       Time(sec)     Memory(MB)
----------------------------------------------
cueing            0.077259      35.765625
flinging          0.002398      35.769531
hangman           0.003562      35.769531
Erich's           0.004667      35.769531
southern          0.004385      35.769531
homburg           0.005198      35.769531
wildly            0.005923      35.789062
suddenness        0.004473      35.789062
truthfully        0.004479      35.789062
fattening         0.005365      35.789062

Average time taken: 0.011771 secs
Average memory used: 35.776953 MB

===== Performance of searching of complex queries in Whoosh =====
Size of the index: 3.048532 MB
No. of indexed documents: 500 
No of search queries: 10
------------------------------------------------------------------
Search Word       Time(sec)     Memory(MB)
----------------------------------------------
backslash         0.003224      37.835938
Jon's             0.003477      37.835938
ripper's          0.007442      37.835938
bark's            0.006726      37.835938
examiner's        0.006974      37.835938
midgets           0.007964      37.835938
ticks             0.005138      37.835938
candies           0.004225      37.835938
Fletcher          0.003260      37.835938
yips              0.004191      37.835938

Average time taken: 0.005262 secs
Average memory used: 37.835938 MB

Preliminary Analysis of Whoosh

In the worst case of 3000 documents each consisting of about 2KB of unicode text and without performing any kind of optimization on our end as well on Whoosh's, it took Whoosh about 15 seconds to complete the indexing and consumed about 15 MB of hard disk space to store the index. The same benchmark when run with multiprocessing on, Whoosh took around 8 seconds to complete the indexing and consumed around 17 MB of hard disk space. Simple search queries i.e the queries in which a keyword was evaluated against only the 'title' field in the index, and complex queries i.e the queries in which a keyword was evaluated against all the fields in the index took an average of 5ms. Hence, searches were satisfactorily fast in Whoosh and indexing was convincingly slow for the experiment I setup. I will perform the same experiment with other search engines and see how they fare.

Design and Implementation

Exact implementation depends on the search engine we decide to chose, but regardless of it, our search interface can be thought to be composed of the following high level parts:-

Client End
Query Processor
Indexer
Index updater
Searcher

Client End:

This is the user facing end of the search interface. A user supplies its query though this web interface. There will be two components of this interface. One will be a single search text box which can be integrated on any page we want just by including its template as {% block search %} {% include 'search.html' %} {% endblock %}. Other component will be a separate search page where a user will be able to perform advanced searches such filtering a video by its upload date, etc.

Query Processor:

Its job would be to process the query passed to it by the Client End and convert it into a form that is recognized by the Searcher. We can support many types of queries such as Wildcard search, Regular expressions, Phrase search, Boolean queries etc. The implementation depends on the search library we choose as almost all of them provide methods to handle the queries. For example, Whoosh provides a qparser.QueryParser class which can parse queries that can be evaluated against a single field in our index. There is also a qparser.MultiField class which handles queries that must be evaluated against multiple fields in our indices.

Indexer:

This end generates indices from the data that we provide to it. We define schema for each of our indices, collect the relevant data and then pass it on to the Indexer to index that data.

In MediaGoblin, we have two possible use cases of the search interface:-

A user is allowed to search only its own data. In this case, we will have to create indices the scope of which is restricted by the user. A query from a user will be evaluated only against the indices that are restricted to that user. To implement this, we will need a variable to store the information of the indices belonging to a given user. This can be a path to a directory where the indices are stored or a random string representing a location. All of this depends on the search library we use. The model definition can be

    UserSearchMetadata:
        user – reference to the user whose indices are being generated
        index_loc – value representing the location of the indices
        last_updated – time when the index was last updated for that user
        index_size – size of the current index

A user is allowed to search the data of other users i.e a user can perform a site wide search. In this case, we don't need to persist any information.

The indexer should be able to do batch indexing as well as incremental indexing. Batch indexing is necessary if the search system is implemented as a plugin because then, the indexer will have to scan the complete db for data to index. Incremental indexing is needed to update the indices when data is added/deleted from the datbase.

Index Updater

Its job is to keep the indices updated at all times. Index Updater will have to handle the updation of the indices under the following use cases-

A new row is added to a database table.
An existing row is updated/changed in a table.
A row is deleted from a table.
A new table is added to the database
A table which is already indexed is deleted from the database.

Use `celery` for performing index update tasks

We can invoke the Index Updater periodically i.e as a cron job. We do use celery and can easily configure it to run the index updater task at defined intervals. One other way is to trigger the index updater when an update is performed in the table related to the models that have been indexed. We can again implement the updater tasks as celery tasks which can run asynchronously. The tasks will have to be added to the task queues from all the methods which change the data in the database.

Searcher:

Searcher takes a processed query from the Query Processor, searches it against the indices and returns the results to the Client End.

A Small implementation of a search interface in Whoosh.

I implemented a very basic search system that indexed media entries in MediaGoblin. My index schema was composed of the id, title, slug and description attributes of MediaEntry objects. id was stored with the index as Whoosh returns stored objects as a result of the searches. I stored id in my index because then I could just query my db for MediaEntry objects with those ids and return the appropriate results to the user.

Client End:

client end

Template Code Complete Code

template code

Indexer Complete code

def get_all_media_entries():
    all_entries = MediaEntry.query.all()
    return all_entries

def create_index():
    schema = Schema(
        title=TEXT,
        description=TEXT,
        media_id=NUMERIC(stored=True),
        slug=TEXT)
    
    if not os.path.exists(MEDIA_INDEX_DIR):
        os.mkdir(MEDIA_INDEX_DIR)

    ix = create_in(MEDIA_INDEX_DIR, schema)
    media_entries = get_all_media_entries()
    
    writer = ix.writer()
    _log.info("Creation of Index started")
    for media in media_entries:
        writer.add_document(title=unicode(media.title), description=unicode(media.description),
                            media_id=media.id, slug=unicode(media.slug))
        _log.info("Adding document with slug %s"%(media.slug))
    writer.commit()
    _log.info("Index creation finished.")

Query Processor and Searcher Complete Code

def search(request):
    
    form = forms.SearchForm(request.form)
    results = None
    context = {
        'form': form, 
        'results': None, 
        'query': None, 
        'search_response': False
    }

    if request.method == 'POST' and form.validate():
        query = form.query.data
        ix = open_dir(indexer.MEDIA_INDEX_DIR)
        with ix.searcher() as searcher:
            query_string = MultifieldParser(['title', 'description', 'slug'], ix.schema).parse(query)
            results = searcher.search(query_string)

            res = []
            if len(results)>0:
                for result in results:
                    media_id = result['media_id']
                    entry = MediaEntry.query.get(media_id)
                    res.append({
                        'slug': entry.slug,
                        'url': entry.url_for_self(request.urlgen),
                    })

            context.update({
                'results': res,
                'search_response': True,
                'query': query,
            })
            return render_to_response(request, 'mediagoblin/search/search.html', context)

Testing

I have significant unit testing experience with Nose. MediaGoblin uses pytest which I find easy to pick up. I will write unit tests for my code and ensure that there is high coverage. This will help us to reduce bugs and have a reliable search system. I will provide the exact details of the unit test framework along the way as I can not currently figure out the tools we will need. But on a high level note, we will need to create test search servers and a celery stub.

Timeline

This is only a tentative timeline. I will make a new timeline once we finalize a search library we want to use to implement the search system. I have devoted about 3 weeks for researching about the best search engine for us.

May 27 - June 17 (Community Bonding Period)

Learn the coding practices of MediaGoblin
Discuss with my mentor the search features we want our search system to have
Read documentation of MediaGoblin and get more familiar with the codebase

June 17 - July 8

Implement indexers and searchers in Whoosh, Solr, Xapian and ElasticSearch
Prepare a report of the performance and usability metrics
Get familiar with SQLAlchemy ORM and jinja2 templating system
Study how other sites implement their search systems

July 9 - July 24

Discuss with my mentor the implementation design of the search system
Implement indexer
Write tests and documentation

July 25 - August 2 (Mid term evaluation period)

Take some break

August 3 - August 18

Implement Index Updater
Get more familiar with celery
Write unit tests and documentation

August 19 - August 26

Implement Searcher
Write unit tests and documentation

August 27 - August 4

Implement the Client End
Study some HTML, CSS and JavaScript for implementing the user facing pages

August 5 - August 12

Discuss with my mentor how to implement hooks to enable any new model to be indexed
Write unit tests and documentation

August 13 - August 20 (Soft pencils down date)

Write the integration tests that test the whole search system
Write documentation on how to enable and use the search system

August 20 - August 27 (Final pencils down date)

A buffer week to handle any unexpected circumstances and delays

Communication

I will communicate with my mentor and other members of the community frequently because I will be needing their help constantly to ensure that I am working in the correct direction. I will be available on IRC and via email as much as possible. I will also send a weekly report to the dev mailing list and ask for suggestions from the community members.

Qualifications

I chose this project because I like web development and I believed that I could work on this project to the best of my capacity. I love python and have done a lot of development in Django. I have also implemented a Web Crawler in python and implemented a Facebook app hosted on Amazon AWS for one my machine learning projects using open source tools. I have contributed over 100 commits to Melange and e-cidadania before. I have the necessary skills required to complete this project though I will need to read up more on CSS, JavaScipt and AJAX to implement the client end of the search interface. I will also need to read up research papers and books on creating best search indices.

Search Interface for MediaGoblin

Summary

Benefits

Deliverables

Plan

Research and Evaluation of Search Engines

1) A single index that indexes all the data of the above four models.

2) Create a separate index for each data model.

Evaluation of Search Engines

Performance and Usability Metrics

List of search engines that I will research about.

Benchmarking Approach:

Indexing Benchmarks Complete Output

Search benchmarks Complete Output

Preliminary Analysis of Whoosh

Design and Implementation

Client End:

Query Processor:

Indexer:

Index Updater

Use celery for performing index update tasks

Searcher:

A Small implementation of a search interface in Whoosh.

Client End:

Template Code Complete Code

Indexer Complete code

Query Processor and Searcher Complete Code

Testing

Timeline

May 27 - June 17 (Community Bonding Period)

June 17 - July 8

July 9 - July 24

July 25 - August 2 (Mid term evaluation period)

August 3 - August 18

August 19 - August 26

August 27 - August 4

August 5 - August 12

August 13 - August 20 (Soft pencils down date)

August 20 - August 27 (Final pencils down date)

Communication

Qualifications

Clone this wiki locally

Use `celery` for performing index update tasks