# Discussion Test: Search queries generator

#### Test statement
Automatic online search bots are the responsible for discovering infringing listings online.
These bots need to know what to find and how to reach those listings.
So for each product that the company protect protect, bots have to be provided with a useful set of search queries so they find the product that the company wants.  
These search queries should bring results in each site that they use, having the most ASSET results as they can, with the least NON-ASSET results possible.

#### Assumptions
Since the algorithm has to provide a set of queries that work for all websites, our generator should not include site-dependant information, such as HTML code, listing position in the page, special codes to exclude items (like select which amazon vendor I want the item from on amazon).
Let's assume for now that our queries consist of text-only information.

#### First approach
We need to have the most ASSET results as we can and the least NON-ASSET, meaning we want to maximize precision.
The idea is to create a very specific query that describe perfectly the ASSET, leaving small 'space' for other items.  
We could start with a query composed by the ASSET's brand, followed by the name of the product and then the ASSET's top-feature. For example, if our product is "Mavic Pro" then our starting query could be "DJI Mavic Pro 2 Drone Quadcopter 1" CMOS Sensor Camera".  
This shoul narrow down the search results. If the results are not enough, one could remove some terms from the query until the listing title classifier returns some NON-ASSET elements and then stop. This could be done with an iterative algorithm that starts with a complex query (let's say the one from the example) that should return no NON-ASSET titles. If the ASSET titles are not enough, the algorithm start removing terms until there's one NON-ASSET title (or with a certain threshold). Now the algorithm has just to go back to the previous step to get the "least NON-ASSET results possible".

#### A step forward: Query Rewriting
*Query rewriting* automatically transforms search queries in order to better represent the searcher's intent.
Specifically there are two main techniques that aims to increase precision: *Query Segmentation* and *Query Scoping*.
* *Query Segmentation* attempts to divide the search query into a sequence of semantic units.
For example, if our query is "python frameworks" treating this segment as a quoted phrase can significant improve precision, avoiding matches for white shirt python and frameworks separately.
* *Query Scoping* improves precision by scoping, or restricting, how different parts of the query match different parts of documents or phrases.

The idea is to start with one or more simple query and use this techniques to adjust the current query or amplify the set of queries.

#### Something more exotic: reinforcement learning
One idea is to create a system that learn the best query (or set of queries) by example and positive/negative rewards. Let's say we start with a simple query (e.g. the name of the product), the bot collects the listing titles, the ML model predicts ASSET/NON-ASSETS. We could define our goal as reducing the number of NON-ASSETS labels. Next, we provide the system with another query (this can be generated pseudo-randomly starting from the complex query seen before or just by introducing terms), the bot runs and we get another number of NON-ASSETS labels. Iteratively, by providing positive (reducing NON-ASSET labels) or negative (increasing NON-ASSET labels) feedbacks the sytem can learn the best query.  
Of course this is just a starting point and many considerations has to be done before saying that this could work.

#### Training set growth
In a scenario with constantly increasing training set the ideal would be to have a model that can be trained only with the new examples, without forgetting the previously acquired knowledge. With this, we could have the benefits for an updated model but without loosing more and more time for training.
This type of learnin is called online learning (or incremental learning).  
It as to be said that some analysis has to be performed before to determine if the increasing training time is costly as much as implement an online learning method.
It depends also on how often we want to retrain our model, if it's each week, each year, etc.

#### Changes in naming over time
Techniques like *Stemming* or *Lemmatization* that removes inflections can help if the modifications are made at the end of product name. If it's at the beginning we could apply a modified version of these techniques, especially designed for that product. This of course could be time consuming if the number of protected product is large and it increases fast.  
A hand-crafted dictionary based method could be also used to include in the queries fake names. For example, if we have a real product "mavic" and some fake-brands "mavoc" "maavic" we could create a dictionary `{'mavic': ['mavoc', 'maavic']}` and use add some queries with the fake-brand names.

If the changing is persistent over time and make the dictionary-based approach infeasible, it'd be possible to create a special machine-learning based algorithm that takes a name and generate possible a list of possible fake names, using a collection of fake-names previously created. This could be done, for example, using Recurrent Neural Networks. In order to use a deep learning approach we should have collected has many fake-brand name as possible. 

#### Missing results and metric change
We should account also for Recall, the fraction of relevant instances among the retrieved instances. It measure what should have been predicted as positive but we flagged as negative.  
If we want a metric to take into consideration both (eg not to drop too much the precision) we can also use also F1 score that is the armonic mean of Precision and Recall.

#### When recall is more important than precision
If recall is more important it means that we also want to keep track of how much ASSETs we wrongly labelled as NON-ASSET. Using one (or few) best query is not enough now and we should extend our set of queries to include more results.  
Inside the *Query rewriting* set of methods, if we want to increase Recall we have:
* *Query Expansion* that broadens the query by adding additional tokens or phrases. These may be related to the original query terms as synonyms or abbreviations (eveng with stemming). If the original query is an AND of tokens, replacing with ORs can expand the result space.
* *Query Relaxation* increases recall by removing tokens that may not be necessary to ensure relevance.