# Machine Learning Math 2

The next important thing we want you to understand is the math behind the cost function(s) that might be applied when performing linear regression. That will be our next lesson.

* [Usual assumptions for linear regression](http://stats.stackexchange.com/questions/16381/what-is-a-complete-list-of-the-usual-assumptions-for-linear-regression)
* Convexity of cost functions
* Stochastic gradient descent
* Different interpretations of distance (euclidean, etc.)
* Regularization

## Text Vectorization

### Checkpoint: Text Vectorization

## Applications: Search Scoring

Let's pretend you have tens of thousands of documents stored in a search index. When an end user enters a search query, what you'd like to do is list what will hopefully be a subset of those documents in a way that will make them feel relevant to the end user.

We can answer a substantial amount of the search question by simply filtering the documents down to just the subset that contain the terms entered by the end user. It's important to note that this may result in false negatives (because sometimes documents contain synonyms of words rather than exact words), but we will temporarily accept that as a limitation, because we'll need additional math in order to overcome it.

* [Elasticsearch Bool Query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html)

However, even acknowledging that there are false negatives, there may be thousands of documents that match, and you can't just blindly provide all the documents, because there's no way for the end user to consume that much information!

One solution that stands out is that maybe you can *summarize* them. Techniques include presenting summary statistics as numbers, presenting data visualizations that condense the results into graphs or other forms that humans can more readily interpret (such as rendering addresses as points on a map), or allowing the end user to "drill down" into the search results with facets.

* [Core Concepts in Data Analysis: Summarization, Correlation, and Visualization](https://www.researchgate.net/publication/232282057_Core_Concepts_in_Data_Analysis_Summarization_Correlation_and_Visualization)

However, this requires you to know a lot about the documents you're getting back, because as you've seen through this example of data exploration for linear regression, how you can summarize the information depends entirely on what kind of data it is.

What if we wanted a more generic way to respond to almost all queries? One such generic way to respond to the query is to provide the documents in descending order of *relevance*. This theoretically optimizes the time a user must spend finding the result, because the documents they need most will be presented to them immediately.

Relevance, however, is an abstract concept. You can't actually have the end user tell how they feel about each of your tens of thousands of documents, and even if you could, those values aren't static over time. If I searched for "liferay" back in 2007, I might expect to get results back on how to install Liferay Portal 4.4. If I searched for "liferay" today, I'd think that same document is not at all relevant to my interests.

So what can we do instead? We can create a *proxy* for relevance which we will call a *score*.

To summarize, our goal now is to create a model that will predict the appropriate *score* for each document, given the context of the search query that the user entered. In Elasticsearch, it turns out that this model is very close to being a *linear combination* of variables derived by performing sparse matrix multiplications of the query vector and the document vectors.

* [Elasticsearch Scoring Theory](https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html)

Once we advance our math further, we'll talk about these derived variables in more detail. However, among these variables is something we already understand now that we've finished regression: a *linear combination* of per-field subscores.

* [Elasticsearch Query-Time Boosting](https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html)

Knowing that, how then can you find the appropriate boosting that will give you the desired ordering of search results?

From this exercise with linear regression, you might already have some ideas on how to derive a model that is predicting something based on a linear combination of weights, and you'll know how to make the model more meaningful by adding new explanatory variables.

Just one thing missing. In order to derive the weights (beyond gut instinct choices), you'll need to take a representative subset of your documents and a representative subset of your queries and manually score them in order to allow the model to have something to predict.

* [CACM Corpus](http://www.search-engines-book.com/collections/) (free)
* [Web Research Collections](http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html) (non-free)

However, knowing how much work this is, you can also approximate score by instead simply providing a binary value to each document indicating is it relevant (using a subjective threshold of relevance). With this simpler data set, you would instead predict the probability that the document is relevant and use that as an approximation of a score.

In predicting the probability that something belongs to a class rather than predicting a raw score, we transform this from a classic linear regression to binary logistic regression.

* [Logistic Regression](https://onlinecourses.science.psu.edu/stat504/node/149)

## Closing Thoughts

Hopefully you have now become curious about linear regression.

You might wonder about simple extensions to linear regression, such as linear spline regression where you have boundary points where the coefficients completely change.

* [An Introduction to Splines](http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf)

You might also wonder why we talked about cost functions at the start and if choosing a different cost function might change the way the regression works.

* [Quantile Regression: An Introduction](http://www.econ.uiuc.edu/~roger/research/intro/rq3.pdf)

It's likely that you've also been wondering about applying transformations of the input and output variables in order to overcome the constraints of the linear relationship between variables.

* [Transformations in Regression](http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/transfrm.pdf)

You might be able to follow examples of people looking at input types that we haven't talked about (such as geospatial data) that will allow you to apply linear regression to other problems that involve the prediction of a continuous variable with range $(-\infty, \infty)$.

* [AirBnb Properties in Boston](https://github.com/ResidentMario/boston-airbnb-geo/blob/master/notebooks/boston-airbnb-geo.ipynb)