# The two basic elements of every machine learning problem

Now that we are somewhat familiar with the two basic problems of machine learning, at least at a high level, lets zoom out one more level to look at two important technical elements shared by both these tasks  These are referred to as *feature design* and *mathematical optimization* respectively. Each will be a constant subject of the course as we fully discuss machine learning problems. 

#  1.  Feature design

Remember that the term *feature* means *input* in the parlance of machine learning.  What, then, does it mean to 'design' a feature or set of features?  We will use the phrase 'feature design' 'to refer to two ideas - each of which is extremely important in practical application of machine learning

1.  Selecting a few relevant features from a large pool of candidates - this very commonly occurs in e.g., the financial and genetic applications discussed previously

2.  Mathematically transforming a given set of inputs to capture nonlinearity in a dataset - this is virtually always done with applications in images, text, and speech 



## 1.1  Selecting relevant features

Very often we must select the most proper input to a machine learning problem like e.g., regression because we while we might have something we wish to predict, we do not know what inputs will give us the greatest insight.  For example, if we wanted to predict the price of a particular stock one month from now - what should we use as our input?  Several possibly useful input features might come to mind - e.g., previous prices, certain economic indicators like the federal fund rate, maybe even the general sentiment of insightful financial journalists if we can get ahold of it - but a single 'silver bullet' input feature, i.e., one that perfectly describes the historical price of a stock, is not apparent.   

So, based solely on the ignorance of what particular input would work best, a common approach is to try to find as many input features as possible, dump them into the model, and select the ones that are most indicative of our target output.  

Let's look at a very simple example of doing this.  Suppose we're interested in understanding the total amount of student debt in the United States for the past decade or so, and predicting its future value.  This is a regression problem, and we have already seen in our previous introduction to the machine learning problem of regression that indeed the input feature *time* is a fairly good one for this output, as it correlates quite strongly with student debt.  But suppose we did not know this - because commonly in practice we will not have such insight - and that to compensate for our ignorance we gathered two candidate input features (remember that in practice we would try to gather as many viable inputs as we could).  

Our two candidate input features are 1) time (in years) and 2) the annual sales of the Chiquita banana company.  What in the world do banana sales have to do with student debt?  Likely none - but lets take a look.  First lets take a look at the entire dataset - that is we use both inputs and the output.  Since we have two input features and one output the full dataset is 3-dimensional

<img src="files/student_debt_and_chiquita_3D.png" width=500 height=250/>

Now lets look at each input individually with the output - unsurprisingly just glancing at the left panel (where the input feature is time) and right panel (where it is banana sales) time appears to be a much better input for predicting student debt.  Time is the far better choice of input here it almost perfectly correlates with the output, whereas the relationship between banana sales and student debt looks vague at best.

<img src="files/student_debt_and_chiquita_2D.png" width=500 height=250/>

The feature design task of *feature selection* - which we will learn about in the course - will allow us to automate the task of selecting the better of these two input features - time - so that we can produce the most useful regression model possible.  More generally it will allow us determine the best feature or set of features for general regression problems as well.


## 1.2  Transforming input features to capture nonlinearity

Very often we must try to transform an input design the final features we feed into our machine learning model. We do this by leveraging our understanding of the phenomenon under study, and by encoding this knowledge into a tractable mathematical or computational transformation of given inputs. These transformed features - as we will see - allows for significantly greater learning. 

Before diving into the details for a modern problem, lets first discuss a revealing historical example of rule-finding. This will not only set the stage for the typical modern task but will highlight one of most critical challenges associated with today's machine learning problems.

###   Galileo and the fundamental rule of gravity

Galileo Galilei - the 17th century scientist and philosopher - is perhaps most famous for his championing of the Copernican model of the solar system (in which the sun was the center of the universe instead of the earth, a long held belief since the days of Aristotle) in the face of much scrutiny from the Catholic church - the governing institution of his time and place. But Galileo also discovered a huge array of scientific principles in his lifetime, and put other principles that were perhaps philosophically 'intuitive' at the time on more solid ground by creating experimental evidence of their veracity. His experiments in determining the rule of earth-bound gravity - which was later codified as Newton's second law - is just such an example. It combines an absolutely brilliant experimental design and approach to data collection with a straightforward application of rule finding via machine learning. 


In order to quantify the pull of gravity on an object Galileo designed the following experiment that measures how far an object falls in a given allotment of time. The basic idea behind the experiment was to drop an object - like a metal ball - multiple times at a fixed distance from the ground and measure how long it took the ball to traverse certain portions of the length. However because accurate enough timekeeping devices did not yet existIt was Galileo himself who, in studying pendulums, eventually led to the development of humankind's first accurate time pieces: the pendulum clock. This was the most precise instrument for keeping time for some 300 years - from about 1650 until the early 1930s. he had to slow things down in order to measure time precisely enough, and so instead of dropping the ball he rolled it down a smooth ramp starting from the top, as shown figuratively below (taken from [1]).

<img src="files/galileo_ramp.png" width=500 height=250/>


Repeating this experiment a number of times, Galileo collected data on how long it took the ball to traverse certain portions of the ramp (specifically he measured how long it took the ball traverse $\frac{1}{4}$, $\frac{1}{2}$, $\frac{2}{3}$, $\frac{3}{4}$ and the full length of the ramp). Repeating this several time he averaged the results - leaving a single data point representing the average time it took the ball to travel  down each fraction of the ramp - as shown below (this data is actually taken from a modern reenactment of Galileo's experiment - see [Refined]).

<img src="files/galileo_data.png" width=250 height=250/>

From philisophical reflection and visual examination of a dataset very much like this one, Galileo proposed a simple nonlinear rule that appeared to explain or equivalently generate this data: that the distance an object travels due to the pull of gravity is *quadratic* in time.  In other words, that

\begin{equation}
\text{portion of ramp traveled / distance an object travels}^{} =\text{constant}^{}\times^{}\text{(time spent traveling)}^2
\end{equation}

Fitting such a quadratic to the above dataset (by properly choosing the value of the constant) we can see that it does indeed represents the dataset quite well.

<img src="files/galileo_data_and_fit.png" width=250 height=250/>

Moreover this quadratic rule - derived by examining such a simple dataset - was  found to be extremely accurate, standing up to both further empirical examination as well as philisophical study (e.g., it is the basis for Newton's second law of gravity).
 

###  From Galiileo to machine learning

In the example above, Galileo determined the quadratic rule for gravity by looking at his dataset and by employing his physical intuition.  Machine learning - in its current state - is a set of tools for replicating this (and only this) part of determining the rules that govern a given system.  That is, machine learning can automatically determine (using a dataset) 

1.  The correct nonlinear relationship between the input and output of a  system, in other words the correct nonlinear function of the input predicts  the output well - in the case of the Galileo example that the relationship between the time an object is falling and the distance it has traveled is quadratic

2.   A proper value for the parameters of this (potentially) nonlinear relationship so that the rule fits the dataset well - in the case of the Galileo example this consists of a single constant



Note - very importantly - what is not included here is *how* we get the data itself - an obviously critical component to forging rules.  Machine learning is a substitute for philisophical / scientific understanding and visual examination in the forging of rules, and so relies entirely on having solid datasets to work with.  The severity of this deficincy ranges from problem to problem, and for many of the examples listed in the first part of this section it is not really a problem at all given that the data in those cases is usally plentiful.  But in a case like Galileo's it is a very serious obstical -  here the data was compiled from a seriously ingenious experiment.  In short - machine learning cannot yet 'collect or create the right' data for determining rules, that part is still very much up to we humans.


But enough of what it cannot do - let's celebrate what machine learning can do!  The fact machine learning can automatically determine the form of the potentially nonlinear relationship between inputs and outputs of a dataset, and tune associated parameters accordingly, gives us incredible power.  This is because there are many instances - as in the examples described in the first part of this section - where we can gather large datasets but the nature of this data - e.g., that it is too high dimensional to visualize - completely prevents us from even proposing a reasonable nonlinear rule.  

Take the task of face detection for example - the technology that places a little square around faces when you take a picture with your smartphone (in order to focus the lens on these portions of the captured image).  In order to make this work one first collects a large database of small facial and non-facial images, like those shown below.   

<img src="files/face_detection_data.tif" width=400 height=400/>

In order to make face detection work we want to use such a dataset to derive a rule that distinguishes facial images from non-facial ones.  Remember that a grayscale digitial image is made up of many small squares called 'pixels', each of which has a brightness level between 0 (completely black) and 255 (completely white). In other words a grayscale digital image can be thought of as a matrix or array whose $\left(i,j\right)^{th}$ value is equal to the brightness level of the $\left(i,j\right)^{th}$ pixel in the image. (A color image is then just a set of three such matrices, one
for each color channel red, green, and blue).  

<img src="files/nugety_pixels.png" width=500 height=500/>

In other words - our dataset consists of small *input* images (which are high dimensional arrays of pixel values) and their associated *output* type or *class* - either face or non-face.  Because the output class label 'face' and 'non-face' are not numeric in nature, these labels are translated into distinct numbers - e.g., +1 for a face image and -1 for a non-face image.  So, in other words, in order to determine a successful rule distinguishing faces from non-faces we must determine a (potentially) nonlinear function of image pixels which accurately returns +1 if the input image is a face, and -1 otherwise.  That is for some function $f$ that takes in an input image from the database

\begin{equation}
\text{class of input image} = f(\text{input image pixels}) = \begin{cases}
+1 & \,\,\text{if input is a face}\\
-1 & \,\,\text{if input is not a face}
\end{cases}
\end{equation}

Machine learning, as we will see, can be used to automatically determine a proper form of for the function $f$ and properly tune its parameters.    Just think -  even determining a proper function for such a problem would be absolutely impossible to do 'by eye' - as we saw Galileo did with the gravity-experiment data - since the image data is far too high dimensional for us to even visualize.

Once machine learning is used to properly determine a function and as well as its parameters, when one wants to detect faces in a new full image (as on your smartphone) a small window square window is passed over all regions of the input image.  The content in each small windowed - image is then passed through the function $f$ to determine if it contains a face or not, as illustrated figuratively below.

<img src="files/face_detection_test.png" width=500 height=500/>

### In summary

**Feature design** is the task of selecting the right input or determining the proper nonlinear relationship between the given input and output of a system, in other words the a nonlinear function of the input predicts which accurately predicts the output 

**However gathering or creating proper datasets is still up to we humans.**

# 2.   Mathematical optimization

Every learning problem has parameters that must be tuned properly to ensure optimal learning. For example, there are two parameters that must be properly tuned in the case linear regression (with one dimensional input): the slope and intercept of the linear model.  These two parameters are tune by forming a 'cost function' - a continuous function in both parameters - that measures how well the linear model fits a dataset given a value for its slope and intercept.  The proper tuning of these parameters via the cost function corresponds geometrically to finding the values for the parameters that make the cost function as small as possible or, in other words, *minimize* the cost function.  In the image below - taken from [Refined] - you can see how choosing a set of parameters higher on the cost function results in a corresponding linear fit that is poorer than the one corresponding to parameters at the lowest point on the cost surface.

![image](files/bigpicture_regression_optimization.png) 

This same idea holds true for regression with higher dimensional input, as  as well as classification where we must properly tune the intercept and normal vector to the fitting hyperplane.  Again, the parameters minimizing the cost function provide the better classification result.  This is illustrated for classification below - again taken from [Refined].

![image](files/bigpicture_classification_optimization.png) 

The tuning of these parameters is accomplished by a set of tools known collectively as mathematical optimization. Mathematical optimization is the formal study of how to properly minimize cost functions and is used not only in machine learning, but reasons in a variety of other fields including operations, logistics, and physics.