# Table of content
[5.1. Why object recognition is difficult](#object_reg)  
[5.2. Ways to achieve viewpoint invariance](#viewpoint_invar)  
[5.3. Convolutional neural networks for hand written digit recognition](#hand_written_digit_reg)  
[5.4. Convolutional neural networks for object recognition](#conv_net_obj_reg)  

## 5.1. Why object recognition is difficult
<a id="object_reg"> </a>
- Things that make object recognition hard to recognize objects:  
    - Segmentation: real scenes are mixed with other objects:  
        - hard to tell which pieces go together as parts of the same object.
        - parts of an object can be hidden behind other object (suffers from the occlusion due to other objects).
    - Lighting: pixel intensity is as dependent on object as it is on lighting $\rightarrow$ variations in perspective lighting
    - Scale (Deformation): objects can deform in a variety of non-affine ways (i.e. a hand-written $2$ can have a large loop or just a cusp). Same object can look very different (for example, written numeral $2$ or $4$).
    - Affordances: object classes are often defined by how they are used (i.e. chairs are things designed for sitting on so they have a wide variety of physical shapes. You sit in a chair, but modern vs classic chairs can be widly different, then you have to have knowledge that the thing is to be sat on).  
    $\rightarrow$ many objects are defined more by what it is used for than what it looks like
    - Viewpoint/Transformation: changes in viewpoint cause changes in images that standard learning methods cannot cope with. Information hops (n~ bước nhảy thông tin) between input dimensions (i.e. pixels)
    ![infor_hops](images/infor_hops.png)
    i.e. A medical database in which the age of the patient is sometimes labeled incorrectly as the patient's weight - the example gives age and weight randomly changing locations - this is called "dimension hopping" which needs to be eliminated before applying ML. Viewpoint changes cause "dimension hopping".

## 5.2. Ways to achieve viewpoint invariance
<a id="viewpoint_invar"> </a>
- A few common approaches:
    - Use redundant invariant features
    - Box objects and normalize pixels: put a box around the object and use normalized pixels
    - Replicated features with pooling (convolutional neural nets - $\color{red}{Lecture\ 5.3}$
    - Hierarchy of parts that have explicit poses relative to camera ($\color{red}{Lecture\ 5.e}$)
- Details:
    - Invariant feature approach: 
        - Extract a large, redundant set of features that are invariant under transformations. 
        - The underlying assumption is based on the observation that humans can effortlessly detect objects in different poses and lighting conditions and, so, there must exist properties or features which are invariant over these variabilities. 
        - With enough invariant features, there is only one way to put them together into an image (relationships between features are automatically captured by other features due to multiple overlaps). 
        - But for recognition, need to avoid features that are parts of objects.
    - _Judicious normalization approach_ (boxing/normalizing objects):
        - Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
        - Solves "dimension hopping" if the box is always done correctly, the same part of an object always occurs on the same normalized pixels.
        - Can provide invariance to many degrees of freedom: $\color{red}{translation,\ rotation,\ scale,\ shear\ (dịch\ chuyển),\ stretch\ ...}$
        - Boxing, however is difficult, due to segmentation errors, occlusion, unusual orientations.
        - Need to know what the shape is in order to box it right, which is the problem looking to solve already.
    - _Brute Force normalization approach_ (boxing):
        - When training the recognizer, use very clean data (use well-segmented, upright images) for training, to fit the correct box, so boxing can be done accurately and cleanly.
        - At test time, try all possible boxes in a range of positions and scales, try to throw noisier less clean data.
        - Is widely used for detecting upright things like faces and house numbers in unsegmented images.
        - Important that the network can tolerate some sloppiness in the boxing so more coarse/less accurate boxing can be dont at test time. (it is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes).

## 5.3. Convolutional neural networks for hand written digit recognition
<a id="hand_written_digit_reg"> </a>

## 5.4. Convolutional neural networks for object recognition
<a id="conv_net_obj_reg"> </a>