<h1 align="center">Popular Models and Algorithms in Machine Learning (Computer Vision)</h1>

AI and computer vision have become fundamental technologies that impact our daily lives in many ways we may not realize. These technologies let self-driving cars navigate roads safely by identifying obstacles, secure our phones through facial recognition, and power tools like Google Lens that can identify objects, translate text, and retrieve information from photos we take. Snapchat AR filters and VR virtual tours demonstrate how these technologies are transforming entertainment and how we explore digital spaces. Understanding these technologies matters because they shape safety, privacy, and how humans connect and interact with the digital world. 

---

## Abstract

People naturally see shapes, colors, objects, and scenes within an image. However, computers view images as digital data. Digital images are made up of pixels, and each pixel is assigned a numerical value based on its color or intensity. In this way, a computer sees an image as a large grid of numbers. The goal of computer vision is to allow computers to analyze and understand these numerical representations to achieve a (super) human level of perception of visual information. In computer vision, there are three foundational levels. Low-level vision focuses on image processing and pixel level operations, mid-level vision deals with analyzing multiple images, and high-level vision involves recognition and understanding images to perform tasks such as classification, identification, or detection. In the pre-deep learning era of computer vision, systems relied on image processing and feature extraction for tasks like optical character recognition and face detection. However, in 2012, there was a deep learning breakthrough with AlexNet, a convolutional neural network (CNN) that could classify images into 1000 distinct object categories. AlexNet significantly outperformed competing models and established CNNs as the standard approach for modern computer vision. The advancement of AI and computer vision is driven by the improvement of camera technologies, the integration of larger datasets, and higher computing power, all of which contribute to more robust and impactful real-world applications.

---

## Scribe

#### Computer Vision Foundation

**Computer vision** is a technology that enables machines to process, analyze, and extract meaningful information from images and videos. According to Intel, it is "a type of AI that trains computers to emulate how humans see, make sense of what they see, and act on that processed and analyzed information" (Intel Corporation, n.d.). However, the process by which a human sees and interprets an image is very different from that of a computer. Humans view a photo as a whole picture, intuitively recognizing distinct shapes, objects, places, and people. Computers view an image starting from the pixels that make up the image. A **pixel** is the smallest unit of a digital image which contains information of color or intensity. In a grayscale image, each pixel is represented by a single number indicating its **intensity**, the brightness level, which ranges between 0 and 255, where 0 represents black, and 255 represents white (ApX Machine Learning, 2025). The entire grayscale image forms a matrix of these intensity values, as shown in Figure 1. In a color image, the most common way to represent color is the RGB (Red, Green, Blue) model, where each pixel stores three numerical values representing the intensity of red, green, and blue, with each value ranging from 0 (no intensity) to 255 (maximum intensity), as shown in Figure 2 (ApX Machine Learning, 2025). 



| <img src="https://tse2.mm.bing.net/th/id/OIP.l1BOqS6T-sWBpRKoKRMfwAAAAA?pid=Api&P=0&h=220" alt="Grayscale Image Matrix" width="450">| <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*CvTMHL1lYT1S_QnA.png" alt="RGB Color Model" width="600"> |
|:--:|:--:|
|  *Figure 1*. A grayscale image represented as a matrix of numerical values, where each pixel in the image corresponds to an intensity value ranging from 0 to 255. Adapted from "Image Filtering," *Grayscale Pixel Value*. Retrieved October 11, 2025, from https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html | *Figure 2*. A pixel of an RGB image has an intensity value for all three color components. Adapted from "Medium," *Color image channels*. Retrieved October 11, 2025, from https://medium.com/@mjbharmal2002/gray-scaling-with-the-algorithms-b83f87975885 |

The goal of computer vision is to give computers **(super) human-level perception**, allowing them to interpret images in a way that matches or exceeds human capabilities. However, achieving this goal can be difficult. During the presentation, *Popular Models and Algorithms in Computer Vision*, Dr. Chen noted that in the past, determining whether a photo contained a bird was a difficult research problem, because computers had to extract a high-level concept from a massive matrix of numbers (Chen, 2025).


Computer vision can be divided into low level, mid level, and high level computer vision. **Low level vision** involves the **processing and manipulation of pixels**, and feature extraction. Pixel manipulation includes resizing an image by zooming in or out, erasing pixels, color conversion by changing a color image into a grayscale image, and image adjustments by modifying the exposure, saturation, or hue of an image. **Feature extraction** involves **edge detection** which identifies the boundaries and edges of objects within an image, oriented gradients which determine the orientation of gradients, and **image segmentation** through dividing an image into different parts based on the color value. **Mid level vision** handles patch processing, operations performed on a small block of pixels, and can integrate multiple images. Panorama stitching is a mid level vision application that finds distinct features in an image to determine where to match between different images of the same scene to form a panoramic image, as shown in Figure 3 (Chen, 2025). **Multi-view stereo** uses multiple images taken from different viewpoints to infer a 3D model of a scene. This technique relies on **motion parallax**, where objects at different distances shift by different amounts when the viewpoint changes, with closer objects shifting more than distant ones (Milvus, n.d.). Another application of mid level vision is a **structured light scan**, which projects stripes of light onto an object. The stripes distort based on the object's 3D surface, and by analyzing the distortions, the system can recover and reconstruct a 3D model of the object, as shown in Figure 4. Similarly, LiDAR (Light Detection and Ranging) is another mid level vision technology that performs range finding by measuring distances with laser beams to create 3D reconstructions of scenes (IBM, 2025). **Optical flow**, another mid level task, tracks the motion of objects, visualizing how they move across multiple frames, as shown in Figure 5 (Metaeyes, 2024). This technique can be applied to time lapse videos to compute the alignment of images across time. 

| <img src="https://miro.medium.com/v2/resize:fit:450/1*CXZ804tKLPy2hiikJbYH3w.png" alt="Panorama Image" width="250"> | <img src="https://www.arc-bg.com/arcsite/ckfinder/userfiles/files/Working-principle-of-structured-light-scanner.png" alt="Structured Light Scanner" width="3000"> | <img src="https://i.ytimg.com/vi/LjjJQ81RbX0/hqdefault.jpg" alt="Optical Flow" width="500"> |
|:--------:|:--------:|:--------:|
| *Figure 3*. Stitching images to create a panorama. Adapted from "Medium," *Image Stitching to create a Panorama*. Retrieved October 11, 2025, from https://medium.com/@navekshasood/image-stitching-to-create-a-panorama-5e030ecc8f7 | *Figure 4*. It projects a structured light pattern onto an object, and captures the way in which the object deforms the light pattern. Adapted from "ARC Metrologia," *Structured light scanners*. Retrieved October 11, 2025, from https://www.arc-bg.com/en/253-structured-light-scanners  | *Figure 5*. Uses optical flow to track and detect the hand movements. Adapted from "Vision AI," *Optical Flow Tracking Grid and its use for Real-Time Object Detection*. Retrieved October 11, 2025, from https://www.youtube.com/watch?v=LjjJQ81RbX0 |

**High level vision** goes back to the main goal of computer vision, giving computers a (super) human-level perception, which involves the recognition and interpretation of an image. **Recognition** tasks include verification, detection, identification, classification, and scene categorization. **Verification** determines whether a given object is a specific class or not. **Detection** locates specific objects in a scene and determines their positions. **Identification** is an advanced form of verification that involves complex inference to distinguish specific instances of objects rather than just their general categories. **Classification** predicts what object classes are present in an image. Scene categorization classifies images based on their environment and contextual information, such as distinguishing between location types and identifying activities or events occurring within the scene (Chen, 2025).

One basic method of object recognition is template matching. **Template matching** is a method for locating a template image within a larger image, as shown in Figure 6. It works by sliding the template over the input image and calculating the similarity (such as the dot product) between the template and each patch it overlaps (OpenCV, n.d.). While this method works well for detecting specific instances of objects, it performs poorly with generic object categories or when objects appear at different scales or orientations. **SIFT Matching** (Scale Invariant Feature Transform), is another object recognition method which detects distinct features in both the template and input image and is resilient to changes in the image scale, orientation, and intensity, as shown in Figure 7. In general, template matching is most effective when the target object maintains a consistent size and orientation, such as logos, while SIFT matching is better suited for scenarios where objects vary in scale, rotation, or brightness (Albao, 2023).


| <img src="https://docs.opencv.org/4.x/template_ccoeff_1.jpg" alt="Template Matching" width="350"> <img src="https://docs.opencv.org/4.x/messi_face.jpg" alt="Template Matching" width="60">| <img src="https://miro.medium.com/v2/resize:fit:1358/1*bPN9KN1Y7Lkl8_8wfNtQgA.png" alt="SIFT features" width="700"> |
|:--:|:--:|
| *Figure 6*. The left image displays the matching map, where the brightest point indicates the location with the highest similarity between the template and the input image. The middle image shows the detected location of the template image within the input image, marked by a bounding box. The right image is the template image which is a photo of Messi's face. Adapted from "OpenCV," *template_ccoeff_1*, *messi_face*. Retrieved October 11, 2025, from https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html | *Figure 7*. There are distinct features in the template image of the truck that correspond to similar features in the input image. Adapted from "Medium," *SIFT Features*. Retrieved October 11, 2025, from https://medium.com/@deepanshut041/introduction-to-sift-scale-invariant-feature-transform-65d7f3a72d40 |

There are many challenges that come with object recognition stemming from variability, as shown in Figure 8. Depending on the camera position, illumination, scale, deformation, occlusion, background clutter, and intra-class variation, identifying an object could be difficult, and the large number of potential object classes makes this task even more complex. 



|<img src="https://cs231n.github.io/assets/challenges.jpeg" alt="Variability Challenges" width="800">|
|:--:|
| *Figure 8*. Examples of variability challenges in object recognition. Adapted from "CS231N deep learning for computer vision," *Challenges*. Retrieved October 11, 2025, from https://cs231n.github.io/classification/#summary-applying-knn-in-practice |

During the presentation, *Popular Models and Algorithms in Computer Vision*, Dr. Chen explained the concept of an image manifold. He remarked that the space of all images is very large since even a one-megapixel image can be represented as a vector with about one million elements, making it a very high dimensional vector. However, there exists a **lower dimensional space**, called an **image manifold**, that can describe an object and more easily distinguish it from others. Rather than working directly with the raw input space containing millions of pixels, this lower dimensional space operates at a high-level conceptual level, making it easier to classify images (Chen, 2025). 

This means that while two images of a cat at different angles may be far apart in the raw pixel space, the two images are just different points on the same lower dimensional manifold. The dimension of an image manifold depends on the task. A binary classification task may have a manifold with low dimension, while a task with 100 object categories may have a manifold with much higher dimension. The process of discovering this manifold structure, known as manifold learning, is a method for non-linear dimensionality reduction that focuses on uncovering the underlying manifold structure in high-dimensional data (Activeloop, n.d.). It reduces the dimensions of the data while preserving essential information, making it easier for algorithms to focus on the core aspects of an image. Manifold learning algorithms work well at identifying the underlying structure of objects even when variability such as illumination or viewpoint is present (Riswanto, 2024).

#### A Brief History of Computer Vision: Pre and Post Deep Learning

The history of computer vision can be divided into the pre-deep learning phase before 2012 and the modern deep learning era that began with the success of AlexNet. **Machine learning** is a subset of artificial intelligence that encompasses various methods for enabling algorithms to learn patterns from data and apply that knowledge to make predictions on new data. **Deep learning** is a branch of machine learning that uses layers of artificial neural networks to learn complex patterns and relationships in data (Terra, 2025).

During the pre-deep learning era, most computer vision algorithms relied on classical methods, such as the template matching algorithm, and focused on pixel level processing. **Optical character recognition** (OCR), the process of extracting text from images, worked well in the pre-deep learning era, especially for recognizing handwritten zip codes. One dataset that was used for this task was **MNIST** (Modified National Institute of Standards and Technology), which contains handwritten digits designed for zipcode recognition. The MNIST dataset originated from the NIST (National Institute of Standards and Technology) dataset, which was commissioned by the U.S. postal service to develop solutions for automated digit recognition (Chen, 2025). Other tasks that were successful during this period were face detection, pedestrian detection, and instance-level recognition.

However, in 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton developed a convolutional neural network called AlexNet, as shown in Figure 9, that marked a turning point in computer vision. **Convolutional neural networks** (CNNs) are specialized artificial neural networks designed to process and extract features from grid-structured data, such as images or other visual data. CNNs analyze images through multiple layers in a hierarchical manner, with early layers detecting simple features like edges and lines, while deeper layers identify complex patterns, shapes, and complete objects (Google, n.d.). The main types of layers are the convolutional layer, pooling layer, and fully connected layer. The convolutional layer uses a **filter window**, a grid that moves across the image to search for features like edges or textures. At each position, it performs calculations that generate an **output matrix**, called a feature map, showing where those features appear in the image. The pooling layer then reduces the size of these feature maps by summarizing their information, making the network more efficient while preserving important features. The fully connected layer combines all the extracted features and uses them to perform classification (IBM, 2025).

|<img src="https://drek4537l1klr.cloudfront.net/elgendy/v-3/Figures/05_04.png" alt="AlexNet" width="800">|
|:--:|
| *Figure 9*. AlexNet's Architecture. Adapted from "Deep Learning for Vision Systems," *Elgendy*. Retrieved October 11, 2025, from https://livebook.manning.com/book/deep-learning-for-vision-systems/chapter-5/ |

In 2012, **AlexNet** won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and significantly outperformed other models. AlexNet's performance proved CNNs were highly effective for image processing and sparked renewed momentum in deep learning research (Jain, 2024). AlexNet took RGB images as input and produced a 1000-dimensional output to classify them into 1000 distinct object categories. The model contained about 60 million parameters. Model **parameters** are internal values that control how a machine learning model processes information and makes predictions. These values are learned and refined during training, enabling the model to accurately handle new data. (Belcic & Stryker, 2025). One innovative aspect of AlexNet was that it was trained across multiple CUDA-enabled NVIDIA GPUs, allowing it to efficiently handle the large model and the massive ImageNet dataset (Chen, 2025).

Since 2012, the deep learning era has brought significant advancements in the field of computer vision. Many of the challenging problems that existed in the pre–deep learning era are now considered solved. Modern models can now achieve human-level accuracy in object classification across thousands of categories. A key innovative step towards these achievements was the transformer architecture.  A **transformer** is a type of neural network that learns the context of an input sequence to generate an output sequence. It is used in natural language processing (NLP) to understand and generate human language (Amazon Web Services, n.d.a). **Vision transformers** (ViT) use the transformer architecture for computer vision tasks by dividing an image into small patches and treating it as sequential data, and excel in image classification, object detection, semantic segmentation, and single image depth estimation (Boesch, 2023). Semantic segmentation and instance segmentation extend object detection by performing pixel-level classification, as shown in Figure 10. **Semantic segmentation** groups all pixels belonging to the same category together, so all cats in an image would be colored the same. **Instance segmentation** goes further by identifying each individual object separately, so each cat would receive its own unique color even though they belong to the same category (Coursera Inc, n.d.). **Single image depth estimation** predicts the depth of a scene from an image, as shown in Figure 11. 

Other tasks have also improved significantly through autoencoders, diffusion models, and vision language models. An **autoencoder** is a type of neural network that compresses an input image into its essential features, and outputs a reconstructed image. It can be used for image denoising, image compression, and applying image super-resolution (Petru, 2022). **Image super-resolution** refines low-resolution images into high-quality and high-resolution versions, as shown in Figure 12 (Chen, 2025). **Diffusion models** are generative models that add random noise to input data and then learn to reverse this process to generate new synthetic data. They are used for image, video, and audio generation (Coursera Inc, 2025). **Stable diffusion** is a latent diffusion model that uses a variational autoencoder (VAE) to compress high-dimensional pixel data into a lower-dimensional latent space before applying the diffusion process. This approach makes it faster and more efficient than traditional diffusion models while still generating high-quality images from text prompts (Amazon Web Services, n.d.b). **Vision language models** have similarly progressed, processing and understanding both visual and textual inputs to enable high-quality **image and video generation**, image captioning, and visual question answering (Pandit, 2024).

|<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*OdEIh5K6qkHSzrPFk5Oa1g.jpeg" alt="Semantic segmentation vs. Instance segmentation" width="800">| <img src="https://www.mlwires.com/wp-content/uploads/2024/07/Depth-Anything-V2-vs-Marigold-1024x713.jpg" alt="Single Image Depth Estimation" width="410">| <img src="https://media.licdn.com/dms/image/v2/D4D12AQGF_t_di8taoA/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1695617252545?e=1762387200&v=beta&t=BlSM_RMHwZ4TqG1D9KQL5yslMFECW3j2sDJUZtNiC54" alt="Image Super Resolution" width="1400"> |
|:--:|:--:|:--:|
| *Figure 10*. An example of semantic segmentation and instance segmentation. Adapted from "Medium," *Semantic Segmentation Vs Instance Segmentation*. Retrieved October 11, 2025, from https://medium.com/@saluem/semantic-segmentation-vs-instance-segmentation-1422d322936b | *Figure 11*. Examples of single image depth estimation with two different models. Adapted from "MLWIRES," *Comparison between Marigold and Depth Anything V2 in open-world images*. Retrieved October 11, 2025, from https://www.mlwires.com/depth-anything-v2-a-highly-capable-depth-estimation-model/  | *Figure 12*. Example of image super-resolution. Adapted from "LinkedIn," *Unlocking Clarity: Revolutionizing Image Super-Resolution with Shallow Neural Networks*. Retrieved October 11, 2025, from https://www.linkedin.com/pulse/unlocking-clarity-revolutionizing-image-shallow-neural-shubham-rout/|

The success of computer vision depended on advances in **imaging and optics** to provide sufficient visual data for learning. The development of better **data generation tools**, such as high-quality and accessible cameras, made it possible to gather large quantities of images. Equally important were advancements in **computing power**, including the high-speed internet and GPUs. GPUs were important because they could process large amounts of data quickly and in parallel. **CUDA**, a general-purpose programming platform for GPUs developed by NVIDIA, further accelerated this progress by allowing developers to harness the full computational power of GPUs. The combination of having larger datasets of images and higher computing power led to the rapid advancement of modern computer vision (Chen, 2025).

---

The keywords that were omitted were image prediction, image stabilization, low level-space, AI perception, super-human vision, multi-view imaging, and stable fusion models. 

- I chose not to include the keyword "image prediction" because I think it was refering to "classification", the process of predicting an object's class in an image, which I've addressed in the detailed summary.

- I chose not to include the keyword "image stabilization" because I think it was refering to "image super-resolution", converting a low resolution/blurry image into a high resolution image, which I've addressed in the detailed summary.

- I chose not to include the keyword "low level-space" because I think it was refering to "lower dimensional space", an image manifold, which I've addressed in the detailed summary.

- I chose not to include the keyword "AI perception" and "super-human vision" because I think it was refering to "super human-level perception", the concept of allowing computers to interpret images in a way that matches or exceeds human capabilities, which I've addressed in the detailed summary.

- I chose not to include the keyword "multi-view imaging" because I think it was refering to "multi-view stereo", taking multiple images from different viewpoints to infer a 3D model of a scene, which I've addressed in the detailed summary.

- I chose not to include the keyword "stable fusion models" because I think it was refering to "stable diffusion", which I've addressed in the detailed summary.

---

## Reflection

The ultimate goal of computer vision is to achieve (super) human-level perception that can enhance and expand the capabilities of humans. It creates a partnership where machines take on tasks that need speed, accuracy, and large-scale processing, while humans provide judgment, creativity, and ethical oversight. Over the past decades, computer vision has progressed from performing basic image recognition to complex tasks like semantic and instance segmentation. Problems that were once considered difficult are now embedded in applications people use daily. This progress was made possible through the technological advancements of affordable, high-quality cameras and powerful hardware like GPUs that can process large amounts of image data. 

Today, computer vision is deeply integrated into daily life in both visible and unseen ways. It enables smartphones to recognize faces and translate text, assists doctors in finding diseases earlier, and helps make roads safer with driver-assistance systems. Instead of reducing human capabilities, computer vision strengthens them, giving people new ways to see and understand the world while improving safety, accessibility, and efficiency in many aspects of modern society.

Despite these achievements, many challenges still remain. Current models often struggle to perform consistently when there are changes in lighting, angles, and occlusion, which is important for applications like self-driving cars and security systems. Another major challenge is the need for large amounts of labeled data to train these systems, which is expensive and time-consuming to collect, especially in fields like healthcare and industrial inspection. Ethical concerns also play a big role in computer vision. It is important to make sure these systems work fairly for everyone, reduce bias, and remain transparent about how they make decisions.

Just as computer vision relied on many breakthroughs to reach its current state, solving today’s challenges will also require more research and innovation. Progress in hardware, algorithms, data, and theory must work together to create systems that are powerful, reliable, transparent, and fair. These challenges will continue to inspire discovery and improvement, guiding the development of the next era of computer vision.

---

## Citations

Activeloop. (n.d.). *Manifold Learning.* ActiveLoop. https://www.activeloop.ai/resources/glossary/manifold-learning/ 

Albao, C. (2023, June 18). *The final frame: Navigating the realms of SIFT, template matching, and RANSAC in our final image processing encounter*. Medium. https://medium.com/@chingalbao/the-final-frame-navigating-the-realms-of-sift-template-matching-and-ransac-in-our-final-image-4dde17fa8bee 

Amazon Web Services. (n.d.a). *What are Transformers in Artificial Intelligence?*. aws. https://aws.amazon.com/what-is/transformers-in-artificial-intelligence/ 

Amazon Web Services. (n.d.b). *What is Stable Diffusion?*. aws. https://aws.amazon.com/what-is/stable-diffusion/ 

ApX Machine Learning. (2025). *Digital Image Representation*. ApX. https://apxml.com/courses/introduction-to-computer-vision/chapter-2-digital-image-fundamentals/digital-image-representation 

Belcic, I., & Stryker, C. (2025, August 21). *What are model parameters?*. IBM. https://www.ibm.com/think/topics/model-parameters 

Boesch, G. (2023, November 25). *Vision Transformers (ViT) in Image Recognition*. viso.ai. https://viso.ai/deep-learning/vision-transformer-vit/ 

*Challenges* [Online image]. CS231N deep learning for computer vision. https://cs231n.github.io/classification/#summary-applying-knn-in-practice 

Chen, H. G. (2025, October 6). *Popular Models and Algorithms in Computer Vision* [PowerPoint slides]. College of Natural Sciences, University of Hawaii at Manoa. https://vimeo.com/1126338536 

*Color image channels* [Online image]. Medium. https://medium.com/@mjbharmal2002/gray-scaling-with-the-algorithms-b83f87975885

*Comparison between Marigold and Depth Anything V2 in open-world images* [Online image]. MLWIRES. https://www.mlwires.com/depth-anything-v2-a-highly-capable-depth-estimation-model/

Coursera Inc. (n.d.). *Semantic segmentation vs. instance segmentation: What’s the difference?* Coursera. https://www.coursera.org/articles/semantic-segmentation-vs-instance-segmentation 

Coursera Inc. (2025, May 27). *What are diffusion models?*. Coursera. https://www.coursera.org/articles/diffusion-models 

*Elgendy* [Online image]. Deep Learning for Vision Systems. https://livebook.manning.com/book/deep-learning-for-vision-systems/chapter-5/

Google. (n.d.). *What is a convolutional neural network?*. Google. https://cloud.google.com/discover/what-are-convolutional-neural-networks 

*Grayscale Pixel Value* [Online image]. Image Filtering. https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html

IBM. (2025, July 22). *What is LIDAR?*. IBM. https://www.ibm.com/think/topics/lidar

IBM. (2025, September 29). *What are convolutional neural networks?*. IBM. https://www.ibm.com/think/topics/convolutional-neural-networks 

*Image Stitching to create a Panorama* [Online image]. Medium. https://medium.com/@navekshasood/image-stitching-to-create-a-panorama-5e030ecc8f7

Intel Corporation. (n.d.). *What is computer vision?*. Intel. https://www.intel.com/content/www/us/en/learn/what-is-computer-vision.html 

Jain, A. (2024, November 2). *Deep learning architecture 2 : Alexnet*. Medium. https://medium.com/@abhishekjainindore24/deep-learning-architecture-2-alexnet-8018f7640161 

*messi_face* [Online image]. OpenCV. https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html

Metaeyes. (2024, June 10). *What are the three levels of Computer Vision?*. metaeye. https://www.metaeye.co.uk/what-are-the-three-levels-of-computer-vision

Milvus. (n.d.). *What is the parallax effect in the computer vision?*. Milvus. https://milvus.io/ai-quick-reference/what-is-the-parallax-effect-in-the-computer-vision  

OpenCV. (n.d.). *Template Matching*. OpenCV. https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html 

*Optical Flow Tracking Grid and its use for Real-Time Object Detection* [Online image]. Vision AI. https://www.youtube.com/watch?v=LjjJQ81RbX0

Pandit, B. (2024, August 29). *Vision language models (vlms) explained*. DataCamp. https://www.datacamp.com/blog/vlms-ai-vision-language-models 

Petru, P. (2022, October 21). *What is an autoencoder?*. Roboflow Blog. https://blog.roboflow.com/what-is-an-autoencoder-computer-vision/ 

Riswanto, U. (2024, October 2). *Why manifold learning is the future of image recognition*. Medium. https://ujangriswanto08.medium.com/why-manifold-learning-is-the-future-of-image-recognition-4915dc9cbfed 

*Semantic Segmentation Vs Instance Segmentation* [Online image]. Medium. https://medium.com/@saluem/semantic-segmentation-vs-instance-segmentation-1422d322936b

*SIFT Features* [Online image]. Medium. https://medium.com/@deepanshut041/introduction-to-sift-scale-invariant-feature-transform-65d7f3a72d40

*Structured light scanners* [Online image]. ARC Metrologia. https://www.arc-bg.com/en/253-structured-light-scanners 

*template_ccoeff_1* [Online image]. OpenCV. https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html

Terra, J. (2025, February 13). *What is deep learning? models, applications, and examples*. Caltech. https://pg-p.ctme.caltech.edu/blog/ai-ml/what-is-deep-learning 

*Unlocking Clarity: Revolutionizing Image Super-Resolution with Shallow Neural Networks*. [Online image]. LinkedIn. https://www.linkedin.com/pulse/unlocking-clarity-revolutionizing-image-shallow-neural-shubham-rout/