Skip to content
This repository has been archived by the owner on Apr 21, 2024. It is now read-only.

Latest commit

 

History

History
51 lines (51 loc) · 31.1 KB

20210201.md

File metadata and controls

51 lines (51 loc) · 31.1 KB

ArXiv cs.CV --Mon, 1 Feb 2021

1.General-Purpose OCR Paragraph Identification by Graph Convolution Networks ⬇️

Paragraphs are an important class of document entities. We propose a new approach for paragraph identification by spatial graph convolution networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.

2.Surprisingly Simple Semi-Supervised Domain Adaptation with Pretraining and Consistency ⬇️

Visual domain adaptation involves learning to classify images from a target visual domain using labels available in a different source domain. A range of prior work uses adversarial domain alignment to try and learn a domain invariant feature space, where a good source classifier can perform well on target data. This however, can lead to errors where class A features in the target domain get aligned to class B features in source. We show that in the presence of a few target labels, simple techniques like self-supervision (via rotation prediction) and consistency regularization can be effective without any adversarial alignment to learn a good target classifier. Our Pretraining and Consistency (PAC) approach, can achieve state of the art accuracy on this semi-supervised domain adaptation task, surpassing multiple adversarial domain alignment methods, across multiple datasets. Notably, it outperforms all recent approaches by 3-5% on the large and challenging DomainNet benchmark, showing the strength of these simple techniques in fixing errors made by adversarial alignment.

3.Gaining Scale Invariance in UAV Bird's Eye View Object Detection by Adaptive Resizing ⬇️

In this work, we introduce a new preprocessing step applicable to UAV bird's eye view imagery, which we call Adaptive Resizing. It is constructed to adjust the vast variances in objects' scales, which are naturally inherent to UAV data sets. Furthermore, it improves inference speed by four to five times on average. We test this extensively on UAVDT, VisDrone, and on a new data set, we captured ourselves. On UAVDT, we achieve more than 100 % relative improvement in AP50. Moreover, we show how this method can be applied to a general UAV object detection task. Additionally, we successfully test our method on a domain transfer task where we train on some interval of altitudes and test on a different one. Code will be made available at our website.

4.Towards Generalising Neural Implicit Representations ⬇️

Neural implicit representations have shown substantial improvements in efficiently storing 3D data, when compared to conventional formats. However, the focus of existing work has mainly been on storage and subsequent reconstruction. In this work, we argue that training neural representations for both reconstruction tasks, alongside conventional tasks, can produce more general encodings that admit equal quality reconstructions to single task training, whilst providing improved results on conventional tasks when compared to single task encodings. Through multi-task experiments on reconstruction, classification, and segmentation our approach learns feature rich encodings that produce high quality results for each task. We also reformulate the segmentation task, creating a more representative challenge for implicit representation contexts.

5.Leveraging domain labels for object detection from UAVs ⬇️

Object detection from Unmanned Aerial Vehicles (UAVs) is of great importance in many aerial vision-based applications. Despite the great success of generic object detection methods, a large performance drop is observed when applied to images captured by UAVs. This is due to large variations in imaging conditions, such as varying altitudes, dynamically changing viewing angles, and different capture times. We demonstrate that domain knowledge is a valuable source of information and thus propose domain-aware object detectors by using freely accessible sensor data. By splitting the model into cross-domain and domain-specific parts, substantial performance improvements are achieved on multiple datasets across multiple models and metrics. In particular, we achieve a new state-of-the-art performance on UAVDT for real-time detectors. Furthermore, we create a new airborne image dataset by annotating 13 713 objects in 2 900 images featuring precise altitude and viewing angle annotations.

6.Polynomial Trajectory Predictions for Improved Learning Performance ⬇️

The rising demand for Active Safety systems in automotive applications stresses the need for a reliable short to mid-term trajectory prediction. Anticipating the unfolding path of road users, one can act to increase the overall safety. In this work, we propose to train artificial neural networks for movement understanding by predicting trajectories in their natural form, as a function of time. Predicting polynomial coefficients allows us to increased accuracy and improve generalisation.

7.Open World Compositional Zero-Shot Learning ⬇️

Compositional Zero-Shot learning (CZSL) requires to recognize state-object compositions unseen during training. In this work, instead of assuming the presence of prior knowledge about the unseen compositions, we operate on the open world setting, where the search space includes a large number of unseen compositions some of which might be unfeasible. In this setting, we start from the cosine similarity between visual features and compositional embeddings. After estimating the feasibility score of each composition, we use these scores to either directly mask the output space or as a margin for the cosine similarity between visual features and compositional embeddings during training. Our experiments on two standard CZSL benchmarks show that all the methods suffer severe performance degradation when applied in the open world setting. While our simple CZSL model achieves state-of-the-art performances in the closed world scenario, our feasibility scores boost the performance of our approach in the open world setting, clearly outperforming the previous state of the art.

8.Few-Shot Learning for Road Object Detection ⬇️

Few-shot learning is a problem of high interest in the evolution of deep learning. In this work, we consider the problem of few-shot object detection (FSOD) in a real-world, class-imbalanced scenario. For our experiments, we utilize the India Driving Dataset (IDD), as it includes a class of less-occurring road objects in the image dataset and hence provides a setup suitable for few-shot learning. We evaluate both metric-learning and meta-learning based FSOD methods, in two experimental settings: (i) representative (same-domain) splits from IDD, that evaluates the ability of a model to learn in the context of road images, and (ii) object classes with less-occurring object samples, similar to the open-set setting in real-world. From our experiments, we demonstrate that the metric-learning method outperforms meta-learning on the novel classes by (i) 11.2 mAP points on the same domain, and (ii) 1.0 mAP point on the open-set. We also show that our extension of object classes in a real-world open dataset offers a rich ground for few-shot learning studies.

9.Complementary Pseudo Labels For Unsupervised Domain Adaptation On Person Re-identification ⬇️

In recent years, supervised person re-identification (re-ID) models have received increasing studies. However, these models trained on the source domain always suffer dramatic performance drop when tested on an unseen domain. Existing methods are primary to use pseudo labels to alleviate this problem. One of the most successful approaches predicts neighbors of each unlabeled image and then uses them to train the model. Although the predicted neighbors are credible, they always miss some hard positive samples, which may hinder the model from discovering important discriminative information of the unlabeled domain. In this paper, to complement these low recall neighbor pseudo labels, we propose a joint learning framework to learn better feature embeddings via high precision neighbor pseudo labels and high recall group pseudo labels. The group pseudo labels are generated by transitively merging neighbors of different samples into a group to achieve higher recall. However, the merging operation may cause subgroups in the group due to imperfect neighbor predictions. To utilize these group pseudo labels properly, we propose using a similarity-aggregating loss to mitigate the influence of these subgroups by pulling the input sample towards the most similar embeddings. Extensive experiments on three large-scale datasets demonstrate that our method can achieve state-of-the-art performance under the unsupervised domain adaptation re-ID setting.

10.Efficient-CapsNet: Capsule Network with Self-Attention Routing ⬇️

Deep convolutional neural networks, assisted by architectural design strategies, make extensive use of data augmentation techniques and layers with a high number of feature maps to embed object transformations. That is highly inefficient and for large datasets implies a massive redundancy of features detectors. Even though capsules networks are still in their infancy, they constitute a promising solution to extend current convolutional networks and endow artificial visual perception with a process to encode more efficiently all feature affine transformations. Indeed, a properly working capsule network should theoretically achieve higher results with a considerably lower number of parameters count due to intrinsic capability to generalize to novel viewpoints. Nevertheless, little attention has been given to this relevant aspect. In this paper, we investigate the efficiency of capsule networks and, pushing their capacity to the limits with an extreme architecture with barely 160K parameters, we prove that the proposed architecture is still able to achieve state-of-the-art results on three different datasets with only 2% of the original CapsNet parameters. Moreover, we replace dynamic routing with a novel non-iterative, highly parallelizable routing algorithm that can easily cope with a reduced number of capsules. Extensive experimentation with other capsule implementations has proved the effectiveness of our methodology and the capability of capsule networks to efficiently embed visual representations more prone to generalization.

11.Self-Supervised Representation Learning for RGB-D Salient Object Detection ⬇️

Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are all required to be pre-trained on the ImageNet to learn the hierarchy features which can help to provide a good initialization. However, the collection and annotation of large-scale datasets are time-consuming and expensive. In this paper, we utilize Self-Supervised Representation Learning (SSL) to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and unlabeled RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts as well as reduce the gap between two modalities, thereby providing an effective initialization for the downstream task. In addition, for the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion (MPF) module that splits a single feature fusion into multi-path fusion to achieve an adequate perception of consistent and differential information. The MPF module is general and suitable for both cross-modal and cross-level feature fusion. Extensive experiments on six benchmark RGB-D SOD datasets, our model pre-trained on the RGB-D dataset ($6,335$ without any annotations) can perform favorably against most state-of-the-art RGB-D methods pre-trained on ImageNet ($1,280,000$ with image-level annotations).

12.Neural networks for semantic segmentation of historical city maps: Cross-cultural performance and the impact of figurative diversity ⬇️

In this work, we present a new semantic segmentation model for historical city maps that surpasses the state of the art in terms of flexibility and performance. Research in automatic map processing is largely focused on homogeneous corpora or even individual maps, leading to inflexible algorithms. Recently, convolutional neural networks have opened new perspectives for the development of more generic tools. Based on two new maps corpora, the first one centered on Paris and the second one gathering cities from all over the world, we propose a method for operationalizing the figuration based on traditional computer vision algorithms that allows large-scale quantitative analysis. In a second step, we propose a semantic segmentation model based on neural networks and implement several improvements. Finally, we analyze the impact of map figuration on segmentation performance and evaluate future ways to improve the representational flexibility of neural networks. To conclude, we show that these networks are able to semantically segment map data of a very large figurative diversity with efficiency.

13.The Mind's Eye: Visualizing Class-Agnostic Features of CNNs ⬇️

Visual interpretability of Convolutional Neural Networks (CNNs) has gained significant popularity because of the great challenges that CNN complexity imposes to understanding their inner workings. Although many techniques have been proposed to visualize class features of CNNs, most of them do not provide a correspondence between inputs and the extracted features in specific layers. This prevents the discovery of stimuli that each layer responds better to. We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer. Exploring features in this class-agnostic manner allows for a greater focus on the feature extractor of CNNs. Our method uses a dual-objective activation maximization and distance minimization loss, without requiring a generator network nor modifications to the original model. This limits the number of FLOPs to that of the original network. We demonstrate the visualization quality on widely-used architectures.

14.Spatiotemporal Dilated Convolution with Uncertain Matching for Video-based Crowd Estimation ⬇️

In this paper, we propose a novel SpatioTemporal convolutional Dense Network (STDNet) to address the video-based crowd counting problem, which contains the decomposition of 3D convolution and the 3D spatiotemporal dilated dense convolution to alleviate the rapid growth of the model size caused by the Conv3D layer. Moreover, since the dilated convolution extracts the multiscale features, we combine the dilated convolution with the channel attention block to enhance the feature representations. Due to the error that occurs from the difficulty of labeling crowds, especially for videos, imprecise or standard-inconsistent labels may lead to poor convergence for the model. To address this issue, we further propose a new patch-wise regression loss (PRL) to improve the original pixel-wise loss. Experimental results on three video-based benchmarks, i.e., the UCSD, Mall and WorldExpo'10 datasets, show that STDNet outperforms both image- and video-based state-of-the-art methods. The source codes are released at \url{this https URL}.

15.Multi-Threshold Attention U-Net (MTAU) based Model for Multimodal Brain Tumor Segmentation in MRI scans ⬇️

Gliomas are one of the most frequent brain tumors and are classified into high grade and low grade gliomas. The segmentation of various regions such as tumor core, enhancing tumor etc. plays an important role in determining severity and prognosis. Here, we have developed a multi-threshold model based on attention U-Net for identification of various regions of the tumor in magnetic resonance imaging (MRI). We propose a multi-path segmentation and built three separate models for the different regions of interest. The proposed model achieved mean Dice Coefficient of 0.59, 0.72, and 0.61 for enhancing tumor, whole tumor and tumor core respectively on the training dataset. The same model gave mean Dice Coefficient of 0.57, 0.73, and 0.61 on the validation dataset and 0.59, 0.72, and 0.57 on the test dataset.

16.Re Learning Memory Guided Normality for Anomaly Detection ⬇️

The authors have introduced a novel method for unsupervised anomaly detection that utilises a newly introduced Memory Module in their paper. We validate the authors claim that this helps improve performance by helping the network learn prototypical patterns, and uses the learnt memory to reduce the representation capacity of Convolutional Neural Networks. Further, we validate the efficacy of two losses introduced by the authors, Separateness Loss and Compactness Loss presented to increase the discriminative power of the memory items and the deeply learned features. We test the efficacy with the help of t-SNE plots of the memory items.

17.NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation ⬇️

3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation. The code is publicly available at this https URL.

18.Deep Triplet Hashing Network for Case-based Medical Image Retrieval ⬇️

Deep hashing methods have been shown to be the most efficient approximate nearest neighbor search techniques for large-scale image retrieval. However, existing deep hashing methods have a poor small-sample ranking performance for case-based medical image retrieval. The top-ranked images in the returned query results may be as a different class than the query image. This ranking problem is caused by classification, regions of interest (ROI), and small-sample information loss in the hashing space. To address the ranking problem, we propose an end-to-end framework, called Attention-based Triplet Hashing (ATH) network, to learn low-dimensional hash codes that preserve the classification, ROI, and small-sample information. We embed a spatial-attention module into the network structure of our ATH to focus on ROI information. The spatial-attention module aggregates the spatial information of feature maps by utilizing max-pooling, element-wise maximum, and element-wise mean operations jointly along the channel axis. The triplet cross-entropy loss can help to map the classification information of images and similarity between images into the hash codes. Extensive experiments on two case-based medical datasets demonstrate that our proposed ATH can further improve the retrieval performance compared to the state-of-the-art deep hashing methods and boost the ranking performance for small samples. Compared to the other loss methods, the triplet cross-entropy loss can enhance the classification performance and hash code-discriminability

19.Position, Padding and Predictions: A Deeper Look at Position Information in CNNs ⬇️

In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. In this paper, we first test this hypothesis and reveal that a surprising degree of absolute position information is encoded in commonly used CNNs. We show that zero padding drives CNNs to encode position information in their internal representations, while a lack of padding precludes position encoding. This gives rise to deeper questions about the role of position information in CNNs: (i) What boundary heuristics enable optimal position encoding for downstream tasks?; (ii) Does position encoding affect the learning of semantic representations?; (iii) Does position encoding always improve performance? To provide answers, we perform the largest case study to date on the role that padding and border heuristics play in CNNs. We design novel tasks which allow us to quantify boundary effects as a function of the distance to the border. Numerous semantic objectives reveal the effect of the border on semantic representations. Finally, we demonstrate the implications of these findings on multiple real-world tasks to show that position information can both help or hurt performance.

20.D3DLO: Deep 3D LiDAR Odometry ⬇️

LiDAR odometry (LO) describes the task of finding an alignment of subsequent LiDAR point clouds. This alignment can be used to estimate the motion of the platform where the LiDAR sensor is mounted on. Currently, on the well-known KITTI Vision Benchmark Suite state-of-the-art algorithms are non-learning approaches. We propose a network architecture that learns LO by directly processing 3D point clouds. It is trained on the KITTI dataset in an end-to-end manner without the necessity of pre-defining corresponding pairs of points. An evaluation on the KITTI Vision Benchmark Suite shows similar performance to a previously published work, DeepCLR [1], even though our model uses only around 3.56% of the number of network parameters thereof. Furthermore, a plane point extraction is applied which leads to a marginal performance decrease while simultaneously reducing the input size by up to 50%.

21.Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks ⬇️

In this paper, we introduce the Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse in deep learning training [PHD20]. Moreover, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before its confirmation by our computational experiments.

22.Automated Deep Learning Analysis of Angiography Video Sequences for Coronary Artery Disease ⬇️

The evaluation of obstructions (stenosis) in coronary arteries is currently done by a physician's visual assessment of coronary angiography video sequences. It is laborious, and can be susceptible to interobserver variation. Prior studies have attempted to automate this process, but few have demonstrated an integrated suite of algorithms for the end-to-end analysis of angiograms. We report an automated analysis pipeline based on deep learning to rapidly and objectively assess coronary angiograms, highlight coronary vessels of interest, and quantify potential stenosis. We propose a 3-stage automated analysis method consisting of key frame extraction, vessel segmentation, and stenosis measurement. We combined powerful deep learning approaches such as ResNet and U-Net with traditional image processing and geometrical analysis. We trained and tested our algorithms on the Left Anterior Oblique (LAO) view of the right coronary artery (RCA) using anonymized angiograms obtained from a tertiary cardiac institution, then tested the generalizability of our technique to the Right Anterior Oblique (RAO) view. We demonstrated an overall improvement on previous work, with key frame extraction top-5 precision of 98.4%, vessel segmentation F1-Score of 0.891 and stenosis measurement 20.7% Type I Error rate.

23.Robust Representation Learning with Feedback for Single Image Deraining ⬇️

A deraining network may be interpreted as a condition generator. Image degradation generated by the deraining network can be attributed to defective embedding features that serve as conditions. Existing image deraining methods usually ignore uncertainty-caused model errors that lower embedding quality and embed low-quality features into the model directly. In contrast, we replace low-quality features by latent high-quality features. The spirit of closed-loop feedback in the automatic control field is borrowed to obtain latent high-quality features. A new method for error detection and feature compensation is proposed to address model errors. Extensive experiments on benchmark datasets as well as specific real datasets demonstrate the advantage of the proposed method over recent state-of-the-art methods.

24.A Petri Dish for Histopathology Image Analysis ⬇️

With the rise of deep learning, there has been increased interest in using neural networks for histopathology image analysis, a field that investigates the properties of biopsy or resected specimens that are traditionally manually examined under a microscope by pathologists. In histopathology image analysis, however, challenges such as limited data, costly annotation, and processing high-resolution and variable-size images create a high barrier of entry and make it difficult to quickly iterate over model designs.
Throughout scientific history, many significant research directions have leveraged small-scale experimental setups as petri dishes to efficiently evaluate exploratory ideas, which are then validated in large-scale applications. For instance, the Drosophila fruit fly in genetics and MNIST in computer vision are well-known petri dishes. In this paper, we introduce a minimalist histopathology image analysis dataset (MHIST), an analogous petri dish for histopathology image analysis. MHIST is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists and annotator agreement level. MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 minutes using 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, we use MHIST to study natural questions such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance.
By introducing MHIST, we hope to not only help facilitate the work of current histopathology imaging researchers, but also make histopathology image analysis more accessible to the general computer vision community. Our dataset is available at this https URL.

25.Reliable COVID-19 Detection Using Chest X-ray Images ⬇️

Coronavirus disease 2019 (COVID-19) has emerged the need for computer-aided diagnosis with automatic, accurate, and fast algorithms. Recent studies have applied Machine Learning algorithms for COVID-19 diagnosis over chest X-ray (CXR) images. However, the data scarcity in these studies prevents a reliable evaluation with the potential of overfitting and limits the performance of deep networks. Moreover, these networks can discriminate COVID-19 pneumonia usually from healthy subjects only or occasionally, from limited pneumonia types. Thus, there is a need for a robust and accurate COVID-19 detector evaluated over a large CXR dataset. To address this need, in this study, we propose a reliable COVID-19 detection network: ReCovNet, which can discriminate COVID-19 pneumonia from 14 different thoracic diseases and healthy subjects. To accomplish this, we have compiled the largest COVID-19 CXR dataset: QaTa-COV19 with 124,616 images including 4603 COVID-19 samples. The proposed ReCovNet achieved a detection performance with 98.57% sensitivity and 99.77% specificity.