Automatically detect web compatibility issues
Clone or download
gabriel-v and marco-c Remove image caching from utils.load_image (#270)
Caching all the images like this will eventually lead to OOM. Let's just
read and resize them from disk every time we need them.
Latest commit eea0ba4 Dec 7, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
autowebcompat Remove image caching from utils.load_image (#270) Dec 7, 2018
data @ c3fb54c Update data submodule Oct 30, 2018
label_persons Add 500+ labels (#262) Oct 4, 2018
labeling_guide Add readme about dataset labeling (#220) Jul 3, 2018
tests Add name creation and parsing in utils.py (#242) Jul 26, 2018
tools @ 4e34797 Update tools submodule Jun 19, 2018
.coveragerc Misc improvements for code coverage (#110) Mar 8, 2018
.gitignore Updat .gitignore to include .pytest_cache and .idea (#130) Mar 15, 2018
.gitmodules Store autowebcompat tools in a separate git repository and stop downl… Feb 27, 2018
.isort.cfg force_single_line configuration for isort (#190) May 31, 2018
.travis.yml Remove blank line in .travis.yml Mar 8, 2018
LABELING.md Add readme about dataset labeling (#220) Jul 3, 2018
LICENSE Add license Jan 31, 2018
README.md Add readme about dataset labeling (#220) Jul 3, 2018
collect.py Use xpath to jump_back (#253) Aug 3, 2018
data_inconsistencies.py Ddd flake8-quotes (#152) Mar 20, 2018
dom_test.py Alignment Graph in xpert (#256) Aug 21, 2018
generate_labels.py Add draw_boundary_box to select area where incompatibilities lie. (#128) Apr 14, 2018
get_xpath.js Write locations of all elements on a webpage (#239) Jul 30, 2018
label.py Add name creation and parsing in utils.py (#242) Jul 26, 2018
labels.csv Add 500+ labels (#262) Oct 4, 2018
migrate_files.py Migrate files according to new convention (#252) Aug 21, 2018
pretrain.ipynb Add pretrain notebook (#200) Jun 11, 2018
pretrain.py force_single_line configuration for isort (#190) May 31, 2018
requirements.txt Add DOM-based compatibility with a basic technique (#226) Jul 31, 2018
setup.cfg Remove redundant comments, add dirs to flake8 exclude list Jan 31, 2018
test-requirements.txt Ddd flake8-quotes (#152) Mar 20, 2018
train.ipynb Don't overwrite tensorflow and keras packages with pip install (#192) Jun 5, 2018
train.py Add training from user specific labels file (#218) Jun 20, 2018
webcompatdata-bzlike.json Add file containing bugs from webcompat.com and move the function to … Jan 29, 2018

README.md

AutoWebCompat - Automatically detect web compatibility issues

Build Status

The aim of this project is creating a tool to automatically detect web compatibility issues without human intervention.

Collecting screenshots

The project uses Selenium to collect web page screenshots automatically on Firefox and Chrome.

The crawler loads web pages from the URLs on the webcompat.com tracker and tries to reproduce the reported issues by interacting with the elements of the page. As soon as the page is loaded and after every interaction with the elements, the crawler takes a screenshot.

The crawler repeats the same steps in Firefox and Chrome, generating a set of comparable screenshots.

The data/ directory contains the screenshots generated by the crawler (N.B.: This directory is not present in the repository itself, but it will be created automatically after you setup the project as described in the Setup paragraph).

Labeling

Labeling Guide

Training

Now that we have a dataset with labels, we can train a neural network to automatically detect screenshots that are incompatible. We are currently using a Siamese architecture with different Convolutional Neural Networks, but are open to test other ideas.

We plan to employ three training methodologies:

  1. Training from scratch on the entire training set;
  2. Finetuning a network previously pretrained on ImageNet (or other datasets);
  3. Finetuning a network previously pretrained in an unsupervised fashion.

For the unsupervised training, we are using a related problem for which we already have labels (detecting screenshots belonging to the same website). The pre-training can be helpful because we have plenty of data (as we don't need to manually label them) and we can fine-tune the network we pre-train for our problem of interest.

Structure of the project

  • The autowebcompat/utils.py module contains some utility functions;
  • The autowebcompat/network.py module contains neural network definition, optimizers definition, along with the loss and accuracy;
  • The collect.py script is the crawler that collects screenshots of web pages in different browsers;
  • The label.py script is a utility that helps labelling couples of screenshots (are they the same in the two browsers or are there differences?);
  • The pretrain.py script trains a neural network on the website screenshots for a slightly different problem (for which we know the solution), so that we can reuse the network weights for the training on the actual problem;
  • The train.py script trains the neural network on the website screenshots to detect compat issues;
  • The data_inconsistencies.py script checks the generated screenshots and takes note of any data inconsistency (e.g. screenshots that were taken in Firefox but not in Chrome).

Setup

Python 3 is required.

  • Install Git Large File Storage, either manually or through a package like git-lfs if available on your system (in case of using PackageCloud).
  • Clone the repository with submodules: git lfs clone --recurse-submodules REPO_URL
  • Install the dependencies in requirements.txt: pip install -r requirements.txt.
  • Install the dependencies in test-requirements.txt: pip install -r test-requirements.txt.

Training the network

  • The pretrain.py or train.py script can be run to train the neural network, with the following options:

    -network                  To select which network architecture to use
    
    -optimizer                To select the optimizer to use   
    
    -classification_type      Either Y vs N + D or Y + N vs D
    
    --early_stoppping	      (Optional) To stop training when validation accuracy has stopped improving
    

Communication

Real-time communication for this project happens on Mozilla's IRC network, irc.mozilla.org, in the #webcompat channel.