Skip to content

microsoft/TableSense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tablesense: Spreadsheet table detection with convolutional neural networks

Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. To enable data-driven models, we annotated a large amount of table ranges on real spreadsheet data. Our annotations are based on three public datasets (VEnron2, VEUSUS, and VFUSE), which are widely used in spreadsheet domain. To eliminate similar spreadsheets that may introduce lots of duplicated labeling efforts, we use the published dataset which has clustered similar sheets by SpreadCluster:

  1. VEnron2 is built on the Enron email archive by SpreadCluster (MSR 2017). It contains 1,609 evolution groups and 12,254 spreadsheets.
  2. VEUSES is built on EUSES by SpreadCluster (MSR 2017). It contains 177 evolution groups and 363 spreadsheets.
  3. VFUSE is built on FUSE by SpreadCluster (MSR 2017). It contains 188 evolution groups and 1,143 spreadsheets.

Note that the WebSheet dataset introduced by TableSense needs to solve compliance issues before publishing, so we firstly publish annotations for VEnron2, VEUSUS, and VFUSE to facilitate recent research. To process raw Excel files, we first transformed original Excel files from .xls to .xlsx. Second, we tried to read and extract features from these files using ClosedXML. We excluded those files that failed to transform and process. Then we seleceted one file for each cluster, and labeled only the first two sheets for those files containing multiple spreadsheets. All sheets had been labeled and checked by no less than two persons. We excluded those controversial cases between annotators. Finally we got 2,615 tables from 1,645 spreadsheet. Since VEnron2 has the greatest number of clusters, it contributes most annotated table ranges. The annotation schema looks like the following example:

File name Sheet name Training/testing File folder Table region 1 Table region ...
1_AGAVE.x February training_set VEnron2\1027 B3:F5 ...
... ... ... ... ... ...

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published