Skip to content

jandrew/Spreadsheet-XLSX-Reader-LibXML

Repository files navigation

NAME

Spreadsheet::XLSX::Reader::LibXML - Read xlsx spreadsheet files DEPRECATED

perl version Build Status Coverage Status github version CPAN version kwalitee

SYNOPSIS

The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder of the package

#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::XLSX::Reader::LibXML;

my $parser   = Spreadsheet::XLSX::Reader::LibXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );

if ( !defined $workbook ) {
        die $parser->error(), "\n";
}

for my $worksheet ( $workbook->worksheets() ) {

        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {
                for my $col ( $col_min .. $col_max ) {

                        my $cell = $worksheet->get_cell( $row, $col );
                        next unless $cell;

                        print "Row, Col    = ($row, $col)\n";
                        print "Value       = ", $cell->value(),       "\n";
                        print "Unformatted = ", $cell->unformatted(), "\n";
                        print "\n";
                }
        }
        last;# In order not to read all sheets
}

###########################
# SYNOPSIS Screen Output
# 01: Row, Col    = (0, 0)
# 02: Value       = Category
# 03: Unformatted = Category
# 04: 
# 05: Row, Col    = (0, 1)
# 06: Value       = Total
# 07: Unformatted = Total
# 08: 
# 09: Row, Col    = (0, 2)
# 10: Value       = Date
# 11: Unformatted = Date
# 12: 
# 13: Row, Col    = (1, 0)
# 14: Value       = Red
# 16: Unformatted = Red
# 17: 
# 18: Row, Col    = (1, 1)
# 19: Value       = 5
# 20: Unformatted = 5
# 21: 
# 22: Row, Col    = (1, 2)
# 23: Value       = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25: 
# More intermediate rows ... 
# 82: 
# 83: Row, Col    = (6, 2)
# 84: Value       = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################

DEPRECATION

This module is being replaced by Spreadsheet::Reader::ExcelXML. The bugs associated with XML::LibXML integration wound up being insurmountable for me. I can only infer for all cases but release v0.38.22 will not build on one of my windows machines and on an AWS server used at $work. Netbsd doesn't have a single passed test since release. I still can't even distil the failure mode down to a case that I can submit but the new package without XML::LibXML is running where this one didn't. I opened the package in PAUSE to co-maintainership and I would be happy to hand full ownership to another party if someone wanted to take this over. For now this is an orphan package. I plan on pulling it from CPAN altogether after a year. The new package does run slower for large files since it uses a pure perl parser. On the other hand a fair amount of the TODO and bugs for this package got fixed in the new release. I invite you to try it if you liked this package.

DESCRIPTION

This is an object oriented just in time Excel spreadsheet reader package that should parse all excel files with the extentions .xlsx, .xlsm, .xml Excel 2003 xml (SpreadsheetML) that can be opened in Excel 2007+ applications. The quick-start example provided in the SYNOPSIS attempts to follow the example from Spreadsheet::ParseExcel (.xls binary file reader) as close as possible. There are additional methods and other approaches that can be used by this package for spreadsheet reading but the basic access to data from newer xml based Excel files can be as simple as above.

The intent is to fully document all public functions but you may need to go to sub modules to find more detailed documentation. This package operates on the Excel file with three primary tiers of classes. Each level provides object methods to access the next level down.

    Workbook level (This doc)

    * General attribute settings that affect parsing of the file in general * The place to set workbook level output formatting * Object methods to retreive document level metadata * Object methods to return specific Worksheet instances for data retrieval * The place to set workbook level formatting

      Worksheet level

      * Object methods to return specific cell instances/data * Access to some worksheet level format information (more access pending) * The place to customize data output formats targeting specific cell ranges

        Cell level

        * Access to the cell contents * Access to the cell formats (more access pending)

There are some differences from the Spreadsheet::ParseExcel package. For instance in the SYNOPSIS the '$parser' and the '$workbook' are actually the same class for this package. You could therefore combine both steps by calling ->new with the 'file' attribute called out. The test for load success would then rely on the method file_opened. Afterward it is still possible to call ->error on the instance. Another difference is the data formatter and specifically date handling. This package allows for a simple pluggable custom output format that is very flexible as well as handling dates older than 1-January-1900. I leveraged coercions from Type::Tiny to do this but anything that follows that general format will work here. Additionally, this is a Moose based package and possible I also use interfaces for each of the sub-roles/classes used in parsing. This should alow you to change only the part you want to perform differently if you have the desire to tinker with the guts. Read the full documentation for all opportunities!

In the realm of extensibility this package uses XML::LibXML which has multiple ways to read an XML file but this release only has an XML::LibXML::Reader parser option. Future iterations could include a DOM parser option but that is a very low priority. Currently this package does not provide the same access to the visual format elements provided in Spreadsheet::ParseExcel. That is on the longish and incomplete TODO list. To skip the why and nitty gritty of design and jump to implementation details go to the Attributes section.

Architecture Choices

This is yet another package for parsing Excel xml or 2007+ workbooks. The goals of this package are five fold. First, as close as possible produce the same output as is visible in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere as close as is reasonable to the Spreadsheet::ParseExcel API (where it doesn't conflict with the first objective) so that less work would be needed to integrate ParseExcel and this package. An addendum to the second goal is this package will not expose elements of the object hash for use by the consuming program. This package will either return an unblessed hash with the equivalent elements to the Spreadsheet::ParseExcel output (instead of a class instance) or it will provide methods to provide these sets of data. The third goal is to provide an XLSX sheet parser that is built on XML::LibXML. The other two primary options for XLSX parsing on CPAN use either a one-off XML parser (Spreadsheet::XLSX) or XML::Twig (Spreadsheet::ParseXLSX). In general if either of them already work for you without issue then there is no reason to switch to this package. Fourth, excel files get abused in the wild. They get abused by humans and they get abused by scripts. In general the Excel application handles this mangling gracefully. The goal here is to be able to read any xml based spreadsheet Excel can read. Please submit examples where this is not true to my github repo so I can work to improve this package. If you don't want your test case included with the distribution I will use it to improve the package without publishing it. Fifth (and finally), the design of this package is targeted at handling as large of an Excel file as possible. In general this means that design decisions will generally sacrifice speed to keep RAM consumption low. Specifically this spreadsheet parser does not read the file into memory completely when it is opened. Since the data in the sheet is parsed just in time the information that is not contained in the primary meta-data headers will not be available for review until the sheet parses to that point. In cases where the parser has made choices that prioritize speed over RAM savings there will generally be an attribute available to turn that decision off. All in all this package solves many of the issues I found parsing Excel in the wild. I hope it solves some of yours as well.

Warnings

1.This package uses Archive::Zip. Not all versions of Archive::Zip work for everyone. I have tested this with Archive::Zip 1.30. Please let me know if this does not work with a sucessfully installed (read passed the full test suit) version of Archive::Zip newer than that.

2. Not all workbook sheets (tabs) are created equal! Some Excel sheet tabs are only a chart. These tabs are 'chartsheets'. The methods with 'worksheet' in the name only act on the sub set of tabs that are worksheets. Future methods with 'chartsheet' in the name will focus on the subset of sheets that are chartsheets. Methods with just 'sheet' in the name have the potential to act on both. The documentation for the chartsheet level class is found in Spreadsheet::XLSX::Reader::LibXML::Chartsheet (still under construction). All chartsheet classes do not provide access to cells.

3. This package supports reading xlsm files (Macro enabled Excel 2007+ workbooks). xlsm files allow for binaries to be embedded that may contain malicious code. However, other than unzipping the excel file no work is done by this package with the sub-file 'vbaProject.bin' containing the binaries. This package does not provide an API to that sub-file and I have no intention of doing so. Therefore my research indicates there should be no risk of virus activation while parsing even an infected xlsm file with this package but I encourage you to use your own judgement in this area. caveat utilitor!

4. This package will read some files with 'broken' xml. In general this should be transparent but in the case of the maximum row value and the maximum column value for a worksheet it can cause some surprising problems. This includes the possibility that the maximum values are initially stored as 'undef' if the sheet does not provide them in the metadata as expected. The answer to the methods "row_range" in Spreadsheet::XLSX::Reader::LibXML::Worksheet and "col_range" in Spreadsheet::XLSX::Reader::LibXML::Worksheet will then change as more of the sheet is parsed. The parser improves these values as information is available based on the dimensional scope of the users cell parsing. These values are generally never available in Excel 2003 xml files. The primary cause of these broken XML elements in Excel 2007+ files are non-XML applications writing to the excel spreadsheet. You can use the attribute file_boundary_flags or the methods "get_next_value" in Spreadsheet::XLSX::Reader::LibXML::Worksheet or "fetchrow_arrayref" in Spreadsheet::XLSX::Reader::LibXML::Worksheet as alternates for pre-testing for boundaries when iterating.

5. Version v0.40.2 changes the way file caching is turned on and off. It also changes the way it is set when starting an instance of this package. If you did not turn off caching explicitly before this release there should no be a problem with this change. The goal is to automatically differentiate large files and small files and turn caching off in a targeted manner in response to larger file sizes. This should allow larger spreadsheets that may have exceeded available RAM to run (slowly) when they didn't run at all before without forcing small sheets to run too much slower. However, if you do have caching turned off in your code using the old Boolean setting this package will now see it, fix it upon load, and emit a warning. I will still be tweaking this setting over the next few releases. This warning will stay till 3/1/2017 and then the old callout will no longer be supported.

6. Version v0.40.2 introduces the file attribute and will start the deprication of the file_name and file_handle attributes as well as the following methods: set_file_name, set_file_handle, file_handle, and has_file_handle. This change is intended to remove an overly complex set of dependancies that was causing trouble for garbage collection on cleanup. Please use the file attribute and the file_opened methods as replacements moving forward. Support for backwards compatible use of the old attributes and methods will be removed after 3/1/2017.

7. Version v0.40.2 introduces support for SpreadsheetML (Excel 2003) .xml extention documents. These documents should include

xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"

somewhere in the header to indicate their intended format. This change does introduce a lot of behind the scenes re-plumbing but the top level tests all stayed the same. This means that for .xlsx and .xlsm extentions there should not be any obvious changes or (hopefully) significant new bugs. Note warnings 5 and 6. However, to get this release out and rolling I don't have a full set of tests for the .xml extention paths and Microsofts documentation for that format is spotty in some relevant areas (I still don't know what I don't know) so please submit any cases that appear to behave differently than expected for .xml extention files that are readable by the Excel application. I am also interested in cases where an out of memory error occurs with an .xml extension file. This warning will stay till 3/1/2017.

Attributes

Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For ways to manage the workbook when opened see the Primary Methods. For additional lesser used workbook options see Secondary Methods.

Example

$workbook_instance = Spreadsheet::XLSX::Reader::LibXML->new( %attributes )

note: if the file information is not included in the initial %attributes then it must be set by one of the attribute setter methods below or the parse method before the rest of the package can be used.

file_name

    Definition: This attribute holds the full file name and path for the xlsx|xlsm file to be parsed.

    Default no default - either this or a file handle must be provided to read a file

    Range any unencrypted xlsx|xlsm file that can be opened in Microsoft Excel.

    attribute methods Methods provided to adjust this attribute

      set_file_name

        Definition: change the file name value in the attribute (this will reboot the workbook instance)

      has_file_name

        Definition: this is used to see if the workbook loaded correctly using the file_name option to open an Excel .xlsx file.

file_handle

    Definition: This attribute holds a copy of the passed file handle reference.

    Default no default - either this or a file name must be provided to read a file

    Range any unencrypted xlsx file handle that can be opened in Microsoft Excel

    attribute methods Methods provided to adjust this attribute

      set_file_handle

        Definition: change the set file handle (this will reboot the workbook instance)

      has_file_handle

        Definition: this is used to see if the workbook loaded correctly when using the file_handle option to open an Excel .xlsx file.

error_inst

    Definition: This attribute holds an 'error' object instance. It should have several methods for managing errors. Currently no error codes or error language translation options are available but this should make implementation of that easier.

    Default: a Spreadsheet::XLSX::Reader::LibXML::Error instance with the attributes set as;

    ( should_warn => 0 )

    Range: The minimum list of methods to implement for your own instance is;

    error set_error clear_error set_warnings if_warn

    The error instance must be able to extract the error string from a passed error object as well. For now the current implementation will attempt ->as_string first and then ->message if an object is passed.

    attribute methods Methods provided to adjust this attribute

      get_error_inst

        Definition: returns this instance

      error

        Definition: delegated method from the class used to get the most recently logged error string

      set_error

        Definition: delegated method from the class used to set a new error string (or pass an error object for extraction of the error string)

      clear_error

        Definition: delegated method from the class used to clear the current error string

      set_warnings

        Definition: delegated method from the class used to turn on or off real time warnings when errors are set

      if_warn

        Definition: delegated method from the class used to extend this package and see if warnings should be emitted.

      should_spew_longmess

        Definition: delegated method from the class used to turn on or off the Carp 'longmess'for error messages

      spewing_longmess

        Definition: delegated method from the class used to understand the current state the longmess concatenation for error messages

sheet_parser

    Definition: This sets the way the .xlsx file is parsed. For now the only choice is 'reader'.

    Default 'reader'

    Range 'reader'

    attribute methods Methods provided to adjust this attribute

      set_parser_type

        Definition: the way to change the parser type

      get_parser_type

        Definition: returns the currently set parser type

count_from_zero

    Definition: Excel spreadsheets count from 1. Spreadsheet::ParseExcel counts from zero. This allows you to choose either way.

    Default 1

    Range 1 = counting from zero like Spreadsheet::ParseExcel, 0 = Counting from 1 like Excel

    attribute methods Methods provided to adjust this attribute

      counting_from_zero

        Definition: a way to check the current attribute setting

      set_count_from_zero

        Definition: a way to change the current attribute setting

file_boundary_flags

    Definition: When you request data to the right of the last column or below the last row of the data this package can return 'EOR' or 'EOF' to indicate that state. This is especially helpful in 'while' loops. The other option is to return 'undef'. This is problematic if some cells in your table are empty which also returns undef. What is determined to be the last column and row is determined by the attribute empty_is_end.

    Default 1

    Range 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when requesting a position that is out of bounds

    attribute methods Methods provided to adjust this attribute

      boundary_flag_setting

        Definition: a way to check the current attribute setting

      change_boundary_flag

        Definition: a way to change the current attribute setting

empty_is_end

    Definition: The excel convention is to read the table left to right and top to bottom. Some tables have an uneven number of columns with real data from row to row. This allows the several methods that excersize a 'next' function to wrap after the last element with data rather than going to the max column. This also triggers 'EOR' flags after the last data element and before the sheet max column when not implementing 'next' functionality.

    Default 0

    Range 0 = treat all columns short of the max column for the sheet as being in the table, 1 = treat all cells after the last cell with data as past the end of the row. This will be most visible when boundary flags are turned on or next functionality is used in the context of the values_only attribute is on.

    attribute methods Methods provided to adjust this attribute

      is_empty_the_end

        Definition: a way to check the current attribute setting

      set_empty_is_end

        Definition: a way to set the current attribute setting

values_only

    Definition: Excel will store information about a cell even if it only contains formatting data. In many cases you only want to see cells that actually have values. This attribute will change the package behaviour regarding cells that have formatting stored against that cell but no actual value.

    Default 0

    Range 1 = skip cells with formatting only and treat them as completely empty, 0 = return informat about cells that only contain formatting

    attribute methods Methods provided to adjust this attribute

      get_values_only

        Definition: a way to check the current attribute setting

      set_values_only

        Definition: a way to set the current attribute setting

from_the_edge

    Definition: Some data tables start in the top left corner. Others do not. I don't reccomend that practice but when aquiring data in the wild it is often good to adapt. This attribute sets whether the file reads from the top left edge or from the top row with data and starting from the leftmost column with data.

    Default 1

    Range 1 = treat the top left corner of the sheet as the beginning of rows and columns even if there is no data in the top row or leftmost column, 0 = Set the minimum row and minimum columns to be the first row and first column with data

    attribute methods Methods provided to adjust this attribute

      set_from_the_edge

        Definition: a way to set the current attribute setting

cache_positions

    Definition: This parse can be slow. It does this by trading processing and file storage for RAM usage but that is probably not the average users choice. Currently four of the files can implement selective caching. The setting for this attribute takes a hash ref with the file indicators as keys and the max file size in bytes as the value. When the sub file handle exceeds that size then caching for that subfile is turned off. The default setting shows an example with the four available cached size.

    warning: This behaviour changed with v0.40.2. Prior to that this setting accepted a boolean value that turned all caching on or off universally. If a boolean value is passed a deprication warning will be issued and the input will be changed to this format. 'On' will be converted to the default caching levels. A boolean 'Off' is passed then the package will set all maximum caching levels to 0.

    Default

    {
            sharedStrings => 5242880,# 5 MB
            styles => 5242880,# 5 MB
            worksheet_interface => 5242880,# 5 MB
            chartsheet_interface => 5242880,# 5 MB
    }

    attribute methods Methods provided to adjust this attribute

      get_cache_positions

        Definition: read the attribute

      get_cache_size( $target_file )

        Definition: return the max file size allowed to cache for the indicated $target_file

      set_cache_size( $target_file = $max_file_size )>

        Definition: set the $max_file_size to be cached for the indicated $target_file

      has_cache_size( $target_file )

        Definition: returns true if one of the four allowed files is passed at $target_file

formatter_inst

group_return_type

    Definition: Traditionally ParseExcel returns a cell object with lots of methods to reveal information about the cell. In reality the extra information is not used very much (witness the popularity of Spreadsheet::XLSX). Because many users don't need or want the extra cell formatting information it is possible to get either the raw xml value, the raw visible cell value (seen in the Excel format bar), or the formatted cell value returned either the way the Excel file specified or the way you specify instead of a Cell instance with all the data. . See "custom_formats" in Spreadsheet::XLSX::Reader::LibXML::Worksheet to insert custom targeted formats for use with the parser. All empty cells return undef no matter what.

    Default instance

    Range instance = returns a populated Spreadsheet::XLSX::Reader::LibXML::Cell instance, unformatted = returns just the raw visible value of the cell shown in the Excel formula bar, value = returns just the formatted value stored in the excel cell, xml_value = the raw value for the cell as stored in the sub-xml files

    attribute methods Methods provided to adjust this attribute

      get_group_return_type

        Definition: a way to check the current attribute setting

      set_group_return_type

        Definition: a way to set the current attribute setting

empty_return_type

    Definition: Traditionally Spreadsheet::ParseExcel returns an empty string for cells with unique formatting but no stored value. It may be that the more accurate way of returning undef works better for you. This will turn that behaviour on. If Excel stores an empty string having this attribute set to 'undef_string' will still return the empty string!

    Default empty_string

    Range empty_string = populates the unformatted value with '' even if it is set to undef undef_string = if excel stores undef for an unformatted value it will return undef

    attribute methods Methods provided to adjust this attribute

      get_empty_return_type

        Definition: a way to check the current attribute setting

      set_empty_return_type

        Definition: a way to set the current attribute setting

Primary Methods

These are the primary ways to use this class. They can be used to open an .xlsx workbook. They are also ways to investigate information at the workbook level. For information on how to retrieve data from the worksheets see the Worksheet and Cell documentation. For additional workbook options see the Secondary Methods and the Attributes sections. The attributes section specifically contains all the methods used to adjust the attributes of this class.

All methods are object methods and should be implemented on the object instance.

Example:

my @worksheet_array = $workbook_instance->worksheets;

parse( $file_name|$file_handle, $formatter )

    Definition: This is a convenience method to match "parse($filename, $formatter)" in Spreadsheet::ParseExcel. It only works if the file_name or file_handle attribute was not set with ->new. It is one way to set the 'file_name' or 'file_handle' attribute [and the default_format_list attribute]. You cannot pass both a file name and a file handle simultaneously to this method.

    Accepts:

    $file = a valid xlsx file [or a valid xlsx file handle] (required)
    [$formatter] = see the default_format_list attribute for valid options (optional)

    Returns: itself when passing with the xlsx file loaded to the workbook level or undef for failure.

worksheets

    Definition: This method will return an array (not an array reference) containing a list of references to all worksheets in the workbook. This is not a reccomended method. It is provided for compatibility to Spreadsheet::ParseExcel. For alternatives see the get_worksheet_names method and the worksheet methods. For now it also only returns the tabular worksheets in the workbook. All chart worksheets are ignored! (future inclusion will included a backwards compatibility policy)

    Accepts: nothing

    Returns: an array ref of Worksheet objects for all worksheets in the workbook.

worksheet( $name )

    Definition: This method will return an object to read values in the worksheet. If no value is passed to $name then the 'next' worksheet in physical order is returned. 'next' will NOT wrap It also only iterates through the 'worksheets' in the workbook (but not the 'chartsheets').

    Accepts: the $name string representing the name of the worksheet object you want to open. This name is the word visible on the tab when opening the spreadsheet in Excel. (not the underlying zip member file name - which can be different. It will not accept chart tab names.)

    Returns: a Worksheet object with the ability to read the worksheet of that name. It returns undef and sets the error attribute if a 'chartsheet' is requested. Or in 'next' mode it returns undef if past the last sheet.

    Example: using the implied 'next' worksheet;

    while( my $worksheet = $workbook->worksheet ){
            print "Reading: " . $worksheet->name . "\n";
            # get the data needed from this worksheet
    }

in_the_list

    Definition: This is a predicate method that indicates if the 'next' worksheet function has been implemented at least once.

    Accepts:nothing

    Returns: true = 1, false = 0 once

start_at_the_beginning

    Definition: This restarts the 'next' worksheet at the first worksheet. This method is only useful in the context of the worksheet function.

    Accepts: nothing

    Returns: nothing

worksheet_count

    Definition: This method returns the count of worksheets (excluding charts) in the workbook.

    Accepts:nothing

    Returns: an integer

get_worksheet_names

    Definition: This method returns an array ref of all the worksheet names in the workbook. (It excludes chartsheets.)

    Accepts: nothing

    Returns: an array ref

    Example: Another way to parse a workbook without building all the sheets at once is;

    for $sheet_name ( @{$workbook->worksheet_names} ){
            my $worksheet = $workbook->worksheet( $sheet_name );
            # Read the worksheet here
    }

get_sheet_names

    Definition: This method returns an array ref of all the sheet names (tabs) in the workbook. (It includes chartsheets.)

    Accepts: nothing

    Returns: an array ref

get_chartheet_names

    Definition: This method returns an array ref of all the chartsheet names in the workbook. (It excludes worksheets.)

    Accepts: nothing

    Returns: an array ref

sheet_name( $Int )

    Definition: This method returns the sheet name for a given physical position in the workbook from left to right. It counts from zero even if the workbook is in 'count_from_one' mode. B(It will return chart names but chart tab names cannot currently be converted to worksheets). You may actually want worksheet_name instead of this function.

    Accepts: integers

    Returns: the sheet name (both workbook and worksheet)

    Example: To return only worksheet positions 2 through 4

    for $x (2..4){
            my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
            # Read the worksheet here
    }

sheet_count

    Definition: This method returns the count of all sheets in the workbook (worksheets and chartsheets).

    Accepts: nothing

    Returns: a count of all sheets

worksheet_name( $Int )

    Definition: This method returns the worksheet name for a given order in the workbook from left to right. It does not count any 'chartsheet' positions as valid. It counts from zero even if the workbook is in 'count_from_one' mode.

    Accepts: integers

    Returns: the worksheet name

    Example: To return only worksheet positions 2 through 4 and then parse them

    for $x (2..4){
            my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
            # Read the worksheet here
    }

worksheet_count

    Definition: This method returns the count of all worksheets in the workbook (not including chartsheets).

    Accepts: nothing

    Returns: a count of all worksheets

chartsheet_name( $Int )

    Definition: This method returns the chartsheet name for a given order in the workbook from left to right. It does not count any 'worksheet' positions as valid. It counts from zero even if the workbook is in 'count_from_one' mode.

    Accepts: integers

    Returns: the chartsheet name

chartsheet_count

    Definition: This method returns the count of all chartsheets in the workbook (not including worksheets).

    Accepts: nothing

    Returns: a count of all chartsheets

error

    Definition: This returns the most recent error message logged by the package. This method is mostly relevant when an unexpected result is returned by some other method.

    Accepts:nothing

    Returns: an error string.

Secondary Methods

These are the additional methods that include ways to extract additional information about the .xlsx file and ways to modify workbook and worksheet parsing that are less common. Note that all methods specifically used to adjust workbook level attributes are listed in the Attribute section. This section primarily contains methods for or delegated from private attributes set up during the workbook load process.

parse_excel_format_string( $format_string )

creator

    Definition: Retrieve the stored creator string from the Excel file.

    Accepts nothing

    Returns A string

date_created

    Definition: returns the date the file was created

    Accepts nothing

    Returns A string

modified_by

    Definition: returns the user name of the person who last modified the file

    Accepts nothing

    Returns A string

date_modified

    Definition: returns the date when the file was last modified

    Accepts nothing

    Returns A string

get_epoch_year

    Definition: This returns the epoch year defined by the Excel workbook.

    Accepts: nothing

    Returns: 1900 = Windows Excel or 1904 = Apple Excel

get_shared_string

get_format_position

set_defined_excel_format_list

change_output_encoding

set_cache_behavior

get_date_behavior

set_date_behavior

FLAGS

The parameter list (attributes) that are possible to pass to ->new is somewhat long. Therefore you may want a shortcut that aggregates some set of attribute settings that are not the defaults but wind up being boilerplate. I have provided possible alternate sets like this and am open to providing others that are suggested. The flags will have a : in front of the identifier and will be passed to the class in the 'use' statement for consumption by the import method. The flags can be stacked and where there is conflict between the flag settings the rightmost passed flag setting is used.

Example;

use Spreadsheet::XLSX::Reader::LibXML v0.34.4 qw( :alt_default :debug );

:alt_default

This is intended for a deep look at data and skip formatting cells.

:just_the_data

This is intended for a shallow look at data and skip formatting.

:just_raw_data

This is intended for a shallow look at raw text and skips all formatting including number formats.

:debug

Turn on "should_warn" in Spreadsheet::XLSX::Reader::LibXML::Error in the Error attribute (instance)

BUILD / INSTALL from Source

0. Please note that using cpanm is much easier than a source build! (but it will not always give the latest github version)

cpanm Spreadsheet::XLSX::Reader::LibXML

And then if you feel kindly

cpanm-reporter

1. This package uses Alien::LibXML to try and ensure that the mandatory prerequisite XML::LibXML will load. The biggest gotcha here is that older (<5.20.0.2) versions of Strawberry Perl and some other Win32 perls may not support the script 'pkg-config' which is required. You can resolve this by installation PkgConfig as 'pkg-config'. I have included the short version of that process below but download the full PkgConfig distribution and read README.win32 file for other options and much more explanation.

    this will conflict with any existing pkg-config installed

    C:\> cpanm PkgConfig --configure-args=--script=pkg-config

It may be that you still need to use a system package manager to load the 'libxml2-devel' library. If this is the case or you experience any other installation issues please submit them to github especially if they occur prior to starting the test suit as these failures will not auto push from CPAN Testers so I won't know to fix them!

2. Download a compressed file with this package code from your favorite source

3. Extract the code from the compressed file.

    If you are using tar on a .tar.gz file this should work:

    tar -zxvf Spreadsheet-XLSX-Reader-LibXML-v0.xx.tar.gz

4. Change (cd) into the extracted directory

5. Run the following

    (for Windows find what version of make was used to compile your perl)

    perl  -V:make

    (then for Windows substitute the correct make function (s/make/dmake/g)? below)

perl Makefile.PL

make

make test

make install # As sudo/root

make clean

SUPPORT

TODO

    1. Add POD for all the new chart methods!

    1. Build an 'Alien::LibXML::Devel' package to load the libxml2-devel libraries from source and require that and Alien::LibXML in the build file. So all needed requirements for XML::LibXML are met

      Both libxml2 and libxml2-devel libraries are required for XML::LibXML

    1. Add an individual test just for Spreadsheet::XLSX::Reader::LibXML::Row (Currently tested in the worksheet test)

    2. Add an individual test just for Spreadsheet::XLSX::Reader::LibXML::ZipReader (Currently only tested in the top level test)

    3. Add individual tests just for the File, Meta, Props, Rels sub workbook interfaces

    4. Add an individual test just for Spreadsheet::XLSX::Reader::LibXML::ZipReader::ExtractFile

    5. Add individual tests just for the XMLReader sub modules NamedStyles, and PositionStyles

    6. Add a pivot table reader (Not just read the values from the sheet)

    7. Add calc chain methods

    8. Add more exposure to workbook/worksheet formatting values

    9. Build a DOM parser alternative for the sheets

      (Theoretically faster than the reader and no longer JIT so it uses more memory)

AUTHOR

    Jed Lund

    jandrew@cpan.org

CONTRIBUTORS

This is the (likely incomplete) list of people who have helped make this distribution what it is, either via code contributions, patches, bug reports, help with troubleshooting, etc. A huge 'thank you' to all of them.

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

This software is copyrighted (c) 2014, 2016 by Jed Lund

DEPENDENCIES

SEE ALSO