Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The test code tests/data/test_mm.py does not work. #80

Closed
dkkim1005 opened this issue Dec 19, 2023 · 1 comment
Closed

The test code tests/data/test_mm.py does not work. #80

dkkim1005 opened this issue Dec 19, 2023 · 1 comment

Comments

@dkkim1005
Copy link
Contributor

dkkim1005 commented Dec 19, 2023

Bug

OSError is raised when executing the test code tests/data/test_mm.py. All test cases failed for the same issue.

$ nosetests ./data/test_mm.py -v

test0_get_default_option (data.test_mm.TestMatrixMarket) ... ok                                                                                                                                                                                                                                
test1_is_valid_option (data.test_mm.TestMatrixMarket) ... ok                                                                                                                                                                                                                                   
test2_create (data.test_mm.TestMatrixMarket) ... [INFO    ] 2023-12-19 04:03:30 [mm.py:247] Create the database from matrix market file.                                                                                                                                                       
[DEBUG   ] 2023-12-19 04:03:30 [mm.py:252] Building meta part...                                                                                                                                                                                                                               
^M[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s[INFO    ] 2023-12-19 04:03:30 [base.py:179] File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.                                                                                                                     
[ERROR   ] 2023-12-19 04:03:30 [mm.py:162] Cannot create db: Can't write data (no appropriate function for conversion path)                                                                                                                                                                    
[ERROR   ] 2023-12-19 04:03:30 [mm.py:163] Traceback (most recent call last):                                                                                                                                                                                                                  
  File "/home/bc-user/.local/lib/python3.10/site-packages/buffalo/data/mm.py", line 141, in _create                                                                                                                                                                                            
    idmap["rows"][:] = np.loadtxt(fin, dtype=f"S{uid_max_col}")                                                                                                                                                                                                                                
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "/home/bc-user/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 999, in __setitem__                                                                                                                                                                                       
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)                                                                                                                                                                                                                                 
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper                                                                                                                                                                                                                        
  File "h5py/h5d.pyx", line 283, in h5py.h5d.DatasetID.write                                                                                                                                                                                                                                   
  File "h5py/_proxy.pyx", line 114, in h5py._proxy.dset_rw                                                                                                                                                                                                                                     
OSError: Can't write data (no appropriate function for conversion path)

......(skip the middle lines)

MatrixMarketDataReader: DEBUG: creating temporary matrix-market data from numpy-kind array
MatrixMarket: INFO: Create the database from matrix market file.
MatrixMarket: DEBUG: Building meta part...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s
MatrixMarket: INFO: File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.
MatrixMarket: ERROR: Cannot create db: Can't write data (no appropriate function for conversion path)
MatrixMarket: ERROR: Traceback (most recent call last):
  File "/home/bc-user/.local/lib/python3.10/site-packages/buffalo/data/mm.py", line 141, in _create
    idmap["rows"][:] = np.loadtxt(fin, dtype=f"S{uid_max_col}")
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/bc-user/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 999, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 283, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 114, in h5py._proxy.dset_rw
OSError: Can't write data (no appropriate function for conversion path)

[PROGRESS] 100.00% 0.0/0.0secs 1,137.96it/s

--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 10 tests in 0.041s

FAILED (errors=5)

The cause is from mismatching between the data type of HDF5 and the numpy object, as annotated in the above error log. The current version only supports "utf-8" encoding for creating idmap, which makes the MatrixMarket object fail to load both user and item ID lists. To resolve the issue, converting the encoding rule from "utf-8" to "ascii" might be the feasible way. I tested a code with the local patch(buffalo/data/base.py) as follows,

# Method in Data class
def _create_database(self, path, **kwargs):
    ......
    [ASIS]
    idmap.create_dataset("rows", (num_users,), dtype=h5py.string_dtype("utf-8", length=uid_max_col),
                         maxshape=(num_users,))
    idmap.create_dataset("cols", (num_items,), dtype=h5py.string_dtype("utf-8", length=iid_max_col),
                         maxshape=(num_items,))
    ......
    [TOBE]
    idmap.create_dataset("rows", (num_users,), dtype=h5py.string_dtype("ascii", length=uid_max_col),
                         maxshape=(num_users,))
    idmap.create_dataset("cols", (num_items,), dtype=h5py.string_dtype("ascii", length=iid_max_col),
                         maxshape=(num_items,))
    ......
test0_get_default_option (data.test_mm.TestMatrixMarket) ... ok
test1_is_valid_option (data.test_mm.TestMatrixMarket) ... ok
test2_create (data.test_mm.TestMatrixMarket) ...
[INFO    ] 2023-12-19 04:54:58 [mm.py:247] Create the database from matrix market file.
[DEBUG   ] 2023-12-19 04:54:58 [mm.py:252] Building meta part...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s[INFO    ] 2023-12-19 04:54:58 [base.py:179] File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.
[PROGRESS] 100.00% 0.0/0.0secs 742.35it/s
[INFO    ] 2023-12-19 04:54:58 [mm.py:260] Creating working data...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s^M[PROGRESS] 100.00% 0.0/0.0secs 168,937.24it/s
[DEBUG   ] 2023-12-19 04:54:58 [mm.py:264] Working data is created on /tmp/tmpr5a6iwrk
[INFO    ] 2023-12-19 04:54:58 [mm.py:265] Building data part...
[INFO    ] 2023-12-19 04:54:58 [base.py:417] Building compressed triplets for rowwise...
[INFO    ] 2023-12-19 04:54:58 [base.py:418] Preprocessing...
[INFO    ] 2023-12-19 04:54:58 [base.py:421] In-memory Compressing ...
[INFO    ] 2023-12-19 04:54:59 [base.py:301] Load triplet files. Total job files: 73
[INFO    ] 2023-12-19 04:54:59 [base.py:451] Finished
[INFO    ] 2023-12-19 04:54:59 [base.py:417] Building compressed triplets for colwise...
[INFO    ] 2023-12-19 04:54:59 [base.py:418] Preprocessing...
[INFO    ] 2023-12-19 04:54:59 [base.py:421] In-memory Compressing ...
[INFO    ] 2023-12-19 04:54:59 [base.py:301] Load triplet files. Total job files: 73
[INFO    ] 2023-12-19 04:54:59 [base.py:451] Finished
[INFO    ] 2023-12-19 04:54:59 [mm.py:279] DB built on ./mm.h5py
ok
......(skip the middle lines)
test3_list (data.test_mm.TestMatrixMarketReader) ... [DEBUG   ] 2023-12-19 04:55:01 [mm.py:70] creating temporary matrix-market data from numpy-kind array
ok

----------------------------------------------------------------------
Ran 10 tests in 3.166s

OK

However, this patch is not functional for treating w2v training(PR) in which "utf-8" characters are employed to train Korean words. To reconcile this conflict, providing the appropriate encoding rules for both loading a matrix-market file and a stream data file is one of the feasible actions.

@chiwanpark
Copy link
Member

@dkkim1005 We need to unify the type of string data to h5py.string_dtype("utf-8"). Could you send a PR fixing this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants