Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Create compressed salary testing data #14587

Merged
merged 2 commits into from
Nov 11, 2016

Conversation

dhimmel
Copy link
Contributor

@dhimmel dhimmel commented Nov 4, 2016

Create compressed versions of the salary dataset for testing #14576.

Rename salary.table.csv to salaries.tsv because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename salary.table.gz to salaries.tsv.gz, since compressed files should append to not strip the original extension.

Created new files by running the following commands:

cd pandas/io/tests/parser/data
bzip2 --keep salaries.tsv
xz --keep salaries.tsv
zip salaries.tsv.zip salaries.tsv

There will be CI testing failures since the modified files are not yet in master or on amazon S3. @jorisvandenbossche how do we deal with this?

Create compressed versions of the salary dataset for testing pandas-dev#14576.

Rename `salary.table.csv` to `salaries.tsv` because the dataset is tab rather
than comma delimited. Remove the word table because it's implied by the
extension. Rename `salary.table.gz` to `salaries.tsv.gz`, since compressed files
should append to not strip the original extension.

Created new files by running the following commands:

```sh
cd pandas/io/tests/parser/data
bzip2 --keep salaries.tsv
xz --keep salaries.tsv
zip salaries.tsv.zip salaries.tsv
```
@jreback
Copy link
Contributor

jreback commented Nov 4, 2016

don't use tsv for extension, use csv

@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 4, 2016

don't use tsv for extension, use csv

Why is that? The table isn't comma separated. This leads to issues like GitHub failing to render the table:

salaries-csv

@jreback
Copy link
Contributor

jreback commented Nov 4, 2016

because you are changing something that is standard

if you wish to do that make a new PR but to be honest it's pretty non standard

@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 4, 2016

because you are changing something that is standard

@jreback hmm. Presently, the CSV standard is being violated as RFC 4180 specifies:

Within the header and each record, there may be one or more fields, separated by commas.

However, salary.table.csv does meet the TSV standard:

Fields in a record are separated from each other by a tab character.

So I'm happy to save this for another pull request, but I thought it would be most efficient to address it here... especially since:

  1. we're already standardizing the file names in other ways -- like removing .table and adding the extension to proceed .gz.
  2. changing the name of testing files creates temporary testing failures... so might as well do two birds with one stone?

Just let me know what's the best way forward.

@sinhrks sinhrks added Testing pandas testing functions or related to the test suite IO CSV read_csv, to_csv labels Nov 7, 2016
@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 8, 2016

Tagging @jreback @sinhrks @jorisvandenbossche -- let me know what to do next here, so I can continue with #14576.

@jorisvandenbossche
Copy link
Member

@dhimmel

  • let's keep the files as .csv for now. Although those files meet the tsv standard, this standard is certainly not followed in all fields, and also in the pandas codebase we use the .csv file extension for all text files at the moment, no matter the delimiter that is used.
  • standardizing the file names in the other ways is certainly fine! (the compression suffixes and leaving out the 'table')

For the CI failures, those are indeed not avoidable. But since it doesn't change code, only the file names, it's OK if you can confirm those specific tests pass locally.

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Nov 8, 2016
@codecov-io
Copy link

Current coverage is 85.27% (diff: 100%)

Merging #14587 into master will decrease coverage by <.01%

@@             master     #14587   diff @@
==========================================
  Files           140        140          
  Lines         50693      50693          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43230      43229     -1   
- Misses         7463       7464     +1   
  Partials          0          0          

Powered by Codecov. Last update 7a2bcb6...759d48d

@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 9, 2016

let's keep the files as .csv for now.

Done in 24341b5.

For the CI failures, those are indeed not avoidable. But since it doesn't change code, only the file names, it's OK if you can confirm those specific tests pass locally.

@jorisvandenbossche, here're the errors I get when I run nosetests pandas locally in the development environment (full log):

ERROR: test_url_gz (pandas.io.tests.parser.test_network.TestUrlGz)
ERROR: test_url_gz_infer (pandas.io.tests.parser.test_network.TestUrlGz)
ERROR: test_url (pandas.io.tests.parser.test_parsers.TestCParserHighMemory)
ERROR: test_url (pandas.io.tests.parser.test_parsers.TestCParserLowMemory)
ERROR: test_url (pandas.io.tests.parser.test_parsers.TestPythonParser)

It appears to me that all errors are from remote URLs not resolving... but how do I "confirm these specific tests pass locally". Should I locally change the URLs?

@jorisvandenbossche
Copy link
Member

It appears to me that all errors are from remote URLs not resolving... but how do I "confirm these specific tests pass locally". Should I locally change the URLs?

Ah, yes, of course you have locally the same problem .. No problem, it's ok like this.

@jorisvandenbossche jorisvandenbossche changed the title Create compressed salary testing data TST: Create compressed salary testing data Nov 11, 2016
@jorisvandenbossche jorisvandenbossche merged commit 85a6464 into pandas-dev:master Nov 11, 2016
@jorisvandenbossche
Copy link
Member

@dhimmel Thanks!

@jorisvandenbossche
Copy link
Member

I made a small correction: ed21736

@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 11, 2016

I made a small correction: ed21736

Yeah I should have more explicit that https://s3.amazonaws.com/pandas-test/salary.table.gz would have to be renamed on S3.

So in the longer term, here's what I'm thinking. First, is the S3 test actually valuable in addition to the GitHub URL test?

So if the S3 tests are valuable, then I'd suggest uploading salaries.csv* to S3 and I'll add those tests to #14576 accordingly.

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 11, 2016
@jorisvandenbossche
Copy link
Member

@dhimmel that sounds right (the S3 here is just an url (not an S3 test), so github url is just as fine)

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 4, 2016

@dhimmel that sounds right (the S3 here is just an url (not an S3 test), so github url is just as fine)

@jorisvandenbossche, based on how the reading from URL works, we actually will need to also test reading compressed files from S3 URLs. See bff0f3e

So would it be possible to upload salaries.csv, salaries.csv.bz2, salaries.csv.gz, salaries.csv.xz, and salaries.csv.zip to the S3?

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 4, 2016
@jorisvandenbossche
Copy link
Member

@jreback Do you know who has access to the amazon data?

@dhimmel Are you that it is not just an url in this case? (the link you gave does not lead to visible code ("We went looking everywhere, but couldn’t find those commits"), probably because you had to force push after a rebase or so)

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 4, 2016

probably because you had to force push after a rebase or so

Yep I rebased. See c042391. The point is that a different implementation is used to read URLs that are S3.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 4, 2016

Yes, but I think this url is not seen as a 'S3 url', but just as a plain url (if you enter this url in a browser, you just get to download the file). A S3 url would more look like "s3://pandas-test/somefile.csv". What you link to is in the s3.py file, not in the main parsing file.

@jorisvandenbossche
Copy link
Member

For example:

In [14]: pd.io.common._is_s3_url('https://s3.amazonaws.com/pandas-test/salary.table.gz')
Out[14]: False

In [15]: pd.io.common._is_s3_url('s3://pandas-test/salary.table.gz')
Out[15]: True

jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016
jorisvandenbossche added a commit that referenced this pull request Dec 15, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jan 3, 2017
Version 0.19.2

* tag 'v0.19.2': (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (pandas-dev#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
  [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886)
  Fixed KDE Plot to drop the missing values (pandas-dev#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531)
  TST: correct url for test file on s3 (xref pandas-dev#14587)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jan 3, 2017
* releases: (78 commits)
  RLS: v0.19.2
  DOC: update release notes for 0.19.2
  TST: skip gbq upload test as flakey
  DOC: clean-up v0.19.2 whatsnew
  DOC: update Pandas Cheat Sheet (GH13202)
  DOC: Pandas Cheat Sheet
  TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957)
  flake8 fix import
  Remove test - from 0.20.0 PR slipped in
  PERF: fix getitem unique_check / initialization issue
  cache and remove boxing (pandas-dev#14931)
  CLN: Resubmit of GH14700.  Fixes GH14554.  Errors other than Indexing…
  Clean up construction of Series with dictionary and datetime index
  BUG: .fillna() for datetime64 with tz is passing thru floats
  BUG: Patch read_csv NA values behaviour
  ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936)
  [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886)
  Fixed KDE Plot to drop the missing values (pandas-dev#14820)
  ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531)
  TST: correct url for test file on s3 (xref pandas-dev#14587)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants