Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated repo to python3 #2

Merged
merged 1 commit into from
Jan 1, 2018
Merged

Updated repo to python3 #2

merged 1 commit into from
Jan 1, 2018

Conversation

zjijz
Copy link
Contributor

@zjijz zjijz commented Dec 1, 2017

I needed to test this with a python 3 project, so I updated the logic as-is to be compatible with Python 3. Reran the tests to ensure they all passed.

@jegesh
Copy link
Owner

jegesh commented Dec 2, 2017

Did you run the tests with python 2.x as well? I'm not familiar with the next(reader) syntax you used on line 97, that's the only reason I'm asking.

@carvetighter
Copy link

in RandomAccessReader method _get_line_data() variable current_line needs to start and reset to 1 not 0. Also you need to determine if you want to the "new line" character in the line returned. If you want to keep it then there are no changes. If you want to drop it the length of the line would be current_line - 1.

@jegesh
Copy link
Owner

jegesh commented Dec 7, 2017

@carvetighter Is your comment a continuation of my question, or a separate issue? Are you addressing the current code or the PR code?

@carvetighter
Copy link

carvetighter commented Dec 7, 2017

@jegesh This is a continuation of your question. I copied the code and tested it on test_file.csv and the length of each line wasn't correct. It would often pull "/\n" + the next line. I tested this on Python 3.x not on Python 2.7; on the PR code. I hope that helps.

@carvetighter
Copy link

Also, I tested this on the Airline Dataset (here) and I ran out of memory. It's about 12Gb. The dictionary of lines turns out to be about 128 million entries. On a laptop with 8Gb Ram I get an memory error. I'm currently working on a sampling version based off of your code here.

@jegesh
Copy link
Owner

jegesh commented Dec 10, 2017

@carvetighter It would be really awesome if you could profile that test run with the airline dataset - 128 million entries doesn't sound like it should take up a whole lot of memory.

@zjijz Please see @carvetighter's comment about the lines not parsing correctly.

@zjijz
Copy link
Contributor Author

zjijz commented Dec 11, 2017

Sorry about not responding, github wasn't sending me notifications for whatever reason.

I really only did this because I needed a random access file so I could avoid putting over 10000 images into memory at once. I tried this repo because I needed a project that wouldn't assume lines were the same length, and it worked flawlessly. Thank you for all your work so far!

This will not work with Python 2 due to breaking changes between Python 2 and 3. The Python 3 version would need to be provided as a separate .whl file.

Most of these changes are restructuring of packages but there are also generator and range changes. next(...) is the new syntax for generators since they were converted to language-level constructs.

In terms of @carvetighter's comments, I forked the repo as-is and just changed Python 2 constructs to Python 3 ones. I can update with these changes, but it might be better to update the base repo for Python 2 and I can update this pull request from there.

I disagree that line indexing should start at 1, and that is more Pythonic to start at 0. The newline being included is a matter of preference but note that some OS's use \r\n or just \n to indicate new lines so current_line - 1 might not always strip the newline characters.

@zjijz
Copy link
Contributor Author

zjijz commented Dec 11, 2017

@jegesh Also, it is probably possible to have once merged library for Python 2 and 3 where the Python version is checked at import and the differences in language versions are abstracted and filled in properly for each version. I'm not sure if this is what big libraries like NumPy or TensorFlow did though.

@carvetighter
Copy link

@zjijz I probably miscommunicated; it's not a matter of pythonic or not; it didn't pass the optest with the test data; the line would continuously end short by n + 1 characters. to get it to work I had to change it to start at 1 not 0.

@zjijz
Copy link
Contributor Author

zjijz commented Dec 12, 2017

@carvetighter I'm not having any tests fail on my end. Which tests are failing for you?

@carvetighter
Copy link

carvetighter commented Dec 20, 2017

@zjijz sorry for the delay, just finished my Masters; you are correct I was making a mistake on my testing

In testing on the airline dataset (12GB) I am getting a memory error. Below is a small snipit of some test code which I'm demonstrating the size of the list of dictionaries when mapping the file. For a 10 million line file the list of a dictionaries is 81 MB. If you have a 120 million line file the list may be 972 MB.

I guess my question is how big of a file is too big for this algorithm?

import sys

list_lines = list()
for int_line in range(0, 10000000):
list_lines.append({'position':200, 'length':340})

print(int_line)
print(sys.getsizeof(list_lines))

capture

@jegesh
Copy link
Owner

jegesh commented Dec 20, 2017

@carvetighter The answer, it appears, is that it depends on the free memory on your machine...

@jegesh
Copy link
Owner

jegesh commented Dec 20, 2017

It appears from the conversation that the tests are passing for python 2.7 with the PR code. Can I get confirmation on this?

@carvetighter
Copy link

I think I'm going to copy the repo and write some modifications testing the number of lines of the file. If the number of lines is more than 1 million, completely arbitrary, I'm going to estimate the line length. I'm going to give you @zjijz and @jegesh the credit. I'm going to simplify the code a bit as well.

@zjijz
Copy link
Contributor Author

zjijz commented Dec 21, 2017

@jegesh The tests in this pull request are passing for Python 3, but the same code won't be interpretable by Python 2. What approach would you like to take for merging Python 2 and Python 3 source code?

@carvetighter Are you using this in your thesis?

@carvetighter
Copy link

@zjijz I used it in a project dealing with the airline data; the line estimation worked really well on a large file; for the 12 GB file it took about 1.5 minutes to count the lines in the file; this will also be useful at work in dealing with large file extracts; I finished my code last night an I just need to document it; I broke it up into three main classes vice two (Base Class, Text Sampler, Csv Sampler); it may be hard reading my code since I started coding in the late 90's with C++ so I have the data type in my naming convention and my documentation is verbose; I capped the file length at 1 million rows to use the list of dictionaries as a line index

@jegesh I would recommend two repo's; one for Python 2.x and one for Python 3.x; since Python 2.x isn't actively supported it might be useful to have two repo's; my two cents

@jegesh
Copy link
Owner

jegesh commented Jan 1, 2018

Is two repos somehow better (safer from a development standpoint, easier to maintain) than just maintaining two branches, one for python 2 and one for python 3? I've never had the need to maintain separate versions for different versions of python before. Would the six library come in handy here?

@carvetighter
Copy link

I think it would be easier to have two branches. I don't know how that would work with PIP. The Python 2 code is not compatible with the Python 3 code. You will need to keep them separate somehow. You could start another repo but you would have to maintain another repo. It's really up to you.

@jegesh
Copy link
Owner

jegesh commented Jan 1, 2018

I just downloaded @zjijz 's code and run it with python 2.7. There were two tests that failed. I was able to fix it by adding a single word to the code. Now all the tests pass (with flying colors, I dare say). And here it is time for six to shine, since the word I had to add was unicode(). I'll merge the PR and add a commit to make the code universal again, and then redeploy to pypi.

@jegesh jegesh merged commit 4291d93 into jegesh:master Jan 1, 2018
@jegesh
Copy link
Owner

jegesh commented Jan 1, 2018

I pushed an additional commit which should make it work for both python 2.7 and 3. I tested it with 2.7, I would appreciate if someone could run the tests against the latest commit before I deploy the new version to pypi. Thanks bros

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants