Updated repo to python3 #2

zjijz · 2017-12-01T04:39:04Z

I needed to test this with a python 3 project, so I updated the logic as-is to be compatible with Python 3. Reran the tests to ensure they all passed.

jegesh · 2017-12-02T20:13:54Z

Did you run the tests with python 2.x as well? I'm not familiar with the next(reader) syntax you used on line 97, that's the only reason I'm asking.

carvetighter · 2017-12-07T00:19:36Z

in RandomAccessReader method _get_line_data() variable current_line needs to start and reset to 1 not 0. Also you need to determine if you want to the "new line" character in the line returned. If you want to keep it then there are no changes. If you want to drop it the length of the line would be current_line - 1.

jegesh · 2017-12-07T18:10:32Z

@carvetighter Is your comment a continuation of my question, or a separate issue? Are you addressing the current code or the PR code?

carvetighter · 2017-12-07T20:25:34Z

@jegesh This is a continuation of your question. I copied the code and tested it on test_file.csv and the length of each line wasn't correct. It would often pull "/\n" + the next line. I tested this on Python 3.x not on Python 2.7; on the PR code. I hope that helps.

carvetighter · 2017-12-07T22:06:13Z

Also, I tested this on the Airline Dataset (here) and I ran out of memory. It's about 12Gb. The dictionary of lines turns out to be about 128 million entries. On a laptop with 8Gb Ram I get an memory error. I'm currently working on a sampling version based off of your code here.

jegesh · 2017-12-10T11:20:22Z

@carvetighter It would be really awesome if you could profile that test run with the airline dataset - 128 million entries doesn't sound like it should take up a whole lot of memory.

@zjijz Please see @carvetighter's comment about the lines not parsing correctly.

zjijz · 2017-12-11T20:00:08Z

Sorry about not responding, github wasn't sending me notifications for whatever reason.

I really only did this because I needed a random access file so I could avoid putting over 10000 images into memory at once. I tried this repo because I needed a project that wouldn't assume lines were the same length, and it worked flawlessly. Thank you for all your work so far!

This will not work with Python 2 due to breaking changes between Python 2 and 3. The Python 3 version would need to be provided as a separate .whl file.

Most of these changes are restructuring of packages but there are also generator and range changes. next(...) is the new syntax for generators since they were converted to language-level constructs.

In terms of @carvetighter's comments, I forked the repo as-is and just changed Python 2 constructs to Python 3 ones. I can update with these changes, but it might be better to update the base repo for Python 2 and I can update this pull request from there.

I disagree that line indexing should start at 1, and that is more Pythonic to start at 0. The newline being included is a matter of preference but note that some OS's use \r\n or just \n to indicate new lines so current_line - 1 might not always strip the newline characters.

zjijz · 2017-12-11T20:05:52Z

@jegesh Also, it is probably possible to have once merged library for Python 2 and 3 where the Python version is checked at import and the differences in language versions are abstracted and filled in properly for each version. I'm not sure if this is what big libraries like NumPy or TensorFlow did though.

carvetighter · 2017-12-12T02:34:13Z

@zjijz I probably miscommunicated; it's not a matter of pythonic or not; it didn't pass the optest with the test data; the line would continuously end short by n + 1 characters. to get it to work I had to change it to start at 1 not 0.

zjijz · 2017-12-12T03:27:58Z

@carvetighter I'm not having any tests fail on my end. Which tests are failing for you?

carvetighter · 2017-12-20T16:40:11Z

@zjijz sorry for the delay, just finished my Masters; you are correct I was making a mistake on my testing

In testing on the airline dataset (12GB) I am getting a memory error. Below is a small snipit of some test code which I'm demonstrating the size of the list of dictionaries when mapping the file. For a 10 million line file the list of a dictionaries is 81 MB. If you have a 120 million line file the list may be 972 MB.

I guess my question is how big of a file is too big for this algorithm?

import sys

list_lines = list()
for int_line in range(0, 10000000):
list_lines.append({'position':200, 'length':340})

print(int_line)
print(sys.getsizeof(list_lines))

jegesh · 2017-12-20T20:36:03Z

@carvetighter The answer, it appears, is that it depends on the free memory on your machine...

jegesh · 2017-12-20T20:37:42Z

It appears from the conversation that the tests are passing for python 2.7 with the PR code. Can I get confirmation on this?

carvetighter · 2017-12-20T21:17:44Z

I think I'm going to copy the repo and write some modifications testing the number of lines of the file. If the number of lines is more than 1 million, completely arbitrary, I'm going to estimate the line length. I'm going to give you @zjijz and @jegesh the credit. I'm going to simplify the code a bit as well.

zjijz · 2017-12-21T15:04:02Z

@jegesh The tests in this pull request are passing for Python 3, but the same code won't be interpretable by Python 2. What approach would you like to take for merging Python 2 and Python 3 source code?

@carvetighter Are you using this in your thesis?

carvetighter · 2017-12-21T15:38:41Z

@zjijz I used it in a project dealing with the airline data; the line estimation worked really well on a large file; for the 12 GB file it took about 1.5 minutes to count the lines in the file; this will also be useful at work in dealing with large file extracts; I finished my code last night an I just need to document it; I broke it up into three main classes vice two (Base Class, Text Sampler, Csv Sampler); it may be hard reading my code since I started coding in the late 90's with C++ so I have the data type in my naming convention and my documentation is verbose; I capped the file length at 1 million rows to use the list of dictionaries as a line index

@jegesh I would recommend two repo's; one for Python 2.x and one for Python 3.x; since Python 2.x isn't actively supported it might be useful to have two repo's; my two cents

jegesh · 2018-01-01T21:31:13Z

Is two repos somehow better (safer from a development standpoint, easier to maintain) than just maintaining two branches, one for python 2 and one for python 3? I've never had the need to maintain separate versions for different versions of python before. Would the six library come in handy here?

carvetighter · 2018-01-01T21:37:09Z

I think it would be easier to have two branches. I don't know how that would work with PIP. The Python 2 code is not compatible with the Python 3 code. You will need to keep them separate somehow. You could start another repo but you would have to maintain another repo. It's really up to you.

jegesh · 2018-01-01T22:01:16Z

I just downloaded @zjijz 's code and run it with python 2.7. There were two tests that failed. I was able to fix it by adding a single word to the code. Now all the tests pass (with flying colors, I dare say). And here it is time for six to shine, since the word I had to add was unicode(). I'll merge the PR and add a commit to make the code universal again, and then redeploy to pypi.

jegesh · 2018-01-01T22:14:57Z

I pushed an additional commit which should make it work for both python 2.7 and 3. I tested it with 2.7, I would appreciate if someone could run the tests against the latest commit before I deploy the new version to pypi. Thanks bros

Updated to python3

b947e26

jegesh merged commit 4291d93 into jegesh:master Jan 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated repo to python3 #2

Updated repo to python3 #2

zjijz commented Dec 1, 2017

jegesh commented Dec 2, 2017

carvetighter commented Dec 7, 2017

jegesh commented Dec 7, 2017

carvetighter commented Dec 7, 2017 •

edited

Loading

carvetighter commented Dec 7, 2017

jegesh commented Dec 10, 2017

zjijz commented Dec 11, 2017 •

edited

Loading

zjijz commented Dec 11, 2017

carvetighter commented Dec 12, 2017

zjijz commented Dec 12, 2017

carvetighter commented Dec 20, 2017 •

edited

Loading

jegesh commented Dec 20, 2017

jegesh commented Dec 20, 2017 •

edited

Loading

carvetighter commented Dec 20, 2017

zjijz commented Dec 21, 2017

carvetighter commented Dec 21, 2017

jegesh commented Jan 1, 2018

carvetighter commented Jan 1, 2018

jegesh commented Jan 1, 2018

jegesh commented Jan 1, 2018

Updated repo to python3 #2

Updated repo to python3 #2

Conversation

zjijz commented Dec 1, 2017

jegesh commented Dec 2, 2017

carvetighter commented Dec 7, 2017

jegesh commented Dec 7, 2017

carvetighter commented Dec 7, 2017 • edited Loading

carvetighter commented Dec 7, 2017

jegesh commented Dec 10, 2017

zjijz commented Dec 11, 2017 • edited Loading

zjijz commented Dec 11, 2017

carvetighter commented Dec 12, 2017

zjijz commented Dec 12, 2017

carvetighter commented Dec 20, 2017 • edited Loading

jegesh commented Dec 20, 2017

jegesh commented Dec 20, 2017 • edited Loading

carvetighter commented Dec 20, 2017

zjijz commented Dec 21, 2017

carvetighter commented Dec 21, 2017

jegesh commented Jan 1, 2018

carvetighter commented Jan 1, 2018

jegesh commented Jan 1, 2018

jegesh commented Jan 1, 2018

carvetighter commented Dec 7, 2017 •

edited

Loading

zjijz commented Dec 11, 2017 •

edited

Loading

carvetighter commented Dec 20, 2017 •

edited

Loading

jegesh commented Dec 20, 2017 •

edited

Loading