Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with load-tsv function #57

Closed
micrub opened this issue Sep 7, 2014 · 6 comments · Fixed by #62
Closed

problem with load-tsv function #57

micrub opened this issue Sep 7, 2014 · 6 comments · Fixed by #62
Labels

Comments

@micrub
Copy link

micrub commented Sep 7, 2014

Hi ,

I am trying to run following function on tsv file with more than 100k line, while running it on my laptop.
The function looks like this.

(defn- hashed-data  [file-name]
  (->>
    (pig/load-tsv file-name)
    (pig/map  (fn  [[ & args]]
                [args]))))

Instead of getting same amount of lines like in input file I am always getting only 1000 items.
do I miss something obvious ?

@mapstrchakra
Copy link
Contributor

You need to change the binding of how many results you are returning. By
default it returns a 1000 items in the REPL

(def ^:dynamic max-load-records 1000).

You can wrap your function with a max-load-records binding as follows:

(defn- hashed-data [file-name](binding [pigpen.local/max-load-records 100000
%28->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29%29)))

On Sunday, September 7, 2014, Michael Rubanov notifications@github.com
wrote:

Hi ,

I am trying to run following function on tsv file with more than 100k
line, while running it on my laptop.
The function looks like this.

(defn- hashed-data [file-name](->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29))

Instead of getting same amount of lines like in input file I am always
getting only 1000 items.
do I miss something obvious ?


Reply to this email directly or view it on GitHub
#57.

@mbossenbroek
Copy link
Contributor

Yeah, I put that cap in there because the version of rx I'm using doesn't unsubscribe properly from the observable. It's kind of a hacky fix, but this prevents you from processing potentially large files just to throw the result away. In general, the REPL should only be used for vetting your code & then you'd run at scale on the cluster, but 100k should be well within the limits of what it can handle locally.

At Netflix, we sample large GB files over the network directly into pigpen - without this limit it was just continuing to download the file on a background thread & slowing down the REPL. This was painful when I just wanted the first 10 records.

The longer term fix is to upgrade the version of rx that I'm using, but they tend to break their API frequently so I've been waiting for v1.0 to be released.

-Matt

On Sunday, September 7, 2014 at 11:17 AM, mapstrchakra wrote:

You need to change the binding of how many results you are returning. By
default it returns a 1000 items in the REPL

(def ^:dynamic max-load-records 1000).

You can wrap your function with a max-load-records binding as follows:

(defn- hashed-data [file-name]

(binding [pigpen.local/max-load-records 10000
(->>
(pig/load-tsv file-name)
(pig/map (fn [[ & args]]
[args]))))))

On Sunday, September 7, 2014, Michael Rubanov <notifications@github.com (mailto:notifications@github.com)>
wrote:

Hi ,

I am trying to run following function on tsv file with more than 100k
line, while running it on my laptop.
The function looks like this.

(defn- hashed-data [file-name](->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29))

Instead of getting same amount of lines like in input file I am always
getting only 1000 items.
do I miss something obvious ?


Reply to this email directly or view it on GitHub
#57.


Reply to this email directly or view it on GitHub (#57 (comment)).

@micrub
Copy link
Author

micrub commented Sep 7, 2014

Thanks for clarification , though following wrapping didn't solved this issue in repl , still getting 1000 items back:

(binding [pigpen.local/*max-load-records* 100000])

@mbossenbroek
Copy link
Contributor

Your example shows the trailing paren after the binding expression… To use binding, you need to enclose the code that requires the rebinding.

(binding [pigpen.local/max-load-records 100000](->> %28pig/load-tsv)
(pig/dump)))

In this case, you'd want to make sure that the code calling pig/dump is what gets wrapped, not the load command. The load command just builds an expression tree.

(def x (pig/load …))

(binding [pigpen.local/max-load-records 100000](pig/dump x))

Let me know if that works for you.

-Matt

On Sunday, September 7, 2014 at 2:35 PM, Michael Rubanov wrote:

Thanks for clarification , though following wrapping didn't solved this issue in repl , still getting 1000 items back:
(binding [pigpen.local/max-load-records 100000]) ``


Reply to this email directly or view it on GitHub (#57 (comment)).

@mbossenbroek
Copy link
Contributor

After thinking about this some more, I'm going to change the default to be unlimited and add this as an option to limit it only if you need it.

@mbossenbroek
Copy link
Contributor

Fixed entirely in #61 This binding is no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants