unclear reference to lambda baseline #2

kbenoit · 2018-01-07T09:17:22Z

In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom.

@ArthurSpirling @kmunger can you recall which one this is supposed to be?

The text was updated successfully, but these errors were encountered:

ArthurSpirling · 2018-01-07T13:16:19Z

Let me take a look tmrw.

On Sun, Jan 7, 2018 at 4:17 AM Kenneth Benoit ***@***.***> wrote: In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom. @ArthurSpirling <https://github.com/arthurspirling> @kmunger <https://github.com/kmunger> can you recall which one this is supposed to be? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTngGrgww51yYH3Y30YBtfns5ZVvApks5tIIujgaJpZM4RVlpN> .

-- Via iPhone, apologies for terseness

ArthurSpirling · 2018-01-07T18:09:26Z

Actually, wasn't this user defined? That is, it was up to the user to decide what they wanted to compare a given lambda to -- for example, one could specify an interest in comparing a particular snippet to e.g one by Eisenhower (assuming that was already in the data) as a reference? --- If I have that wrong, then I'm p certain it was intended to default to the fifth grade text, which is the hardcoded -2.17... figure. Can you clarify what reference_bottom is? (I assume it's the hardest snippet in the data, or sth) AS

…

On Sun, Jan 7, 2018 at 4:17 AM, Kenneth Benoit ***@***.***> wrote: In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom. @ArthurSpirling <https://github.com/arthurspirling> @kmunger <https://github.com/kmunger> can you recall which one this is supposed to be? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTngGrgww51yYH3Y30YBtfns5ZVvApks5tIIujgaJpZM4RVlpN> .

-- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger · 2018-01-12T15:41:28Z

That's right-- the "reference" call is from older code. The current code has the hardcoded top and bottom values that derived by simply sorting the extreme lambdas on the SOTU. When I did this, I left the older code in there and just added an extra column with the new, hardcorded approach.

The solution is to just get rid of the old code, which I can do easily. But the longer-term question is whether we should allow this to be user defined? Should we use the SOTU values as defaults and allow users to specify if they want to change them?

ArthurSpirling · 2018-01-12T17:23:05Z

Yes, I think that's what we want: default to the present values, but allow users to specify something other than defaults should they want.

…

On Fri, Jan 12, 2018 at 3:41 PM, Kevin Munger ***@***.***> wrote: That's right-- the "reference" call is from older code. The current code has the hardcoded top and bottom values that derived by simply sorting the extreme lambdas on the SOTU. When I did this, I left the older code in there and just added an extra column with the new, hardcorded approach. The solution is to just get rid of the old code, which I can do easily. But the longer-term question is whether we should allow this to be user defined? Should we use the SOTU values as defaults and allow users to specify if they want to change them? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTnmYqQ-KTz9AXc_j5vA_aQ7hK3kwlks5tJ30ogaJpZM4RVlpN> .

-- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger · 2018-01-12T19:25:10Z

Ok, on that.

And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided).

I'm currently rewriting the documentation to reflect what we're doing:

The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these.

The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale.

Are these the defaults we want?

ArthurSpirling · 2018-01-12T22:10:53Z

don't we use the fifth grade texts as 100? That's what the paper implies, no?

…

On Fri, Jan 12, 2018 at 8:16 PM, Kevin Munger ***@***.***> wrote: Ok, on that. And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided). I'm currently rewriting the documentation to reflect what we're doing: The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these. The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale. Are these the defaults we want? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTnjOXJ1mcX-nw76pdXoqeVgZ70kn-ks5tJ72ngaJpZM4RVlpN> .

-- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger · 2018-01-12T22:38:51Z

Reading back over the documentation, yes, that seems to be the case--and I just checked numbers, which do match up. So, are *these *the defaults we want: the baseline probablity comparison is the same as the 100 on the scaled version? On Fri, Jan 12, 2018 at 5:10 PM, Arthur Spirling <notifications@github.com> wrote:

…

don't we use the fifth grade texts as 100? That's what the paper implies, no? On Fri, Jan 12, 2018 at 8:16 PM, Kevin Munger ***@***.***> wrote: > Ok, on that. > > And I've realized what the problem is: we've confused the "reference" > (used to compute the probability scores) with the endpoints for rescaling. > These don't necessarily have to come from the same source, but all three do > need to be input as defaults (or user provided). > > I'm currently rewriting the documentation to reflect what we're doing: > > The default value for "reference" is the lambda across the fifth grade > texts--our "prob" output thus calculates the probablity that a text is > easier than these. > > The default values for "reference_top" and "reference_bottom" come from > the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 > scale. > > Are these the defaults we want? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2# issuecomment-357331107>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ASQTnjOXJ1mcX- nw76pdXoqeVgZ70kn-ks5tJ72ngaJpZM4RVlpN> > . > -- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGQLe2MkNzRZLcVvvOjPCBnbOBCV8YpAks5tJ9hugaJpZM4RVlpN> .

ArthurSpirling · 2018-01-12T22:41:16Z

that's what makes sense to me, yes: 100 is the 5th grade text, 0 is the hardest SOTU text (which is at college level, by FRE standards). Those being default ends for the 0-100 make sense, and fifth grade texts being the default comparison for the probability calculations. On Fri, Jan 12, 2018 at 10:38 PM, Kevin Munger <notifications@github.com> wrote:

…

Reading back over the documentation, yes, that seems to be the case--and I just checked numbers, which do match up. So, are *these *the defaults we want: the baseline probablity comparison is the same as the 100 on the scaled version? On Fri, Jan 12, 2018 at 5:10 PM, Arthur Spirling ***@***.*** > wrote: > don't we use the fifth grade texts as 100? That's what the paper implies, > no? > > On Fri, Jan 12, 2018 at 8:16 PM, Kevin Munger ***@***.***> > wrote: > > > Ok, on that. > > > > And I've realized what the problem is: we've confused the "reference" > > (used to compute the probability scores) with the endpoints for > rescaling. > > These don't necessarily have to come from the same source, but all three > do > > need to be input as defaults (or user provided). > > > > I'm currently rewriting the documentation to reflect what we're doing: > > > > The default value for "reference" is the lambda across the fifth grade > > texts--our "prob" output thus calculates the probablity that a text is > > easier than these. > > > > The default values for "reference_top" and "reference_bottom" come from > > the extremes of the SOTU corpus, and are used to rescale texts on the > 0-100 > > scale. > > > > Are these the defaults we want? > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#2# > issuecomment-357331107>, > > or mute the thread > > <https://github.com/notifications/unsubscribe-auth/ASQTnjOXJ1mcX- > nw76pdXoqeVgZ70kn-ks5tJ72ngaJpZM4RVlpN> > > . > > > > > > -- > Deputy Director, Center for Data Science <http://cds.nyu.edu/> > Director of Graduate Studies, MSDS > <http://cds.nyu.edu/academics/ms-in-data-science/> > Associate Professor of Politics and Data Science > New York University > http://www.nyu.edu/projects/spirling/ > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2# issuecomment-357368548>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AGQLe2MkNzRZLcVvvOjPCBnbOBCV8YpAks5tJ9hugaJpZM4RVlpN> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTnpmoi8NBfARSa5wuw6LcHYMeHBwVks5tJ978gaJpZM4RVlpN> .

-- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger · 2018-01-13T19:54:09Z

Ok, made these changes.

kbenoit · 2018-01-15T21:36:48Z

Thanks, I think that corrected it. @kmunger with e7ac504 the package now passes the CRAN check - except for the too-large data objects.

Note that I removed the article_manuscript and manuscript_chapter folders, since these should only be in the sophistication-papers repository.

ArthurSpirling · 2018-01-15T21:39:23Z

Very good - so this will now appear on CRAN as a package? best AS

…

On Mon, Jan 15, 2018 at 4:36 PM, Kenneth Benoit ***@***.***> wrote: Thanks, I think that corrected it. @kmunger <https://github.com/kmunger> with e7ac504 <e7ac504> the package now passes the CRAN check - except for the too-large data objects. Note that I removed the article_manuscript and manuscript_chapter folders, since these should only be in the *sophistication-papers* repository. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTnjWfZYvovEtDblXfdKxFmUepfjD4ks5tK8TxgaJpZM4RVlpN> .

-- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kbenoit · 2018-01-16T15:11:29Z

No, we would need to submit it, but first cut out the large data objects. There is a 5MB size limit on CRAN packages and we are way over that (26.1 Mb). Most of those were for replicating our analysis however, and that could be removed from the package.

There are also some documentation and robustness (testing!) issues that need to be addressed before it's released as a general tool. I've spoken to @kmunger about this and am happy to guide work in this area.

ArthurSpirling · 2018-01-16T15:45:36Z

Thanks for the clarification - that makes sense.

…

On Tue, Jan 16, 2018 at 10:11 AM, Kenneth Benoit ***@***.***> wrote: No, we would need to submit it, but first cut out the large data objects. There is a 5MB size limit on CRAN packages and we are way over that (26.1 Mb). Most of those were for replicating our analysis however, and that could be removed from the package. There are also some documentation and robustness (testing!) issues that need to be addressed before it's released as a general tool. I've spoken to @kmunger <https://github.com/kmunger> about this and am happy to guide work in this area. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASQTnqc0d-dTUDloRIFnd0V3awsS6MUkks5tLLwhgaJpZM4RVlpN> .

-- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger · 2018-01-16T15:55:44Z

Indeed, I'm happy to start working on this, and @ken any guidance would be appreciated. I'll go ahead and start removing the large data objects, to get it down to size. On Tue, Jan 16, 2018 at 10:45 AM, Arthur Spirling <notifications@github.com> wrote:

…

Thanks for the clarification - that makes sense. On Tue, Jan 16, 2018 at 10:11 AM, Kenneth Benoit ***@***.*** > wrote: > No, we would need to submit it, but first cut out the large data objects. > There is a 5MB size limit on CRAN packages and we are way over that (26.1 > Mb). Most of those were for replicating our analysis however, and that > could be removed from the package. > > There are also some documentation and robustness (testing!) issues that > need to be addressed before it's released as a general tool. I've spoken to > @kmunger <https://github.com/kmunger> about this and am happy to guide > work in this area. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2# issuecomment-357990342>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ASQTnqc0d- dTUDloRIFnd0V3awsS6MUkks5tLLwhgaJpZM4RVlpN> > . > -- Deputy Director, Center for Data Science <http://cds.nyu.edu/> Director of Graduate Studies, MSDS <http://cds.nyu.edu/academics/ms-in-data-science/> Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGQLe0HdfWqmXK7jqjuTA_BAZXpUOYkUks5tLMQggaJpZM4RVlpN> .

kbenoit · 2018-01-16T16:15:53Z

Best would be to create replication materials needed for our chapter and paper, removing the larger objects from the package as needed, but using the package functions to get the results. Each time you make a data object local, you can remove it from the package.

kmunger closed this as completed Jan 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unclear reference to lambda baseline #2

unclear reference to lambda baseline #2

kbenoit commented Jan 7, 2018

ArthurSpirling commented Jan 7, 2018 via email

ArthurSpirling commented Jan 7, 2018 via email

kmunger commented Jan 12, 2018

ArthurSpirling commented Jan 12, 2018 via email

kmunger commented Jan 12, 2018

ArthurSpirling commented Jan 12, 2018 via email

kmunger commented Jan 12, 2018 via email

ArthurSpirling commented Jan 12, 2018 via email

kmunger commented Jan 13, 2018

kbenoit commented Jan 15, 2018

ArthurSpirling commented Jan 15, 2018 via email

kbenoit commented Jan 16, 2018

ArthurSpirling commented Jan 16, 2018 via email

kmunger commented Jan 16, 2018 via email

kbenoit commented Jan 16, 2018

unclear reference to lambda baseline #2

unclear reference to lambda baseline #2

Comments

kbenoit commented Jan 7, 2018

ArthurSpirling commented Jan 7, 2018 via email

ArthurSpirling commented Jan 7, 2018 via email

kmunger commented Jan 12, 2018

ArthurSpirling commented Jan 12, 2018 via email

kmunger commented Jan 12, 2018

ArthurSpirling commented Jan 12, 2018 via email

kmunger commented Jan 12, 2018 via email

ArthurSpirling commented Jan 12, 2018 via email

kmunger commented Jan 13, 2018

kbenoit commented Jan 15, 2018

ArthurSpirling commented Jan 15, 2018 via email

kbenoit commented Jan 16, 2018

ArthurSpirling commented Jan 16, 2018 via email

kmunger commented Jan 16, 2018 via email

kbenoit commented Jan 16, 2018