Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unclear reference to lambda baseline #2

Closed
kbenoit opened this issue Jan 7, 2018 · 15 comments
Closed

unclear reference to lambda baseline #2

kbenoit opened this issue Jan 7, 2018 · 15 comments

Comments

@kbenoit
Copy link
Owner

kbenoit commented Jan 7, 2018

In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom.

@ArthurSpirling @kmunger can you recall which one this is supposed to be?

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 7, 2018 via email

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 7, 2018 via email

@kmunger
Copy link
Collaborator

kmunger commented Jan 12, 2018

That's right-- the "reference" call is from older code. The current code has the hardcoded top and bottom values that derived by simply sorting the extreme lambdas on the SOTU. When I did this, I left the older code in there and just added an extra column with the new, hardcorded approach.

The solution is to just get rid of the old code, which I can do easily. But the longer-term question is whether we should allow this to be user defined? Should we use the SOTU values as defaults and allow users to specify if they want to change them?

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 12, 2018 via email

@kmunger
Copy link
Collaborator

kmunger commented Jan 12, 2018

Ok, on that.

And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided).

I'm currently rewriting the documentation to reflect what we're doing:

The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these.

The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale.

Are these the defaults we want?

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 12, 2018 via email

@kmunger
Copy link
Collaborator

kmunger commented Jan 12, 2018 via email

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 12, 2018 via email

@kmunger
Copy link
Collaborator

kmunger commented Jan 13, 2018

Ok, made these changes.

@kmunger kmunger closed this as completed Jan 13, 2018
@kbenoit
Copy link
Owner Author

kbenoit commented Jan 15, 2018

Thanks, I think that corrected it. @kmunger with e7ac504 the package now passes the CRAN check - except for the too-large data objects.

Note that I removed the article_manuscript and manuscript_chapter folders, since these should only be in the sophistication-papers repository.

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 15, 2018 via email

@kbenoit
Copy link
Owner Author

kbenoit commented Jan 16, 2018

No, we would need to submit it, but first cut out the large data objects. There is a 5MB size limit on CRAN packages and we are way over that (26.1 Mb). Most of those were for replicating our analysis however, and that could be removed from the package.

There are also some documentation and robustness (testing!) issues that need to be addressed before it's released as a general tool. I've spoken to @kmunger about this and am happy to guide work in this area.

@ArthurSpirling
Copy link
Collaborator

ArthurSpirling commented Jan 16, 2018 via email

@kmunger
Copy link
Collaborator

kmunger commented Jan 16, 2018 via email

@kbenoit
Copy link
Owner Author

kbenoit commented Jan 16, 2018

Best would be to create replication materials needed for our chapter and paper, removing the larger objects from the package as needed, but using the package functions to get the results. Each time you make a data object local, you can remove it from the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants