-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usda data scraping and cleaning process #36
Comments
@dbuona @lizzieinvancouver @DeirdreLoughnan
All cleaning scripts can be found in "scrapeUSDAseedmanual/cleaning" and all intermediate data files can be found in "scrapeUSDAseedmanual/output/earlyIterationDataSheets" Ideally I would have done this all in 1 script but we came across a couple issues along the way involving data cleaning that Selena was already working on, so we felt it made more sense to start new scripts upon older ones so that we wouldn't be tampering with changes done by hand through Excel. I forgot to mention that "germinationCleaningFinal.R" gave the output "USDAGerminationCleanedFinal.csv" which was what I used to make the "usdaGerminationMaster.xlsx" file. There's another file in the earlyIterationDataSheets called "usdaGerminationJINJJA.csv" which I made as a backup because my RStudio was bugging out on me during the EGRET correction script writing. |
@buniwuuu @ngoj1 I've started to combine all the of the cleaning scripts into a master source file called clearnmerge_all_usda.R, however I can't get the 3 cleaning scripts to run sequentially. Can you work on getting it to run? |
@FrederikBaumgarten possible helpful pseudocode:
|
Just to summarize here is what needs to happen on this code:
|
Hello,
Sorry for the late reply! I just finished up taking grad photos with my
family and Britany and I will be doing shoot elongation until 2pm, but
after that I will try troubleshooting the code and converting the parts
that were edited manually in excel into R code.
Best regards,
Justin
…On Wed., Jul. 10, 2024, 10:15 a.m. dbuona ***@***.***> wrote:
Just to summarize here is what needs to happen on this code:
- Get all the cleaning files to run without externally manipulating
them in excel
- In germination_cleaning.R combine the cold stratification entries
into the chilling columns ( i.e. cold stratification and chilling are the
same thing, which should be considered chilling
—
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AU5JG6IE4LDPVOJ6RQG765TZLVT27AVCNFSM6AAAAABKTL64JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGA2TOMBTG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbuona I pushed a new script called "cleanmerge_all_usda_JNVER" (Justin version since I wanted to keep your original script as a backup) where instead of using source() I just combined all the code in the three scripts and then found the majority of the changes we made manually in Excel through just fine combing through the columns. I'm hoping that there aren't any weird values left there, but it's possible I might have missed a few. In any case, Selena would have addressed these weird values in Issue #20 and if you ever encounter them and need me to fix them, I can do that in this new script I've just pushed.
My laptop is about to run out of battery as I forgot to bring my charger but I can address this when I get home later tonight! If there are any warnings that pop up in the code please let me know and I will backtrack and figure it out. |
Sorry this took so long! In the "cleanmerge_all_usda_JNVER" script I added a section at the very bottom where I copied all of the cold.strat.dur.XXX (Avg, Min, and Max) column data into new columns called chill.dur.XXX.comb (for "combined") and just ran some tests to make sure that no NAs were being made or data being overwritten. I decided to put this all in a new column just so that we still have the original chill.dur.XXX columns prior to the merge in case we need them separated. |
@dbuona Could you take a look at this and get us down to one functional script? It should be called cleanAllUsda.R ... and please delete all the other scripts and extraneous files. |
working on understanding relevant files: @ngoj1 1)Does that seem correct? 2) Does that mean all other R files in the cleaning folder within the scrapeUSDAseedmanual folder are no longer in used now that cleanmerge_all_usda_JNVER.R exists? |
@dbuona Can you make sure:
Thanks! |
@dbuona Yes, the new script combines the three separate scripts and also sifts through the manually changed values, so any other script related to the germination data should be outdated, but I can't speak on the scripts used for phenology cleaning. Sorry for the late reply! I was away at Mt. Rainier this past week. |
@ngoj1 Thank you! Do you know who worked on the phenology scripts? |
@buniwuuu and @selenashew did the phenology and seed data cleaning!! |
@selenashew Can you take a look at @dbuona queries from 3 days ago and reply regarding the phenology part? Thanks. |
Hi everyone, My apologies for the late reply! I was away in Singapore for 2 weeks and must have missed this issue when it was first opened; I am now catching myself back up to speed. Justin is completely correct and has covered all of the bases in terms of the process & scripts used for the germination data. In terms of the phenology and seed data, the process was completely different since they were actually able to be scraped from the PDF quite cleanly. The steps I decided were as follows:
|
Hi everyone, As a quick update, I am working on creating the cleaning scripts in R for the phen & seed data and am aiming to have that done by the end of next week! |
@selenashew Sounds good -- thank you! |
The phen & seed data have been cleaned and can be found in the output folder as "cleaned_phen_data_final.csv" and "cleaned_seed_data_final.csv". The cleaning script used is found in the cleaning folder as "phen_seed_data_final_cleaning_script.R". |
@selenashew Thank you! Would you mind updating your files to follow our naming conventions (camelCase)? |
@dbuona I found some things in this issue we should do ... from the retreat: @dbuona Can you make sure:
Thanks! And I will add now:
|
I am pulling out stuff on phen and seed cleaning to a new issue #68 so we can use this issue as the (hopefully) last issue relating to cleaning germination data from USDA seed manual. |
I am going to close this |
@dbuona Whoop! |
No description provided.
The text was updated successfully, but these errors were encountered: