Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usda data scraping and cleaning process #36

Closed
FrederikBaumgarten opened this issue Jul 9, 2024 · 23 comments
Closed

usda data scraping and cleaning process #36

FrederikBaumgarten opened this issue Jul 9, 2024 · 23 comments

Comments

@FrederikBaumgarten
Copy link
Collaborator

No description provided.

@FrederikBaumgarten
Copy link
Collaborator Author

@dbuona @lizzieinvancouver @DeirdreLoughnan
here is what Justin did so far in his words:
workflow:

  • Started in the scapeUSDAseedmanual folder

  • in the cleaning folder is "germination_master_spreadsheet.csv" which is the original data

  • Parsed into RStudio with the cleaning script titled "germinationCleaning.R" which does the mass general cleaning like removal of random symbols, converting unreadable NAs into proper NA format, fixing species names, adding new columns for scarification and chilling etc.

  • the output of this was called germinationCleaned.xlsx, which Selena then went through and manually fixed some issues based on weird values from the USDA manual pdf

  • This was then saved as "germinationCleaned_official.csv"

  • Parsed this file into R through the cleaning script "germinationCleaningFinal.csv" where I changed column names and pivoted wider the germination response data

  • I then got the comment from Deirdre asking for metadata and to make some more changes so it's closer in format to EGRET

  • Thus I converted this to an excel file and made a new sheet for the metadata, saving these two separately as .csv in case anyone wanted them as .csv

  • Parsed the cleaned data into a new cleaning script called "germinationEGRETCorrections.csv" for the final round of touch ups

All cleaning scripts can be found in "scrapeUSDAseedmanual/cleaning" and all intermediate data files can be found in "scrapeUSDAseedmanual/output/earlyIterationDataSheets"

Ideally I would have done this all in 1 script but we came across a couple issues along the way involving data cleaning that Selena was already working on, so we felt it made more sense to start new scripts upon older ones so that we wouldn't be tampering with changes done by hand through Excel.

I forgot to mention that "germinationCleaningFinal.R" gave the output "USDAGerminationCleanedFinal.csv" which was what I used to make the "usdaGerminationMaster.xlsx" file. There's another file in the earlyIterationDataSheets called "usdaGerminationJINJJA.csv" which I made as a backup because my RStudio was bugging out on me during the EGRET correction script writing.

@dbuona
Copy link
Collaborator

dbuona commented Jul 10, 2024

@buniwuuu @ngoj1 I've started to combine all the of the cleaning scripts into a master source file called clearnmerge_all_usda.R, however I can't get the 3 cleaning scripts to run sequentially. Can you work on getting it to run?
Ideally, it would also be best to not write out intermediate xlsx files in between each script too. Let me know if you have any questions.

@lizzieinvancouver
Copy link
Owner

@FrederikBaumgarten possible helpful pseudocode:

chilldurminnonNA <- usda$species[which(is.na(usda$chill.dur.min)==FALSE)]
chilldurmaxonNA <- usda$species[which(is.na(usda$chill.dur.max)==FALSE)]
respvarminnoNA <- usda$species[which(is.na(usda$responsevarmin)==FALSE)]
respvarmaxnoNA <- usda$species[which(is.na(usda$responsevarmax)==FALSE)]

sppwithminmaxchill <- chilldurminnonNA[which(chilldurminnonNA %in% chilldurmaxonNA)]
sppwithminmaxresp <- respvarminnoNA[which(respvarminnoNA %in% respvarmaxnoNA)]

@dbuona
Copy link
Collaborator

dbuona commented Jul 10, 2024

Just to summarize here is what needs to happen on this code:

  • Get all the cleaning files to run without externally manipulating them in excel
  • In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

@ngoj1
Copy link
Collaborator

ngoj1 commented Jul 10, 2024 via email

@ngoj1
Copy link
Collaborator

ngoj1 commented Jul 10, 2024

Get all the cleaning files to run without externally manipulating them in excel

@dbuona I pushed a new script called "cleanmerge_all_usda_JNVER" (Justin version since I wanted to keep your original script as a backup) where instead of using source() I just combined all the code in the three scripts and then found the majority of the changes we made manually in Excel through just fine combing through the columns. I'm hoping that there aren't any weird values left there, but it's possible I might have missed a few. In any case, Selena would have addressed these weird values in Issue #20 and if you ever encounter them and need me to fix them, I can do that in this new script I've just pushed.

In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

My laptop is about to run out of battery as I forgot to bring my charger but I can address this when I get home later tonight!

If there are any warnings that pop up in the code please let me know and I will backtrack and figure it out.

@ngoj1
Copy link
Collaborator

ngoj1 commented Jul 14, 2024

In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

Sorry this took so long! In the "cleanmerge_all_usda_JNVER" script I added a section at the very bottom where I copied all of the cold.strat.dur.XXX (Avg, Min, and Max) column data into new columns called chill.dur.XXX.comb (for "combined") and just ran some tests to make sure that no NAs were being made or data being overwritten. I decided to put this all in a new column just so that we still have the original chill.dur.XXX columns prior to the merge in case we need them separated.

@lizzieinvancouver
Copy link
Owner

@dbuona Could you take a look at this and get us down to one functional script? It should be called cleanAllUsda.R ... and please delete all the other scripts and extraneous files.

@dbuona
Copy link
Collaborator

dbuona commented Jul 31, 2024

working on understanding relevant files:
it seems cleanmerge_all_usda_JNVER.R get us from germination_master_spreadsheet.csv (original data) to something akin to usdaGerminationData.csv without all the manual csv manipulation done previously (though no file is ever written out from cleanmerge_all_usda_JNVER.R). Then Freddie's code in usdaClean.R does additional cleaning.

@ngoj1 1)Does that seem correct? 2) Does that mean all other R files in the cleaning folder within the scrapeUSDAseedmanual folder are no longer in used now that cleanmerge_all_usda_JNVER.R exists?

@lizzieinvancouver
Copy link
Owner

@dbuona Can you make sure:

  • Add a readme (_README.md) to scrapeUSDAseedmanual/cleaning/
  • Delete or move all the cvs and xlsx files in this folder.

Thanks!

@ngoj1
Copy link
Collaborator

ngoj1 commented Aug 3, 2024

@dbuona Yes, the new script combines the three separate scripts and also sifts through the manually changed values, so any other script related to the germination data should be outdated, but I can't speak on the scripts used for phenology cleaning. Sorry for the late reply! I was away at Mt. Rainier this past week.

@lizzieinvancouver
Copy link
Owner

@ngoj1 Thank you! Do you know who worked on the phenology scripts?

@ngoj1
Copy link
Collaborator

ngoj1 commented Aug 4, 2024

@buniwuuu and @selenashew did the phenology and seed data cleaning!!

@lizzieinvancouver
Copy link
Owner

@selenashew Can you take a look at @dbuona queries from 3 days ago and reply regarding the phenology part? Thanks.

@selenashew
Copy link
Collaborator

selenashew commented Aug 6, 2024

Hi everyone,

My apologies for the late reply! I was away in Singapore for 2 weeks and must have missed this issue when it was first opened; I am now catching myself back up to speed. Justin is completely correct and has covered all of the bases in terms of the process & scripts used for the germination data.

In terms of the phenology and seed data, the process was completely different since they were actually able to be scraped from the PDF quite cleanly. The steps I decided were as follows:

  1. @buniwuuu and I created a script called "phenology_data_preparation_script.R" which can be found at egret\analyses\scrapeUSDAseedmanual\cleaning. The script automates the preparation of the phen data ahead of merging all of phenology files together into a single CSV by removing all of the confidence scores, quotation marks, adding, 2 new columns (pdf_table_number and genus_name), and removes the first row of the CSV in order to access the column names.

  2. The script was then also adjusted to be able to use to prep all of the seed data for merging.

  3. While I was away, Britany then created another script called phenSeedCleaning.R (which can be found under the same path) that then combines all of the prepped phen data into one CSV file and all of the prepped seed data into another CSV file. The outputs can be found under "scrapeUSDAseedmanual/output/phenologyCombined.csv" and "scrapeUSDAseedmanual/output/seedCombined.csv" respectively.

  4. Where we are currently at -> I will need to manually go through the PDF to find the actual PDF page numbers for each of the phen and seed data tables as well as some of the Genus names (as they were not scraped) as requested by Deirdre and add them into each combined CSV file.

  5. We will then create another script to do some final cleaning- there shouldn't be much to this script as the phen and seed data are quite clean already.

@selenashew
Copy link
Collaborator

Hi everyone,

As a quick update, I am working on creating the cleaning scripts in R for the phen & seed data and am aiming to have that done by the end of next week!

@lizzieinvancouver
Copy link
Owner

@selenashew Sounds good -- thank you!

@selenashew
Copy link
Collaborator

The phen & seed data have been cleaned and can be found in the output folder as "cleaned_phen_data_final.csv" and "cleaned_seed_data_final.csv". The cleaning script used is found in the cleaning folder as "phen_seed_data_final_cleaning_script.R".

@lizzieinvancouver
Copy link
Owner

@selenashew Thank you! Would you mind updating your files to follow our naming conventions (camelCase)?

@lizzieinvancouver
Copy link
Owner

lizzieinvancouver commented Aug 28, 2024

@dbuona I found some things in this issue we should do ... from the retreat:

@dbuona Can you make sure:

  • Add a readme (_README.md) to scrapeUSDAseedmanual/cleaning/
  • Delete or move all the cvs and xlsx files in this folder (and update paths).

Thanks!

And I will add now:

  • Make the names of files and folders camelCase
  • Delete obsolete files you find.

@lizzieinvancouver
Copy link
Owner

I am pulling out stuff on phen and seed cleaning to a new issue #68 so we can use this issue as the (hopefully) last issue relating to cleaning germination data from USDA seed manual.

@dbuona
Copy link
Collaborator

dbuona commented Sep 11, 2024

I am going to close this

@dbuona dbuona closed this as completed Sep 11, 2024
@lizzieinvancouver
Copy link
Owner

@dbuona Whoop!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants