usda data scraping and cleaning process #36

FrederikBaumgarten · 2024-07-09T18:13:42Z

No description provided.

FrederikBaumgarten · 2024-07-09T18:16:29Z

@dbuona @lizzieinvancouver @DeirdreLoughnan
here is what Justin did so far in his words:
workflow:

Started in the scapeUSDAseedmanual folder
in the cleaning folder is "germination_master_spreadsheet.csv" which is the original data
Parsed into RStudio with the cleaning script titled "germinationCleaning.R" which does the mass general cleaning like removal of random symbols, converting unreadable NAs into proper NA format, fixing species names, adding new columns for scarification and chilling etc.
the output of this was called germinationCleaned.xlsx, which Selena then went through and manually fixed some issues based on weird values from the USDA manual pdf
This was then saved as "germinationCleaned_official.csv"
Parsed this file into R through the cleaning script "germinationCleaningFinal.csv" where I changed column names and pivoted wider the germination response data
I then got the comment from Deirdre asking for metadata and to make some more changes so it's closer in format to EGRET
Thus I converted this to an excel file and made a new sheet for the metadata, saving these two separately as .csv in case anyone wanted them as .csv
Parsed the cleaned data into a new cleaning script called "germinationEGRETCorrections.csv" for the final round of touch ups

All cleaning scripts can be found in "scrapeUSDAseedmanual/cleaning" and all intermediate data files can be found in "scrapeUSDAseedmanual/output/earlyIterationDataSheets"

Ideally I would have done this all in 1 script but we came across a couple issues along the way involving data cleaning that Selena was already working on, so we felt it made more sense to start new scripts upon older ones so that we wouldn't be tampering with changes done by hand through Excel.

I forgot to mention that "germinationCleaningFinal.R" gave the output "USDAGerminationCleanedFinal.csv" which was what I used to make the "usdaGerminationMaster.xlsx" file. There's another file in the earlyIterationDataSheets called "usdaGerminationJINJJA.csv" which I made as a backup because my RStudio was bugging out on me during the EGRET correction script writing.

dbuona · 2024-07-10T16:28:03Z

@buniwuuu @ngoj1 I've started to combine all the of the cleaning scripts into a master source file called clearnmerge_all_usda.R, however I can't get the 3 cleaning scripts to run sequentially. Can you work on getting it to run?
Ideally, it would also be best to not write out intermediate xlsx files in between each script too. Let me know if you have any questions.

lizzieinvancouver · 2024-07-10T16:40:14Z

@FrederikBaumgarten possible helpful pseudocode:

chilldurminnonNA <- usda$species[which(is.na(usda$chill.dur.min)==FALSE)]
chilldurmaxonNA <- usda$species[which(is.na(usda$chill.dur.max)==FALSE)]
respvarminnoNA <- usda$species[which(is.na(usda$responsevarmin)==FALSE)]
respvarmaxnoNA <- usda$species[which(is.na(usda$responsevarmax)==FALSE)]

sppwithminmaxchill <- chilldurminnonNA[which(chilldurminnonNA %in% chilldurmaxonNA)]
sppwithminmaxresp <- respvarminnoNA[which(respvarminnoNA %in% respvarmaxnoNA)]

dbuona · 2024-07-10T17:15:06Z

Just to summarize here is what needs to happen on this code:

Get all the cleaning files to run without externally manipulating them in excel
In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

ngoj1 · 2024-07-10T20:00:32Z

Hello, Sorry for the late reply! I just finished up taking grad photos with my family and Britany and I will be doing shoot elongation until 2pm, but after that I will try troubleshooting the code and converting the parts that were edited manually in excel into R code. Best regards, Justin

…

On Wed., Jul. 10, 2024, 10:15 a.m. dbuona ***@***.***> wrote: Just to summarize here is what needs to happen on this code: - Get all the cleaning files to run without externally manipulating them in excel - In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AU5JG6IE4LDPVOJ6RQG765TZLVT27AVCNFSM6AAAAABKTL64JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGA2TOMBTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ngoj1 · 2024-07-10T23:32:24Z

Get all the cleaning files to run without externally manipulating them in excel

@dbuona I pushed a new script called "cleanmerge_all_usda_JNVER" (Justin version since I wanted to keep your original script as a backup) where instead of using source() I just combined all the code in the three scripts and then found the majority of the changes we made manually in Excel through just fine combing through the columns. I'm hoping that there aren't any weird values left there, but it's possible I might have missed a few. In any case, Selena would have addressed these weird values in Issue #20 and if you ever encounter them and need me to fix them, I can do that in this new script I've just pushed.

In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

My laptop is about to run out of battery as I forgot to bring my charger but I can address this when I get home later tonight!

If there are any warnings that pop up in the code please let me know and I will backtrack and figure it out.

ngoj1 · 2024-07-14T08:12:32Z

In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

Sorry this took so long! In the "cleanmerge_all_usda_JNVER" script I added a section at the very bottom where I copied all of the cold.strat.dur.XXX (Avg, Min, and Max) column data into new columns called chill.dur.XXX.comb (for "combined") and just ran some tests to make sure that no NAs were being made or data being overwritten. I decided to put this all in a new column just so that we still have the original chill.dur.XXX columns prior to the merge in case we need them separated.

lizzieinvancouver · 2024-07-15T17:21:24Z

@dbuona Could you take a look at this and get us down to one functional script? It should be called cleanAllUsda.R ... and please delete all the other scripts and extraneous files.

dbuona · 2024-07-31T13:23:06Z

working on understanding relevant files:
it seems cleanmerge_all_usda_JNVER.R get us from germination_master_spreadsheet.csv (original data) to something akin to usdaGerminationData.csv without all the manual csv manipulation done previously (though no file is ever written out from cleanmerge_all_usda_JNVER.R). Then Freddie's code in usdaClean.R does additional cleaning.

@ngoj1 1)Does that seem correct? 2) Does that mean all other R files in the cleaning folder within the scrapeUSDAseedmanual folder are no longer in used now that cleanmerge_all_usda_JNVER.R exists?

lizzieinvancouver · 2024-08-02T20:05:40Z

@dbuona Can you make sure:

Add a readme (_README.md) to scrapeUSDAseedmanual/cleaning/
Delete or move all the cvs and xlsx files in this folder.

Thanks!

ngoj1 · 2024-08-03T20:10:22Z

@dbuona Yes, the new script combines the three separate scripts and also sifts through the manually changed values, so any other script related to the germination data should be outdated, but I can't speak on the scripts used for phenology cleaning. Sorry for the late reply! I was away at Mt. Rainier this past week.

lizzieinvancouver · 2024-08-03T23:37:52Z

@ngoj1 Thank you! Do you know who worked on the phenology scripts?

ngoj1 · 2024-08-04T00:11:34Z

@buniwuuu and @selenashew did the phenology and seed data cleaning!!

lizzieinvancouver · 2024-08-04T00:42:00Z

@selenashew Can you take a look at @dbuona queries from 3 days ago and reply regarding the phenology part? Thanks.

selenashew · 2024-08-06T07:23:08Z

Hi everyone,

My apologies for the late reply! I was away in Singapore for 2 weeks and must have missed this issue when it was first opened; I am now catching myself back up to speed. Justin is completely correct and has covered all of the bases in terms of the process & scripts used for the germination data.

In terms of the phenology and seed data, the process was completely different since they were actually able to be scraped from the PDF quite cleanly. The steps I decided were as follows:

@buniwuuu and I created a script called "phenology_data_preparation_script.R" which can be found at egret\analyses\scrapeUSDAseedmanual\cleaning. The script automates the preparation of the phen data ahead of merging all of phenology files together into a single CSV by removing all of the confidence scores, quotation marks, adding, 2 new columns (pdf_table_number and genus_name), and removes the first row of the CSV in order to access the column names.
The script was then also adjusted to be able to use to prep all of the seed data for merging.
While I was away, Britany then created another script called phenSeedCleaning.R (which can be found under the same path) that then combines all of the prepped phen data into one CSV file and all of the prepped seed data into another CSV file. The outputs can be found under "scrapeUSDAseedmanual/output/phenologyCombined.csv" and "scrapeUSDAseedmanual/output/seedCombined.csv" respectively.
Where we are currently at -> I will need to manually go through the PDF to find the actual PDF page numbers for each of the phen and seed data tables as well as some of the Genus names (as they were not scraped) as requested by Deirdre and add them into each combined CSV file.
We will then create another script to do some final cleaning- there shouldn't be much to this script as the phen and seed data are quite clean already.

selenashew · 2024-08-10T09:07:03Z

Hi everyone,

As a quick update, I am working on creating the cleaning scripts in R for the phen & seed data and am aiming to have that done by the end of next week!

lizzieinvancouver · 2024-08-13T18:57:45Z

@selenashew Sounds good -- thank you!

selenashew · 2024-08-15T22:09:19Z

The phen & seed data have been cleaned and can be found in the output folder as "cleaned_phen_data_final.csv" and "cleaned_seed_data_final.csv". The cleaning script used is found in the cleaning folder as "phen_seed_data_final_cleaning_script.R".

lizzieinvancouver · 2024-08-21T00:28:37Z

@selenashew Thank you! Would you mind updating your files to follow our naming conventions (camelCase)?

lizzieinvancouver · 2024-08-28T18:29:43Z

@dbuona I found some things in this issue we should do ... from the retreat:

@dbuona Can you make sure:

Add a readme (_README.md) to scrapeUSDAseedmanual/cleaning/
Delete or move all the cvs and xlsx files in this folder (and update paths).

Thanks!

And I will add now:

Make the names of files and folders camelCase
Delete obsolete files you find.

lizzieinvancouver · 2024-08-28T18:32:58Z

I am pulling out stuff on phen and seed cleaning to a new issue #68 so we can use this issue as the (hopefully) last issue relating to cleaning germination data from USDA seed manual.

dbuona · 2024-09-11T18:11:35Z

I am going to close this

lizzieinvancouver · 2024-09-15T11:12:47Z

@dbuona Whoop!

lizzieinvancouver added the get done now label Jul 16, 2024

lizzieinvancouver mentioned this issue Aug 2, 2024

Cleaning USDA seed manual data #20

Closed

5 tasks

lizzieinvancouver mentioned this issue Aug 28, 2024

cleaning phen and seed data from USDA #68

Open

dbuona closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usda data scraping and cleaning process #36

usda data scraping and cleaning process #36

FrederikBaumgarten commented Jul 9, 2024

FrederikBaumgarten commented Jul 9, 2024

dbuona commented Jul 10, 2024

lizzieinvancouver commented Jul 10, 2024

dbuona commented Jul 10, 2024 •

edited by ngoj1

Loading

ngoj1 commented Jul 10, 2024 via email

ngoj1 commented Jul 10, 2024 •

edited

Loading

ngoj1 commented Jul 14, 2024

lizzieinvancouver commented Jul 15, 2024

dbuona commented Jul 31, 2024

lizzieinvancouver commented Aug 2, 2024

ngoj1 commented Aug 3, 2024

lizzieinvancouver commented Aug 3, 2024

ngoj1 commented Aug 4, 2024

lizzieinvancouver commented Aug 4, 2024

selenashew commented Aug 6, 2024 •

edited

Loading

selenashew commented Aug 10, 2024

lizzieinvancouver commented Aug 13, 2024

selenashew commented Aug 15, 2024

lizzieinvancouver commented Aug 21, 2024

lizzieinvancouver commented Aug 28, 2024 •

edited by dbuona

Loading

lizzieinvancouver commented Aug 28, 2024

dbuona commented Sep 11, 2024

lizzieinvancouver commented Sep 15, 2024

usda data scraping and cleaning process #36

usda data scraping and cleaning process #36

Comments

FrederikBaumgarten commented Jul 9, 2024

FrederikBaumgarten commented Jul 9, 2024

dbuona commented Jul 10, 2024

lizzieinvancouver commented Jul 10, 2024

dbuona commented Jul 10, 2024 • edited by ngoj1 Loading

ngoj1 commented Jul 10, 2024 via email

ngoj1 commented Jul 10, 2024 • edited Loading

ngoj1 commented Jul 14, 2024

lizzieinvancouver commented Jul 15, 2024

dbuona commented Jul 31, 2024

lizzieinvancouver commented Aug 2, 2024

ngoj1 commented Aug 3, 2024

lizzieinvancouver commented Aug 3, 2024

ngoj1 commented Aug 4, 2024

lizzieinvancouver commented Aug 4, 2024

selenashew commented Aug 6, 2024 • edited Loading

selenashew commented Aug 10, 2024

lizzieinvancouver commented Aug 13, 2024

selenashew commented Aug 15, 2024

lizzieinvancouver commented Aug 21, 2024

lizzieinvancouver commented Aug 28, 2024 • edited by dbuona Loading

lizzieinvancouver commented Aug 28, 2024

dbuona commented Sep 11, 2024

lizzieinvancouver commented Sep 15, 2024

dbuona commented Jul 10, 2024 •

edited by ngoj1

Loading

ngoj1 commented Jul 10, 2024 •

edited

Loading

selenashew commented Aug 6, 2024 •

edited

Loading

lizzieinvancouver commented Aug 28, 2024 •

edited by dbuona

Loading