Part data tutorial, part data exploration with Evans et al (2020) hydro data for Dartmouth interns
The Women In Science Project gets first-year women at Dartmouth College into research labs, and our lab is working with geospatial data in the Arctic. An essential part of this work will be taking existing data and extracting new topographic or ecological data that might explain historical trends others observe. For that we need to practice acquiring, cleaning and plotting "wide" datasets both as tabular and spatial data.
The following assumes you've had some success setting up a conda environment in Dartmouth's shared-memory computers and opened Jupyter notebook before: https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=72886 https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=72888 https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=124517
-
Use command prompt to
ssh
into Polaris or Andes:ssh xxxxxxx@[polaris/andes]@dartmouth.edu
-
Create a new environment
git-env
to install git (I have had problems installing git on base environments)conda create --name git-env
-
Activate your new environment with
conda activate git-env
and then clone the repository:git clone https://github.com/jmdelvecchio/WISP_arctic_hydro.git
-
Navigate to the
WISP-arctic-hydro
directory (folder) withcd WISP-arctic-hydro
(tip: start typing "WI" and then hit "tab", it'll autocomplete!) -
Create another new environment for your EDA with the yml file:
conda env create -f environment.yml
-
Activate that new environment with
conda activate wisp_eda
-
Launch Jupyter notebook (if you're doing on Andes or Polaris, use no browser mode)
jupyter notebook --no-browser
and don't forget to ssh tunnel for the notebook. -
Open another instance of your prompt and write
ssh -NfL xxxx:localhost:xxxx [your_user_ID]@[andes/polaris].dartmouth.edu
wherexxxx
is the four-digit localhost:xxxx port you see when you launch Jupyter Notebook -
'Splore some data (after importing your libraries!!)! Two hints:
You will need to call your .csv tables from the data_tables
directory, but I would recommend having your Jupyter Notebook file in your main WISP folder. What I do is designate a data_dir
variable with the path of the directory in which I place my data, like so:
data_dir = './data_tables/'
The .
tells you to use the current working directory (WISP) and then look for a directory called data_tables
. I then use os.path.join()
(info here) when loading to pandas:
df = pd.read_csv(os.path.join(data_dir,'evans_hydro_supplement.csv')
BUT! You will see I have been very cruel and copy and pasted directly from the supplement, so pandas will not be happy.
I'll help you out here: read in the csv with a special-characters encoding, and then write a new csv with utf-8 encoding (some realistic data cleaning for you!)
df = pd.read_csv(os.path.join(data_dir,'evans_hydro_supplement.csv'), encoding='windows-1254') df.to_csv(os.path.join(data_dir,'evans_hydro_supplement_new.csv'), encoding='utf-8', index=False)
Now try df = pd.read_csv(os.path.join(data_dir,'evans_hydro_supplement_new.csv'))
and see how it goes.
My idea here with you cloning this repository with git (rather than just me emailing you data) is that as we progess (read: as I figure out how to do things) I will add GIS files and scripts so we can do geospatial data analysis once we get there, and I can push those updates and you can pull them like real programmers...or something.
- Export clean data tables as .csv files
- Make one scatterplot in which you either have sybols' shapes or colors reflect different categorical variables
- Make one map of watershed boundaries in which the polygons' colors reflect different categorical variables