# Mini project 2: primary productivity in coastal waters

In this project you're again given a dataset and some questions. The data for this project come from the [EPA's National Aquatic Resource Surveys](https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys), and in particular the National Coastal Condition Assessment (NCCA); broadly, you'll do an exploratory analysis of primary productivity in coastal waters.

By way of background, chlorophyll A is often used as a proxy for [primary productivity in marine ecosystems](https://en.wikipedia.org/wiki/Marine_primary_production); primary producers are important because they are at the base of the food web. Nitrogen and phosphorus are key nutrients that stimulate primary production. 

In the data folder you'll find water chemistry data, site information, and metadata files. It might be helpful to keep the metadata files open when tidying up the data for analysis. It might also be helpful to keep in mind that these datasets contain a considerable amount of information, not all of which is relevant to answering the questions of interest. Notice that the questions pertain somewhat narrowly to just a few variables. It's recommended that you determine which variables might be useful and drop the rest.

As in the first mini project, there are accurate answers to each question that are mutually consistent with the data, but there aren't uniquely correct answers. You will likely notice that you have even more latitude in this project than in the first, as the questions are slightly broader. Since we've been emphasizing visual and exploratory techniques in class, you are encouraged (but not required) to support your answers with graphics.

The broader goal of these mini projects is to cultivate your problem-solving ability in an unstructured setting. Your work will be evaluated based on the following:
- choice of method(s) used to answer questions;
- clarity of presentation;
- code style and documentation.

Please write up your results separately from your codes; codes should be included at the end of the notebook.

---

## Part 1: dataset

Merge the site information with the chemistry data and tidy it up. Determine which columns to keep based on what you use in answering the questions in part 2; then, print the first few rows here (but *do not include your codes used in tidying the data*) and write a brief description (1-2 paragraphs) of the dataset conveying what you take to be the key attributes. Direct your description to a reader unfamiliar with the data; ensure that in your data preview the columns are named intelligibly.

*Suggestion*: export your cleaned data as a separate `.csv` file and read that directly in below, as in: `pd.read_csv('YOUR DATA FILE').head()`.

In [1]:
# show a few rows of clean data


*Write your description here.*

## Part 2: exploratory analysis

Answer each question below and provide a visualization supporting your answer. A description and interpretation of the visualization should be offered.

*Comment:* you can either designate your plots in the codes section with clear names and reference them in your answers; or you can export your plots as image files and display them in markdown cells.

### What is the apparent relationship between nutrient availability and productivity?

*Comment*: it's fine to examine each nutrient -- nitrogen and phosphorus -- separately, but do consider whether they might be related to each other.

*Write your answer here.*

### Are there any notable differences in available nutrients among U.S. coastal regions?

*Write your answer here.*

### Based on the 2010 data, does productivity seem to vary geographically in some way? 

If so, explain how; If not, explain what options you considered and ruled out.

*Write your answer here.*

### How does primary productivity in California coastal waters change seasonally in 2010, if at all?

Does your result make intuitive sense?

*Write your answer here.*

### Pose and answer one additional question.

*Write your answer here.*

---

# Codes

In [1]:
!ls

data  hw2-seda.ipynb


In [1]:
import pandas as pd
import numpy as np
import altair as alt

ncca_raw = pd.read_csv('assessed_ncca2010_waterchem.csv')
ncca_sites = pd.read_csv('assessed_ncca2010_siteinfo.csv')
# merge on UID or SITE_ID or STATE actually merge on all 3, merge left
# drop QA_CODES AND COMMENT,NPSPARK
# use nutrients and productivity in tidied data
# what is the productivity
# need the values in date_col to match
# RENAME THE DATA 


FileNotFoundError: [Errno 2] No such file or directory: 'assessed_ncca2010_waterchem.csv'

In [3]:
ncca_sites.head()

Unnamed: 0,UID,SITE_ID,STATE,VISIT_NO,DATE_COL,WTBDY_NM,SITESAMP,INDEX_VISIT,EPA_REG,NCCR_REG,...,NPSPARK,PANEL,STATUS10,STRATUM,TNT,WGT_CAT,WGT_NCCA10,RSRC_CLASS,QA_CODES,COMMENT
0,59,NCCA10-1111,CA,1.0,1-Jul-10,Mission Bay,Y,Y,9,West,...,,Base,Target_Sampled,CalP_Other,Target,NCA_CA_CalP_Other,2.503632,NCA_Estuarine_Coastal,,
1,60,NCCA10-1119,CA,1.0,1-Jul-10,San Diego Bay,Y,Y,9,West,...,,Base,Target_Sampled,CalP_Other,Target,NCA_CA_CalP_Other,5.255002,NCA_Estuarine_Coastal,,
2,61,NCCA10-1123,CA,1.0,1-Jul-10,Mission Bay,Y,Y,9,West,...,,Base,Target_Sampled,CalP_Other,Target,NCA_CA_CalP_Other,2.503632,NCA_Estuarine_Coastal,,
3,62,NCCA10-1127,CA,1.0,1-Jul-10,San Diego Bay,Y,Y,9,West,...,,Base,Target_Sampled,CalP_Other,Target,NCA_CA_CalP_Other,5.255002,NCA_Estuarine_Coastal,,
4,63,NCCA10-1133,NC,1.0,9-Jun-10,White Oak River,Y,Y,4,Southeast,...,,Revisit,Target_Sampled,CarP_Albemarle_Pamlico_Sounds,Target,NCA_NC_CarP_Albemarle_Pamlico_Sounds,75.994127,NCA_Estuarine_Coastal,,


In [4]:
ncca_raw.head()
# parameter name is nurtients 

Unnamed: 0,UID,SITE_ID,STATE,DATE_COL,BATCH_ID,PARAMETER,PARAMETER_NAME,RESULT,UNITS,MDL,MRL,PQL,DATE_ANALYZED,HOLDING_TIME,QACODE,LAB_SAMPLE_ID,SAMPLE_ID,METHOD
0,59,NCCA10-1111,CA,7/1/2010,100714.1,NTL,Total Nitrogen,0.4075,mg N/L,0.015,0.03,,7/14/2010,13.0,,1010242.0,568671.0,
1,59,NCCA10-1111,CA,7/1/2010,100708.1,NO3NO2,Nitrate/Nitrite,0.014,mg N/L,0.002,0.004,,7/8/2010,7.0,,1010242.0,568673.0,
2,59,NCCA10-1111,CA,7/1/2010,100708.1,SRP,Dissolved Inorganic Phosphate,0.028,mg P/L,0.0027,0.0054,,7/8/2010,7.0,,1010242.0,568673.0,
3,59,NCCA10-1111,CA,7/1/2010,IM_CALCULATED,DIN,Dissolved Inorganic Nitrogen,0.014,mg N/L,,,,,,Q23,,,
4,59,NCCA10-1111,CA,7/1/2010,100714.1,PTL,Total Phosphorus,0.061254,mg P/L,0.0012,0.0024,,7/14/2010,13.0,,1010242.0,568671.0,


In [5]:
ncca_new = ncca_sites.drop(columns=['NPSPARK','QA_CODES','COMMENT'])
# visit_no to be number of visits, wtbdy_nm to be body of water

In [6]:
data= ncca_new.merge(ncca_sites, how = 'right', on = ['UID','SITE_ID','STATE'].melt(id_vars=['UID','STATE','PARAMETER','PARAMETER_NAME','RESULT'],
                                                                                    var_name='Date Collected').pivot(index=np.append(ncca_new.columns,'Date Collected'),
                                                                                                                                       columns=['PARAMETER','PARAMETER_NAME'],
                                                                                                                                       values='value').sort_values('Date Collected'))
    
    # error because date_col is different for the two dataframes

AttributeError: 'list' object has no attribute 'melt'

In [2]:
ncca_raw.melt(id_vars=['UID','STATE','PARAMETER','PARAMETER_NAME','RESULT'],var_name=['Nitrogen','Phosphorus','Chlorophyll A'],columns=['PARAMETER','PARAMETER_NAME'],
                                                                                                                                       values='value')
              

NameError: name 'ncca_raw' is not defined