Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

---

## Lab 3: Exploratory Data Analysis
**This lab was distributed Monday 9/16/2019 and should be completed by Monday 9/23/2019 at 11:59PM.**

-------------------------------------------

Welcome to your third lab of the semester!<br>

This lab aims to get you started with exploratory data analysis, including using `.count`, `.groupby`, understanding different file types, and performing basic plotting.

The data for this lab comes from the State of California's [domestic well groundwater monitoring program](https://data.ca.gov/dataset/ground-water-water-quality-results). In California, up to [2 million people get their water from a private domestic well](https://www.waterboards.ca.gov/gama/docs/wellowner_guide.pdf) as opposed to a public water system. For this lab, we've taken the water monitoring dataset and made some modifications for educational purposes, including splitting the dataset to merge later.

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Question 1: Understanding the data
### Question 1.1
What sort of files are `gama_wells.txt`, `gama_measurements.csv`? Describe the difference between these two files. You can inspect the files in a text editor to answer this question.

*Your answer here*

### Question 1.2
Load gama_wells.txt into dataframe `wells`, and gama_measurements.csv into dataframe `measurements`. You should use the pandas functions for [reading .csv files](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [reading .txt files](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html).

In [None]:
wells = ...
measurements = ...

In [None]:
wells.head()

In [None]:
measurements.head()

### Question 1.3
How many rows are in `wells`? How about `measurements`?

In [None]:
# get number of rows in wells

In [None]:
# get number of rows in measurements

### Question 1.4

What does each row of `wells` represent? How about each row of `measurements`?

*Your answer here*

### Question 1.5
Check out the documentation for this dataset provided by the [California Data Portal](https://data.ca.gov/dataset/ground-water-water-quality-results) by clicking on "Data Dictionary". Are there any fields in either `wells` or `measurements` that are not documented or easily understandable from looking at the data dictionary?

*Your answer here*

## Question 2: Merging data
For this question, we want to use the method [`.merge()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) to merge `wells` and `measurements`.<br>
When you use `.merge()`, there are a few fields that you'll have to populate. The `DataFrame` in `DataFrame.merge()` is considered your left dataframe, or the set of data that will show up on the left side of your merged dataframe. The `right` field will contain your right dataframe (the set of data that will show up on the right side of your merged dataframe).<br>
Two of the key fields that you'll have to fill out are `on = ` (the common field that both tables should be matched on when  you merge) and ` how = ` (the type of merge that you want to perform).<br>
A visual of the different types of merges is shown below ([source](http://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)).<br>
<img src="images/joins.png"><br>
An inner merge retains only the records that both tables have in common, an outer merge keeps all records from both tables and fills in `NaN` values for non-overlapping records, and a right merge and a left merge keep all the records from the right or left table, respectively, filling in `NaN` if a particular record is not found in the other table.

### Question 2.1
If we want to link the well measurements in `measurement` to the well characteristics in `wells`, what field do we want to use for `on = `?

*Your answer here*

### Question 2.2
Will using an inner, outer, right, or left merge change the number of records or number of missing values in our final merged dataset? Feel free to try the different options to see what results. Why or why not? Under what conditions would your choice of merge type (inner, outer, right, or left) matter? 

In [None]:
# use this cell for scratch work

*Your answer here*

### Question 2.3
Merge `measurements` and `wells`, keeping `measurements` as the left dataframe. Save the merged dataframe to `measurements_wells`.

In [None]:
measurements_wells = ...

In [None]:
measurements_wells.head()

## Question 3: Groupby

### Question 3.1 
Group `measurements_wells` by "WELL_TYPE", outputting a table that shows the counts of the variables "RESULTS" and "QUALIFER" grouped by "WELL_TYPE". Here, you'll want to use [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.count()`.

In [None]:
# your code here

### Question 3.2
Does groupby count NaNs? You can check by using the methods [`.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) and `.sum`.

In [None]:
# use this cell for scratch work

*Your answer here*

### Question 3.3
Find the average "RESULTS" value for arsenic (AS) measurements, grouped by "WELL_TYPE".

In [None]:
# your code here

## Question 4: Plotting
Now, lets do some basic plotting. According to the [Center for Disease Control and Prevention, shallower wells are more vulnerable to nitrate contamination from fertilizer, waste, or other sources](https://www.cdc.gov/healthywater/drinking/private/wells/disease/nitrate.html). Let's explore this relationship by first creating a dataframe called `nitrate` that contains only the records that measure nitrate concentrations (NO3N). Then, use [`plt.scatter()`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html) to create a scatter plot. You only need to fill in the x and y values by referencing the corresponding column in `nitrate`. Fill in the rest of the functions in the cell below to give the plot an appropriate title and x and y axis labels. Make sure to include units in your axis labels (you can inspect `nitrate` to find the appropriate units).

In [None]:
import matplotlib.pyplot as plt

In [None]:
nitrate = ...

plt.scatter(..., ..., marker = '.')
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)

# Hooray, you're done! 

Please remember to submit your lab work, after running all cells, in .html and .ipynb format on bCourses.