# Data Exploration 

## Import libraries
Here we import the Python modules needed for analysis

In [1]:
#Import modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
#Instruct Jupyter to show plots
%matplotlib inline

## Read in the data
Read in the CSV saved in the last step and convert the data types. 

In [3]:
#Load the saved csv file
df = pd.read_csv('GageData.csv')

In [4]:
#Confirm it looks good by viewing the first 5 records
df.head()

Unnamed: 0,agency_cd,site_no,datetime,discharge,Confidence
0,USGS,2089000,1930-10-01,210.0,A
1,USGS,2089000,1930-10-02,188.0,A
2,USGS,2089000,1930-10-03,200.0,A
3,USGS,2089000,1930-10-04,200.0,A
4,USGS,2089000,1930-10-05,200.0,A


#### Data types
Dataframes are structured so that each column contains values of a constant and defined **data type**. The various data types Pandas can use are shown here:

If we import data into a dataframe, it infers data types from the the values in the input file. We can view the datatypes in a dataframe via the dataframe's `dtypes` property:

In [5]:
#Show the data types of each column
df.dtypes

agency_cd      object
site_no         int64
datetime       object
discharge     float64
Confidence     object
dtype: object

Note that two columns have incorrectly assigned data types: The `site_no` should be a string, not a number, as it holds nominal values. Also, data

In [6]:
#Convert datetime to an actual datetime object
df['datetime'] = pd.to_datetime(df['datetime'],format=('%Y-%m-%d'))
df.dtypes

agency_cd             object
site_no                int64
datetime      datetime64[ns]
discharge            float64
Confidence            object
dtype: object

In [None]:
#Setting the date time as the index allows time slicing
df.index = df.datetime

In [None]:
#Create two new dataframes: One with records before Falls Lake an one after
dfPre = df['1950-01-01':'1979-12-31']
dfPost = df['1984-01-01':'2017-12-31']

## Create scatterplots of daily discharge data
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf


In [None]:
#Create line plots with a specified X and Y column
plt.figure(figsize=(20,6));
plt.plot(df['datetime'],df['discharge']);
plt.axvline(x='1980-01-01',color='red',ls='--');
plt.title("Neuse River Near Goldsboro, NC");
plt.ylabel("Discharge (cfs)");

Use `Seaborn` to make prettier plots (more easily

https://seaborn.pydata.org/tutorial/aesthetics.html


In [None]:
#Activate seaborn default aesthetics
sns.set()
#Repeat above
plt.figure(figsize=(20,6));
plt.plot(df['datetime'], df['discharge']);
plt.axvline(x='1980-01-01',color='red',ls='--');
plt.title("Neuse River Near Goldsboro, NC");
plt.ylabel("Discharge (cfs)");

In [None]:
#And plot them too
plt.figure(figsize=(20,6));
plt.plot(df['datetime'], df['discharge'],color='gray');
plt.plot(dfPre['datetime'], dfPre['discharge'],color='c');
plt.plot(dfPost['datetime'], dfPost['discharge'],color='m');
plt.axvline(x='1979-12-31',color='red',ls='--');
plt.title("Neuse River Near Goldsboro, NC");
plt.ylabel("Discharge (cfs)");

### Creating derived columns

In [None]:
#Convert from cfs to mps (1 CFS = 0.028316847 MPS)
df['mps'] = df['discharge'] * 0.028316847

In [None]:
#Replot
plt.plot(df['datetime'],df['mps']);

In [None]:
#Convert from csf to mgd (1 CFS = 0.53817 MGD)

In [None]:
#Replot

### Summarize and Plot Streamflow data

In [None]:
#Get a count of all records from the dataframe's shape (rows, columns)
df.shape

In [None]:
#Or just show the rows, i.e., the first item in the shape result
df.shape[0]

#### Summarizing records by Confidence code
Here, we group the data by the unique values in a column, namely the `Confidence` column. First, we'll just examine the number of unique values and what those values are. 

In [None]:
#Use nunique on the column to list the number of unique values
df['Confidence'].nunique()

In [None]:
#Use unique to show what the 4 unique values are
df['Confidence'].unique()

Now, we'll group the records by confidence codes

In [None]:
#Create the grouped object
grpConfidence = df.groupby(['Confidence'])

In [None]:
#We can now list the counts of records by confidence code
grpConfidence.count()

In [None]:
#Or we can just show the count by a single column
grpConfidence['discharge'].count()

In [None]:
#https://pandas.pydata.org/pandas-docs/stable/visualization.html
count_by_Confidence = grpConfidence['discharge'].count()
#count_by_Confidence.plot(kind='bar');
count_by_Confidence.plot.bar();

In [None]:
type(count_by_Confidence)
#plt.bar(grpConfidence,grpConfidence['discharge'])

##### Summarizing data with `Describe`

In [None]:
#Default describe function results
df['discharge'].describe()

In [None]:
#Setting our one percentiles
df['discharge'].describe(percentiles=[0.1,0.25,0.75,0.9])

In [None]:
#Describe records before 1980 and after 1984 (using index slicing)
sumPre = dfPre.describe(percentiles=[0.1,0.25,0.75,0.9])
sumPost = dfPre.describe(percentiles=[0.1,0.25,0.75,0.9])

In [None]:
#Combine the pre and post summaries
dfSummary = pd.concat([sumPre,sumPost],axis=1)
dfSummary.columns = ("before","after")
dfSummary[4:-2].plot(kind='bar');

In [None]:
#Box plots
dfPre['discharge'].plot(
    kind='box',
    title='My chart'
);

In [None]:
fig = plt.figure()
ax = 
plt.plot(dfPre)

## Monthly plots

In [None]:
df['Mo'] = df['datetime'].map(lambda x: x.month)
df.head()

In [None]:
byMonth = df.groupby('Mo')

In [None]:
monthlyDF = byMonth['discharge'].mean()

In [None]:
monthlyDF

In [None]:
plt.plot(monthlyDF,);