# World Ocean Data Experimentation

In this, we're going to learn how to manipulate the dataset to tell the stories we want to tell.

There's a lot we can do with the dataset once we import it.

Let's do that, along with all the other cool Python libraries we'll need.

In [14]:
# Import the stuff we need
import pandas as pd # Pandas: Turn spreadsheets into useful DataFrame objects
import numpy as np # NumPy: Handy math functions
import seaborn as sns # Seaborn: Pretty graphs/charts
import matplotlib.pyplot as plt # Matplotlib: What pandas and seaborn use under the hood

# Create a dataframe from the raw CSV data file
df = pd.read_csv("wod.csv").drop("datetime.1",axis=1).set_index(pd.DatetimeIndex(df.datetime))
# Check out the shape of the DataFrame (rows/columns)
print(df.shape)

(24259, 14)


Kind of a lot going on in there already. First we imported a whole bunch of Python modules that we use to work with our dataset. You'll see them in use as we go through the example code. For now, the most important module we need to know about is **pandas**. Pandas is an amazing tool for data science that lets us turn raw data into a special data type known as a **DataFrame**. Think of DataFrames as spreadsheets that you can store and manipulate with code.

Working with spreadsheets is one way of analyzing data, but as your datasets get larger, this kind of "manual" analysis gets more and more unwieldy. 

Take a look up there at what `print(df.info())` says about our DataFrame. It has 24,259 rows and 14 columns. That's...

In [16]:
24259 * 14 # Don't judge me. I learned programming so I wouldn't have to do arithmetic in my head.

339626

...individual pieces of data. That's just too much to work with in a spreadsheet. Besides, this way, no errant clicks will damage our data!

## Working with DataFrame

DataFrames exist to make working with data easier. But like all objects in code, we have to know the **methods** that we can use. [This link](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) will take you to a full list of what DataFrames can do, but let's give a few useful examples.

### Basic DataFrame Info
It's always good to know how much data you're working with, and what kind. For that, we have the `shape` property and the `info()` method. `shape` shows us a pair of values, indicating how many rows/columns our DataFrame has. Let's try it now.

In [48]:
print(df.shape) # Note that shape is a property, not a function that we call with ()

(24259, 14)


`info()` will give us a rundown of each column in the DataFrame, as well as the DataFrame's **index**.

### Index

A DataFrame's index is like its built-in x-axis. When we go about graphing individual columns, you'll see that we never have to provide an x-axis value, because pandas uses the index by default. Let's look at the info for our DataFrame now:

In [49]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 24259 entries, 2010-01-23 11:00:00 to 2016-11-21 08:00:00
Data columns (total 14 columns):
datetime               24259 non-null object
depth                  24259 non-null float64
depth_qc               24259 non-null int64
oxygen                 23413 non-null float64
salinity               23447 non-null float64
salinity_qc_flag       23447 non-null float64
temperature            23875 non-null float64
temperature_qc_flag    23875 non-null float64
latitude               24259 non-null float64
longitude              24259 non-null float64
year                   24259 non-null int64
month                  24259 non-null int64
day                    24259 non-null int64
time                   24259 non-null float64
dtypes: float64(9), int64(4), object(1)
memory usage: 3.4+ MB
None


You'll notice right up top that this DataFrame has a `DatetimeIndex`. The other option would be a `RangeIndex`, which would just mean a row number. `DatetimeIndex` is handy, which we'll see shortly.

You also see what columns are available for use. Looks like we have `depth`, `oxygen`, `salinity`, and `temperature` among others.

Be aware that not all the columns have data in each row. If you don't see a `24259` next to the column name, there's some missing data. Keep that in mind when analyzing.

Now that we know the shape of the data, Let's take a look at the values themselves. `head()` gives us a quick preview of the first few rows of our DataFrame.

In [50]:
print(df.head())

                                datetime  depth  depth_qc  oxygen  salinity  \
datetime                                                                      
2010-01-23 11:00:00  2010-01-23 11:00:00    0.0         0    5.65    33.380   
2010-01-23 11:00:00  2010-01-23 11:00:00    5.0         0    5.65    33.383   
2010-01-23 11:00:00  2010-01-23 11:00:00   10.0         0    5.64    33.378   
2010-01-23 11:00:00  2010-01-23 11:00:00   15.0         0    5.64    33.378   
2010-01-23 11:00:00  2010-01-23 11:00:00   20.0         0    5.64    33.379   

                     salinity_qc_flag  temperature  temperature_qc_flag  \
datetime                                                                  
2010-01-23 11:00:00               0.0        14.32                  0.0   
2010-01-23 11:00:00               0.0        14.33                  0.0   
2010-01-23 11:00:00               0.0        14.34                  0.0   
2010-01-23 11:00:00               0.0        14.33                  0.0

Looks reasonable. Let's drill down.

## Accessing single columns

If I want to look at _just_ temperature (and the index, of course), I can select that column with the syntax:

```python
df["temperature"]
```
That provides a Series (like a 1-column DataFrame) with just temperature. Check it out:

In [55]:
temps = df["temperature"] # Change this to another column to see what you get!
print(temps.head())

datetime
2010-01-23 11:00:00    14.32
2010-01-23 11:00:00    14.33
2010-01-23 11:00:00    14.34
2010-01-23 11:00:00    14.33
2010-01-23 11:00:00    14.34
Name: temperature, dtype: float64


## Filtering Data

Sometimes you don't want _all_ the data. Sometimes you just want a piece. Filtering helps us with that. Let's say I only want data from Januaries. As in, I want to know what things were like across our date range, but _only_ in January.

Helpfully, our DataFrame has a `month` column in it, where numbers `1` to `12` indicate the month of the row's data. Just like we used a column name to get only a single column, we can use a **boolean expression** to filter our data down to just rows that match our expression.

Remember that boolean expressions evaluate to `True` or `False`. So if I were to say "Only January," in our DataFrame, the expression would be:

```python
df["month"] == 1
```

When I put that whole thing in brackets, I get a filtered DataFrame with just the rows that match the expression. We'll also grab data from Mays for comparison. Try it:

In [67]:
jans = df[df["month"] == 1]
mays = df[df["month"] == 5]
print(mays.shape)
# Get the average of the month # in January to prove it's only Januaries
print("jans average month:", jans["month"].mean()) # Cool trick, right?
print("mays average month:", mays["month"].mean())

(827, 14)
jans average month: 1.0
mays average month: 5.0


We can also filter data based on limits. Let's find out what the maximum, minimum, median, and average depth in our dataset is:

In [69]:
print("Max depth: ", df["depth"].max())
print("Min depth: ", df["depth"].min())
print("Avg depth: ", df["depth"].mean())
print("Median depth: ", df["depth"].median())

Max depth:  3500.0
Min depth:  0.0
Avg depth:  161.7344078486335
Median depth:  80.0


Sooo, it's pretty clear that most of our data is _not_ up at 3500. Later, we'll do a visual analysis for more clarity on this. But for now, let's keep our depth data to 200 feet and higher.

In [74]:
filtered_depth = df[df["depth"] <= 200] # Play with the number to see how the shape changes!
print(filtered_depth.shape)

(17321, 14)


In [None]:
## Plotting Data