# Data Storage and Manipulation with Pandas
We've taken a look at using Numpy arrays for data storage, now lets turn to another solution for organizing and manipulating data. This package is called Pandas and relies on "dataframe" objects which allow for more efficient labeling, indexing, and operations on large datasets. 

In [None]:
#Importing the package
import pandas as pd
pd.__version__

We can use pandas to create "series" objects which function much like arrays in numpy. Try running the following block of code to see how creating a series object works.

In [None]:
data = pd.Series([-2,-1,0,1,2])

This series is stored with appropriate indecies, and we can extract both the values and indecies by running either data.values or data.indicies. Indexing into the series is done in the same way as numpy. 

In [None]:
print(data,"\n...........")
print(data.values,"\n...........")
print(data.index,"\n...........")
print(data[1])

## Dataframes
Dataframes provide a nice way to store data with labels. Often when working with large datasets containing many different variables, it becomes helpful to referenece columns or rows by a label instead of a numbered index. There are many ways to build a pandas dataframe -- let's examine a couple. 

In [None]:
#Direct creaton of a dataframe with data
df1 = pd.DataFrame([2,4,6,8]) #indecies are automatically assigned (0,1,2,...)
df2 = pd.DataFrame([2,4,6,8],['human','dog','insect','spider']) #indecies are specified by second argument pd.DataFrame(data,indecies)
df3 = pd.DataFrame([2,4,6,8],['human','dog','insect','spider'],columns = ['Legs']) 

In [None]:
#Creation from numpy arrays
import numpy as np
x = np.arange(-5,5,1)
y = np.arange(10,101,10)
z = np.linspace(0,1,10)
matrix = [x,y,z]

df4 = pd.DataFrame(matrix).transpose()

In [None]:
#Adding Column Names
cols = ['Col1','Col2','Col3']
df4.columns = cols

In [None]:
#Index manipulation
#df4.set_index('Col1')  #inplace=True
#df4.reset_index() #Drop=False, inplace = False

In [None]:
#Viewing part of the data
#df4.head()

In [None]:
#Adding a column
c4 = ['zero','one','two','three','four','five','six','seven','eight','nine']
df4['Col4'] = c4

In [None]:
#Indexing into dataframes
df4['Col1']

### Pulling in data from other sources
We can also use pandas to pull data from other sources. There are a plethora of ways data can be stored - SQL tables, txt, csv, pkl, etc. Finding the right application given a data source is as simple as a google search. 

For now, lets add some CSV data. Navigate to the github page and download "eddy.csv" and save to a location on your computer where you know where it lives. 

![](https://imgur.com/tluBCEZ.png)

![](https://imgur.com/ElrlfMk.png)


In [None]:
#Import csv data to dataframe
source = pd.read_csv('source_data.csv')
tower = pd.read_csv('tower_data.csv')

In [None]:
import matplotlib.pyplot as plt #import the package
fig,ax = plt.subplots() #setup the figure
ax.plot(source['Local_DT'],source['LI_CO2']) #plot the data
ax.xaxis.set_major_locator(plt.MaxNLocator(10)) #set the number of x axis ticks
plt.gcf().autofmt_xdate() #get a nice date format for the x axis
plt.show()

In [None]:
fig,ax = plt.subplots() #setup the figure
ax.plot(tower['Local_DT'],tower['PIC_CO2']) #plot the data
ax.xaxis.set_major_locator(plt.MaxNLocator(10))  #set the number of x axis ticks
plt.gcf().autofmt_xdate() #get a nice date format for the x axis
plt.show()

In [None]:
#Clip the data to a shorter time window
DT1 = '2018-09-05 18:00:00'
DT2 = '2018-09-05 19:30:00'
source_clipped = source.loc[(source['Local_DT']>=DT1)&(source['Local_DT']<=DT2)].reset_index(drop=True)
tower_clipped = tower.loc[(tower['Local_DT']>=DT1)&(tower['Local_DT']<=DT2)].reset_index(drop=True)

In [None]:
import matplotlib.gridspec as grd
fig = plt.figure(figsize=(10,5))
gs = grd.GridSpec(2,1)
ax = fig.add_subplot(gs[0])
ax.plot(source_clipped['EPOCH_TIME'],source_clipped['LI_CO2'],color='blue')
ax = fig.add_subplot(gs[1],sharex=ax)
ax.plot(tower_clipped['EPOCH_TIME'],tower_clipped['PIC_CO2'],color='red')
plt.gcf().autofmt_xdate()
plt.show()


### Concatenation
We can also merge dataframes together by index - in this case lets use the EPOCH_TIME. Sometimes some errors arise and we need to deal with them in unique ways. 

In [None]:
#Set epoch time as index concatenation
tower.index = tower['EPOCH_TIME']
del(tower['EPOCH_TIME'])
source.index = source['EPOCH_TIME']
del(source['EPOCH_TIME'])

#Ensure there are no duplicate epochs - I was getting an error when trying to concatenate for some reason
source = source[~source.index.duplicated(keep='first')]
tower = tower[~tower.index.duplicated(keep='first')]

In [None]:
full_df = pd.concat([tower,source],axis=1).dropna().drop('Local_DT',axis=1).reset_index()

### Running functions on DataFrames
Often we will want to run functions using data within the dataframe to create a new column, much like what most people are familiar with doing in excel. Lets take a look at how to do this. 

In [None]:
def xy_speed(row):
    return np.sqrt(row['ANEM_X']**2+row['ANEM_Y']**2)


def direction(row):
    d = np.arctan2(row['ANEM_Y'],row['ANEM_X'])*180.0/np.pi
    if d < 0:
        return 360.0 + d
    else:
        return d

In [None]:
full_df['speed'] = full_df.apply(lambda row: xy_speed(row),axis=1)
full_df['direction'] = full_df.apply(lambda row: direction(row),axis=1)

In [None]:
full_df