<img src="files/img/pandas.png" alt="Operations Across Axes" />

# [Pandas](http://pandas.pydata.org/) - Python Data Analysis Library
---
## I shamelessly picked from these chaps:
 - [Daniel Chen - Pandas for Data Analysis](https://www.youtube.com/watch?v=oGzU688xCUs)
 - [Jeff Delaney - 19 Essential Snippets in Pandas](https://jeffdelaney.me/blog/useful-snippets-in-pandas/)
 - [Burke Squires - Intro to Data Analysis with Python](https://github.com/burkesquires/python_biologist/tree/master/05_python_data_analysis)


## General plan:
- what is Pandas all about?
- brief intro to pandas objects and syntax
- numpy dataframe, show the basics
- import gapminder dataset, interactive
---
### Things to show:
- start with building from numpy to a pandas dataframe
- import pandas as pd
- create dataframe (first from np.array)
- explore: head, tail, sample, shape, describe, info
- differentiate series (single vector) and dataframes (multiple vectors)
- change column names (lists)
- add/remove, and reorder a column (mean, conditional logic)
- add/remove and reorder a row (dictionary, key=col name)
- change value in row
- create a plot (matplotlib)
- combine two dataframes (create a new dataframe with same index as the first)
    - (axis=0 and axis=1)
- demonstrate .loc and .iloc
- reindex (change index number)
    
### Gapminder dataset: 
- how to import a (tab)-delimited file as a dataframe
- explore dataset
    - df.duplicate
    - df.unique
    - df.nunique
- filter
    - conditonal logic
    - sort
    - groupby
- apply funciton to every row
    - do this with a loop
- filling NaNs / missing data
- create plot (save?)
- export as .csv, excel sheet, or pickle file

### Extras:
- time series (use in index to sort/order)
- [tidy dataset](http://vita.had.co.nz/papers/tidy-data.pdf)

## Jupyter Notebook Shortcuts
- documentation: [mysterious_function]?
- check function arguments: shift + tab
- run current cell/block: shift + enter 
- insert cell above: esc + a
- delete cell: esc (hold) + d + d (double tap)

In [1]:
type?

<img src="files/img/python-scientific-ecosystem.png" alt="Operations Across Axes" />

## What is Pandas?
- This is the go-to data analysis library for Python
- The non-clicky version of Excel
- Cousin of R
- Progeny of NumPy
- Best buds with Matplotlib
- Think in vector operations

## Pandas Objects and Syntax:
- DataFrame = Indexed rows and columns of data, like a spreadsheet or database table.
- Series = single column of data
- Shape: [number_of_rows, number_of_columns] in a DataFrame
- Axis: 
    - 0 == Calculate statistic for each column
    - 1 == Calculate statistic for each row
    - *reversed when using the drop() function to remove columns/rows
    
<img src="files/img/python-operations-across-axes.svg" alt="Operations Across Axes" />

In [2]:
# quick demo of speed and vector operations

# standard python 

# create 3 lists of a million ints
A=range(1000000)
B=range(1000000)
C=range(1000000)

# begin timing the operation
import time
start_time = time.time()

# generate new list based on the A, B, and C lists
Z = []
for idx in range(len(A)):
    Z.append(A[idx] + B[idx] * C[idx])

python_time = time.time() - start_time
print('Took', python_time, 'seconds')

Took 1.4674549102783203 seconds


In [3]:
# repeat with NumPy

# create 3 arrays of a million ints
import numpy as np
A=np.arange(1000000)
B=np.arange(1000000)
C=np.arange(1000000)

# begin timing the operation
start_time = time.time()

# generate new array based on the A, B, and C arrays
Z = A + B * C

numpy_time = time.time() - start_time
print('Took', numpy_time, 'seconds')

# how much faster is NumPy
print('Numpy is', python_time/numpy_time, 'times faster')

Took 0.052857160568237305 seconds
Numpy is 27.76265116216131 times faster


# Creating A Simple Pandas DataFrame From NumPy
---
## Create A NumPy Array Of Integers

In [4]:
# check numpy version
print('NumPy version:',np.version.version)

NumPy version: 1.13.3


In [8]:
# create a 4x100 numpy ndarray using numpy.random.randint()
np.random.seed(0)
array = np.random.randint(0,100,size=(100,4))
array[:5]

array([[44, 47, 64, 67],
       [67,  9, 83, 21],
       [36, 87, 70, 88],
       [88, 12, 58, 65],
       [39, 87, 46, 88]])

In [9]:
# check the array and the type
type(array)

numpy.ndarray

## Create A Pandas DataFrame

In [10]:
# import the pandas library
import pandas as pd
print('Pandas version:', pd.__version__)

Pandas version: 0.20.3


In [35]:
# create a Pandas DataFrame from the NumPy ndarray
df = pd.DataFrame(data=array, index=None, columns=None, dtype=None)
#df

## Exploring A DataFrame

In [None]:
# check the shape
df.shape

In [None]:
# you can also use the len() function to get the number or rows/observations
len(df)

In [12]:
# get a concise summary of the DataFrame with .info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
0    100 non-null int64
1    100 non-null int64
2    100 non-null int64
3    100 non-null int64
dtypes: int64(4)
memory usage: 3.2 KB


In [13]:
# view the top 5 rows
df.head()

Unnamed: 0,0,1,2,3
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


In [14]:
# view the bottom 5 rows
df.tail()

Unnamed: 0,0,1,2,3
95,8,79,79,53
96,11,4,39,92
97,45,26,74,52
98,49,91,51,99
99,18,34,51,30


In [15]:
# take a sample of random rows/observations
df.sample(5)

Unnamed: 0,0,1,2,3
51,24,79,41,18
75,93,84,2,69
21,6,68,47,3
28,48,93,3,98
5,81,37,25,77


In [16]:
# view brief descriptive stats of the DataFrame
df.describe()

Unnamed: 0,0,1,2,3
count,100.0,100.0,100.0,100.0
mean,47.71,50.51,45.38,51.85
std,28.639626,27.36897,28.634859,29.711194
min,0.0,0.0,0.0,0.0
25%,23.75,31.75,20.75,28.0
50%,44.0,50.0,43.0,52.0
75%,73.0,75.5,69.25,75.5
max,99.0,99.0,99.0,99.0


## Manipulating DataFrame Columns (Variables)

In [26]:
# view current columns
df.columns

RangeIndex(start=0, stop=4, step=1)

In [27]:
# send current column names to a list
cols = df.columns.tolist()
cols

[0, 1, 2, 3]

In [37]:
# change column names

# create a list of new column names (same length as columns)
cols = ['a', 'b', 'c', 'd']

# set list to column names
df.columns = cols
df.head()

Unnamed: 0,a,b,c,d
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


In [29]:
# insert a new column
df['new_column'] = 'cheese'
df.head()

Unnamed: 0,a,b,c,d,new_column
0,44,47,64,67,cheese
1,67,9,83,21,cheese
2,36,87,70,88,cheese
3,88,12,58,65,cheese
4,39,87,46,88,cheese


In [30]:
# changing the column positions

# set current column order to a list object
cols = df.columns.tolist()
print('Starting column order:', cols)

# manipulate column names as a list object
# reverse column order
rev_order = cols[::-1]
print('Reverse column order:', rev_order)

# move last column to first
new_order = cols[-1:] + cols[:-1]
print('Last to first order:', new_order)

# set the column order (creates new dataframe)
df = df[new_order]
df.head()

Starting column order: ['a', 'b', 'c', 'd', 'new_column']
Reverse column order: ['new_column', 'd', 'c', 'b', 'a']
Last to first order: ['new_column', 'a', 'b', 'c', 'd']


Unnamed: 0,new_column,a,b,c,d
0,cheese,44,47,64,67
1,cheese,67,9,83,21
2,cheese,36,87,70,88
3,cheese,88,12,58,65
4,cheese,39,87,46,88


In [31]:
# delete a column
del df['new_column']
df.head()

Unnamed: 0,a,b,c,d
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


In [38]:
# alternate way to delete column (or row), axis numbers are reversed

# save the column to add back later
a = df['a']

# use drop() to remove column
df.drop(['a'], axis=1) # does this really delete the column?

Unnamed: 0,b,c,d
0,47,64,67
1,9,83,21
2,87,70,88
3,12,58,65
4,87,46,88
5,37,25,77
6,9,20,80
7,79,47,64
8,99,88,49
9,19,19,14


In [39]:
# recheck to see if column 'a' was dropped
df.head()

Unnamed: 0,a,b,c,d
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


In [40]:
# drop column 'a' properly
df2 = df.drop(['a'], axis=1)
df2.head()

Unnamed: 0,b,c,d
0,47,64,67
1,9,83,21
2,87,70,88
3,12,58,65
4,87,46,88


In [43]:
# but we still have the original DataFrame with 4 columns
df.head()

Unnamed: 0,a,b,c,d
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


## Combining DataFrames
Before getting into the manipulation of DataFrame rows it helps to understand a bit more about index values and combined dataframes

In [45]:
# Using pandas.concat() to combine DataFrames
df3 = pd.concat([df, df])
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 99
Data columns (total 4 columns):
a    200 non-null int64
b    200 non-null int64
c    200 non-null int64
d    200 non-null int64
dtypes: int64(4)
memory usage: 7.8 KB


## Manipulating DataFrame Rows (Observations)
Two important functions to introduce here are loc() and iloc()
- loc() - accesses the index based on the value
- iloc() - accesses the index based on the position.  
You may come across 

In [None]:
# view column names in list
df.columns.tolist()

In [None]:
df.ix[0]

In [None]:
# but is the default calculating mean for each column or each row?
df.mean(axis=0) # axis == 0 (calculate statistic for each column)

In [None]:
df.loc[len(df)] = []

In [None]:
# cannot find a way to add a row with just 4 values and automatically calculate the mean
# maybe use the apply method to input the mean
df['newmean'] = df.apply(np.sum, axis=1)

In [None]:
df.loc[4] = [1,2,3,4,5]

## Create a Pandas DataFrame from a dictionary

In [None]:
my_dict = {'a':['cheese', 'dog', 'goat', '4h'], 'b':['lush','planet', '2017', 'la trance'] }

In [None]:
df2 = pd.DataFrame(my_dict)

In [None]:
df2

In [None]:
df3 = pd.concat([df, df2], axis=1)
df3

In [None]:
# show merge and join?

## Import a .csv as a Pandas DataFrame

In [None]:

# import a .csv file to a DataFrame
df = pd.read_csv('dummydata.csv', sep=',', 
                 header='infer', 
                 names=None,
                 index_col=None, 
                 usecols=None)
# strip-down column names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-','_')

In [None]:
import numpy as np
r = np.random.normal(0,1,10000000)
r[0]

In [None]:
# i,j = np.where( a==value )
# np.where(np.logical_and(a>=6, a<=10))
index = np.where(np.logical_and(r>=3, r<=5))
values = r[index]
len(values)

In [None]:
val = (r>3)&(r<5)
val

In [None]:
x = [0, 1, 2, 5, -1, 3.4]

z = [3, -1, 10, 12., -4.2, 0.]

z2 = [0., 10., -7., 12., 82., 19.]

new_list = []
for x,y in zip(x,z):
    print(type(x))
    print(y)
    print()