<img style="float: left;" src="files/img/pandas.png"  />

# [Pandas](http://pandas.pydata.org/) - Python Data Analysis Library
---
## I shamelessly picked from these chaps:
 - [Daniel Chen - Pandas for Data Analysis](https://www.youtube.com/watch?v=oGzU688xCUs)
 - [Jeff Delaney - 19 Essential Snippets in Pandas](https://jeffdelaney.me/blog/useful-snippets-in-pandas/)

## General plan:
- what is Pandas all about?
- brief intro to pandas objects and syntax
- numpy dataframe, show the basics
- import gapminder dataset, interactive
---
### Things to show:
- start with building from numpy to a pandas dataframe
- import pandas as pd
- create dataframe (first from np.array)
- explore: head, tail, sample, shape, describe, info
- differentiate series (single vector) and dataframes (multiple vectors)
- change column names (lists)
- add/remove, and reorder a column (mean, conditional logic)
- add/remove and reorder a row (dictionary, key=col name)
- change value in row
- create a plot (matplotlib)
- combine two dataframes (create a new dataframe with same index as the first)
    - (axis=0 and axis=1)
- demonstrate .loc and .iloc
- reindex (change index number)
    
### Gapminder dataset: 
- how to import a (tab)-delimited file as a dataframe
- explore dataset
    - df.duplicate
    - df.unique
    - df.nunique
- filter
    - conditonal logic
    - sort
    - groupby
- apply funciton to every row
    - do this with a loop
- filling NaNs / missing data
- create plot (save?)
- export as .csv, excel sheet, or pickle file

### Extras:
- time series (use in index to sort/order)
- [tidy dataset](http://vita.had.co.nz/papers/tidy-data.pdf)

## Jupyter Notebook Shortcuts
- documentation: [mysterious_function]?
- check function arguments (brief): shift + tab
- run current cell/block: shift + enter 
- insert cell above: esc + a
- delete cell: esc (hold) + d + d (double tap)

## What is Pandas?
- the go-to data analysis library for Python
- the non-clicky version of Excel
- the fraternal twin of R (have access to the Python universe)
- the progeny of NumPy (but can use heterogeneous data)
- best friends with Matplotlib (used to make pretty plots)

<img src="files/img/python-scientific-ecosystem.png" alt="Operations Across Axes" />

## Big Concepts:
- NumPy vs. Pandas: 
    - NumPy - arrays are homogeneous (same same type)
    - Pandas - dataframes can be heterogeneous (multiple data types, like lists)
- think in vector operations

In [6]:
# quick demo of speed and vector operations

# standard python 

# create 3 lists of a million ints
A=range(1000000)
B=range(1000000)
C=range(1000000)

# begin timing the operation
import time
start_time = time.time()

# generate new list based on the A, B, and C lists
Z = []
for idx in range(len(A)):
    Z.append(A[idx] + B[idx] * C[idx])

python_time = time.time() - start_time
print('Took', python_time, 'seconds')

Took 1.0992920398712158 seconds


In [12]:
# repeat with NumPy

# create 3 arrays of a million ints
import numpy as np
A=np.arange(1000000)
B=np.arange(1000000)
C=np.arange(1000000)

# begin timing the operation
start_time = time.time()

# generate new array based on the A, B, and C arrays
Z = A + B * C

numpy_time = time.time() - start_time
print('Took', numpy_time, 'seconds')

# how much faster is NumPy
print('Numpy is', python_time/numpy_time, 'times faster')

Took 0.01576685905456543 seconds
Numpy is 69.72168876926101 times faster


# Making A Simple Pandas DataFrame From Scratch
---

In [19]:
# use NumPy to generate a random dataset to be imported into a Pandas DataFrame

# import NumPy
import numpy as np
print('NumPy version:',np.version.version)


NumPy version: 1.11.1


In [37]:
# create a 4x4 numpy ndarray
np.random.seed(0) # using numpy.random.seed() generates a reproducible set of random numbers
array = np.random.randint(0,100,size=(100,4))

In [38]:
# check the array and type
type(array)

numpy.ndarray

## Brief Pandas Reference:
- DataFrame = Indexed rows and columns of data, like a spreadsheet or database table.
- Series = single column of data
- Shape: [number_of_rows, number_of_columns] in a DataFrame
- Axis: 
    - 0 == Calculate statistic for each column
    - 1 == Calculate statistic for each row

<img src="files/img/python-operations-across-axes.svg" alt="Operations Across Axes" />

## Create A Pandas DataFrame

In [39]:
# import the pandas library
import pandas as pd
print('Pandas version:', pd.__version__)

Pandas version: 0.20.1


In [45]:
# create a Pandas DataFrame from the NumPy ndarray
df = pd.DataFrame(data=array, index=None, columns=None, dtype=None)

In [None]:
# check out the DataFrame
#df

## Explore A DataFrame

In [32]:
# check the shape of the DataFrame
df.shape

(1704, 6)

In [24]:
# use info for a brief summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


In [41]:
# check the first 5 rows
df.head()

Unnamed: 0,0,1,2,3
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65
4,39,87,46,88


In [42]:
# check the last 5 rows
df.tail()

Unnamed: 0,0,1,2,3
95,8,79,79,53
96,11,4,39,92
97,45,26,74,52
98,49,91,51,99
99,18,34,51,30


In [44]:
# check 5 sample rows
df.sample(5)

Unnamed: 0,0,1,2,3
14,36,53,5,38
3,88,12,58,65
76,12,44,66,91
87,87,32,19,72
19,14,99,53,12


In [57]:
col = ['a','b','c','d']
df.columns = col

In [56]:
a = df.loc[0]
type(a)
a

0    44
1    47
2    64
3    67
Name: 0, dtype: int64

In [53]:
a = df[0]
type(a)

pandas.core.series.Series

In [48]:
# get some basic descriptive stats
df.describe()

In [9]:
# view column names in list
df.columns.tolist()

[0, 1, 2, 3]

In [10]:
# change column names
cols = ['a','b','c','d']
df.columns = cols
df

Unnamed: 0,a,b,c,d
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88
3,88,12,58,65


In [11]:
# check the mean for each column
df.mean()

a    58.75
b    38.75
c    68.75
d    60.25
dtype: float64

In [12]:
# but is the default calculating mean for each column or each row?
df.mean(axis=0) # axis == 0 (calculate statistic for each column)

a    58.75
b    38.75
c    68.75
d    60.25
dtype: float64

In [13]:
# what about now?
df.mean(axis=1) # axis == 1 (calculate statistic for each row)

0    55.50
1    45.00
2    70.25
3    55.75
dtype: float64

In [14]:
# adding a column with the mean of each row
df['mean'] = df.mean(axis=1)
df

Unnamed: 0,a,b,c,d,mean
0,44,47,64,67,55.5
1,67,9,83,21,45.0
2,36,87,70,88,70.25
3,88,12,58,65,55.75


In [19]:
df.loc[len(df)] = []

ValueError: cannot set a row with mismatched columns

In [15]:
# cannot find a way to add a row with just 4 values and automatically calculate the mean
# maybe use the apply method to input the mean
df['newmean'] = df.apply(np.sum, axis=1)

In [102]:
df.loc[4] = [1,2,3,4,5]

ValueError: cannot set a row with mismatched columns

## Create a Pandas DataFrame from a dictionary

In [None]:
my_dict = {'a':['cheese', 'dog', 'goat', '4h'], 'b':['lush','planet', '2017', 'la trance'] }

In [None]:
df2 = pd.DataFrame(my_dict)

In [None]:
df2

In [None]:
df3 = pd.concat([df, df2], axis=1)
df3

In [None]:
# show merge and join?

## Import a .csv as a Pandas DataFrame

In [14]:
# import a tab-delimited file into Pandas as a DataFrame
df = pd.read_csv('data/gapminder.tsv', # path to the data file
                 sep='\t',             # the the entries are seperated
                 header='infer',       # what row to use for column names
                 names=None,           # substitute column names
                 index_col=None,       # what column to use as the index
                 usecols=None)         # pull specific columns from the data file

In [15]:
df

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.113360
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


### Exercise
1. Explore attributes of the DataFrame
    * What is the shape?
    * What are the variables for each observation?
    * Are there any missing entries?
2. What is the range in years?
3. What is the range in years for Zambia?
4. What country has the greatest number of observations?
5. What continent has the greatest number of observations?

In [25]:
import numpy as np
r = np.random.normal(0,1,10000000)
r[0]

1.4554442913175125

In [53]:
# i,j = np.where( a==value )
# np.where(np.logical_and(a>=6, a<=10))
index = np.where(np.logical_and(r>=3, r<=5))
values = r[index]
len(values)

13513

In [56]:
val = (r>3)&(r<5)
val

array([False, False, False, ..., False, False, False], dtype=bool)

In [9]:
x = [0, 1, 2, 5, -1, 3.4]

z = [3, -1, 10, 12., -4.2, 0.]

z2 = [0., 10., -7., 12., 82., 19.]

new_list = []
for x,y in zip(x,z):
    print(type(x))
    print(y)
    print()

<class 'int'>
3

<class 'int'>
-1

<class 'int'>
10

<class 'int'>
12.0

<class 'int'>
-4.2

<class 'float'>
0.0

