## What is pandas?
### Working with pandas
*Curtis Miller*

In this notebook we see a preview of pandas `Series` and `DataFrame`s.

We load in a data set from the hard drive and compare the NumPy object to a pandas object.

In [1]:
import numpy as np
import pandas as pd

In [2]:
schema = np.dtype([('sepal_length', np.float16),    # Need to define a custom dtype to read CSV of mixed data type
                   ('sepal_width',  np.float16),
                   ('petal_length', np.float16),
                   ('petal_width',  np.float16),
                   ('species',      '<U16')])

In [4]:
np_data = np.loadtxt("iris.csv", skiprows=1, dtype=schema, delimiter=',')

In [5]:
np_data

array([( 5.1015625 ,  3.5       ,  1.40039062,  0.19995117, 'setosa'),
       ( 4.8984375 ,  3.        ,  1.40039062,  0.19995117, 'setosa'),
       ( 4.69921875,  3.19921875,  1.29980469,  0.19995117, 'setosa'),
       ( 4.6015625 ,  3.09960938,  1.5       ,  0.19995117, 'setosa'),
       ( 5.        ,  3.59960938,  1.40039062,  0.19995117, 'setosa'),
       ( 5.3984375 ,  3.90039062,  1.70019531,  0.39990234, 'setosa'),
       ( 4.6015625 ,  3.40039062,  1.40039062,  0.30004883, 'setosa'),
       ( 5.        ,  3.40039062,  1.5       ,  0.19995117, 'setosa'),
       ( 4.3984375 ,  2.90039062,  1.40039062,  0.19995117, 'setosa'),
       ( 4.8984375 ,  3.09960938,  1.5       ,  0.09997559, 'setosa'),
       ( 5.3984375 ,  3.69921875,  1.5       ,  0.19995117, 'setosa'),
       ( 4.80078125,  3.40039062,  1.59960938,  0.19995117, 'setosa'),
       ( 4.80078125,  3.        ,  1.40039062,  0.09997559, 'setosa'),
       ( 4.30078125,  3.        ,  1.09960938,  0.09997559, 'setosa'),
      

In [6]:
type(np_data)

numpy.ndarray

In [7]:
np_data[:5]    # Slicing operations

array([( 5.1015625 ,  3.5       ,  1.40039062,  0.19995117, 'setosa'),
       ( 4.8984375 ,  3.        ,  1.40039062,  0.19995117, 'setosa'),
       ( 4.69921875,  3.19921875,  1.29980469,  0.19995117, 'setosa'),
       ( 4.6015625 ,  3.09960938,  1.5       ,  0.19995117, 'setosa'),
       ( 5.        ,  3.59960938,  1.40039062,  0.19995117, 'setosa')],
      dtype=[('sepal_length', '<f2'), ('sepal_width', '<f2'), ('petal_length', '<f2'), ('petal_width', '<f2'), ('species', '<U16')])

In [8]:
np_data[:5]['sepal_length']

array([ 5.1015625 ,  4.8984375 ,  4.69921875,  4.6015625 ,  5.        ], dtype=float16)

In [9]:
np_data[:5][['petal_length', 'species']]

array([( 1.40039062, 'setosa'), ( 1.40039062, 'setosa'),
       ( 1.29980469, 'setosa'), ( 1.5       , 'setosa'),
       ( 1.40039062, 'setosa')],
      dtype=[('petal_length', '<f2'), ('species', '<U16')])

But there is a better way, with pandas.

In [10]:
pd_data = pd.read_csv("iris.csv")

In [11]:
pd_data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [12]:
type(pd_data)

pandas.core.frame.DataFrame

In [13]:
pd_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [14]:
pd_data.head().sepal_length

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

In [15]:
pd_data.head().loc[:, ['petal_length', 'species']]

Unnamed: 0,petal_length,species
0,1.4,setosa
1,1.4,setosa
2,1.3,setosa
3,1.5,setosa
4,1.4,setosa


In [None]:
type(pd_data.sepal_length)

At its core, though, pandas is built on top of NumPy.

In [None]:
pd_data.values

In [None]:
np_pd_data = pd.DataFrame(np_data)    # Converting to a DataFrame
np_pd_data