### True Learning Objectives

- How can I process data in Python

#### How do I load data into Python

We use a Python **library** called **Pandas** to load and process data in Python. 

**Library** is a collection of functions that are designed to add additional functionalities, often for a specific purpose, to the basic Python package. 

**Pandas**, Python Data Analysis Library, is an open source library that provide high-performance, easy-to-use data structure and data analysis tools for the Python programming language. 

In [9]:
import pandas

In [10]:
pandas.read_csv('data/combined.csv')

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
7,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control
8,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
9,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control


The data is here, but it is not usable. Equilavent analogy in Excel is someone prints out a copy of all data and gives it to you. Informative, but useless. What we need is a name, an identifier:

In [11]:
data = pandas.read_csv('data/combined.csv')

In [13]:
data

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
7,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control
8,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
9,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control


What is data?

In [14]:
print(type(data))

<class 'pandas.core.frame.DataFrame'>


Pandas' DataFrame is one of the two *core* data structures in Pandas (the other is *Series*). 

- DataFrame is conceptionally equivalent to an Excel spreadsheet but is more powerful and versatile.
- DataFrame represents a table whose columns are vectors with same length but possible different data types

### Notes

- Pick the right tool for the job. Sometimes, Excel is the best tool.
- Using Python/Pandas (and other programming-based data analytic tools) requires a different perspective in thinking about data. 
- You will loose the *physical* interactions with data (similar to Excel). 
- Instead, you have to think about what data you want, then ask Python to show you (think shopping on Amazon ...)

#### DataFrame structures

We still need some hints from Python to help us *perceive*/*imagine* the data

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34786 entries, 0 to 34785
Data columns (total 13 columns):
record_id          34786 non-null int64
month              34786 non-null int64
day                34786 non-null int64
year               34786 non-null int64
plot_id            34786 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
genus              34786 non-null object
species            34786 non-null object
taxa               34786 non-null object
plot_type          34786 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 3.5+ MB


In [16]:
data.describe()

Unnamed: 0,record_id,month,day,year,plot_id,hindfoot_length,weight
count,34786.0,34786.0,34786.0,34786.0,34786.0,31438.0,32283.0
mean,17804.204421,6.473725,16.095987,1990.495832,11.343098,29.287932,42.672428
std,10229.682311,3.398384,8.249405,7.468714,6.794049,9.564759,36.631259
min,1.0,1.0,1.0,1977.0,1.0,2.0,4.0
25%,8964.25,4.0,9.0,1984.0,5.0,21.0,20.0
50%,17761.5,6.0,16.0,1990.0,11.0,32.0,37.0
75%,26654.75,10.0,23.0,1997.0,17.0,36.0,48.0
max,35548.0,12.0,31.0,2002.0,24.0,70.0,280.0


In [19]:
data.shape

(34786, 13)

In [22]:
# Rown count
data.shape[0]

34786

Can we view some rows?

In [24]:
data.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


In [25]:
data.head(10)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
7,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control
8,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
9,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control


In [26]:
data.tail()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
34781,26966,10,25,1997,7,PL,M,20.0,16.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34782,27185,11,22,1997,7,PL,F,21.0,22.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34783,27792,5,2,1998,7,PL,F,20.0,8.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34784,28806,11,21,1998,7,PX,,,,Chaetodipus,sp.,Rodent,Rodent Exclosure
34785,30986,7,1,2000,7,PX,,,,Chaetodipus,sp.,Rodent,Rodent Exclosure


In [27]:
data.tail(10)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
34776,33305,12,15,2001,7,PB,M,29.0,44.0,Chaetodipus,baileyi,Rodent,Rodent Exclosure
34777,34524,7,13,2002,7,PB,M,25.0,16.0,Chaetodipus,baileyi,Rodent,Rodent Exclosure
34778,35382,12,8,2002,7,PB,M,26.0,30.0,Chaetodipus,baileyi,Rodent,Rodent Exclosure
34779,26557,7,29,1997,7,PL,F,20.0,22.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34780,26787,9,27,1997,7,PL,F,21.0,16.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34781,26966,10,25,1997,7,PL,M,20.0,16.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34782,27185,11,22,1997,7,PL,F,21.0,22.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34783,27792,5,2,1998,7,PL,F,20.0,8.0,Peromyscus,leucopus,Rodent,Rodent Exclosure
34784,28806,11,21,1998,7,PX,,,,Chaetodipus,sp.,Rodent,Rodent Exclosure
34785,30986,7,1,2000,7,PX,,,,Chaetodipus,sp.,Rodent,Rodent Exclosure


In [30]:
list(data)

['record_id',
 'month',
 'day',
 'year',
 'plot_id',
 'species_id',
 'sex',
 'hindfoot_length',
 'weight',
 'genus',
 'species',
 'taxa',
 'plot_type']

Now that we know about the **structure** of our data, how do we access the individual data subset?
This is called **indexing/selection**:

Select column:

In [33]:
data['record_id']

0            1
1           72
2          224
3          266
4          349
5          363
6          435
7          506
8          588
9          661
10         748
11         845
12         990
13        1164
14        1261
15        1374
16        1453
17        1756
18        1818
19        1882
20        2133
21        2184
22        2406
23        2728
24        3000
25        3002
26        4667
27        4859
28        5048
29        5180
         ...  
34756    21209
34757    25710
34758    26042
34759    26096
34760    26356
34761    26475
34762    26546
34763    26776
34764    26819
34765    28332
34766    28336
34767    28337
34768    28338
34769    28585
34770    28667
34771    29231
34772    30355
34773    32085
34774    32477
34775    33103
34776    33305
34777    34524
34778    35382
34779    26557
34780    26787
34781    26966
34782    27185
34783    27792
34784    28806
34785    30986
Name: record_id, Length: 34786, dtype: int64

Select row by index:

In [40]:
data.loc[0]

record_id                 1
month                     7
day                      16
year                   1977
plot_id                   2
species_id               NL
sex                       M
hindfoot_length          32
weight                  NaN
genus               Neotoma
species            albigula
taxa                 Rodent
plot_type           Control
Name: 0, dtype: object

Slice rows:

In [38]:
data[5:10]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
5,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
7,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control
8,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
9,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control


Techniques to select/index DataFrame are extensive (http://pandas.pydata.org/pandas-docs/version/0.15/indexing.html#indexing). For the sake of simplicity, we will touch on specific approach as neccessitated by our data mining tasks.