In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pprint import pprint

In [2]:
df = pd.read_csv('./landsend_veg_2007_2012.csv')
print df.shape

(3592, 9)


### Helpful synopsis data
1. 	Relative cover by hit type (bare ground, litter, native plant cover, exotic plant cover)
2. 	Relative cover vegetation by guild. Guilds are a combination of nativity, life history (annual, perennial), and stature (forb, grass, rush/sedge, shrub, tree).
3. 	Relative cover of veg by stature (size class)
4. 	Relative cover of veg by life history (annual, perennial, biennial, etc.)
5. 	Relative cover of height category
6. 	Relative cover by all species, just native species, just exotic species

### columns:

**Site-YearCode, Transect, Point, Height , Species, Plant code, Native Status, Life History, Stature**


#### Site-Year Code
Four letter shortening of the full site name using the first two letters of the first word and first two letters of the second word. If the site has more than two words in the name, initials  are used (e.g. Sutro Dunes = SUDO, Navy Memorial Slope = NMS). The year the observations were made is attached to the site name (e.g. all observations taken in 2012 will have xxxx-2012 as a name).

#### Transect Number 
The discrete line along which observations are made. In this study this could be any number between zero and infinity, but should be sequential and at regular intervals.

#### Point Number
Discrete locations along the transect at predetermined intervals where observations are made. In this case a dowel rod is dropped perpendicular to the tape and parallel to a standing person to the ground. In this study this could be any number between zero and infinity, though the numbers should be sequential and at regular intervals.

#### Height
Distance from the ground where plants cross the point on the transect. In this study, height classes were used: 
Low = 0 to 0.5 meters 
Medium = 0.51 to 2.0 meters 
High = 2.1 to 15 meters 
S (for super high) = 15+ meters

#### Scientific Name
The Latin genus and species assigned to the plant based on the Jepson Manual of California (1993 version, it has since been updated with new names in 2012).  The Jepson can be accessed online at ucjeps.berkeley.edu or on CalFlora at www.calflora.org

#### Plant Code
Four letter shortening of the plant name based on the first two letters of the genus and first two letters of the species. If duplicates exist at a site, USDA plants (plants.usda.gov) will be consulted on the number to be added to the end (e.g. TRLA16 is Triteleia laxa, TRLA3 is Trichostemma lanatum).

#### Native Status
Whether the plant is considered native or exotic as defined by the Jepson Manual of California (1993).


#### Life History
Describes whether the plant is an annual or perennial plant. If “shrub” is listed, this should be replaced with “perennial.” If “biennial” is listed, it should be replaced with “annual.”

#### Stature
In other studies this grouping has been called “guild.” In this study the choices are forb*, grass, rush/sedge, shrub, or tree. *A forb is a soft-bodied plant that does not make a woody stem.

#### Common name
The colloquial name for a plant, separate from its Latin name.

#### Dune
This was added by surveyors as another subdivider to the Sutro Dunes. This category can be disregarded.

#### Data Recorder
The person recording the data. Recorded so questions about the point could be addressed to the person who wrote down the data.

#### Reader
The person who “read” the plants on the line (i.e. what plants were touching the dowel at the point). Recorded so questions about the point could be addressed to the person who observed the plants.



In [3]:
colstr = ', '.join(list(df.columns.values))
print colstr

Site-YearCode, Transect, Point, Height , Species, Plant code, Native Status, Life History, Stature


In [4]:
for col in df.columns:
    print '%-30s %d' % (col, len(df[col].unique()))

Site-YearCode                  14
Transect                       50
Point                          131
Height                         9
Species                        172
Plant code                     119
Native Status                  5
Life History                   4
Stature                        7


In [5]:
pprint(sorted(list(df['Site-YearCode'].unique())))

['EAPO-2011',
 'NMS-2010',
 'NMS-2011',
 'NMS-2012',
 'NUWO-2010',
 'NUWO-2011',
 'NUWO-2012',
 'SUDO-2010',
 'SUDO-2011',
 'SUDO-2012',
 'SUDO-2013',
 'SUDO-2014',
 'SUDU-2008',
 'SUDU-2009']


In [6]:
print 'total rows:', len(df)
print '\n# null values: '
print df.isnull().sum()
print '\n% null values:'
print df.isnull().sum()/len(df)*100

total rows: 3592

# null values: 
Site-YearCode       0
Transect            0
Point               1
Height              0
Species             0
Plant code       1019
Native Status    1621
Life History     1791
Stature          1791
dtype: int64

% null values:
Site-YearCode     0.000000
Transect          0.000000
Point             0.027840
Height            0.000000
Species           0.000000
Plant code       28.368597
Native Status    45.128062
Life History     49.860802
Stature          49.860802
dtype: float64


In [7]:
from define_column import define_column

In [8]:
define_column('Transect')

The discrete line along which observations are made. 

In this study this could be any number between zero and infinity, but should be sequential and at regular intervals.


'The discrete line along which observations are made. \n\nIn this study this could be any number between zero and infinity, but should be sequential and at regular intervals.'