# pandas

a Python package providing fast, flexible data structures for data analysis. 

The primary data structure in pandas is the DataFrame which is arranged like a spreadsheet in rows and columns. Each row represents an observation in the dataset and each column is an attribute. Columns may be of different types. Rows and columns are labeled for fast access.

Below we see how to create a data frame from a csv file and look at the first few rows.

In [2]:
import pandas as pd
df = pd.read_csv('Heart.csv', index_col='ID')
df.head()

Unnamed: 0_level_0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


### selecting

The following demonstrates two different ways to index elements.

In [3]:
print(df['Age'][1]) # bracket notation
print(df.Age[1])    # use column attribute (has to be a valid Python identifier)

63
63


### loc and iloc accessors

**loc** indexes using labels and **iloc** indexes using index positions

Both specify the indices in row, col order.

In the **loc** example we ask for the element with row label 1 and column label Age. In the **iloc** example we retrieve the same information using indices which start at 0. 

If the row position contains ':' then all rows will be selected, if the col position contains ':' then all columns will be selected.

In [4]:
df.loc[1, 'Age']  # row, col order


63

In [5]:
df.iloc[0, 0]

63

### selecting certain columns

By using a nested list within square brackets, [[ ]], this ensures that a data frame is returned. In this example we did not retreive the columns in the same order as the original data frame. 

In [6]:
df_new = df[['Sex','Age']]
df_new.head()

Unnamed: 0_level_0,Sex,Age
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,63
2,1,67
3,1,67
4,1,37
5,0,41


### slicing data frames

Next we select only one column, which returns a series. A pandas series is a one-dimensional array. In this case we also selected just the first 5 rows.

In [7]:
df['Age'][:5]   # this is a series not a data frame

ID
1    63
2    67
3    67
4    37
5    41
Name: Age, dtype: int64

In [11]:
df[['Age']][:5]  # this is a data frame because of the double brackets [[ ]]

Unnamed: 0_level_0,Age
ID,Unnamed: 1_level_1
1,63
2,67
3,67
4,37
5,41


There is much more to pandas than we have covered here, we just covered enough to get started with machine learning.