# pandas

a Python package providing fast, flexible data structures for data analysis. Read the docs and find installation instructions [here:](https://pandas.pydata.org/docs/)   https://pandas.pydata.org/docs/

The primary data structure in pandas is the DataFrame which is arranged like a spreadsheet in rows and columns. Each row represents an observation in the dataset and each column is an attribute. Columns may be of different types. Rows and columns are labeled for fast access.

Below we see how to create a data frame from a csv file and look at the first few rows.

The entire data frame was read below. To read only certain columns, the usecols feature is helpful:

```
df = pd.read_csv('name.csv', usecols=['colb', 'colx'])
```

In [2]:
import pandas as pd
df = pd.read_csv('Heart.csv', index_col='ID')
df.head()

Unnamed: 0_level_0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


In [3]:
# pandas version in this notebook

pd.__version__

'1.0.3'

### selecting

The following demonstrates two different ways to index elements.

In [4]:
print(df['Age'][1]) # bracket notation
print(df.Age[1])    # use column attribute (has to be a valid Python identifier)

63
63


### loc and iloc accessors

**loc** indexes using labels and **iloc** indexes using index positions

Both specify the indices in row, col order.

In the **loc** example we ask for the element with row label 1 and column label Age. In the **iloc** example we retrieve the same information using indices which start at 0. 

If the row position contains ':' then all rows will be selected, if the col position contains ':' then all columns will be selected.

In [5]:
df.loc[1, 'Age']  # row, col order


63

In [6]:
df.loc[1:2, ['Age', 'Sex']]  # multiple row, multiple col

Unnamed: 0_level_0,Age,Sex
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,63,1
2,67,1


In [7]:
df.iloc[0, 0]

63

In [8]:
df.iloc[0:2, 0:2]

Unnamed: 0_level_0,Age,Sex
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,63,1
2,67,1


### selecting certain columns

By using a nested list within square brackets, [[ ]], this ensures that a data frame is returned. In this example we did not retreive the columns in the same order as the original data frame. 

In [9]:
df_new = df[['Sex','Age']]
df_new.head()

Unnamed: 0_level_0,Sex,Age
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,63
2,1,67
3,1,67
4,1,37
5,0,41


### slicing data frames

Next we select only one column, which returns a series. A pandas series is a one-dimensional array. In this case we also selected just the first 5 rows.

In [10]:
df['Age'][:5]   # this is a series not a data frame

ID
1    63
2    67
3    67
4    37
5    41
Name: Age, dtype: int64

In [11]:
df[['Age']][:5]  # this is a data frame because of the double brackets [[ ]]

Unnamed: 0_level_0,Age
ID,Unnamed: 1_level_1
1,63
2,67
3,67
4,37
5,41


### Selecting rows from a pandas data frame

In [12]:
df[5:10]

Unnamed: 0_level_0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
6,56,1,nontypical,120,236,0,0,178,0,0.8,1,0.0,normal,No
7,62,0,asymptomatic,140,268,0,2,160,0,3.6,3,2.0,normal,Yes
8,57,0,asymptomatic,120,354,0,0,163,1,0.6,1,0.0,normal,No
9,63,1,asymptomatic,130,254,0,2,147,0,1.4,2,1.0,reversable,Yes
10,53,1,asymptomatic,140,203,1,2,155,1,3.1,3,0.0,reversable,Yes


In [13]:
df[df.Age > 70]

Unnamed: 0_level_0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
43,71,0,nontypical,160,302,0,0,162,0,0.4,1,2.0,normal,No
104,71,0,nonanginal,110,265,1,2,130,0,0.0,1,1.0,normal,No
162,77,1,asymptomatic,125,304,0,2,162,1,0.0,1,3.0,normal,Yes
234,74,0,nontypical,120,269,0,2,121,1,0.2,1,1.0,normal,No
258,76,0,nonanginal,140,197,0,1,116,0,1.1,2,0.0,normal,No
274,71,0,asymptomatic,112,149,0,0,125,0,1.6,2,0.0,normal,No


In [14]:
df.sample(frac=0.01)

Unnamed: 0_level_0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
174,62,0,asymptomatic,140,394,0,2,157,0,1.2,2,0.0,normal,No
25,60,1,asymptomatic,130,206,0,2,132,1,2.4,2,2.0,reversable,Yes
273,46,1,asymptomatic,140,311,0,0,120,1,1.8,2,2.0,reversable,Yes


In [15]:
df.sample(n=5)

Unnamed: 0_level_0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
248,47,1,asymptomatic,110,275,0,2,118,1,1.0,2,1.0,normal,Yes
159,60,1,asymptomatic,140,293,0,2,170,0,1.2,2,2.0,reversable,Yes
78,51,0,nonanginal,140,308,0,2,142,0,1.5,1,1.0,normal,No
280,58,0,asymptomatic,130,197,0,0,131,0,0.6,2,0.0,normal,No
44,59,1,nonanginal,150,212,1,0,157,0,1.6,1,0.0,normal,No


There is much more to pandas than we have covered here, we just covered enough to get started with machine learning.