# 03/04 Access rows and columns

In [1]:
import pandas as pd

Let's access some data. So first we'll load our dataframe. 

In [2]:
csv_file = 'track.csv'
df = pd.read_csv(csv_file, parse_dates=['time'])

To get a column, you can use the **square bracket and the name of the column, so df "lat" .** I'm getting the whole latitude column. 

In [3]:
df['lat']

0      32.519585
1      32.519606
2      32.519612
3      32.519654
4      32.519689
         ...    
735    32.517020
736    32.517035
737    32.517087
738    32.517098
739    32.517142
Name: lat, Length: 740, dtype: float64

You can also use df.lat like an attribute access, but **I don't recommend using the dot notation** in some cases column names have spaces in them and then you can't use the documentation. Anyway, so **getting the habit of using square brackets, which always.** 

In [4]:
df.lat # Don't do this!

0      32.519585
1      32.519606
2      32.519612
3      32.519654
4      32.519689
         ...    
735    32.517020
736    32.517035
737    32.517087
738    32.517098
739    32.517142
Name: lat, Length: 740, dtype: float64

You can select **more than a single column**, so here I'm passing a list with two column names and I'm going to get back a data frame with two columns. 

In [5]:
df[['lat', 'lng']]

Unnamed: 0,lat,lng
0,32.519585,35.015021
1,32.519606,35.014954
2,32.519612,35.014871
3,32.519654,35.014824
4,32.519689,35.014776
...,...,...
735,32.517020,35.014387
736,32.517035,35.014355
737,32.517087,35.014279
738,32.517098,35.014264


If you want a specific value, you can first select the column and then the row. **So selecting the latitude column and then the first value**. Remember that Python is 0. Based and not one based. The first item is at 0 location. 

In [6]:
df['lat'][0]

np.float64(32.519585)

If you want an **entire row you can use the dot loc access** so dot loc and now we get the row. 

In [7]:
df.loc[0]

time      2015-08-20 03:48:07.235000
lat                        32.519585
lng                        35.015021
height                    136.199997
Name: 0, dtype: object

So we see the time, latitude, longitude and height look and also work with **slices as well**. So from 2 to. 7 unlike slicing in Python, which are half open, the dot lock in pandas slices from the start to the end, including the end. 

In [8]:
df.loc[2:7]

Unnamed: 0,time,lat,lng,height
2,2015-08-20 03:48:25.660,32.519612,35.014871,123.0
3,2015-08-20 03:48:26.819,32.519654,35.014824,120.5
4,2015-08-20 03:48:27.828,32.519689,35.014776,118.900002
5,2015-08-20 03:48:29.720,32.519691,35.014704,119.900002
6,2015-08-20 03:48:30.669,32.519734,35.014657,120.900002
7,2015-08-20 03:48:33.793,32.519719,35.014563,121.699997


You can also combine slices and column selection, so we're **selecting only the latitude and longitude columns, and then getting from 2 to 7.** This time Panda is going to use the Python style slicing which is 1/2 open range, meaning you get the 1st. Index up to but not including the last one. 

I know it's a bit confusing. The best way to make these things sink in is to practice take some data frames, **slice and dice them until you get comfortable** with the results and you'll be just fine. 

In [9]:
df[['lat', 'lng']][2:7]

Unnamed: 0,lat,lng
2,32.519612,35.014871
3,32.519654,35.014824
4,32.519689,35.014776
5,32.519691,35.014704
6,32.519734,35.014657


**Every column in pandas is a series.** One of the differences between series and regular Python. **Lists is that they have a labeled axis called an index.** 

All the columns in the data frame show the same index, so we can do **df.index and then run the cell and this is a range index starting from zero until 740 with steps of one.** 

In [10]:
df.index

RangeIndex(start=0, stop=740, step=1)

In [11]:
import numpy as np
df1 = pd.DataFrame(
    np.arange(10).reshape(5, 2),
    columns=['x', 'y'],
    index=['a', 'b', 'c', 'd', 'e'],
)
df1

Unnamed: 0,x,y
a,0,1
b,2,3
c,4,5
d,6,7
e,8,9


Let's create a small example. So we **create a data frame.** We have some values with Numpy, a range of five rows and two columns. We specify the columns and. **Then we say the index is ABCD and E.** We're going to run this code. You see now that the index here is ABCD and E.

In [12]:
df1.loc[0]

<class 'KeyError'>: 0

And now, if you're going to use dot lock 0, **this is going to fail because there's no row labeled 0, but there is a row labeled A** so I can do df dot loc A. 

In [13]:
df1.loc['a']

x    0
y    1
Name: a, dtype: int32

I can also slice between A and D and pandas can handle it just as well. A side note, labels. **Don't have to be unique if there are repeating labels you will get all the rows with this label.** 

In [14]:
df1.loc['a':'d']

Unnamed: 0,x,y
a,0,1
b,2,3
c,4,5
d,6,7


Sometimes you'd like to **access the 1st row regardless of the label**. In this case you can use the iloc accessor, which works by position, so DF, iloc at location 0. And this is going to work and bring us the first row. 

In [15]:
df1.iloc[0]

x    0
y    1
Name: a, dtype: int32

Another kind of index you can have is time based index. 

In [16]:
df.index

RangeIndex(start=0, stop=740, step=1)

So let's change the data frame index which currently if we look at it is a range index and we are **going to set it to the time.** And have a look at it and now it is a daytime index with all of these values. 

In [17]:
df.index = df['time']
df.index

DatetimeIndex(['2015-08-20 03:48:07.235000', '2015-08-20 03:48:24.734000',
               '2015-08-20 03:48:25.660000', '2015-08-20 03:48:26.819000',
               '2015-08-20 03:48:27.828000', '2015-08-20 03:48:29.720000',
               '2015-08-20 03:48:30.669000', '2015-08-20 03:48:33.793000',
               '2015-08-20 03:48:34.869000', '2015-08-20 03:48:37.708000',
               ...
               '2015-08-20 04:20:18.844000', '2015-08-20 04:20:21.996000',
               '2015-08-20 04:20:22.897000', '2015-08-20 04:20:24.905000',
               '2015-08-20 04:20:25.835000', '2015-08-20 04:20:28.982000',
               '2015-08-20 04:20:29.923000', '2015-08-20 04:20:32.863000',
               '2015-08-20 04:20:33.994000', '2015-08-20 04:20:42.329000'],
              dtype='datetime64[ns]', name='time', length=740, freq=None)

Now, if you're going to run the lock zero, we're going to see that it's going to **fail because we don't have any row that is labeled with zero.** 

In [18]:
df.loc[0]

<class 'KeyError'>: 0

In [19]:
df.loc['2015-08-20 03:48:34']

Unnamed: 0_level_0,time,lat,lng,height
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-08-20 03:48:34.869,2015-08-20 03:48:34.869,32.519694,35.014549,121.199997


In [20]:
df.loc['2015-08-20 03:48']

Unnamed: 0_level_0,time,lat,lng,height
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-08-20 03:48:07.235,2015-08-20 03:48:07.235,32.519585,35.015021,136.199997
2015-08-20 03:48:24.734,2015-08-20 03:48:24.734,32.519606,35.014954,126.599998
2015-08-20 03:48:25.660,2015-08-20 03:48:25.660,32.519612,35.014871,123.0
2015-08-20 03:48:26.819,2015-08-20 03:48:26.819,32.519654,35.014824,120.5
2015-08-20 03:48:27.828,2015-08-20 03:48:27.828,32.519689,35.014776,118.900002
2015-08-20 03:48:29.720,2015-08-20 03:48:29.720,32.519691,35.014704,119.900002
2015-08-20 03:48:30.669,2015-08-20 03:48:30.669,32.519734,35.014657,120.900002
2015-08-20 03:48:33.793,2015-08-20 03:48:33.793,32.519719,35.014563,121.699997
2015-08-20 03:48:34.869,2015-08-20 03:48:34.869,32.519694,35.014549,121.199997
2015-08-20 03:48:37.708,2015-08-20 03:48:37.708,32.519625,35.014515,121.699997
