# 01 Pandas Basic Access

Loading data from CSV and sqlite and various ways of selection.

* load from CSV
* load from sqlite3 via SQL
* dictionary like access, label and position based indexer, advanced indexing
* rule of thumb: use `loc` is for label based indexing, `iloc` for integer based indexing

## CSV file

Example file: yearly populations per country.

```
data
└── [461K]  population.csv

0 directories, 1 file
```

In [1]:
!tree -sh data

[4.0K]  [01;37mdata[0m
├── [ 11K]  [01;35m220px-MAN_M2000_Pritschenwagen.jpg[0m
├── [ 65M]  [00mautos.csv[0m
├── [4.0K]  [01;37mE-PRTR_database_v13[0m
│   ├── [ 75M]  [00mPollutant releases.xlsx[0m
│   ├── [7.6M]  [00mPollutant transfers.xlsx[0m
│   └── [ 94M]  [00mWaste transfers.xlsx[0m
├── [4.7K]  [00miris_dirty.csv[0m
├── [ 35K]  [01;31mklarchiv_05792_daily_akt.zip[0m
├── [112M]  [00mnotMNIST_small.mat[0m
├── [461K]  [00mpopulation.csv[0m
├── [544K]  [00mpopulation.db[0m
├── [ 133]  [00mpopulation.sql[0m
├── [ 55K]  [00mprodukt_klima_tag_20161216_20180618_05792_modified.txt[0m
├── [ 75K]  [00mprodukt_klima_tag_20161216_20180618_05792.txt[0m
├── [4.0K]  [01;37mspeed-limit-signs[0m
│   ├── [4.0K]  [01;37m0[0m
│   │   ├── [9.1K]  [01;32m00000.ppm[0m
│   │   ├── [9.0K]  [01;32m00001.ppm[0m
│   │   ├── [ 18K]  [01;32m00002.ppm[0m
│   │   ├── [4.8K]  [01;32m00003.ppm[0m
│   │   ├── [3.0K]  [01;32m00004.ppm[0m
│   │   ├── [6.9K]  [01;32m00005.p

In [2]:
!wc -l data/population.csv

14886 data/population.csv


In [3]:
import numpy as np
import pandas as pd

import random
import sqlite3

%matplotlib inline

In [4]:
df = pd.read_csv("data/population.csv") # https://git.io/fA8aT

In [5]:
df.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,92490932.0
1,Arab World,ARB,1961,95044497.0
2,Arab World,ARB,1962,97682294.0
3,Arab World,ARB,1963,100411076.0
4,Arab World,ARB,1964,103239902.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14885 entries, 0 to 14884
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  14885 non-null  object 
 1   Country Code  14885 non-null  object 
 2   Year          14885 non-null  int64  
 3   Value         14885 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 465.3+ KB


This seems like a tidy data set.

## Selecting rows and columns

In [7]:
df.columns

Index(['Country Name', 'Country Code', 'Year', 'Value'], dtype='object')

### Select a single row by index

* position based indexing (df.iloc)

In [8]:
df.iloc[0]

Country Name    Arab World
Country Code           ARB
Year                  1960
Value           92490932.0
Name: 0, dtype: object

### Selecting multiple rows

* position based indexing (df.iloc)

In [9]:
df.iloc[0:10]

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,92490932.0
1,Arab World,ARB,1961,95044497.0
2,Arab World,ARB,1962,97682294.0
3,Arab World,ARB,1963,100411076.0
4,Arab World,ARB,1964,103239902.0
5,Arab World,ARB,1965,106174988.0
6,Arab World,ARB,1966,109230593.0
7,Arab World,ARB,1967,112406932.0
8,Arab World,ARB,1968,115680165.0
9,Arab World,ARB,1969,119016542.0


### Selecting a column by key

* dictionary style access

In [10]:
df["Country Name"].head()

0    Arab World
1    Arab World
2    Arab World
3    Arab World
4    Arab World
Name: Country Name, dtype: object

### Advanced Selection

* Selecting one or more columns
* Reordering columns

In [11]:
df[["Country Code", "Year"]].head()

Unnamed: 0,Country Code,Year
0,ARB,1960
1,ARB,1961
2,ARB,1962
3,ARB,1963
4,ARB,1964


### Selection via label-based indexer (df.loc)

* specify row and columns at the same time

In [12]:
df.loc[:, "Country Code":"Year"].head()

Unnamed: 0,Country Code,Year
0,ARB,1960
1,ARB,1961
2,ARB,1962
3,ARB,1963
4,ARB,1964


### Selecting rows and columns

* either with df.loc or df.iloc

In [13]:
df.loc[2:10, "Country Code"]

2     ARB
3     ARB
4     ARB
5     ARB
6     ARB
7     ARB
8     ARB
9     ARB
10    ARB
Name: Country Code, dtype: object

In [14]:
df.loc[::2, "Year"].head()

0    1960
2    1962
4    1964
6    1966
8    1968
Name: Year, dtype: int64

In [15]:
df.iloc[0:5, 0:2]

Unnamed: 0,Country Name,Country Code
0,Arab World,ARB
1,Arab World,ARB
2,Arab World,ARB
3,Arab World,ARB
4,Arab World,ARB


In [16]:
df.iloc[0:10:2, [1, 0]]

Unnamed: 0,Country Code,Country Name
0,ARB,Arab World
2,ARB,Arab World
4,ARB,Arab World
6,ARB,Arab World
8,ARB,Arab World


In [17]:
df.iloc[:, lambda f: random.randint(0, len(df.columns) - 1)].head()

0    1960
1    1961
2    1962
3    1963
4    1964
Name: Year, dtype: int64

## Reading from an SQLITE database

Sqlite is one of the most deployed embedded databases. It supports SQL and is a good fit for storing smaller and mid-sized relational data sets. For larger, numerical data sets HDF5 is a popular choice.

In [18]:
conn = sqlite3.connect("data/population.db")
df = pd.read_sql("select * from population", conn)
conn.close()

In [19]:
df.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1960,92490932
1,Arab World,ARB,1961,95044497
2,Arab World,ARB,1962,97682294
3,Arab World,ARB,1963,100411076
4,Arab World,ARB,1964,103239902
