# ICS 434: DATA SCIENCE FUNDAMENTALS

## Introduction to Pandas: Series and DataFrame classes

---


### Brief Introduction to Pandas

* `Pandas` is the de facto package for working with tabular data.
  * Think of it as `Excel` on steroids (without the graphical interface and the menus).
* Supports a plethora of file formats (`Excel`, `CSV`, `TSV`, `JSON`, `hdf5`, ....)

### Brief Introduction to Pandas -- Cont'd

<img src="https://www.dropbox.com/s/vi451zvf9sw8jfr/pandas_architecture.png?dl=1" width="800">


### Pandas `Series` and `DataFrames`

* Pandas relies principally on two types of data structures:

1. `Series`: those are list-like objects that store data in a given order.
    * It helps to think of a `Series` as a column (or row) in `Excel`.

2. `DataFrames`: those are spreadsheet-like tables that contain one or more `Series`.
    * It helps to think of a `DataFrame` as a table (or worksheet) in `Excel`.



### About the Data

- The examples below use the cost of drugs prescribed to Medicare patients.

- The complete dataset is publicly available on the Centers for Medicare & Medicaid Services ([`CMS` website](https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html)):

- This toy dataset contains the following columns:


| Column | Description | 
|:----------|-----------|
| `unique_id`| A unique identifier for a Medicare claim to CMS |
| `doctor_id` | The Unique Identifier of the doctor who <br/> prescribed the medicine |
| `specialty` | The specialty of the doctor who prescribed the medicine |
| `medication` | The medication prescribed |
| `nb_beneficiaries` | The number of beneficiaries the <br/> medicine was prescribed to |
| `spending` | The total cost of the medicine prescribed|



<img src="https://www.dropbox.com/s/ha5s9t1n7yl0yx4/medicare_data.png?dl=1" alt="drawing" style="width:800px">

### `Series`


* A `Series` is conceptually "similar" to a python `list`
  * An ordered collection of items

* All the items in a `Series` have the **same datatype**.
    * Items are converted to `Object` datatype to maintain this condition
      * Very common source of bugs or error

* Each data entry in a *Series* is associated with a label.
  * Those labels are the `index` of the data.



### Example of Series as Rows and Columns

* It helps to think of pandas `Series` as either a row (or a column) of an Excel worksheet



<img src="https://www.dropbox.com/s/imt11t2xnw4f2d0/possible_series.png?dl=1" alt="drawing" style="width:500px">

### Creating a Panda `Series`



```python
import pandas as pd
```

* To create a pandas `Series`, you can call the `Series` function and pass it a `list` of values and a `list` of labels (an `index`)
  * The `index` is optional and automatically set to `RangeIndex` (simple range) by default

```python

>>> s =  pd.Series(data = [1234, 'DIAZEPAM', 3, '$32'],
                   index = ['doctor_id', 'medication', 'nb_beneficiaries', 'spending'])

>>> s
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
```


In [2]:
import pandas as pd
s =  pd.Series(data = [1234, 'DIAZEPAM', 3, '$32'], 
               index = ['doctor_id', 'medication', 'nb_beneficiaries', 'spending'])
s

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object

In [3]:
import pandas as pd
s = pd.Series(data = [1234, 'DIAZEPAM', 3, '$32'])
s

0        1234
1    DIAZEPAM
2           3
3         $32
dtype: object

In [4]:
s.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
s.index = ["label_1", "label_2", "label_3", "label_4"]
s.index

Index(['label_1', 'label_2', 'label_3', 'label_4'], dtype='object')

In [6]:
s.index = ["label_1", "label_2", "label_3", "label_4"]
s

label_1        1234
label_2    DIAZEPAM
label_3           3
label_4         $32
dtype: object

In [7]:
s.values

array([1234, 'DIAZEPAM', 3, '$32'], dtype=object)

In [8]:
s =  pd.Series( data= [1234, 'DIAZEPAM', 3, '$32'], 
               index= ['doctor_id', 'medication', 'nb_beneficiaries', 'spending'])
s

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object

### Indexing `Series`



* You can access the data in `pandas` by index (position of the object) using the same approach seen in list indexing

```python
>>> s[1]
'DIAZEPAM'

>>> s[-1]
'$32'
```

* You can also access a value in the array using the data index

```python
>>> s["medication"]
'DIAZEPAM'

>>> s["spending"]
'$32'
```
* Based on  the above, it is fair to think of a `Series` as a hybrid between lists an dictionaries

In [9]:
print(s[1])
print(s["medication"])

DIAZEPAM
DIAZEPAM


### Subsetting `Series`

- Subsetting *Series* can be also done using the range operator (`":"`)

```python
>>> s[0:3]
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
dtype: object
```

### Indexing `Series` 

* We can also index a subset of elements from a *Series* using a list of indexes

```python
>>> s[[0,2,1]]
doctor_id               1234
nb_beneficiaries           3
medication          DIAZEPAM
dtype: object
```

- Note that the line above contains two sets of square brackets `[[ ]]`, the first is the indexing operator, the second (inner) is the delimiter for the list.
    

In [10]:
s[0:3]

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
dtype: object

In [11]:
s[[0,2]]

doctor_id           1234
nb_beneficiaries       3
dtype: object

### Subsetting Series -- Cont'd

* Subsetting *Series* can also be done using lists of labels.

```python
>>> s[["doctor_id", "nb_beneficiaries"]]
doctor_id           1234
nb_beneficiaries       3
dtype: object
```

* Note that line above also contains two sets of square brackets `[[ ]]`, the first is the indexing operator, the second, inner, set is the delimiter for the list.
    

In [12]:
s[["doctor_id", "nb_beneficiaries"]]

doctor_id           1234
nb_beneficiaries       3
dtype: object

### `DataFrames`

* It helps to think of `DataFrames` are spreadsheet-like tables made of a set of `Series`
  * The *Series* share the same index
  
<img src="https://www.dropbox.com/s/cc6037l3lz0bc03/df_as_series.png?dl=1" alt="drawing" style="width:800px">

* The example above illustrates a collection of column `Series` as a `DataFrame`, but the analogy holds for row `Series`

### Reading a `DataFrame` from a File

* A `DataFrame` can be created by reading data from a file

  * `Pandas` supports many input formats, including `Excel`, `CSV`, `TSV`, `JSON`, `SAS`, etc.

* We can read the example comma-delimited (CSV) spending table using:
    
```python
>>> spending_df  = pd.read_csv("data/spending.csv")
```

* Jupyter beautifies `DataFrames` by printing them  as `HTML` tables.
  * Using `print` renders them as text-based output instead. 
  * Try to avoid the explicit `print` by including the name as the last statement of the cell, or use `display`.
  

In [13]:
spending_df  = pd.read_csv("data/spending.csv")

spending_df

Unnamed: 0,unique_id,doctor_id,specialty,medication,nb_beneficiaries,spending
0,AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
1,AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
2,CC128705,1298765423,Cardiology,NADOLOL,13,
3,GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
4,YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"
5,YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"
6,YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,"$3,131.96"
7,PL346720,1326175365,Family,OXYCODONE HCL,87,"$12,881.04"
8,GZ129032,1518970284,Hemato-oncology,DIGOXIN,54,"$3,766.34"


### Dimensions of the `DataFrame`

* A useful property of the `DataFrame` is the `shape`

  * This property describes the number of rows and the number of columns in your `DataFrame`
  
* `shape` returns a tuple
  * a list of elements describing the number of rows and the number of columns in an array

  * Tuples are delimited by `( )` rather `[ ]`

```python
>>>  spending_df.shape
(9, 6)
```

* The above indicates that the `spending_df`  `DataFrame` has 9 rows (table entries) and 6 columns

### `DataFrame`, Indexes and Columns

* As opposed to `Series` which only have indexes, `DataFrames` have `Indexes` (labels and positions) and `Column` (labels and positions).
  * In my opinion, `Column` is not an ideal name for it... 

* A `DataFrame` can, therefore, be indexed by row index, row label, column index, column label, or by a combination of row and column index/label.

* In the example below, since a row index was not explicitly provided, the labels for the rows are the same as the indices, i.e., 0 through 8.


<img src="https://www.dropbox.com/s/fi9elbiyaealjt3/df_index_cols.png?dl=1" alt="drawing" style="width:500px;"/>


### Index Location

- Row indexing can be carried out by passing the `iloc` (index location) operator a single index or a list of indexes.

  -  The `iloc` operator uses `[]` instead of the `()` that methods and functions use.


```python    
>>> spending_df.iloc[3]
```

or

```python    
>>> spending_df.iloc[[1,5]]
```

- When a single index is given, `pandas` returns a `Series`
  - Remember that a row is merely a `Series`
  
- When a list of indices is given, `pandas` returns another `DataFrame` 
  - The returned `DataFrame` is a subset of the original
  

In [15]:
# Returns row, or Series
spending_df.iloc[3]

unique_id              GH890091
doctor_id            1346358827
specialty                Family
medication          HYDROCODONE
nb_beneficiaries            331
spending              $8,511.14
Name: 3, dtype: object

In [16]:
# Returns two rows as a DataFrame

spending_df.iloc[[1,5]]

Unnamed: 0,unique_id,doctor_id,specialty,medication,nb_beneficiaries,spending
1,AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
5,YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"


### Indexing by Row and Column Indexes 

* You can pass `iloc` a combination of row and column indexes using the following construct

```python
 data_frame.iloc[row_index_info, column_index_info]
```
* `row_index_info` and `column_index_info` can be either a single index or a list of indexes


### Example:

```python
>>>  spending_df.iloc[3, 3]
'HYDROCODONE'

>> spending_df.iloc[2, [1,3]]
doctor_id     1298765423
medication       NADOLOL
Name: 2, dtype: object

>>>  spending_df.iloc[[2,4], [1,3]]
```
|  |      doctor_id |   medication |
|:--|:-----------------|:------------|
|2 | 1298765423  | NADOLOL  |
|4 | 1548247315  | ALPRAZOLAM |


### Reading Tables with Index Labels

*  Rather than using the default integer index label, a `DataFrame` can be indexed using one or more columns of data.
  * We will cover complex indexes (more than one column) later.

*  The index may or may not be a column from the data

* Recall that we imported the file using:


```python
spending_df  = pd.read_csv("data/spending.csv")
```

<img src="https://www.dropbox.com/s/qb62v2iov9frp38/default_index.png?dl=1">

### Reading Tables with Index Labels

*  Rather than using the default integer index label, a `DataFrame` can be indexed using one or more columns of data.
  * We will cover complex indexes (more than one column) later.

*  The index may or may not be a column from the data.

* Recall that we imported the file using:


```python
spending_df  = pd.read_csv("data/spending.csv")
```

### Specifying the Index Labels

* Specify the column name when reading the table
  * E.g.: using the `read_csv()` function.
  * For example, we can use the `unique_id` column as index labels 


```python
spending_df  = pd.read_csv( "data/spending.csv", index_col="unique_id")
```

<img src="https://www.dropbox.com/s/wmixxjn6dol14b6/custom_index.png?dl=1" alt="drawing" style="width:600px">

In [17]:
spending_df  = pd.read_csv("data/spending.csv", index_col='unique_id')
spending_df

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
CC128705,1298765423,Cardiology,NADOLOL,13,
GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"
YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,"$3,131.96"
PL346720,1326175365,Family,OXYCODONE HCL,87,"$12,881.04"
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54,"$3,766.34"


In [18]:
spending_df_two_idx  = pd.read_csv("data/spending.csv", index_col=['unique_id', "doctor_id"])
spending_df_two_idx

Unnamed: 0_level_0,Unnamed: 1_level_0,specialty,medication,nb_beneficiaries,spending
unique_id,doctor_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
CC128705,1298765423,Cardiology,NADOLOL,13,
GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"
YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,"$3,131.96"
PL346720,1326175365,Family,OXYCODONE HCL,87,"$12,881.04"
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54,"$3,766.34"


### Inspecting DataFrames

* When a `DataFrame` contains a large number of rows it's common to use the method `head` or `tail` to display the first or the last five entries, respectively, of the `DataFrame`.

  * `head(n=5)` shows the top n (5 by default) lines of DataFrame
  * `tail(n=5)` shows the bottom n (5 by default) lines of DataFrame

In [19]:
spending_df.head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AB789982,1952310666,Psychiatry,CLONAZEPAM,226,"$1,848.88"
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
CC128705,1298765423,Cardiology,NADOLOL,13,
GH890091,1346358827,Family,HYDROCODONE,331,"$8,511.14"
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"


In [20]:
spending_df.tail()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"
YY190561,1548247315,Psychiatry,GABAPENTIN,86,"$1,807.16"
YY572610,1548247315,Psychiatry,MIRTAZAPINE,191,"$3,131.96"
PL346720,1326175365,Family,OXYCODONE HCL,87,"$12,881.04"
GZ129032,1518970284,Hemato-oncology,DIGOXIN,54,"$3,766.34"


### Indexing `DataFrame` Columns Using Labels

* Accessing data in the `DataFrame` can also be done using indexes or column label


* Indexing columns takes a column's label or a list of column labels

```python
spending_df["nb_beneficiaries"]
```
or 
```python
spending_df[["doctor_id", "nb_beneficiaries"]]
```



### Indexing `DataFrame` Columns Using Labels -- Cont'd

* When a single label is given, `pandas` returns a `Series`
  * Remember that a column is simply a `Series`

* When a list of labels is given, `pandas` returns another `DataFrame` 
  * The returned `DataFrame` is a subset of the original

In [21]:
spending_df["nb_beneficiaries"]

unique_id
AB789982    226
AV967778    103
CC128705     13
GH890091    331
YY219322     28
YY190561     86
YY572610    191
PL346720     87
GZ129032     54
Name: nb_beneficiaries, dtype: int64

In [22]:
spending_df[["doctor_id", "nb_beneficiaries"]]

Unnamed: 0_level_0,doctor_id,nb_beneficiaries
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
AB789982,1952310666,226
AV967778,1952310666,103
CC128705,1298765423,13
GH890091,1346358827,331
YY219322,1548247315,28
YY190561,1548247315,86
YY572610,1548247315,191
PL346720,1326175365,87
GZ129032,1518970284,54


### Indexing `DataFrame` Rows Using Labels

* Row indexing using labels can be carried out using the `loc` (location) operator.
  * Similar to `iloc` The `loc` operator uses `[]` instead of the `()` that methods and functions use. 

*  `loc` can take a single label or a list of labels.

```python    
spending_df.loc["AV967778"]
```

or

```python    
spending_df.loc[["AV967778","YY219322"]]
```


### Indexing `DataFrame` Rows Using Labels -- Cont'd

* When a single index is given, `pandas` returns a `Series`.
  - Remember that a row is simply a `Series`.
  
* When a list of indices is given, `pandas` returns another `DataFrame`.
  - The returned `DataFrame` is a subset of the original.

In [23]:
spending_df.loc["AV967778"]

doctor_id           1952310666
specialty           Psychiatry
medication            DIAZEPAM
nb_beneficiaries           103
spending               $662.87
Name: AV967778, dtype: object

In [24]:
spending_df.loc[["AV967778","YY219322"]]

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AV967778,1952310666,Psychiatry,DIAZEPAM,103,$662.87
YY219322,1548247315,Psychiatry,ALPRAZOLAM,28,"$1,964.49"


### Subsetting by Range on Labels

* Although not useful or intuitive in this case,  the range operator also works with the index and column labels.


* Therefore, the following lines are all valid lines.

```python    
spending_df.loc["AB789982", "specialty":"spending"]    

spending_df.loc["AB789982":"GZ129032", "specialty"]    

spending_df.loc["AB789982":"GZ129032", "specialty":"nb_beneficiaries"]    
```

* As opposed to ranges in lists, the upper limit of the range is included  in the result.


In [25]:
spending_df.loc["AB789982":"GZ129032", "specialty":"nb_beneficiaries"]

Unnamed: 0_level_0,specialty,medication,nb_beneficiaries
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AB789982,Psychiatry,CLONAZEPAM,226
AV967778,Psychiatry,DIAZEPAM,103
CC128705,Cardiology,NADOLOL,13
GH890091,Family,HYDROCODONE,331
YY219322,Psychiatry,ALPRAZOLAM,28
YY190561,Psychiatry,GABAPENTIN,86
YY572610,Psychiatry,MIRTAZAPINE,191
PL346720,Family,OXYCODONE HCL,87
GZ129032,Hemato-oncology,DIGOXIN,54



### Indexing Review

| Syntax                |   Meaning    |
|:--------------------------|:-------------------|
| DataFrame["col_name"]    |  Returns `Series` of "col_name" |
| DataFrame[ ["col_name_1", "col_name_2"] ]    |  Returns `DataFrame` with  "col_name_1" and "col_name_2"|
| DataFrame.loc["label"] | returns entry indexed by "label" |
| DataFrame.loc["label", ["col_3", 'col_5'] ] | returns entry indexed by "label", subsets columns to only "col_3" and "col_5" |
| DataFrame.loc[ ["label_1", "label_2"], ['col_3', 'col_5'] ] | returns lines with indices "label_1" and "label_2", subsets columns to only "col_3" and "col_5" |
| DataFrame.iloc[23, [0, 1] ] | returns line of index 23, and only values of columns 0 and 1 |
| DataFrame.iloc[[1, 2], [0, 1] ] | returns lines of indices 1 and 2, and only values of columns 0 and 1 |
