# 3. Loading and Handling Pandas Data

## Overview

### Questions

* How are Pandas data structures setup?
* How to load data into Pandas?
* How to write data from Pandas to a file.

### Objectives

* Understand the usefulness of Pandas when loading data.
* Understand how to load data and deal with common issues.

## Pandas Data Structures

* The two primary data structures of Pandas:

* Series: an array, or list like collection of values
   * Similar to a single row or a single column in Excel.
* DataFrames: a table-like structure consisting of a collection of rows or column
 <img src="https://change-hi.github.io/morea/data-wrangling/fig/E3_1_series_vs_dataframe.png" width="600">


### Pandas Data Types

* Series and DataFrames have many functions that facilitate data analysis 
  * Filter or impute missing values in a Series (column)
  * Select rows from a Series or DataFrame using conidtional operations
  * Convert DataFrames across formats
  * etc.

* Pandas attempts to assign the correct type to each column based on the type of data it contains. 
  * You can override the data type assigned by Pandas


![](https://www.dropbox.com/s/ir2x9gccdgeomq7/pandas_data_types.png?dl=1)


### About File Formats

* Dozens if not hundreds of file formats.
  * Some such as Excel's format, are binary and are not meant to be read by a human
    * Typically, we use custom formatting to delineate columns and rows.
  * Others are plain text and can be opened and edited in any text editor.

* Plain text file formats fall into one of the following two categories: Delimited and Fixed Width
  * Delimited files are organized such that columns and rows are separated by a certain character called a delimiter
  * Fixed width files are those where each entry in a column has a fixed number of characters
  * Character delimited using special tags or characters.

 

### About File Formats - Cont'd

* Plain text are the most popular formats.
 
 * In this workshop, we will use the ‘Comma Separated Values’ `.csv` format 
  * Entries delimited by commas and the rows are delimited by a new line
  * The file may contian a header with column labels 
  * Rows may have an index
  
```bash
,column1,column2,column3,
row1,a,b,c
row2,d,e,f
row3,g,h,i
```

* The file above is in the csv (comma delimited) format, has a header with a missing first value (potentially index)


## Loading and Parsing Data

The following csv data is stored in a file called `my_data.csv`

```bash
,column1,column2,column3,
row1,a,b,c
row2,d,e,f
row3,g,h,i
```


* The table looks as follows:


```bash
|      | column 1 | column 2 | column 3 |
| ---- | -------- | -------- | -------- |
| row1 | a        | b        | c        |
| row2 | d        | e        | f        |
| row3 | g        | h        | i        |

```



## Import the Pandas Package

* `Series` and `DataFrames` can be created from scratch or loading their data from a file.
 * Pandas supports a  variety of file formats, such as comma delimited (csv), tab delimted (tsv), excel, etc.

* Loading a specific format is done using custom functions. For example:
  * read a csv using `read_csv`
  * Read an Excel file using `read_excel`
  * read a JSON file using `read_json`
  * etc..


## Loading and Parsing Data

* Before loading the data we need to import the pandas package.
  * Pandas is typically imported using the `pd` alias
  
```python
import pandas as pd
```
  
* To read a csv file into a variable called `df` we can write:

```python
df = pd.read_csv('data/some_data.csv')

```
* Simply typing the variable name `df` will display the table in a user friendly format.

```python
df
```

![](https://change-hi.github.io/morea/data-wrangling/fig/E3_2_loaded_dataframe.png)





## Loading and Parsing Data

* Each of Pandas functions for reading a text file provides many parameters to customaize how a file is read

* Customize the field delimiter or separator (comma by default)

 
```python
df = pd.read_csv('data/weired_format.csv', sep='|')
```
* or 
```python
df = pd.read_csv('data/tsv_example.tsv', sep='\t')
```


* Read in a a small subset
```python
df = pd.read_csv('data/tsv_example.tsv', nrows=5)
```




## Headers and Indexes

* Headers (column labels) and index (row labels) are very useful for indexing into the data
* By default, `read_csv` assumes that: 
    * The first row is the table header
    * Rows are indexed using integer values from 0 to n-1, were n is the number of rows in the data.  

* You can change the `read_csv` behaviour to omit the header or rename the columns 
* You can change the `read_csv` behaviour to specify which column to use as the index.



In [1]:
import pandas as pd

# ```bash
# |column1|column2|column3
# row1|a|b|c
# row2|d|e|f
# row3|g|h|i
# ```

df = pd.read_csv("data/weired_format.csv", sep="|")
df

Unnamed: 0.1,Unnamed: 0,column1,column2,column3
0,row1,a,b,c
1,row2,d,e,f
2,row3,g,h,i
3,row4,j,k,l
4,row5,m,n,o


In [2]:

# ```bash
# row1|a|b|c
# row2|d|e|f
# row3|g|h|i
# ```

df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|")
df

Unnamed: 0,row1,a,b,c
0,row2,d,e,f
1,row3,g,h,i
2,row4,j,k,l
3,row5,m,n,o


In [3]:
df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|", header= None)
df

Unnamed: 0,0,1,2,3
0,row1,a,b,c
1,row2,d,e,f
2,row3,g,h,i
3,row4,j,k,l
4,row5,m,n,o


In [17]:
df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|", header= None, nrows=2)
df

Unnamed: 0,0,1,2,3
0,row1,a,b,c
1,row2,d,e,f


In [23]:

# row1,a,b,c
# row2,d,e,f
# row3,g,h,i



df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|")
df

Unnamed: 0,row1,a,b,c
0,row2,d,e,f
1,row3,g,h,i
2,row4,j,k,l
3,row5,m,n,o


In [24]:

# row1,a,b,c
# row2,d,e,f
# row3,g,h,i



df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|", header=None)
df

Unnamed: 0,0,1,2,3
0,row1,a,b,c
1,row2,d,e,f
2,row3,g,h,i
3,row4,j,k,l
4,row5,m,n,o


In [27]:

# row1,a,b,c
# row2,d,e,f
# row3,g,h,i



df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|",  
                 names=["COL_ONE", "COL_TWO", "COL_THREE"])
df

Unnamed: 0,COL_ONE,COL_TWO,COL_THREE
row1,a,b,c
row2,d,e,f
row3,g,h,i
row4,j,k,l
row5,m,n,o


In [28]:
# row1,a,b,c
# row2,d,e,f
# row3,g,h,i

df = pd.read_csv("data/weired_format_no_header.csv", 
                 sep="|",  
                 names=["COL_ONE", "COL_TWO", "COL_THREE"], 
                 index_col=0)
df

Unnamed: 0,COL_ONE,COL_TWO,COL_THREE
row1,a,b,c
row2,d,e,f
row3,g,h,i
row4,j,k,l
row5,m,n,o


### Missing Values

* There are often missing values in a real-world datasets.
  * E.g. `NA`, `N.A.`, `9999`, `missing`,  `''`, etc. 
* Some functions depend on properly identifying missing values. 
  * What is the averge of [1, 2 , 3, 'UNKNOWN']?
* Pandas identifies missing NaN (Not a Number)
  * Provides ways to handle missing values in computation values.
* `read_csv` can take as a parameter the value used to represent missing values. For example, 

```python
df = pd.read_csv("data/null_values_example.csv", na_values='Null')
df
```



### Without `na_values='Null'`
    
![](https://change-hi.github.io/morea/data-wrangling/fig/E3_5_null_values.png)


### With `na_values='Null'`:
    
![](https://change-hi.github.io/morea/data-wrangling/fig/E3_6_nan_values.png)    
    

###  Writing Data in Text Format

Pandas DataFrames have a collection of `to_<filetype>` methods used to write data to disk

 * Example, `to_csv()` takes the parameter path and and will either create a new file or overwrite the existing file with the same name.
```python
df.to_csv('data/new_file.csv')
```

* Has a  number of optional parameters to change the delimiter, write the numerical automatically generated index, omit certain columns, etc.



## Key Points

* Pandas contains numerous methods to help load/write data to/from files of different types.
* `read_csv` is highly customizable and can allow you to handle many issues when loading the data.

## 1 - Exercise: Read an Excel File

Try it yourself! Fill in the blanks to load the first 10 lines of the excel file `'20_sales_records.xlsx'` into a variable called `df` and then display the `DataFrame`.
*  Instructions
  * The file is located in the `data` folder.
  * Use the `read_excel` command along with the argument you learned to parse a specified number of rows.
  * This file has `NaN` values that are not automatically detected. They are labeled as `'none'`. Have Pandas interpret these as `NaN` values upon loading of the dataset.
  * Display the results.