# Introduction to Flat Files

## Flat Files

- Simple
- Data stored as plain text (no formatting)
- One row per line
- Values for different fields are separated by a delimiter
- most common flat files types: comma-separated values
- pandas function to load all kinds of flat files:  <code>read_csv()</code>

In [1]:
import pandas as pd

In [2]:
file = 'data/vt_tax_data_2016.csv'

tax_data = pd.read_csv(file)

tax_data.head()



Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,50,VT,0,1,111580,85090,14170,10740,45360,130630,...,53660,50699,0,0,0,0,10820,9734,88260,138337
1,50,VT,0,2,82760,51960,18820,11310,35600,132950,...,74340,221146,0,0,0,0,12820,20029,68760,151729
2,50,VT,0,3,46270,19540,22650,3620,24140,91870,...,44860,266097,0,0,0,0,10810,24499,34600,90583
3,50,VT,0,4,30070,5830,22190,960,16060,71610,...,29580,264678,0,0,0,0,7320,21573,21300,67045
4,50,VT,0,5,39530,3900,33800,590,22500,103710,...,39170,731963,40,24,0,0,12500,67761,23320,103034


<p>For different delimeters, you can specify the kind of delimeter with the property <code>sep</code> </p>

## Modifying flat file imports

### Limiting columns

- Choose columns to load with the <code>usecols</code> keyword argument
- The usecols keyword argument accepts a list of column numbers or names, or a function to  filter column's names

Limiting datasets to only variables of interest makes them more manageable and streamlines pipelines, but make sure you aren't losing confounding data in the process.

In [3]:
# name of the columns we need
col_names = ['STATEFIPS', 'STATE', 'zipcode', 'agi_stub', 'N1']

tax_data = pd.read_csv(file, usecols=col_names)

tax_data.head()


Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1
0,50,VT,0,1,111580
1,50,VT,0,2,82760
2,50,VT,0,3,46270
3,50,VT,0,4,30070
4,50,VT,0,5,39530


In [4]:
# number of the columns we need
col_numbers = [0,1,2,3,4]

# read file with just t5he columns we need
tax_data_cols = pd.read_csv(file, usecols=col_numbers)

tax_data_cols.head()


Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1
0,50,VT,0,1,111580
1,50,VT,0,2,82760
2,50,VT,0,3,46270
3,50,VT,0,4,30070
4,50,VT,0,5,39530


In [5]:
#verifying that is the same as previous
print(tax_data_cols.equals(tax_data))

True


### Limiting rows

- Limit the number of rows loaded with the <code>nrows</code> keyword argument

In [6]:
first1000_rows = pd.read_csv(file, nrows=1000)

first1000_rows.shape

(1000, 147)

- The arguments <code>nrows</code> and <code>skiprows</code> can be used together to process a file in chunks
- <code>skiprows</code> accepts a list of row numbers, a number of rows, or a function to filter rows
- Set <code>header=None</code> so pandas knows there are no column headers

In [7]:
next1000_rows = pd.read_csv(file, nrows=1000, skiprows=1000, header=None) 
next1000_rows.head(n=1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,137,138,139,140,141,142,143,144,145,146
0,50,VT,5730,4,20,0,0,0,40,40,...,0,0,0,0,0,0,0,0,0,0


### Assing column names

- Supply column names by passing a list to the <code>names</code> argument
- the list <b>MUST</b> have a name for every column in the dataset
- If there's a need to rename just a few columns, is necessary to do it after the import

In [8]:
# get the name of the columns from the previus chunk of dataset loaded
col_names = list(first1000_rows)

next500_rows = pd.read_csv(file, nrows=1000, skiprows=1000, header=None, names=col_names)

next500_rows.head(1)

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,50,VT,5730,4,20,0,0,0,40,40,...,0,0,0,0,0,0,0,0,0,0


## Handling errors and missing data

### Common issues

- Wrong data types
- Missing values 
- Record that can't be read by pandas

In [9]:
tax_data.dtypes

STATEFIPS     int64
STATE        object
zipcode       int64
agi_stub      int64
N1            int64
dtype: object

Pandas automatically infer data types, but sometimes is not the type you need. 

Example, in the dataset above python infers that zip code is a interger, but would be better to understand the zipcode as a string.

### Specifying data types

- `dtype` is the keyword argument for import and covert to a specific data type
- `dtype` takes a dictionary of column names and data types

In [10]:
convert_data_type = pd.read_csv(file, dtype={"zipcode": str} )

convert_data_type.dtypes

STATEFIPS     int64
STATE        object
zipcode      object
agi_stub      int64
N1            int64
              ...  
A85300        int64
N11901        int64
A11901        int64
N11902        int64
A11902        int64
Length: 147, dtype: object

### Customize missing values

- Pandas automatically infers some values as missing
- `na_values` is the argument to set custom missing values
- It accepts a single value, list or dictionary of columns and values

In [11]:
missing_values_custom = pd.read_csv(file, na_values={"zipcode": 0})

missing_values_custom[missing_values_custom['zipcode'].isna()]

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,50,VT,,1,111580,85090,14170,10740,45360,130630,...,53660,50699,0,0,0,0,10820,9734,88260,138337
1,50,VT,,2,82760,51960,18820,11310,35600,132950,...,74340,221146,0,0,0,0,12820,20029,68760,151729
2,50,VT,,3,46270,19540,22650,3620,24140,91870,...,44860,266097,0,0,0,0,10810,24499,34600,90583
3,50,VT,,4,30070,5830,22190,960,16060,71610,...,29580,264678,0,0,0,0,7320,21573,21300,67045
4,50,VT,,5,39530,3900,33800,590,22500,103710,...,39170,731963,40,24,0,0,12500,67761,23320,103034
5,50,VT,,6,9620,600,8150,0,7040,26430,...,9600,894432,3350,4939,4990,20428,3900,93123,2870,39425


### Lines with errors

Example of errors: more columns than data to fill those columns

- Set `error_bad_lines = False` to skip unparseable rows
- Set `warn_bad_lines = True` to see a warn message with the lines that were skipped