# Reading and writing delimited data files with vanilla Python

For a data task of any size, it usually makes sense to use a dedicated third-party library like `pandas`. In general, go with the tool that helps you accomplish your task easiest and quickest. It might be an extra step to install third-party dependencies, but you can typically cut down on development time by using a specialized tool like pandas. Sometimes, though, your task is simple enough that you can use built-in Python tools to complete your task, or maybe you don't have a stable Internet connection and you're on deadline, whatever. At any rate, it's good to know that these built-in tools exist.

Python's built-in [`csv`](https://docs.python.org/3.6/index.html) module has functionality for working with delimited data files of all kinds. Let's take it for a spin. We'll work with the MLB data that lives at: `../data/mlb.csv`

### Import the module

In [3]:
import csv

### Reading data files

We're going to use a `with` block and the built-in `open()` function to open a data file and read the contents.

With Python's csv module, you can use the `csv.reader()` method to read a data file in as a list of lists, like this:

```python
[
    ['name', 'job', 'age'],
    ['Cody Winchester', 'Training director', 32],
    ['Robert Caro', 'Author', 82]
]
```

... or as a list of dictionaries (a list of [OrderedDict](https://docs.python.org/3/library/collections.html#collections.OrderedDict) objects, technically), using the `csv.DictReader()` method, like this:
```python
[
    {'name': 'Cody Winchester', 'job': 'Training director', 'age': 32},
    {'name': 'Robert Caro', 'job': 'Author', 'age': 82}
]
```

Which one you use is largely a matter of personal preference. I prefer using `DictReader()` because it allows you to access bits of data by field name instead of list position, which I find more intuitive.

In [14]:
with open ('../data/mlb.csv', 'r') as f:
    
    # create a reader object
    reader = csv.DictReader(f)
    
    # print the fieldnames
    print(reader.fieldnames)
    
    # loop over the reader object and print the 
    for row in reader:
        print(row['TEAM'], row['NAME'], row['SALARY'])

['NAME', 'TEAM', 'POS', 'SALARY', 'START_YEAR', 'END_YEAR', 'YEARS']
LAD Clayton Kershaw 33000000
ARI Zack Greinke 31876966
BOS David Price 30000000
DET Miguel Cabrera 28000000
DET Justin Verlander 28000000
CHC Jason Heyward 26055288
LAA Albert Pujols 26000000
SEA Felix Hernandez 25857143
CHC Jon Lester 25000000
NYY CC Sabathia 25000000
SEA Robinson Cano 24000000
TEX Prince Fielder 24000000
SF Johnny Cueto 23500000
MIN Joe Mauer 23000000
BOS Hanley Ramirez 22750000
TEX Cole Hamels 22500000
NYM Yoenis Cespedes 22500000
LAD Adrian Gonzalez 22357142
SF Buster Posey 22177778
WSH Max Scherzer 22142857
DET Justin Upton 22125000
CIN Joey Votto 22000000
NYY Masahiro Tanaka 22000000
NYM Jose Reyes 22000000
WSH Jayson Werth 21571429
ATL Matt Kemp 21500000
BAL Chris Davis 21233006
NYY Jacoby Ellsbury 21142857
CWS James Shields 21000000
ATL Freddie Freeman 20859375
SF Matt Cain 20833333
COL Carlos Gonzalez 20428571
BOS Rick Porcello 20125000
LAA Mike Trout 20083333
TOR Troy Tulowitzki 20000000
TEX

DET Daniel Norris 545500
SD Robbie Erlin 545500
TB Matt Duffy 545300
TOR Devon Travis 545200
TEX Alex Claudio 545050
TB Blake Snell 545000
COL Chad Bettis 545000
COL Chris Rusin 545000
MIN Danny Santana 545000
ATL Danny Winkler 545000
SF Derek Law 545000
SEA Edwin Diaz 545000
NYY Greg Bird 545000
OAK Josh Phegley 545000
LAD Josh Ravin 545000
OAK Kendall Graveman 545000
OAK Marcus Semien 545000
OAK Mark Canha 545000
BAL Tyler Wilson 545000
TEX Matt Bush 544920
TEX Ryan Rua 544740
SD Ryan Buchter 544700
DET Tyler Collins 544700
NYY Aaron Judge 544500
MIL Travis Shaw 544400
SD Luis Perdomo 544300
SD Travis Jankowski 544300
CLE Abraham Almonte 544200
DET Matt Boyd 544200
ARI Archie Bradley 544100
PHI Aaron Nola 544000
ATL Mike Foltynewicz 544000
SEA Shawn O'Malley 544000
PIT Trevor Williams 544000
NYY Tyler Austin 544000
NYM Seth Lugo 543500
HOU Joe Musgrove 543400
LAA Andrew Heaney 543000
TOR Bo Schultz 543000
TOR Joe Biagini 543000
PHI Tommy Joseph 543000
TOR Ryan Tepera 542700
ARI Jerem

### Writing data files

We'll use a `with` block again to open a file in write (`w`) mode.

As with reading, so with writing: You can write files using the `csv.writer()` method, which takes lists of data, or the `csv.DictWriter()` method, which accepts dictionaries. Again, I prefer the dictionary-based approach.

If you use `csv.writer()`, you'll need to write a list of headers in addition to your data. With `csv.DictWriter()`, you specify the headers by passing a list of headers to the `fieldnames` keyword argument as you create the writer object. Then you can use the object's `writeheader()` method to write out the header row. Important: The keys in your dictionary must match _exactly_ the list of headers you pass to the writer object.

Both writer objects have a `writerow()` method, for writing out a single row of data, and a `writerows()` method (plural) for writing a collection of data.

In [17]:
new_mlb_players = [
    {'NAME': 'Jeff Tweedy', 'POS': 'SP', 'SALARY': 1000000, 'START_YEAR': 2015, 'END_YEAR': 2019, 'YEARS': 5},
    {'NAME': 'John Stirratt', 'POS': '1B', 'SALARY': 950000, 'START_YEAR': 2015, 'END_YEAR': 2019, 'YEARS': 5},
    {'NAME': 'Nels Cline', 'POS': 'C', 'SALARY': 900000, 'START_YEAR': 2015, 'END_YEAR': 2019, 'YEARS': 5},
    {'NAME': 'Pat Sansone', 'POS': 'SS', 'SALARY': 850000, 'START_YEAR': 2015, 'END_YEAR': 2019, 'YEARS': 5},
    {'NAME': 'Mikael Jorgensen', 'POS': 'RP', 'SALARY': 800000, 'START_YEAR': 2015, 'END_YEAR': 2019, 'YEARS': 5},
    {'NAME': 'Glenn Kotche', 'POS': '3B', 'SALARY': 750000, 'START_YEAR': 2015, 'END_YEAR': 2019, 'YEARS': 5},
]

with open('new-mlb-players.csv', 'w', newline='') as f:
    headers = ['NAME', 'POS', 'SALARY', 'START_YEAR', 'END_YEAR', 'YEARS']
    writer = csv.DictWriter(f, fieldnames=headers)
    
    writer.writeheader()
    writer.writerows(new_mlb_players)

### Reading _and_ writing data files

You can open multple files for reading and writing in the same `with` block. A simple use case might be filtering a CSV to get a subset of data, which is what we'll do here.

In [None]:
with open('../data/mlb.csv', 'r') as infile, open('royals.csv', 'w') as outfile:
    reader = csv.DictReader(infile)
    headers = reader.fieldnames
    writer = csv.DictWriter(outfile, fieldnames=headers)
    
    writer.writeheader()
    
    for row in reader:
        if row['TEAM'] == 'KC':
            writer.writerow(row)

The equivalent code in `pandas` is much more terse:

```python
import pandas as pd

df = pd.read_csv('../data/mlb.csv')
royals = df[df['TEAM'] == 'KC']
royals.to_csv('royals.csv')
```