## Input/Output


### Programming for Data Science
### Last Updated: Jan 16, 2023
---  

### PREREQUISITES
- variables
- data types

### SOURCES 
- JSON  
https://en.wikipedia.org/wiki/JSON


- pandas read_csv()  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


- context managers  
https://medium.com/better-programming/context-managers-in-python-go-beyond-with-open-as-file-85a27e392114


- pickle  
https://docs.python.org/3/library/pickle.html

### OBJECTIVES
- Introduce data formats: text, csv, json
- Show how to work with the data formats: read, write
- Discuss `pickle` for serializing/de-serializing objects
- Demonstrate how to manipulate pathnames
- Show how to test if a file exists
- Illustrate how to list the files in a directory

### CONCEPTS

- text file
- csv
- JSON
- `json.loads()`, `json.dump()`
- delimiter
- `with`, `open()`, `close()`
- pickle and unpickle
- using `os.path` library to manipulate directories

---


## JSON (Javastring Object Notation)
  
- open standard file format
- data interchange format 
- useful human-readable format  
- very popular 
- uses key-value pairs in a hierarchical (tree) format 
- semi-structured, flexible format

**Examples**  
One level of nesting with keys: `name_first`, `name_last`:

```
{"name_first":"james", "name_last":"jordan"}
```


Two levels of nesting; first holds `name`, second holds `name_first`, `name_last`:
```
{"name":{"name_first":"james", "name_last":"jordan"}}
```

or in a tree format:  
```
{"name":
        {
          "name_first":"james", 
          "name_last":"jordan"
        } 
}
```

Note that python dictionaries have similar structure to JSON, as they both use key:value pairs.

[Website for editing, representing JSON](http://jsoneditoronline.org/)

## JSON in Python

built-in module `json` imported as:  

```
import json
```

**Read data in JSON format:**

In [5]:
import json

# open and read JSON file, containing line:
# {"name_first":"james", "name_last":"jordan"}

# NOTE: file is an arbitrary name
with open('data_example.json', 'r') as file:
    js = file.read()

# parse json:
di = json.loads(js)

# Print Results
print('original data:')
print(js)
print(type(js))
print('-'*50)
print('parsed python data:')
print(di)
print(type(di))
print(di['name_first'])

original data:
{"name_first":"james", "name_last":"jordan"}
<class 'str'>
--------------------------------------------------
parsed python data:
{'name_first': 'james', 'name_last': 'jordan'}
<class 'dict'>
james


**Writing data to a json file**

Similar to reading JSON with `open(filename, 'r')`, the function is used in write mode `'w'`.  
`json.dump()` writes the dict to a file.

In [7]:
with open('filename.json', 'w') as file:
    json.dump(di, file)

---

### TRY FOR YOURSELF (UNGRADED EXERCISES)

1) JSON: Copy the above code that reads in `data_example.json`. After parsing to a dict with `json.loads(js)`, do these tasks:
- append a new key:value pair to `di`
- save the updated dict as a json file
- verify the file looks correct

In [9]:
# read the json
with open('data_example.json', 'r') as file:
    js = file.read()

# parse json
di = json.loads(js)
di['new_key'] = 'new_val'
print(di)

with open('data_example_two.json', 'w') as fp:
    json.dump(di, fp)

{'name_first': 'james', 'name_last': 'jordan', 'new_key': 'new_val'}


## Text File Format 

- text files contain textual data (absent images)   
- can be saved in plain text or rich text formats
- typical extensions: txt, rtf, log, doc, docx (where doc, docx are MSFT proprietary)

## Text in Python

**Read in text file** using `open()`, print the data, and close the file:

In [9]:
f1 = open('data_example.txt','r') # 'r' for read mode
data = f1.read()                  # read file content
print(data)
f1.close()

(Reuters) - President-elect Joe Biden may 
Pope Francis to have
U.S. governors work to


Using `with open()` is preferred, as the file will be closed, even in the event of an error

In [11]:
with open('data_example.txt', "r") as f:
    data = f.read()
    print(data)

# check if file is closed
print('\nFile closed? \n', f.closed)

(Reuters) - President-elect Joe Biden may 
Pope Francis to have
U.S. governors work to

File closed? 
 True


**Writing to text file**

`open()` can be used again, in mode 'a' for append or 'w' for write.

In [13]:
# append to the file
with open('data_example.txt', "a") as f:
    f.write('\n' + 'another line')

**Aside**  
`with` command is called a *context manager*.   
The context manager sets up a temporary context, and destructs the context after the operations are completed.  
Here, it does housekeeping of opening, closing file.

## CSV Format

A *comma separated value* (CSV) file is a plain text file containing rows of data separated by a character, generally commas.  
A header row containing column (field) names may be included. It's often in the first row.  

Example:
```
name,email,phone
laura palmer,lpalmer@twin_peaks,123-456-7890,
agent dale cooper,dcooper@twin_peaks,123-454-7899
```

CSV format is very popular, but using comma as separator (delimiter) can be problematic:   
if data itself contains commas, delimiter won't work properly.

Popular workaround: enclose text with commas in quotes...works until data contains commas and quotes!  
Leads to alternative delimiters such as pipe `|` which less commonly appears in data.
#只能用PANDA，comma分割的

## CSV in Python

**Read data in csv format:**

Here we use pandas `read_csv()` to read data in csv format.  
This is the most common method for reading this format.

Some important parameters:
- *delimiter* or *sep*: the field delimiter. default is comma.
- *header*: row number containing column names


[Details](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [16]:
import pandas as pd

# data source: https://archive.ics.uci.edu/ml/datasets/Wine

wine = pd.read_csv('data_example_wine.csv')

In [18]:
type(wine)

pandas.core.frame.DataFrame

In [20]:
isinstance(wine, pd.DataFrame)

True

In [22]:
# show the first few rows with header

wine.head()

Unnamed: 0,class_id,alcohol,malic_acid,ash,alcalinity,magnesium,phenols,flavanoids,nonflav_phenols,proanthocyanins,color_intensity,hue,OD280_OD315,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


The data now lives in a pandas dataframe for a wide range of work.  
We will do a lot of pandas work in the course.

In [24]:
wine.columns

Index(['class_id', 'alcohol', 'malic_acid', 'ash', 'alcalinity', 'magnesium',
       'phenols', 'flavanoids', 'nonflav_phenols', 'proanthocyanins',
       'color_intensity', 'hue', 'OD280_OD315', 'proline'],
      dtype='object')

**Write data to csv format:**


In [26]:
# keep only the first two rows, saving to new csv file

wine.head(2).to_csv('data_example_wine_first_two_rows.csv')

### TRY FOR YOURSELF (UNGRADED EXERCISES)

2) CSV Exercise

a) Read in a dataset from this URL:  
'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data'

note: this URL can be directly passed to `read_csv()`

b) You will notice the first record comes in as a header row.  
Pass a parameter to `read_csv()` so there is no header.

c) Write the data to a file with txt extension, with pipe separator |

In [28]:
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data'
iris = pd.read_csv(url, header=None)
iris
iris.to_csv('some_data.txt',sep='|')

## pickle — Python object serialization¶



Pickling converts a Python object to a byte stream (conversion to bytes is called *serialization*).  
Unpickling is inverse operation, converting byte stream back to Python object (conversion from bytes is called *de-serialization*).

The `pickle` module allows for serializing and de-serializing a Python object structure (model, dataframe, ...).  


Some benefits:  
- object can be compressed when pickling
- easy to save complex data
- easy to use (minimal code)

Some differences between pickle protocols and JSON:

- JSON is text serialization format, outputting unicode text; pickle is a binary serialization format
- JSON is human-readable; pickle is not
- JSON is interoperable and widely used outside Python; pickle is Python-specific

[Details](https://docs.python.org/3/library/pickle.html)

An alternative to pickle is `joblib`, which we don't discuss here.

**Write to a pickle file, using context manager: `with`**

In [30]:
import pickle

var = 5
with open('test_pickle.pkl', 'wb') as f:
    pickle.dump(var, f)

**Read from a pickle file:**

In [32]:
with open('test_pickle.pkl', 'rb') as f:
    data = pickle.load(f)

print(data)

5


## Pathnames & Directory Management

Show the current working directory

In [34]:
import os

In [36]:
os.getcwd()

'C:\\Users\\HUAWEI\\Desktop\\Programming_for_Data_Science\\repo\\hudm5001\\lecture_notes\\python'

Functions in the `os.path` module are helpful for manipulating pathnames

In [38]:
some_path = '/Users/clark_kent/data.csv'  

Get the filename

In [40]:
os.path.basename(some_path)

'data.csv'

Get the directory name

In [42]:
os.path.dirname(some_path)

'/Users/clark_kent'

Building a path the proper way: use `os.path.join`

This makes the code portable, as it adjusts for operating system.

Example on a Windows machine:

In [45]:
dir_path = 'C:\\Users\\bruce_wayne'
file_name = 'joker.csv'

fullpath_to_joker = os.path.join(dir_path, file_name)
fullpath_to_joker

'C:\\Users\\bruce_wayne\\joker.csv'

**Test if File Exists**

In [47]:
os.path.exists('/etc/passwd')

False

**Get a directory listing**

Example of checking what is in the working directory

In [49]:
os.listdir()

['.ipynb_checkpoints',
 'classes1.ipynb',
 'classes_servers.ipynb',
 'control_structures.ipynb',
 'dash_basics.ipynb',
 'dash_callback.ipynb',
 'data_example.json',
 'data_example.txt',
 'data_example_two.json',
 'data_example_wine.csv',
 'data_example_wine_first_two_rows.csv',
 'data_types.ipynb',
 'data_types_zzq.ipynb',
 'exception_handling.ipynb',
 'filename.json',
 'functions.ipynb',
 'functions_calling_other_functions.ipynb',
 'input_output-zq.ipynb',
 'input_output.ipynb',
 'input_output.zip',
 'interacting_w_relational_database.py',
 'iris_data.csv',
 'iterables_and_iterators.ipynb',
 'lambda_functions.ipynb',
 'list_and_dict_comprehensions.ipynb',
 'matplotlib.ipynb',
 'numpy-zq.ipynb',
 'numpy.ipynb',
 'operators-zq.ipynb',
 'operators.ipynb',
 'pandas_bridges.ipynb',
 'pandas_dataframes1.ipynb',
 'pandas_dataframes2.ipynb',
 'plotly.ipynb',
 'plotnine_ggplot.ipynb',
 'recursion.ipynb',
 'roadmap.md',
 'some_data.txt',
 'sql_expl.ipynb',
 'statsmodels.ipynb',
 'style.css',
 '

### TRY FOR YOURSELF (UNGRADED EXERCISES)

3) Locate a data file on your computer and follow these steps:

- create a variable containing the path to the file
- create a variable containing the data file name
- create a variable containing the full path (path + filename), using `os.path.join`.
- load the data file