In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

### Importing and Exporting Files

Most often data will be stored in a file, either locally on the computer or online. We'll learn how to read and write data to files.

### Python file handle(open)

In Python we interact with files on disk using the commands open and close. We've included a file in the data folder called sample.txt. Let's open it and read its contents

In [12]:
f = open('./sample.txt', 'r')
data = f.read()
f.close()

print(data)
print(f)

Hello!
Congratulations!
You've read data from file.
<_io.TextIOWrapper name='./sample.txt' mode='r' encoding='cp1252'>


Notice that we open the file and assign it to f, read the data from f. What is f? it's called a file handle. It's an object that connect the interpreter to the file we open. We read the data using this connection, and then once we're done with close the connection. It's a good habit to close a file handle once we're done with it, so usually we will do it automatically using Python's with keyword.

In [14]:
#f is automatically closed
#at the end of the body of the with statement
with open('./sample.txt', 'r') as f:
    print(f.read())

Hello!
Congratulations!
You've read data from file.


In [15]:
#using the with command closes the file automatically
f.read()

ValueError: I/O operation on closed file.

We can also read individual lines of a file

In [17]:
#at the end of the body of the with statement
with open('./sample.txt', 'r') as f:
    print(f.readline())
    print(f.readline())

Hello!

Congratulations!



We can also read the entire page with readlines

In [18]:
with open('./sample.txt', 'r') as f:
    print(f.readlines())

['Hello!\n', 'Congratulations!\n', "You've read data from file."]


However, the difference between read and readlines is that .read() returns a str while .readlines() returns a list

In [22]:
with open('./sample.txt', 'r') as f:
    print(type(f.readlines()))

<class 'list'>


In [21]:
with open('./sample.txt', 'r') as f:
    print(type(f.read()))

<class 'str'>


Writing to files is very similar. The main difference is when we open the file, we will use the 'w' flag instead of 'r'

In [26]:
with open('./my_data.txt', 'w') as f:
    f.write('This is a new file.\n')
    f.write('I am practicing writing data to disk.')
    
with open('./my_data.txt', 'r') as f:
    print(f.read())

This is a new file.
I am practicing writing data to disk.


No matter how often we execute the above cell, the same output gets printed. Opening the file with the 'w' flag will overwrite the contents of the file. If we want to add to what is already in the file, we have to open the fiule with the 'a' flag ('a' stands for append).

In [29]:
with open('./my_data.txt', 'a') as f:
    f.write('\nAdding a new line.')
    
with open('./my_data.txt', 'r') as f:
    data = f.read()
    
print(data)

This is a new file.
I am practicing writing data to disk.
 Adding a new line.
 Adding a new line.
Adding a new line.


If we open a a data with r+ flag we can write to the file as well

In [30]:
with open('./fail.txt', 'r+') as f:
    f.write('This will not fail')
    
with open('./fail.txt', 'r') as f:
    data = f.read()
    
print(data)

FileNotFoundError: [Errno 2] No such file or directory: './fail.txt'

### OS module

Python has a module for navigating the computer's file system called os. There are many useful tools in the os module, but there are two functions that are most useful for finding files.

In [2]:
import os

In [3]:
#dir(os)

In [4]:
os.listdir('.')

['.ipynb_checkpoints',
 'DS_IO.ipynb',
 'Hash_function.ipynb',
 'my_data.txt',
 'PY_Algorithms.ipynb',
 'PY_OOP.ipynb',
 'PY_Pythonic.ipynb',
 'Roughwork.ipynb',
 'sample.txt',
 'stock_prices.csv',
 'WQU_unit _one.ipynb']

The command listdir is the simpler of the two functions we'll cover. It simply lists the contents of the directory path we specify. When we pass '.' as the argument, listdir will look in the current directory

In [5]:
# os.walk?

Signature: os.walk(top, topdown=True, onerror=None, followlinks=False)
Docstring:
Directory tree generator.

For each directory in the directory tree rooted at top (including top
itself, but excluding '.' and '..'), yields a 3-tuple

    dirpath, dirnames, filenames

dirpath is a string, the path to the directory.  dirnames is a list of
the names of the subdirectories in dirpath (excluding '.' and '..').
filenames is a list of the names of the non-directory files in dirpath.
Note that the names in the lists are just names, with no path components.
To get a full path (which begins with top) to a file or directory in
dirpath, do os.path.join(dirpath, name).

If optional arg 'topdown' is true or not specified, the triple for a
directory is generated before the triples for any of its subdirectories
(directories are generated top down).  If topdown is false, the triple
for a directory is generated after the triples for all of its
subdirectories (directories are generated bottom up).

When topdown is true, the caller can modify the dirnames list in-place
(e.g., via del or slice assignment), and walk will only recurse into the
subdirectories whose names remain in dirnames; this can be used to prune the
search, or to impose a specific order of visiting.  Modifying dirnames when
topdown is false has no effect on the behavior of os.walk(), since the
directories in dirnames have already been generated by the time dirnames
itself is generated. No matter the value of topdown, the list of
subdirectories is retrieved before the tuples for the directory and its
subdirectories are generated.

By default errors from the os.scandir() call are ignored.  If
optional arg 'onerror' is specified, it should be a function; it
will be called with one argument, an OSError instance.  It can
report the error to continue with the walk, or raise the exception
to abort the walk.  Note that the filename is available as the
filename attribute of the exception object.

By default, os.walk does not follow symbolic links to subdirectories on
systems that support them.  In order to get this functionality, set the
optional argument 'followlinks' to true.

Caution:  if you pass a relative pathname for top, don't change the
current working directory between resumptions of walk.  walk never
changes the current directory, and assumes that the client doesn't
either.


In [19]:
for root, dirs, files in os.walk('.'):
    print('This is root ',root)
    print('This is dirs ',dirs)
    print('This is files ',files)


This is root  .
This is dirs  ['.ipynb_checkpoints', 'data']
This is files  ['DS_IO.ipynb', 'Hash_function.ipynb', 'PY_Algorithms.ipynb', 'PY_OOP.ipynb', 'PY_Pythonic.ipynb', 'Roughwork.ipynb', 'WQU_unit _one.ipynb']
This is root  .\.ipynb_checkpoints
This is dirs  []
This is files  ['DS_IO-checkpoint.ipynb', 'Hash_function-checkpoint.ipynb', 'PY_Algorithms-checkpoint.ipynb', 'PY_OOP-checkpoint.ipynb', 'PY_Pythonic-checkpoint.ipynb', 'Roughwork-checkpoint.ipynb', 'WQU_unit _one-checkpoint.ipynb']
This is root  .\data
This is dirs  []
This is files  ['my_data.txt', 'sample.txt', 'stock_prices.csv']


In [7]:
for root,dirs, files in os.walk('.'):
    print(dirs)

['.ipynb_checkpoints']
[]


In [18]:
for root, dirs, files in os.walk('.'):
    for file in files:
        print(os.path.join(root, file))

.\DS_IO.ipynb
.\Hash_function.ipynb
.\PY_Algorithms.ipynb
.\PY_OOP.ipynb
.\PY_Pythonic.ipynb
.\Roughwork.ipynb
.\WQU_unit _one.ipynb
.\.ipynb_checkpoints\DS_IO-checkpoint.ipynb
.\.ipynb_checkpoints\Hash_function-checkpoint.ipynb
.\.ipynb_checkpoints\PY_Algorithms-checkpoint.ipynb
.\.ipynb_checkpoints\PY_OOP-checkpoint.ipynb
.\.ipynb_checkpoints\PY_Pythonic-checkpoint.ipynb
.\.ipynb_checkpoints\Roughwork-checkpoint.ipynb
.\.ipynb_checkpoints\WQU_unit _one-checkpoint.ipynb
.\data\my_data.txt
.\data\sample.txt
.\data\stock_prices.csv


In [14]:
for root, dirs, files in os.walk('.'):
    for file in files:
        print(file)

DS_IO.ipynb
Hash_function.ipynb
my_data.txt
PY_Algorithms.ipynb
PY_OOP.ipynb
PY_Pythonic.ipynb
Roughwork.ipynb
sample.txt
stock_prices.csv
WQU_unit _one.ipynb
DS_IO-checkpoint.ipynb
Hash_function-checkpoint.ipynb
PY_Algorithms-checkpoint.ipynb
PY_OOP-checkpoint.ipynb
PY_Pythonic-checkpoint.ipynb
Roughwork-checkpoint.ipynb
WQU_unit _one-checkpoint.ipynb


In [9]:
type(root)

str

In [11]:
for i in files:
    print(i)

DS_IO-checkpoint.ipynb
Hash_function-checkpoint.ipynb
PY_Algorithms-checkpoint.ipynb
PY_OOP-checkpoint.ipynb
PY_Pythonic-checkpoint.ipynb
Roughwork-checkpoint.ipynb
WQU_unit _one-checkpoint.ipynb


### CSV files

One of the simplest and most common formats forsaving data is the Comma Separated Value (CSV)

In [28]:
table = []
with open('./data/my_sample.txt', 'r') as f:
    for line in f.readlines():
        table.append(line.strip().split(','))

table

[['Index', ' Name', ' Age'],
 ['0', ' Dylan', ' 28'],
 ['1', ' Terrance', ' 54'],
 ['2', ' Mya', ' 31']]

In [46]:
list_table = []

def parse_str(line):
    if line[0] == "Index":
        return line
    
    return [int(line[0]), line[1], int(line[2])]
        
with open('./data/my_sample.txt', 'r') as f:       
    for line in f.readlines():
        line = line.strip().split(',')
        list_table.append(parse_str(line))

list_table

[['Index', ' Name', ' Age'],
 [0, ' Dylan', 28],
 [1, ' Terrance', 54],
 [2, ' Mya', 31]]

However, we can work with tabular data much more easily in a Pandas Dataframe. Pandas provides a read_csv mthod to read the data directly into a DataFrame.

In [54]:
import pandas as pd

data = pd.read_csv('./data/my_sample.txt', index_col = 'Index')
data

Unnamed: 0_level_0,Name,Age
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Dylan,28
1,,54
2,Mya,31


The read_csv method is very flexible to deal with the formatting of different data sets.

We can also use pandas to write CSV using the DataFrame's to_csv method

In [51]:
pd.DataFrame({'a' : [0, 3, 10], 'b' : [True, True, False]}).to_csv('./data/written_to_file.csv')

In [52]:
with open('./data/written_to_file.csv', 'r') as f:
    print(f.read())

,a,b
0,0,True
1,3,True
2,10,False



In [56]:
df = pd.read_csv('./data/written_to_file.csv', index_col=0)
df

Unnamed: 0,a,b
0,0,True
1,3,True
2,10,False


### JSON

JSON stands for JavaScript Object Notation. JavaScript is a common language for creating web applications, and JSON files are used to collect and transmit information between JavaScript applications. as a result, a lot of data on the internet exists in the JSON file format. For example, Twitter and GoogleMaps use JSOn

A JSON file is essentially a data structure built out of nested distionaries and lists. Let's make our own example and then we'll examine an example downloaded from the internet

In [4]:
book1 = {'title': 'The Prophet',
        'author': 'Khalil Gibran',
         'genre': 'poetry',
         'tags': ['religion', 'spirituality' 'philosophy', 'Lebanon', 'Arabic', 'Middle East'],
         'book_id': '811.19',
         'copies': [{'edition_year' : 1996,
                     'checkouts' : 486,
                     'borrowed' : False},
                    {'edition_year': 1996,
                    'checkouts' : 443,
                     'borrowed' : False
                    }]
        }

book2 = {'title': 'The Little Prince',
         'author': 'Antoine de Saint-Exupery',
         'genre': 'children',
         'tags': ['fantasy', 'France', 'philosophy', 'illustrated', 'false'],
         'id': '843.912',
         'copies': [{'edition_year':1983,
                     'checkouts': 634,
                     'borrowed': True,
                     'due_date': '2017/02/02'},
                    {'edition_year':2015,
                     'checkouts': 41,
                     'borrowed': False
                    }]
        }

library = [book1, book2]
library

[{'title': 'The Prophet',
  'author': 'Khalil Gibran',
  'genre': 'poetry',
  'tags': ['religion',
   'spiritualityphilosophy',
   'Lebanon',
   'Arabic',
   'Middle East'],
  'book_id': '811.19',
  'copies': [{'edition_year': 1996, 'checkouts': 486, 'borrowed': False},
   {'edition_year': 1996, 'checkouts': 443, 'borrowed': False}]},
 {'title': 'The Little Prince',
  'author': 'Antoine de Saint-Exupery',
  'genre': 'children',
  'tags': ['fantasy', 'France', 'philosophy', 'illustrated', 'false'],
  'id': '843.912',
  'copies': [{'edition_year': 1983,
    'checkouts': 634,
    'borrowed': True,
    'due_date': '2017/02/02'},
   {'edition_year': 2015, 'checkouts': 41, 'borrowed': False}]}]

It's convenient to store the information about the multiple copies as a list of dictionaries within the dictionary about the book, because every copy shares the same title, author, etc

This structure is typical of JSON files. It has the advantage of reducing redundancy of data. We only store the author and title ones, even though there are multiple copies of the book. Also we don't store a due date  for copies that aren't checked out.

If we were to put this data in a table, we wouldhave to duplicate a lot of information. Also, since only one copy in our library is checked out, we also have a column with a lot of missing data.

This is very wasteful. Since JSON files are meant to be shared quickly over the internet, it is important that they are small to reduce the amount of resources needed to store and transmit them.

We can write our library to disk using the JSON module

In [12]:
#note : json is just text structured in a particular way
import json

with open('./data/library.json', 'w') as f:
    json.dump(library, f, indent=2)

In [13]:
!cat ./data/library.json

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [15]:
with open('./data/library.json', 'r') as f:
    loaded_library = f.read()
    
print(loaded_library)

[
  {
    "title": "The Prophet",
    "author": "Khalil Gibran",
    "genre": "poetry",
    "tags": [
      "religion",
      "spiritualityphilosophy",
      "Lebanon",
      "Arabic",
      "Middle East"
    ],
    "book_id": "811.19",
    "copies": [
      {
        "edition_year": 1996,
        "checkouts": 486,
        "borrowed": false
      },
      {
        "edition_year": 1996,
        "checkouts": 443,
        "borrowed": false
      }
    ]
  },
  {
    "title": "The Little Prince",
    "author": "Antoine de Saint-Exupery",
    "genre": "children",
    "tags": [
      "fantasy",
      "France",
      "philosophy",
      "illustrated",
      "false"
    ],
    "id": "843.912",
    "copies": [
      {
        "edition_year": 1983,
        "checkouts": 634,
        "borrowed": true,
        "due_date": "2017/02/02"
      },
      {
        "edition_year": 2015,
        "checkouts": 41,
        "borrowed": false
      }
    ]
  }
]


In [16]:
with open('./data/library.json', 'r') as f:
    reloaded_library = json.load(f)
    
reloaded_library

[{'title': 'The Prophet',
  'author': 'Khalil Gibran',
  'genre': 'poetry',
  'tags': ['religion',
   'spiritualityphilosophy',
   'Lebanon',
   'Arabic',
   'Middle East'],
  'book_id': '811.19',
  'copies': [{'edition_year': 1996, 'checkouts': 486, 'borrowed': False},
   {'edition_year': 1996, 'checkouts': 443, 'borrowed': False}]},
 {'title': 'The Little Prince',
  'author': 'Antoine de Saint-Exupery',
  'genre': 'children',
  'tags': ['fantasy', 'France', 'philosophy', 'illustrated', 'false'],
  'id': '843.912',
  'copies': [{'edition_year': 1983,
    'checkouts': 634,
    'borrowed': True,
    'due_date': '2017/02/02'},
   {'edition_year': 2015, 'checkouts': 41, 'borrowed': False}]}]

In [17]:
reloaded_library == library

True

Pandas can also read json files with read_json

In [18]:
import pandas as pd

pd.read_json('./data/library.json')

Unnamed: 0,title,author,genre,tags,book_id,copies,id
0,The Prophet,Khalil Gibran,poetry,"[religion, spiritualityphilosophy, Lebanon, Ar...",811.19,"[{'edition_year': 1996, 'checkouts': 486, 'bor...",
1,The Little Prince,Antoine de Saint-Exupery,children,"[fantasy, France, philosophy, illustrated, false]",,"[{'edition_year': 1983, 'checkouts': 634, 'bor...",843.912


### Compressed files (Gzip)

Another way we save storage and network resources is by using compression. Many times data sets will contain patterns that can be used to reduce the amount of space needed to store the information

A simple example is the following list of numbers: 10,10,10,2,3,3,3,3,3,50,50,1,1,50,10,10,10,10

Rather than writing out the full list of numbers(18 integers), we can represent the same information with only 14 members: (3, 10), (1,2),(5,3),(2,50),(2,1),(1,50),(4,10)

We  have successfully reduced the amount of numbers we need to represent the same data.

In the world of data science, the most common compression is Gzip(which uses the deflate algorithm). Gzip files end with the extension .gz

In [6]:
import gzip

with open('./data/test_file.docx', 'r') as f:
    data = f.read()
    

with gzip.open('./data/gzip_file', 'wb') as f:
    f.write(data.encode('utf-8'))

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 625: character maps to <undefined>

In [38]:
#!wget -P ./data/https://archive.org/stream/TheEpicofGilgamesh_201606/eog_djvu.txt

### Serialization : Pickle

Often we will want to save work in Python and come back to it later. However, that might be a machine learning model or some other complex object in Python. How do we save complex Python objects? Python has a module for this purpose called pickle. We can use pickle to write a binary file that contains all the information about a Python object. Later we can load that pickle file and reconstruct the object in Python.

In [44]:
pickle_example = ['hello', {'a': 23, 'b': True}, (1, 2, 3), [['dogs', 'cats'], None]]

In [45]:
with open('./data/pickle_example.txt', 'w') as f:
    f.write(pickle_example)

TypeError: write() argument must be str, not list

trying json

In [48]:
with open('./data/pickle_example.txt', 'w') as f:
    json.dump(pickle_example, f)

In [52]:
with open('./data/pickle_example.txt', 'r') as f:
    pickled_json = json.load(f)
    
pickled_json

['hello', {'a': 23, 'b': True}, [1, 2, 3], [['dogs', 'cats'], None]]

Using Pickle

Pickle is capable of representing python arbitrary object and reconstructing to memory from disc exactly.

In [55]:
import pickle

with open('./data/pickle_example.pkl', 'wb') as f:
    pickle.dump(pickle_example, f)
    
with open('./data/pickle_example.pkl', 'rb') as f:
    reloaded_example = pickle.load(f)
    
reloaded_example

['hello', {'a': 23, 'b': True}, (1, 2, 3), [['dogs', 'cats'], None]]

In [56]:
pickle_example == reloaded_example

True

Pickle is an important tool for data scientists. Data processing and training machine learning models can take a long time, and it is useful to save checkpoints.

### NumPy file formats

NumPy also has methods for saving and loading data. They are also often used when working with image data

In [60]:
import numpy as np
sample_array = np.random.random((4, 4))
sample_array

array([[0.81466505, 0.94315638, 0.19423007, 0.30790716],
       [0.44646576, 0.14831023, 0.64269213, 0.19052138],
       [0.14457983, 0.59444781, 0.95180546, 0.41446375],
       [0.46247811, 0.71568299, 0.35503824, 0.30609103]])

In [61]:
#to save plain text
np.savetxt('./data/sample_array.txt', sample_array)

In [62]:
print(np.loadtxt('./data/sample_array.txt'))

[[0.81466505 0.94315638 0.19423007 0.30790716]
 [0.44646576 0.14831023 0.64269213 0.19052138]
 [0.14457983 0.59444781 0.95180546 0.41446375]
 [0.46247811 0.71568299 0.35503824 0.30609103]]


to save as compressed binary

In [63]:
np.save('./data/sample_array.npy', sample_array)

In [66]:
!cat ./data/sample_array.npy'

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [69]:
print(np.load('./data/sample_array.npy'))

[[0.81466505 0.94315638 0.19423007 0.30790716]
 [0.44646576 0.14831023 0.64269213 0.19052138]
 [0.14457983 0.59444781 0.95180546 0.41446375]
 [0.46247811 0.71568299 0.35503824 0.30609103]]


Topics used but not discused

- BASH commands(!)
- wget
- str.split()
- APIs