In [1]:
# We can obtain our data from a variety of sources such as databases, web API’s but a common way  to obtain data is through 
# a good old fashioned file. 
# So in this section, we will use a lot of what we have learnt so far to deal with reading data from and writing to files.

# Python can read files pretty easily from the standard library. Its just a case of specifying where the file is and then 
# creating a stream to that location. 
# Let’s demonstrate by having a file located in /Path/to/file/test.csv. 
# This is the full path to the comma separated file test.csv.

import os

x = os.getcwd()
x

'C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book'

In [2]:
print(x)

C:\Users\HAFIFI\Documents\Python Jupyter Notebook\The Python Book


In [3]:
file_name = "\\Files\\test.csv"
file_path = x + file_name

file_path

'C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book\\Files\\test.csv'

In [4]:
f = open(file_path,"r")
f

<_io.TextIOWrapper name='C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book\\Files\\test.csv' mode='r' encoding='cp1252'>

In [5]:
# What we have done here is define a string file_name containing the name of the file and then used the 
# open command with the arguments of file_name and ‘r’ which is the mode to read the file in and in this 
# case it refers to read. We have assigned the return of this to the variable f which is a stream to the 
# file. Now to read in the data from the file we simply run:

data = f.read()
data

'hp,dell,ibm,asus,lenovo,fujitsu'

In [6]:
# What we get back is the data in a single index list, which isn’t that useful. In text files
# what you find is that lines are separated by a line return which means that we could apply
# the split method which will split what it reads into new elements of the list every time it
# sees a line return. Its easier to demonstrate this by an example on a string:

names = "steve\ntony\nbruce\n"
names

'steve\ntony\nbruce\n'

In [7]:
print(names)

steve
tony
bruce



In [8]:
names.split("\n")

['steve', 'tony', 'bruce', '']

In [9]:
# In this case:

f = open(file_path,"r")
data = f.read().split(",")
data

['hp', 'dell', 'ibm', 'asus', 'lenovo', 'fujitsu']

In [10]:
# With a comma separated file we have each item on a given line separated by a comma.
# So to get each item we need to split again based on a comma. 
# For the following:

names = "steve,rodgers\ntony,stark\nbruce,banner\n"
print(names)

steve,rodgers
tony,stark
bruce,banner



In [11]:
names = names.split("\n")
names

['steve,rodgers', 'tony,stark', 'bruce,banner', '']

In [12]:
for n in names:
    print(n)

steve,rodgers
tony,stark
bruce,banner



In [13]:
len(names)

4

In [14]:
names[0]

'steve,rodgers'

In [15]:
for n in names:
    row = n.split(",")
    print(row)

['steve', 'rodgers']
['tony', 'stark']
['bruce', 'banner']
['']


In [16]:
a = 0
names_row = []

while a < 4:
    row = names[a].split(",")
    a += 1
    print(row)

['steve', 'rodgers']
['tony', 'stark']
['bruce', 'banner']
['']


In [17]:
# Initially we split the string on the character \n to create a list containing three items.
# We then loop over the list and for each item in the list we split on the character comma 
# to create a list containing first and last name. 
# We can do this in a single line as opposed to looping the names list with list of comprehension as follows:

names = "steve,rodgers\ntony,stark\nbruce,banner\n"
names

'steve,rodgers\ntony,stark\nbruce,banner\n'

In [18]:
print(names)

steve,rodgers
tony,stark
bruce,banner



In [19]:
names = names.split("\n")
names

['steve,rodgers', 'tony,stark', 'bruce,banner', '']

In [20]:
names = [n.split(",") for n in names]
names

[['steve', 'rodgers'], ['tony', 'stark'], ['bruce', 'banner'], ['']]

In [21]:
# So in a single line we can achieve what we did in the loop.
# Here you can see we have basically moved the loop into a one liner. 
# From both lines we can see that we get an empty list at the end of both implementations. 
# What is happening here is that at the last line return when we split on the line return we get an empty string after it. 
# So with any file where we separate on \n we need to make sure to account for the empty string, the way we can do this
# is refer back to the 'pop' method we introduced earlier:

names

[['steve', 'rodgers'], ['tony', 'stark'], ['bruce', 'banner'], ['']]

In [22]:
names.pop()

['']

In [23]:
names

[['steve', 'rodgers'], ['tony', 'stark'], ['bruce', 'banner']]

In [24]:
# Now, we are able to read files. 
# The next thing to cover is how to write to files. 
# It works in much the same way as for reading from files in that we 
# first need to define the file name and open a stream to write to file.

file_name = "\\Files\\output.csv"
file_path = x + file_name

f = open(file_path,"w")
f

<_io.TextIOWrapper name='C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book\\Files\\output.csv' mode='w' encoding='cp1252'>

In [25]:
# This opens a stream under where your terminal window is open in write mode. 
# To physically write something to a file you need to define something you want to be in the file.

out_str = "something to go in the file\n"
f.write(out_str)
f.close()

In [26]:
# What we have done is create a string that we want to be in our file
# and then using the streams method write we have written the string to file. 
# One thing we missed from the first example when we read from file is that we 
# forgot to close the file stream. 
# Here, we see in the last line that we do this using the close method. 
# Now, Python will generally tidy things like this when you quit Python or when 
# your written program ends, however its good practice to include this in your code.

f = open(file_path,"r")
f.read()

'something to go in the file\n'

In [27]:
f = open(file_path,"r")
data = f.read()
data

'something to go in the file\n'

In [28]:
print(data)

something to go in the file



In [29]:
# Next, we will consider how to append to a file. 
# Now this is very similar to writing to a file. 
# However, when we open a file in write mode, we would override any existing file with the same name. 
# With append, we would keep the existing file and then add whatever we wanted to the end of it. 
# The way we do this is very similar to what we have seen before, we just use the append option 
# when opening the file, so to append to our output.csv we need to write the following:

file_path

'C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book\\Files\\output.csv'

In [30]:
f = open(file_path,"a")
f

<_io.TextIOWrapper name='C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book\\Files\\output.csv' mode='a' encoding='cp1252'>

In [31]:
out_str = "This is second line\n"
f.write(out_str)
f.close()

In [32]:
f = open(file_path,"r")
data = f.read()
data

'something to go in the file\nThis is second line\n'

In [33]:
print(data)

something to go in the file
This is second line



In [34]:
# Let’s expand on this example by applying reading and writing to a bigger example. 
# What we are going to do is import a dataset from sklean which is a package.

from sklearn.datasets import load_boston
boston = load_boston()

In [35]:
# Now here we load up a dictionary object containing a dataset and relevant details that we
# want to work on. Here, we want to take the data and feature_name keys from this dictionary
# and write to a csv file.

feature_names = boston["feature_names"]
feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [36]:
feature_names = boston["feature_names"][::2]
feature_names

array(['CRIM', 'INDUS', 'NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT'],
      dtype='<U7')

In [37]:
# To make things a little more difficult we will take every other column and not include the last two values.

headers = feature_names[::2][:-2]
headers

array(['CRIM', 'NOX'], dtype='<U7')

In [38]:
# This will give us the values that we want to put into out file. 
# So the next thing we will do is open up the file and write the headers to the file.

file_name = "\\Files\\boston_output.csv"
file_path = x + file_name

fo = open(file_path,"w")
fo.write(','.join(headers) + '\n')

# Here, we can see the output of 9 referring to the 9 characters that we wrote to the file.

9

In [39]:
# What we want to do next is write the relevant data referring to the headers to the file.

boston_data = boston['data']

In [40]:
for bd in boston_data:
    row_dict = dict(zip(feature_names, bd))
    val_list = []
    for h in headers:
        val = row_dict[h]
        val_list.append(str(val))
out_str = ','.join(val_list)
fo.write(out_str + '\n')

# What we have done here is assign the data to boston_data and then loop over it. 
# Each element of data can then be zipped with the feature_names to create a dictionary. 
# The reason for doing this is to allow us to select the relevant values to write to file. 
# To do this we loop over the headers and access the dictionary value using the key of the header. 
# These values are then appended to a list and the join method is applied in much the same way we 
# did for the headers to write each line to the file.

14

In [41]:
# Lastly, we need to close the file, technically if we don’t do this then 
# Python will do it for us, however its good practice to do so.

fo.close()

In [42]:
# So, here we have created the file to output and have written the headers to it.
# Next, we will loop across the data and write it line by line to the file 
# making sure to select the columns that we want.

file_path

'C:\\Users\\HAFIFI\\Documents\\Python Jupyter Notebook\\The Python Book\\Files\\boston_output.csv'

In [43]:
f = open(file_path,"r")
data = f.read().split('\n')
data.pop()
data

['CRIM,NOX', '0.04741,11.93']

In [44]:
for d in data:
    print(d)

# That is really about it when it comes to reading, writing, and appending to files. 
# Its important to note that what we have shown only works for single sheet data files.

CRIM,NOX
0.04741,11.93


In [45]:
# EXCEL

# A more common type of file that you might want to open in Python is a spreadsheet like file containing sheets of data. 
# This could be in the form of an xls or xlsx file. 
# Luckily Python has a library for us called openpyxl which allows us to write the data to an excel file and read it
# back in as we will demonstrate.

from sklearn.datasets import load_boston

boston = load_boston()
feature_names = boston["feature_names"]
list(feature_names)

['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT']

In [46]:
feature_names = boston["feature_names"][::2]
list(feature_names)

['CRIM', 'INDUS', 'NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT']

In [47]:
headers = list(feature_names)[::2][:-2]
headers

['CRIM', 'NOX']

In [48]:
# Next, we want to export this data into a sheet of an excel sheet.

from openpyxl import Workbook

wb = Workbook()
sheet1 = wb.create_sheet('boston_data', 0)

# What we do here is import the relevant package and then create a Workbook. 
# For this Workbook we then create a sheet to write to and call it boston_data and insert it into position
# 0 which is the first position of the spreadsheet.

In [49]:
i = 1
for h in headers:
    sheet1.cell(1,i,h)
    i += 1

In [50]:
# Next, we write the headers to our sheet, note we want to insert the values into the first
# row so we set a counter i to 1 to start at the first column and then increment it to insert
# subsequent values into the relevant columns. Here, we use the cell method where we pass
# in row, column, and value, and here the row is fixed at 1.

j = 2
boston_data = boston['data'][0:5]
for bd in boston_data:
    k = 1
    row_dict = dict(zip(feature_names, bd))
    for h in headers:
        val = row_dict[h]
        sheet1.cell(j, k, val)
        k += 1
    j += 1

In [51]:
# Next, we look to write the first five rows of the data to the file so to do so we use the same
# cell method, however now we need to increment rows and columns to deal with the fact
# we have multiple rows. So to do so counters are setup outside the loop for the row and inside
# the loop for the column. This is because we need to reset the columns for every row as we
# want to go back to column 1, hence the k value needs to change to 1 every time we finish
# writing a row.

file_name = "\\Files\\boston.xlsx"
file_path = x + file_name

wb.save(file_path)

# Lastly, to save the data we just use the save method on the workbook and pass in the name
# of the file we want to save.

In [52]:
from openpyxl import load_workbook
wb = load_workbook(file_path)
wb

<openpyxl.workbook.workbook.Workbook at 0x20a3c7e0820>

In [53]:
wb.worksheets

[<Worksheet "boston_data">, <Worksheet "Sheet">]

In [54]:
sheet = wb['boston_data']
sheet

<Worksheet "boston_data">

In [55]:
# We can then access the specific sheet using dictionary notation treating the sheet name as the key. 
# To get the values we use row and column indexes:

sheet[1][0].value

'CRIM'

In [56]:
sheet[1][1].value

'NOX'

In [57]:
sheet[2][0].value

0.00632

In [58]:
# Note that our columns are zero indexed despite us writing to column 1 in the code to
# write to file but we can get the specific value by getting the value attribute.

In [59]:
# JSON

# JSON stands for JavaScript Object Notation and it has become very popular as a data type and is widely used. 
# It’s described as a lightweight data-interchange format. 
# But what actually does that mean, well it’s really a text format to store data that is easy, visually, for us to 
# read and write, and also easy for the computers to parse and generate.

In [60]:
# For a Python user, JSON will appear to be a mixture of lists and dictionaries in that you can have collections of 
# key value pairs like in a dictionary but also have data stored in a manner like a list. 
# Let’s take the example that we have used previously and create a json representation of the data.

from sklearn.datasets import load_boston

In [61]:
boston = load_boston()

In [62]:
dir(boston)

['DESCR', 'data', 'feature_names', 'filename', 'target']

In [63]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [64]:
feature_names = boston['feature_names'][::2]
feature_names

array(['CRIM', 'INDUS', 'NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT'],
      dtype='<U7')

In [65]:
list(feature_names)

['CRIM', 'INDUS', 'NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT']

In [66]:
headers = list(feature_names)[::2][:-2]
headers

['CRIM', 'NOX']

In [67]:
boston_data = boston['data'][0:5]
boston_data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,
        6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
        1.5300e+01, 3.9690e+02, 4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
        6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
        1.7800e+01, 3.9690e+02, 9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
        7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
        1.7800e+01, 3.9283e+02, 4.0300e+00],
       [3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
        6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
        1.8700e+01, 3.9463e+02, 2.9400e+00],
       [6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
        7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
        1.8700e+01, 3.9690e+02, 5.3300e+00]])

In [68]:
# So, what we have done above is what has been done previously, however here we differ by selecting only the 
# first five elements of the data which will allow us to show the data in json representation.

json_list = []

for bd in boston_data:
    row_dict = dict(zip(feature_names, bd))
    val_dict = {}
    for h in headers:
        val = row_dict[h]
        val_dict[h] = val
    json_list.append(val_dict)
print(json_list)

[{'CRIM': 0.00632, 'NOX': 2.31}, {'CRIM': 0.02731, 'NOX': 7.07}, {'CRIM': 0.02729, 'NOX': 7.07}, {'CRIM': 0.03237, 'NOX': 2.18}, {'CRIM': 0.06905, 'NOX': 2.18}]


In [69]:
# The next set of code gets the data into a format to export to json. 
# As mentioned before, we can achieve this via a combination of dictionaries and lists. 
# So, initially we create a list to put every row of our data into. 
# A row can be represented as a dictionary which in this case is simply a key-value pair for two of the feature names 
# which have been assigned to the headers. 
# What we end up with is a list of dictionaries which we will look to export as json.

import json

file_name = "\\Files\\boston.json"
file_path = x + file_name

with open(file_path, 'w') as write_file:
    json.dump(json_list, write_file, indent=4)

In [70]:
# To create the json output, we can use the json package and the dump method passing in the list and an open file 
# as the arguments.
# The next part we need to cover is how to read the json file back into Python, luckily this is easily achieved using 
# the json library.

import json

file_name = "\\Files\\boston.json"
file_path = x + file_name

with open(file_path, 'r') as read_file:
    data = json.load(read_file)

data

[{'CRIM': 0.00632, 'NOX': 2.31},
 {'CRIM': 0.02731, 'NOX': 7.07},
 {'CRIM': 0.02729, 'NOX': 7.07},
 {'CRIM': 0.03237, 'NOX': 2.18},
 {'CRIM': 0.06905, 'NOX': 2.18}]

In [71]:
type(data)

list

In [72]:
# As we did with writing to file, we just use the load method with the open file mode, read which assigns the values in 
# the file to the data object which is of type list.

In [73]:
# XML

# XML stands for Extensible Markup Language and much like JSON it is a way to store data that is easy visually for us to 
# read and write but at the same time easy for computers to parse and generate. 
# Unlike JSON, it doesn’t have a natural link to Python data types and so needs a bit more of an introduction into its types 
# and how it works. 
# Let’s explain using the example below.

Now let’s deconstruct the above example:

The first line is the xml declaration and it could have simply been written as follows:

If we had some specific encoding to use in the xml file we could rewrite it as follows:

Next, we have the following:

- The catalog to catalog are the root elements of the XML and are the start and end of the content. 
- The name used is arbitrary and in this case, just reflects the data we have. 
- You will notice the use of a / on the closing content, this is common between the opening and closing elements.
- Next, we add in a further level down as follows:

- Here, we have defined a book using the opening book and closing book and unlike at the root level we have attached data to this level with the addition of the **id="bk101"**. 
- This is the high level book data, to add more specific data about the book we can do so as follows:

- Under the book level, we have added variables for author, title, genre, price, publish_date, and description. 
- As before you can see that the definition of each variable has an opening and closing using the terminology introduced earlier.
- Lastly, to add another book you would do so as follows:

- We can create another book under our initial book in much the same way as we did before.
- The way we distinguish each book is by using its own id.
- What we have shown here is how we can build interesting data structures using XML.
- The next question to address is how can we create and parse XML objects. 
- To do this we use lxml which is a Python library that allows the user to take advantage of the C libraries libxml2 and libxslt. 
- These are very fast XML processing libraries that are easily accessible through Python.
- As we have done earlier in the chapter, we will use the same example and show how you can create XML from it.

In [74]:
from sklearn.datasets import load_boston

In [75]:
boston = load_boston()

In [76]:
dir(boston)

['DESCR', 'data', 'feature_names', 'filename', 'target']

In [77]:
feature_names = boston['feature_names'][::2]
feature_names

array(['CRIM', 'INDUS', 'NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT'],
      dtype='<U7')

In [78]:
list(feature_names)

['CRIM', 'INDUS', 'NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT']

In [79]:
headers = list(feature_names)[::2][:-2]
headers

['CRIM', 'NOX']

In [80]:
boston_data = boston['data'][0:5]
boston_data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,
        6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
        1.5300e+01, 3.9690e+02, 4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
        6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
        1.7800e+01, 3.9690e+02, 9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
        7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
        1.7800e+01, 3.9283e+02, 4.0300e+00],
       [3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
        6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
        1.8700e+01, 3.9463e+02, 2.9400e+00],
       [6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
        7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
        1.8700e+01, 3.9690e+02, 5.3300e+00]])

The full code to write the data to xml is as follows:

In [81]:
from lxml import etree

In [82]:
root = etree.Element("root")

In [83]:
for bd in boston_data:
    row_dict = dict(zip(feature_names, bd))
    row = etree.SubElement(root, "row")
    for h in headers:
        child = etree.SubElement(row, h)
        val = row_dict[h]
        child.text = str(val)
        
et = etree.ElementTree(root)

file_name = "\\Files\\boston.xml"
file_path = x + file_name

et.write(file_path, pretty_print=True)

Breaking this down, we first import lxml and then create the root of our xml document.

Next, we have to loop over the data in a similar way that we have done before to put the data into our xml.

- The mechanism of looping the data is no different to what we have seen and we create the same row_dict and loop the headers to get the values.
- However the difference is in how we setup the xml and where we write to. 
- For each iteration across the boston_data, we create another row called row under the root using the SubElement method assigning root as the parent. 
- Then for every value, we obtain from looping the headers we create another SubElement this time with parent row and having the name of the header. 
- We assign the value for this by setting the text attribute to be that value. 
- This then gives us the format of data.

- The last part is to write the data to file so we can make use of the write method by passing the root of the document through ElementTree. 
- Note that we set the pretty_print to be True, which gives the following file:

Now, we will show how you can read an XML file in using lxml in Python using the example below.

In [84]:
from lxml import objectify

In [85]:
file_name = "\\Files\\boston.xml"
file_path = x + file_name

xml = objectify.parse(open(file_path))

In [86]:
root = xml.getroot()

In [87]:
children = root.getchildren()

In [88]:
print(children)

[<Element row at 0x20a3a120940>, <Element row at 0x20a3a120880>, <Element row at 0x20a3a1206c0>, <Element row at 0x20a3a120500>, <Element row at 0x20a3a120680>]


In [89]:
for c in children:
    print(c['CRIM'])
    print(c['NOX'])

0.00632
2.31
0.02731
7.07
0.02729
7.07
0.03237
2.18
0.06905
2.18


So, what we have done here is to import objectify from lxml, which will be used to read in the XML.

- Here, we are reading in the XML file and parsing it using the parse method of objectify.
- This gives us an XML object which we can then use to try and parse out the information.
- Next, we look to get the root of the document using:

Having obtained the root, we look to get the children of this which represents the next level down which are the rows.

- Now, to access the values, we can loop through the children as that object is simply a list.
- By doing so we can obtain and print the values as follows:

- These refer to the values in the dataset which we created.
- This chapter has covered some important concepts relating to files and how to read from and write to them using Python. 
- We have covered a number of different file types and given practical examples of how these work. 
- We will show later in the book other approaches to reading and writing to file but these somewhat low level approaches are very important when we want to have a high level of control when it comes to manipulating the data and are a great tool to have in your arsenal.