# 2. Automating Files and Filesystems

In DevOps, you are continually parsing, searching and changing the text in files. Files are a means of persisting the state of data, code and configuration.

Rather than keeping a set of instructions to follow manually, automating the process of updating configurations helps to reduce errors and saves time

# Reading and Writing Files

__Opening files__
* `open` function to create a file object that can read and write files
	- `path`: path of the file
	- `mode`: mode to edit the file; specify the type of file (text, binary, etc)
* `read` method returns contents of file as a string

__Readlines__
* `readlines` method splits content on newline characters, returns a list of strings, each is one line

__with statements__
* Do not need to close a file explicitly, Python will close when out of the idented block

__Opening Binary files__
* Windows systems use `\r\n` for newline, Unix systems use `\n`
* Binary files like _.jpeg_ images, are likely to be corrupted if opened as text
* Appending a `b` to mode, will prevent this

In [17]:
# Read File
file_path = "hamilton.txt"
open_file = open(file_path, "r")
text = open_file.read()
print(f"Length of text: {len(text)}")
print(text[56])
open_file.close()

# Readlines
file_path = "hamilton.txt"
open_file = open(file_path, "r")
text = open_file.readlines()
print(f"Length of text: {len(text)}")
print(text[56])
open_file.close()

# with statement
with open(file_path, "r") as open_file:
	text = open_file.readlines()
print(text[42])

# Opening a binary file
file_path = "1584529319.jpg"
with open(file_path, "rb") as open_file:
	btext = open_file.read()
print(btext[28])

Length of text: 2965
n
Length of text: 72
Will they know what you overcame?

In New York you can be a new man (just you wait)

0


__Write to file__
* `w` argument: used to write to a file
* _For DevOps_: tool `direnv` used to automatically setup development environments. It will scan through the directory and find files with extensions `.envrc` to create the environment.
	- Writes are good to edit this file to specify the type of environment
* Will overwrite files if already exists

__Append to existing file__
* `a` argum,ent: appends new text to the end of the file

Similarly, binary files have the additional `b` argument at the back

In [22]:
text = '''export STAGE=PROD
export TABLE_ID=token-storage-1234'''

with open(".envrc", "w") as opened_file:
	opened_file.write(text)

Good practice to close a file when finished with it. Python closes a file when it is out of scope, but until then the file will continue to consume resoures and may prevent other processes from opening it

## Introduction to `pathlib`

In [21]:
import pathlib
path = pathlib.Path("hamilton.txt")
path.read_text()

path = pathlib.Path(".envrc")
path.write_text("LOG:DEBUG")

9

## Working with JSON

Use the `json` module

__Opening a JSON file__
* `json.load()`: Using the open syntax to open the file, then convert it to a dictionary

__Writing a JSON file__
* `json.dump()`: Using the open syntax to open the file, then write a dictionary to the JSON format

In [26]:
import json
from pprint import pprint
with open('deploy.json', 'r') as opened_file:
	policy = json.load(opened_file)
pprint(policy)

# Changing the resource access to "S3"
policy["Statement"]["Resource"] = "S3"
pprint(policy)

# Writing to JSON
new_dict = {
	"Statement": {
		"Action": "service-prefix:action-name",
		"Condition": {
			"Something": "Greater than 10"
		}
	},
	"Version": "2020-10-18"
}
with open("new_deploy.json", "w") as opened_file:
	json.dump(new_dict, opened_file)

{'Statement': {'Action': 'service-prefix:action-name',
               'Condition': {'DateGreaterThan': {'aws:CurrentTime': '2017-07-01T00:00:00Z'},
                             'DateLessThan': {'aws:CurrentTime': '2017-12-31T23:59:59Z'}},
               'Effect': 'Allow',
               'Resource': '*'},
 'Version': '2012-10-17'}
{'Statement': {'Action': 'service-prefix:action-name',
               'Condition': {'DateGreaterThan': {'aws:CurrentTime': '2017-07-01T00:00:00Z'},
                             'DateLessThan': {'aws:CurrentTime': '2017-12-31T23:59:59Z'}},
               'Effect': 'Allow',
               'Resource': 'S3'},
 'Version': '2012-10-17'}


## Working with YAML

Yet Another Markup Language. Superset of JSON with a more compact format using whitespaces.

Ansible uses YAML format for their _playbooks_

In [35]:
import yaml

with open("playbook.yaml", "r") as opened_file:
	verify_apache = yaml.safe_load(opened_file)
pprint(verify_apache)

with open("new_playbook.yaml", "w") as written_file:
	yaml.dump(new_dict, written_file)

[{'hosts': 'webservers',
  'tasks': [{'name': 'ensure apache is at the latest version',
             'yum': {'name': 'httpd', 'state': 'latest'}}],
  'vars': {'http_port': 80, 'max_clients': 200, 'remote_user': 'root'}}]


## Working with XML

Stands for Extensible Markup Language
* Consists of hierarchical documents of tagged elements
* Web systems used XML to transport data e.g Real Simple Syndication (RSS) feeds
	- RSS feeds used XML-formated pages to track and notify users of updates to websites 
* Python maps XML documents' hierarchical structure to a tree-like data structure. Nodes are elements and a parent-child relationship is used to model the hierarchy

Use the `xml` module

In [37]:
import xml.etree.ElementTree as ET

tree = ET.parse("books.xml")
root = tree.getroot()
print(root)

# iterating over child nodes
for child in root:
	print(child.tag, child.attrib)

# namespacing

<Element 'catalog' at 0x0000020A6929AD90>
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}
book {'id': 'bk104'}
book {'id': 'bk105'}
book {'id': 'bk106'}
book {'id': 'bk107'}
book {'id': 'bk108'}
book {'id': 'bk109'}
book {'id': 'bk110'}
book {'id': 'bk111'}
book {'id': 'bk112'}


## Working with CSV

Use either the `csv` or `pandas` module

In [38]:
import csv

file_path = "NewReport.csv"
with open(file_path, newline="") as csv_file:
	off_reader = csv.reader(csv_file, delimiter=",")
	for _ in range(5):
		print(next(off_reader))

['DataMessageGUID', 'SensorID', 'Sensor Name', 'Date', 'Value', 'Formatted Value', 'Battery', 'Raw Data', 'Sensor State', 'GatewayID', 'Alert Sent', 'Signal Strength', 'Voltage', 'Special Export Value']
['b67432ae-2bbb-42b2-8414-bec45135ca66', '630899', 'G-force - Max & Avg - 630899', '05/17/2022 12:02 AM', '0.059', 'X Max: 0.059 g , Y Max: 0.866 g , Z Max: 0.463 g , Magnitude Max: 0.975 g , X Avg: 0.05 g , Y Avg: 0.855 g , Z Avg: 0.442 g , Magnitude Mean: 0.964 g', '100', '0.059|0.866|0.463|0.975|0.05|0.855|0.442|0.964|0', '0', '971850', 'False', '0', '2.99']
['5d060b22-36d2-4c7f-9ec4-b4f8cf85d5cd', '630899', 'G-force - Max & Avg - 630899', '05/17/2022 12:12 AM', '0.16', 'X Max: 0.16 g , Y Max: 0.946 g , Z Max: 0.554 g , Magnitude Max: 1.044 g , X Avg: 0.03 g , Y Avg: 0.843 g , Z Avg: 0.462 g , Magnitude Mean: 0.961 g', '100', '0.16|0.946|0.554|1.044|0.03|0.843|0.462|0.961|0', '0', '971850', 'False', '0', '2.99']
['03ad74ec-6535-422f-a30a-c89a9db45831', '630899', 'G-force - Max & Avg 

In [61]:
import pandas as pd

df = pd.read_csv("NewReport.csv")
df.head()
df.describe()
df["SensorID"]

0      630899
1      630899
2      630899
3      630899
4      630899
        ...  
210    630899
211    630899
212    630899
213    630899
214    630899
Name: SensorID, Length: 215, dtype: int64

# Using regex to Search Text

* Apache HTTP server is an open source web server widely used to serve web content
* Server can be configured to save log files in different formats
	- Common Log Format (CLF) below

```
<IP Address> <Client Id> <User Id> <Time> <Request> <Status> <Size>
127.0.0.1 - swills [13/Nov/2019:14:43:30 -0800] "GET /assets/234 HTTP/1.0" 
200 2326
```


In [62]:
import re

# Pulling out IP address from a line
line = '127.0.0.1 - rj [13/Nov/2019:14:43:30] "GET HTTP/1.0" 200'
re.search(r"(?P<IP>\d+\.\d+\.\d+\.\d+)", line)

m = re.search(r"(?P<IP>\d+\.\d+\.\d+\.\d+)", line)
print(m.group("IP"))

# Getting time
r = r"\[(?P<Time>\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})\]"
m = re.search(r, line)
print(m.group("Time"))

# Multiple Elements
r = r"(?P<IP>\d+\.\d+\.\d+\.\d+)"
r += r" - (?P<User>\w+) "
r += r"\[(?P<Time>\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})\]"
r += r' (?P<Request>".+")'
m = re.search(r, line)
print(m.group("User"))
print(m.group("Request"))

Use regex to pull information from the whole log. E.g pulling all of the IP addresses for the GET request that happened on November 9, 2019. Able to make modifications based on teh specifics of the request

In [None]:
r = r"(?P<IP>\d+\.\d+\.\d+\.\d+)"
r += r" - (?P<User>\w+) "
r += r"\[(?P<Time>08/Nov/2019:\d{2}:\d{2}:\d{2})\]"
r += r' (?P<Request>"GET .+")'

# access_log is a fake log file
matched = re.search(r, access_log)
for m in matched:
    print(m.group("IP"))

# Dealing with Large Files

Instead of loading the entire file into memory, read one line at a time. Python will process the line and automatically remove read lines from memory.

Different operating systems use alternate line endings (e.g Windows: `\r\n` Mac and Linux: `\n`), so it's difficult to account for different line breaks for the different OS-es.

In [None]:
# Using linebreaks as a separator
with open("big-data.txt", 'r') as source_file:
    with open("big-data-corrected.txt", "w") as target_file:
        for line in source_file:
            target_file.write(line)

# generator function for multiple files
def line_reader(file_path):
    with open(file_path, "r") as source_file:
        for line in source_file:
            yield line

reader = line_reader("big-data.txt")
with open("big-data-corrected.txt", "w") as target_file:
    for line in reader:
        target_file.write(line)

For large binary files, can read the data in chunks. Pass the number of bytes read in each chunk to the file objects `read` method. Will retrun an empty string when its at the end

In [None]:
# pg 109