In [21]:
import os
import sys
import pandas as pd

# File Management 2

## Pandas File Access

We've already looked at our main tool for reading data from disk, the file read/write functionality in Pandas. When reading datasets this is normally all we need. Inside the read_csv function the Pandas people have either created all the file access stuff that they need to read a CSV or, more likely, they repurposed and extended some os library functions to do the work for them.

In [22]:
df = pd.read_csv("chipotle.tsv", sep="\t")
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


## OS and File Read/Write

We can also use the `os` module to do some basic file management. OS is a library that allows us to interact with the operating system on our local computer. Recall that our Python programs run inside an environment setup by the Python install on our computer. This means that as we work we are "inside" that separate environment and we can't directly interact with the underlying computer. The os library, and functions such as read_csv from other libraries, are a tool that allows us to bridge this gap. The os library that is presented to us gives us an assortment of commands to do things like delete files or change the directory we're using. The library's functions are then translated into the correct actions for the actual computer, and passed on to that computer, when the code is run in the Python environment. This also allows for Python code to be portable, or able to run with few to no changes, on different types of computers - I am using a Mac, most of you are probably using a PC, and we can also use a Unix/Linux based system like Google Colab. The code we write can work on all of those environments because of this abstraction, and in each case, the actions triggered by the os module will be different depending on the underlying operating system of the machine. In practice, many of the user-friendly libraries that we might use to access files or folders is built on top of the os module, so we often avoid needing to get into the weeds ourselves.

We can use the `os` module to do things like:
<ul>
    <li>Get the current working directory</li>
    <li>Change the current working directory</li>
    <li>Get a list of files in a directory</li>
    <li>Create a new directory</li>
    <li>Remove a directory</li>
    <li>Remove a file</li>
</ul>

First though, let's get some info about our system. Everyone will get totally different results here - I'm on a MacBook Air running MacOS, and I assume most of you have some variety of a PC running Windows. If you've ever seen any of those website things that tells you, "You are running Windows 10 in Edmonton, Ab..." this is a similar idea. The os.uname() function reaches out to the computer and retrieves some of it's identifying information for us. 

In [23]:
for a in os.uname():
    print(a)

Darwin
Akeems-Air.nait.ca
22.5.0
Darwin Kernel Version 22.5.0: Mon Apr 24 20:52:43 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8112
arm64


## Folder Management

The OS library also provides for an assortment of folder management functions. We can use these to create, rename, and delete folders.

When using large datasets it is very common to have our data distributed over several folders. For example, if we are doing image recognition we might have a folder for "dogs", another with "cars", and another with "rutabagas". To construct our dataset we need to navigate over all of these folders and read in the files, using os and/or some similar libraries. 

Some common folder actions are:
<ul>
<li> os.mkdir() - create a new folder </li>
<li> os.rename() - rename a folder </li>
<li> os.rmdir() - remove a folder </li>
<li> os.getcwd() - get the current working directory </li>
<li> os.chdir() - change the current working directory </li>
<li> os.listdir() - list the contents of a directory </li>
</ul>

In [24]:
os.getcwd()

'/Users/akeem/Documents/GitHub/Programming_Basics_for_ML/workbooks/development_working_dir'

<b>Note:</b> when using these functions, you need to be careful about where you are in the file system. In particular, the folder location doesn't reset itself automatically if you rerun everything. We need to restart the environment to reset the working directory, or navigate ourselves back to the correct location. The above command got the current working directory back from the Python environment, which in turn got it from the operating system. If we make a change to that directory, then rerun the above cell, we aren't "reset" to the original where we were when we started the program. To do that, we'd need to restart the environment, which would kill this current world in which our program is running, and generate a brand new one. 

Here I'm going to capture the current folder name and then move one level up, and back down. 

In [25]:
org_fold = os.getcwd()
tmp = os.getcwd().split("/")[-1]
print(tmp)
os.chdir("../")
os.getcwd()
os.chdir(tmp)

development_working_dir


#### Handling File System Data

We can also capture some of this information, and use it as a variable in our code that we can use to navigate the file system. For example, when grabbing the working directory above we stored it as a string and used the split command to break it into a hierarchy of folder names. Below we can pull the files in a folder and that data is returned in a list. If we have code where we need to jump around from folder to folder, we can use this list to help us navigate. For example, if we have a folder structure like this:

```
data
    - dogs
        - dog1.jpg
        - dog2.jpg
        - dog3.jpg
    - cats
        - cat1.jpg
        - cat2.jpg
        - cat3.jpg
    - rutabagas
        - rutabaga1.jpg
        - rutabaga2.jpg
        - rutabaga3.jpg
```
Our "base" folder is the data folder, and the subfolders inside are where we'll likely need to do all of our work. We can keep this map of the folder structure in some data structure, then we can use that info to move around. For example, we can find out where we are, visit each subfolder to do some work, and then move back to the base folder.

In [26]:
# Capture file list
# Print 5
#<b>Bonus:</b> at some point when we looked at strings, someone asked about putting a newline in an f string. I was wrong, it can work like this, no workarounds needed. 
file_list = os.listdir()
print(f"Number of flies: {len(file_list)}\n")
for file in range(5):
    print(file_list[file])


Number of flies: 38

010_class_comments.ipynb
Clothing.csv
file_list
005_file_access.ipynb
006_objects.ipynb


## Reading and Writing Text Files

Now that we have an idea what is in our folders, we can start recklessly changing things. For example, we can use the os write functions to make a new CSV file and write some data to it. One thing that is reveled to us here that might not be visible is you're used to Windows machines is a look at what is a "text" file. Text files are .txt, but also .csv, .tsv, .py, etc... meaning all of these types of files is made up of just plain text, and we can edit them in any text editor on a computer - the file extension doesn't dictate it, that's for our convenience. The structure of our code will:
<ul>
<li> Open a connection to the file, if it doesn't exist this will create it. </li>
<li> Perform the contents of the loop - writing some data to a file for 100 lines. </li>
<li> When that writing task is complete, it will close the connection automatically thanks to the with. </li>
</ul>

In the open() function call to connect to the file we provide the second argument that defines what type of access we get to the file we are opening:
<ul>
<b><li> 'r' - read only </li>
<li> 'w' - write only </li>
<li> 'a' - append to the end of the file </li>
<li> 'r+' - read and write </li></b>
<li> 'w+' - write and read, overwrites existing files</li>
</ul>

These are mostly pretty simple and self-explanatory, with the exception of the distinction between r+ and w. The 'r+' option is a bit more complicated. This will open the file for reading and writing, but it will not create the file if it doesn't exist. If you try to open a file that doesn't exist with 'r+' you will get an error. The `w` option will create the file if it doesn't exist, but it will also overwrite the file if it does exist. So if we are making something brand new, we want `w`, but if we are attempting to update an existing file, we want `r+`. This is an easy place to make an error, so we should be careful. We can also check to see if the file we want to make already exists, then make a decision. `W+` is another weird option, it is read/write, but will overwrite the file if it exists.

![File Permissions](../../images/file_permissions.png "File Permissions" )
![File Permissions](../images/file_permissions.png "File Permissions" )

Choosing the level of access of a file that we open is important in terms of writing our code to prevent errors. We want to open the file with the least impactful level of access that we need to have to do what we want. So if we are just reading data from a file, opening it as read only will prevent us accidentally changing that file in any way, as we don't even have the ability to do so. If we want a brand-new file, opening it as write only will prevent any old data that may have been hanging around from persisting. The more flexible the level of access, the more options we have for what we can do, the more likely it is that we may do something unintentional.

<b>Note:</b> this will go to our current directory, wherever the pointer is, so if you got rid of that line to reset the locations, it would spit this file out to whatever folder you happen to be in.  

In [27]:
q = 1
os.chdir(org_fold)
with open("fib.csv", "r+") as f:
    str(q) + "," + str(q) + "\n"
    for i in range(100):
        if q == 1:
            f.write(str(i+1) + "," + str(q) + "\n")
            q += 1
        else:
            q += 1
            f.write(str(i+1) + "," + str(q) + "\n")

#### Reading Our File

Now that our file is written, we can read it and see what we got. We need to specify the `r` here, since we are only reading. When reading from a file, there's a few main options:
<ul>
<li> read() - read the entire file into a single string </li>
<li> readline() - read the file one line at a time </li>
<li> readlines() - read the file into a list of strings, one per line </li>
</ul>

The first, read, takes in the entire file into one string. This is fine for small files, but for things that are large it is unruly. For most things of size we probably want to navigate the file one line at a time, using the readline() option. There's also a common shortcut that we can do with a for loop that does this for us easily:
    
    ``` for line in file: ```   

If we are reading in a large file, one line at a time is a better choice. Depending on what we are doing, we may be able to process our data and "deal with it" - whether that be saving it to another file or loading it into some dataset - on the fly. When we get to neural networks towards the end of machine learning, we'll try to read enough data so that our processor is busy - so the computer is never waiting for either data or a free processor. Loading batches allows us to make the most of the power of our computer, as we can minimize the amount of time any part of it spends waiting for something else to finish.

In [28]:
with open("fib.csv", "r") as f:
   for line in f:
      print(line)

1,1

2,3

3,4

4,5

5,6

6,7

7,8

8,9

9,10

10,11

11,12

12,13

13,14

14,15

15,16

16,17

17,18

18,19

19,20

20,21

21,22

22,23

23,24

24,25

25,26

26,27

27,28

28,29

29,30

30,31

31,32

32,33

33,34

34,35

35,36

36,37

37,38

38,39

39,40

40,41

41,42

42,43

43,44

44,45

45,46

46,47

47,48

48,49

49,50

50,51

51,52

52,53

53,54

54,55

55,56

56,57

57,58

58,59

59,60

60,61

61,62

62,63

63,64

64,65

65,66

66,67

67,68

68,69

69,70

70,71

71,72

72,73

73,74

74,75

75,76

76,77

77,78

78,79

79,80

80,81

81,82

82,83

83,84

84,85

85,86

86,87

87,88

88,89

89,90

90,91

91,92

92,93

93,94

94,95

95,96

96,97

97,98

98,99

99,100

100,101



#### Appending a File and the In-File Pointer

Another of the options above that is a little odd is the `a`, for append. This will open the file and add new data to the end of it, but it will not overwrite the existing data. This is useful if we want to add new data to an existing file, but we don't want to lose the old data. This is very useful for things like logs - we likely have a pretty substantial amount of data accumulated in a big text file and we want to add new stuff without losing the old data or having to deal with the old data at all. We can use this to basically tack some entries onto the end of a file easily.

This issue of opening an existing file normally vs appending seems pretty minor, but it can have larger performance implications than we might expect. For example, server logs can be many, many GB of text that lists errors or warnings going back years. We want to keep the log, and we also want to add today's entries. Opening a 2GB file "normally", navigating to the end, then spitting it back out can be slow, appending directly to the end of that same file is fast. This is because just like navigating a file system, a text file itself has a pointer that maintains your position - think of it as an invisible cursor just like we have in any program where we type. Append puts that position cursor directly at the end, and just starts writing. Personally, I once had a job where I remade a little program that went to approximately 150 servers, grabbed their log file, and looked for last night's entry at the end of the file to see if a backup failed. By changing it from opening the files normally, to appending (roughly, the language wasn't Python), I cut the runtime from 4 to 5 hours to about <10 minutes - without making any actual improvements to the logic of the code, just by jumping directly to the end of the text. When someone wrote the original, all the log files were probably tiny, as the system was new, so it didn't matter for performance; as things grew, this became an issue.  

<b>Note:</b> the "it's slow to open a large file and write to the end" thing is obviously a common issue for computers in general. File access packages know this, and are built to be fast no matter what. This idea is still true, just less true than it is with older software.

In [29]:
#with open("fib.csv", "a") as f:
#    print(f.readlines())

### Remote Data

We can also use some code to programmatically download data from the internet. This can save us from having to download large files, but it can also help us to build automated pipelines for getting data. 

One thing that I've worked on a lot in industry is importing data from other systems into LMS systems like Moodle. A common process to do this is for the other system to export a CSV file to a specific location on a file server, then a script that we created will grab the new file from the pre-defined location and feed it into our import work. Applications for personal use are also broad - we could automate downloads of files that are regularly updated. 

For pretty much any data source that we might want to be able to access, there is likely a library that will do so. So we can access FTP servers, different file share protocols, and so on - we just need to look up the correct tool for whichever data source we want to access.

<b>Note:</b> there are many libraries that download files, they're pretty much interchangeable. I'm using `urllib.request` here because it's built into Python and is the "basic standard", but you could also use `requests` or `wget` or `curl` or any number of other libraries. For this, and the others, look at the documentation to see the options and how to use the functions - they are generally similar to this, just provide a URL and a destination.

In [30]:
# Download
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
import urllib.request
try:
    urllib.request.urlretrieve(url, 'chipotle.tsv')
except:
    print("Error downloading file")


### Compression

Many files that we deal with, particularly when downloading datasets, may be compressed. Several libraries provide tools for us to programmatically deal with these files, including decompressing them and moving their files into our working directory.

<b>Note:</b> Pandas read_csv function can also read compressed files directly, so you can load my_data.zip or similar with no interim decompression step.

In [31]:
# Compress some files
import zipfile

# Zip first 5 files in file_list
with zipfile.ZipFile('file_list.zip', 'w') as myzip:
    for file in range(5):
        myzip.write(file_list[file])

Decompress into a folder

In [32]:
# Decompress file_list.zip into file_list folder
with zipfile.ZipFile('file_list.zip', 'r') as myzip:
    myzip.extractall('file_list')

#### Delete the New Folder

We can delete that folder we just extracted, as well as the zip itself. 

In [33]:
#for file in os.listdir('file_list'):
#    print(file)
#    os.remove(file)
#os.rmdir('file_list')

## Exercise

In [34]:
# Download and decompress

## Shutil

The shutil library is another built-in package in Python that provides some tools to access files as well. The shutil library is a higher-level interface to the file system than the os library, meaning we get an additional layer of abstraction and more human-readable commands. Some of the key things we may want to use shutil for are things that we can do with the os, but it is a bit easier to do with shutil:
<ul>
<li>Copying files</li>
<li>Moving files</li>
<li>Removing files</li>
<li>Creating archives</li>
<li>Extracting from archives</li>
</ul>

The actions in the shutil package are more similar to how we probably think of things working as an end user looking at files on our computer, while the os package is more similar to the details of how the computer actually works.

In [35]:
import shutil

# Delete File List Folder
#try:
#    shutil.rmtree('file_list')
#except:
#    print("File List Folder Does Not Exist")


In [36]:
disk = shutil.disk_usage(os.getcwd())
disk

usage(total=994662584320, used=184396263424, free=810266320896)

We can use the shutil library to delete some of the stuff we created above. 