# 6. Working with files and folders¶

## Reading the contents of a folder

If you need to manage large numbers of files in a research project, it can be helpful to organise these files using folders and subfolders. To read the contents of such folders with multiple files, you can make use of the `os` library . The two letters in the name of this library stand for 'operating system'. The library includes various functions that can help you to work with files and folders. One useful function is `listdir()`, which, as is suggested by its name, gives you a list of all the files in a given directory. 

To make use of `os`, this library needs to be imported first.  

In [5]:
import os

The folder containing the notebooks developed for this course contains a subfolder named 'Corpus'. The cell below shows you how you print a list of all the files in this subfolder. 

In [7]:
directory = 'Corpus'

for file_name in os.listdir( directory ):
    print( file_name )

BraveNewWorld.txt
PrideandPrejudice.txt
sonnet116.txt
Ullyses.txt


The `listdir()` function only provides the file names. If you want to do some actual work on files in the folder, you will in most cases need to the path to these files as well. In this example, the relative paths (i.e. the paths from the current location) to the files in the folder are relatively simple. The paths consist of the name of the folder, 'Corpus', combined with the file names.  

The function `join()`, which is part of the `path` module of `os`, can be used to create a string representing the path to a certain file. If you have one variable which records the base directory of a file, and a second variable which captures the filename, the `join()` function can concatenate the values of these two variables to create the full path to this file. 

The `join()` function always follows the conventions that are in place on a given operating system for representing paths. There can often be certain differences. While Mac OS uses forward slashes, for instance, Windows uses back slashes. Working with `join()` makes your code more platform-independent.

Another useful function in `os` is `isfile()`. As you list the files in a certain directory, using `listdir()`, you can apply this function to check whether you are dealing with a file or with something else ( e.g. a subdirectory).

The code below offers a demonstration of these two functions. It lists all the files in the directory that is mentioned, and makes sure that all the subdirectories are ignored. Note that the first line imports the two functions that have been discussed above. As a result of this, it is no longer necessary to use the period syntax for `isfile()` and `join()`.

In [9]:
from os.path import isfile , join

directory = 'Corpus'

for file_name in os.listdir( directory ):
    path = join( directory , file_name )
    if isfile( path ):
        print( path ) 


Corpus\BraveNewWorld.txt
Corpus\PrideandPrejudice.txt
Corpus\sonnet116.txt
Corpus\Ullyses.txt


### Exercise 6.1. 

Working with the `open()` function, print a sentence which gives information about the number of files in the folder named 'Corpus'.

In [17]:
from os.path import isfile , join

directory = 'Corpus'

count = 0

for file_name in os.listdir( directory ):
    path = join( directory , file_name )
    if isfile( path ):
       with open( path , encoding = 'utf-8' ) as file:
            count = count + 1
print("The number of files in the directory is:", count)

The number of files in the directory is: 4


## Reading a file


If the data that you need to work with in a reseaech project is saved as a file on your computer, you can write code to read this file and to make its contents available within the context of your program.

In Python, the contents of files can be read using the `open()` function. The result of this function is a new object which is called a file handler (or, more specifically, a `TextIOWrapper` object). Simply put, a file handler is an object which establishes a connection to the text file on your disk. You are free to give this file handler object any name you like. 

When you use the `open()` function, you are also recommended to specify the character encoding scheme that has been used in the text file, using the `encoding` parameter. This will help Python to process all the characters correctly. 

Once the connection is established via the `open()` fuction, you can access the contents of the file in a variety of ways. A first option is to read the contents on a line-by-line or a paragraph-by-paragraph basis. This first approach can be followed when units such as lines or paragraphs in the text are delineated using the hard return or the newline character. If this is the case, the file handler that is created for the file, using `open()`, also becomes iterable: the `for` keyword can then be used to iterate across the various units represented by this file handler. 
 
The code below demonstrates how you can read and display the full contents of a text file, paragraph by paragraph. It assumes that there is a file named "BraveNewWorld.txt", saved in a folder named "Corpus". It also assumes that the various paragraphs are separated using a hard return. 

In [141]:
path = join("Corpus","BraveNewWorld.txt")
text = open( path , encoding = 'utf-8' )

for paragraph in text:
  #  print(paragraph.strip())
#delete the hashtag for the code to work (made scrolling difficult in the file)

SyntaxError: incomplete input (24815013.py, line 6)

As an alternative, you can also make use of the `read()` function. When you do this, the entire text will not be divided into smaller units. The full contents of the text file will become available as one long string. 

In [20]:
path = join("Corpus","Sonnet116.txt")
text = open( path , encoding = 'utf-8' )

for paragraph in text:
    print(paragraph.strip())

Let me not to the marriage of true minds
Admit impediments. Love is not love
Which alters when it alteration finds,
Or bends with the remover to remove:
O, no! it is an ever-fixed mark,
That looks on tempests and is never shaken;
It is the star to every wandering bark,
Whose worth’s unknown, although his height be taken.
Love’s not Time’s fool, though rosy lips and cheeks
Within his bending sickle’s compass come;
Love alters not with his brief hours and weeks,
But bears it out even to the edge of doom.
If this be error and upon me proved,
I never writ, nor no man ever loved.


After we have run this code, we can manipulate the string that is created by the `read()` function just like any other string.

It is good practice to close the file handler when you are done working on it, using the `close()` method. 

In [22]:
text.close()

Next to the options that have been discussed so far, you can also read files by making use of a mechanism that is referred to as a context handler. 

Context handlers are created using the `with` keyword. After `with`, you need to use the `open()` function. This `open()` function needs to be followed by the words `as` and the name you would like to give to the file handler. In the code block underneath `with`, you can access the contents of this file handler. It is generally useful to assign the contents of the text file to a variable. When the code block underneath `with` ends, the file handler is closed automatically. This is actually a great advantage of a context handler. You don't risk forgetting to include the `close()` function.  

In [30]:
path = join("Corpus","Sonnet116.txt")

with open(path, encoding = 'utf-8') as text:
    contents = text.read()


In [32]:
file = open('Corpus/Sonnet116.txt', encoding='utf-8')
full_text = file.read()
print(full_text)
file.close()

Let me not to the marriage of true minds
Admit impediments. Love is not love
Which alters when it alteration finds,
Or bends with the remover to remove:
O, no! it is an ever-fixed mark,
That looks on tempests and is never shaken;
It is the star to every wandering bark,
Whose worth’s unknown, although his height be taken.
Love’s not Time’s fool, though rosy lips and cheeks
Within his bending sickle’s compass come;
Love alters not with his brief hours and weeks,
But bears it out even to the edge of doom.
If this be error and upon me proved,
I never writ, nor no man ever loved.



## Writing to a file

 The output of code created in a Jupyter notebook will normally be shown directly underneath the code cell. When you run a Python program using the Command Prompt, the full output will normally be printed on the Command Prompt as well.

When the program has many lines to print, it can be very difficult to read the output. In such cases, it can useful to create a new text file which will receive all the output. The results of the program can then be inspected by opening this new file in a text editor. 

The function `open()`, which can be used to read files, can also be invoked to create a new file. Instead of referencing a file which already exists on your disk, you need to provide a new file name. Next to this, you also need to supply a second parameter, the character “w”, which stands for “write”. This “w” character makes it clear to Python that you want to write to a file. The `open()` function used with the “w” parameter similarly creates a file handler.

This handler has a `write()` method, which functions very similarly to the `print()` function. The crucial difference, however, is that the output is not sent to the default output device (e.g. the Command Prompt or Jupyter Notebook), but to the file that is associated with this file handler. 

### Exercise 6.2.

Using the code discussed above, print the full text of Shakespeare's *Sonnet 116* on your screen, and make sure that you also add line numbers, as follows:
    
```
1. [line1]
2. [line2]
```

In [135]:
file = open('Corpus/Sonnet116.txt', encoding='utf-8')
full_text = file.read()
#print(full_text)

i = 1

full_text = full_text.split("\n")
#print(full_text)

out = open('lines_sonnet.txt' , 'w')

#lined_text = []

#print(lined_text)

for line in full_text:
    numbered_line = "[Line "+str(i)+"] " + line + "\n"
    i = i + 1
    #print(numbered_line)
    lined_text.append(numbered_line)
    out.write(numbered_line)  

out.close()

file.close()

### Exercise 6.3.

Building on the code you wrote for exercise 8.1, create a new text file containing the NUMBERED lines of Shakespeare's *Sonnet 116*. As a file name, use "sonnet166_numbered.txt", to make sure that you do not overwrite the existing file. 

In [137]:
file = open('Corpus/Sonnet116.txt', encoding='utf-8')
full_text = file.read()

i = 1

full_text = full_text.split("\n")

out = open('lines_sonnet_numbered.txt' , 'w')

for line in full_text:
    numbered_line = "[Line "+str(i)+"] " + line + "\n"
    i = i + 1
    #print(numbered_line)
    lined_text.append(numbered_line)
    out.write(numbered_line)  

out.close()

file.close()

### Exercise 6.4.

Create a CSV file listing the names of all the files in the subfolder named "Corpus", together with the number of characters in each of these files. The number of characters can be found using the `len()` function. The header should specify 'text' and 'nr_characters' as column names. 

In [249]:
from os.path import isfile , join

directory = 'Corpus'

out = open('test.csv' , 'w')

import csv

with open('test.csv' , 'w', newline='') as csvfile:
    
    writer = csv.writer(csvfile)
    writer.writerow(["text", "nr_characters"])
    
    for file_name in os.listdir( directory ):
        path = join( directory , file_name )
        if isfile( path ):
            file = open(directory+"/"+file_name, encoding='utf-8')
            full_text = file.read()
            length = len(full_text)
            writer.writerow({path, length})
out.close()

file.close()