## Introduction to Cultural Analytics (ht-2023)

Matti La Mela, matti.lamela@abm.uu.se

### Lab material 1b: reading and writing files

This Notebook introduces us to how to open and write txt files. We are able to take external text data for our python data processing. Later we will open csv-files with pandas.

The reference readings for this learning material is

o Chapter 9 in Allen B. Downey Think Python: How to Think Like a Computer Scientist, 2nd ed., Needham 2015, https://greenteapress.com/wp/think-python-2e/

o "Files and character encoding" in Walsh, M. (2021). Introduction to Cultural Analytics and Python. https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html


## 1. Reading a text file

In [None]:
# We will read the text file into a string variable that we label "book". This includes all the characters the text file might
# contain including extra spaces, OCR-dirt, line breaks. These we might have to clean later.

# We use the open() function to open our file. We give as arguments the path, mode, and file encoding.
# We use mode = "r", while we are reading the file. We will use later "w" (writes over the old file) or "a" (append, adds to the existing file).

# The with statement here ensures simply that we do not leave the file open. We should close it with close(). Python is good in handling files that never
# get closed, but for example when writing we need to close the file to have it actually written.

# Note the indentation again: everything that is in the indented code block is done while the file is open!

with open("./texts_week1/Alice_in_wonderland.txt", mode="r", encoding="utf-8") as file:
    book = file.read()
    
print(len(book))
print(book[0:500])

# We print the length of the book, and use the index to print the five hundred first characters of 
# the Alice's adventures in Wonderland by Lewis Carroll (https://www.gutenberg.org/ebooks/11).

## 2. Some data processing with the text data we have read

In [None]:
# We want remove the information about the ebook and start from chapter 1. We know that the opening sentence is "Alice was beginning to get very" etc.

# .index() string method gives us the index number where a sub-string is found.

start = book.index("Alice was beginning")

print (book[start:start+100])


In [None]:
# We then replace our "book" string with the slice of the "book" that starts from the index "start" and runs until the end of the string

book_ch1 = book[start:]


In [None]:
# We continue with processing our text data. For the method that we plan to use, we want the text in lower case.

book_ch1 = book_ch1.lower()

print(book_ch1[0:100])

In [None]:
# Also, we want to assign the text paragraphs into a list. With split() method, we can split the parts between two endlines \n\n, which mark the
# paragraph here

book_paragraphs = book_ch1.split("\n\n")

print(book_paragraphs[0:3])

# we should see now first three paragraphs of the novel.

In [None]:
# We see that the paragraphs contain endlines, \n. Our hypothetical method processes only words, so we could replace the endlines with space.

# We use for loop to iterate through our list of paragraphs. We can use range, but another way is to create a variable that follows 
# the index number of the list.

for i in range(len(book_paragraphs)):
    book_paragraphs[i] = book_paragraphs[i].replace("\n", " ")

print(book_paragraphs[0:5])
                                  

In [None]:
# Depending on our needs, e.g., in what format should we have the text ready for our analysis, we could continue the processing even further.

# We want to write the text to a file, which we do in the next section.

In [None]:
# Ps. One useful string method for cleaning strings is .strip("CHAR")

# This removes the CHAR from the beginning and end of a string. If no CHAR is given strip removes whitespaces.

# In this example string there are extra whitespaces at the beginning and at the end

example = "     this is our data   "

cleaned = example.strip()

print(cleaned)


## 3. Writing to text file

In [None]:
# Writing to text file is done in a similar way like when reading a file, but we use the mode "w".

# Note what mode "w" overwrites any existing file. If we want to append, thus to add new content to the file, we can use mode "a" when writing.
# We use utf-8 encoding for our files.
#
# There are several ways to write our list to the file.

# We can use a for loop for this. Note again the indentation, what are the blocks that are executed when the file is open (with),
# and what is done in the for loop.
#
# In the for loop, we iterate through the cleaned_paragraphs list: each value (thus paragraph in our list) is assigned to "text". We write the
# content of the text variable to the file "output_paragraphs.txt". We add also a breakline \n.

with open("./texts_week1/output_paragraphs.txt", mode="w", encoding="utf-8") as file:
    for paragraph in book_paragraphs:
        file.write(paragraph)
        file.write("\n")

# we should now have a file "output_paragraphs" in our texts_week1 folder. You can obviously use some other path also for this!


In [None]:
# It is very easy to unite a list of strings together with the .join method. The string given to join is what is put
# between the list entries.

# Example with our book titles:

book_titles = ["Introduction to Cultural Analytics & Python", "Distant Horizons", "Data Feminism"]

print(book_titles)

print("")

# We join the list into a string variable. We give space, " ", as the connecting string. This could any string that we might want as a connector.

titles_string = " ".join(book_titles)

print(titles_string)

# The first line of output is a list of strings, the lower one is a string where we have concatenated the list entries.


In [None]:
# Now let's use join() method to unite our paragraphs into a string, and then write the string to a file. We use linebreak "\n" as the connector:

text = "\n".join(book_paragraphs)

# we overwrite the file "output_paragraphs.txt".

with open("./texts_week1/output_paragraphs.txt", mode="w", encoding="utf-8") as file:
    file.write(text)


### Well done. You can try opening the text file (output_paragraphs.txt) in a text processing program, e.g. notepad++ or word.

We see that there are still empty paragraphs written to our file. You can reflect on about what in our cleaning process left these empty spaces. Perhaps our process could be simplified?


## 4. Reading multiple files

In [None]:
# We often want to open several files instead of just one. This enables us to scale our research.

# We import the module glob that allows us to search for filenames in a path. We can use the result for reading the files name by name.

import glob

# Let us see what text files we have in our texts_week1 path. If we use the wild card *.txt, we get all the files that are of txt file
# extension. We have other txt-files in the directory, so we use the more specific expression dhq*.txt: all .txt files that start with dhq.

# Glob creates a list of the file names:

list_files = glob.glob("./texts_week1/dhq*.txt")

# Let's print the first five filenames:

for filename in list_files[0:10]:
    print(filename)

#print(list_files)

In [None]:
# We open every file in our list_files in a for loop. This could be  and do other relevant operations with them (e.g. data cleaning could be done here).

# In this example we save our files as strings to a list ("texts"), where we could then continue working with them.

text_content = []
text_filename = []


for filename in list_files:
    with open(filename, mode="r", encoding="utf-8") as file:
        article = file.read()
        text_content.append(article)
        text_filename.append(filename[14:])  # we save the filenames to a list text_filename. We know that "dhq" is always at index 14 in the filename.

In [None]:
# Let's print the filename of the first article and the 1000 characters of the first article.
# Frst index is for the list element [0], and the second for the range in the string [0:1000]

print(text_filename[0])
print(text_content[0][0:1000])

In [None]:
# We would next work with the articles, or process them into a some more suitable format.

# If we want to write the processed content back to txt-files, we need to establish new filenames (if we don't want to overwrite the original files).
# We can create another list ("write_filename") with the new filenames.

write_filename = []

for i in range(len(text_filename)):
    filename = "./texts_week1/processed-" + text_filename[i]  # we add the write path and the text "processed-" before the actual file name
    write_filename.append(filename)

# And let's print the first filename in the list.

print(write_filename[0])

# Now we could iterate this list (write_filename), like we did when opened the files, but instead of parameter "r", we use "w" and write the processed data to the file.

## We are now done for the labs for week 1. It is good to revise the Python basics from the course textbook. You can now start doing the first lab assignment.


Ps.

Why do we use the encoding = utf-8 when opening files? This is to ensure that we open the file with the right encoding (in this case: utf-8). Otherwise open() uses the computer's default encoding, which might be something else, and might lead to an error.

Utf-8 is the most common encoding today, https://en.wikipedia.org/wiki/Popularity_of_text_encodings, and it covers the whole very encompassing Unicode character set (Unicode 15.0.0 has 149,185 characters, https://www.unicode.org/versions/Unicode15.0.0/). For example, some historical character sets like ASCII had just 95 printable characters. It is thus good to save your text data into utf-8 to ensure compatibility, and if the file is in some different format, it can be converted for example with Notepad++ to utf-8. You can read more about character encodings at https://dsc.gmu.edu/tutorials-data/tutorial-character-encoding/

In [None]:
# If you want to see what kind of unicode characters there are, you can print the first 10000 chars with this for loop:

i = 0

for i in range(0, 10000):
    print(chr(i))
    
# There are even emojis included..
