<a href="https://colab.research.google.com/github/mlepinski/Python-Worksheets/blob/master/Week_10_Worksheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**FILES IN PYTHON**

I wanted to provide a couple examples of working with external files in Python. 

Our focus this week will be when we want Python to do something to help us process external data files. 

We are going to talk about two kinds of files. Text files and CSV (Comma Separated Value) files. Text files are more common and so we will look at them first. CSV files are a way of sharing scientific data in a spreadsheet-like format with rows and columns. CSV files are used a lot when data is available for download on the Internet.  

We will have fun with text files today and do more with CSV files in next week's Worksheet.



---



---




**OPENING FILES**

In order to use a file a Python, you first need to open it. 

To grab a file from the Internet (e.g. Github) that lives on a different computer, we import the package urllib as in the example below. 

Note: A URL is a Uniform Resource Locator that identifies a website, file or other resource on the Internet. (You probably see these frequently at the top of your web browser window). Urllib in Python is a package that provides a bunch of functions/commands for dealing with URLs and the Internet. 

Note: If you have downloaded Python and are running it on your local machine, you can also use the **open** command to open a local file on your machine. 

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

#If we running Python on our Machine and wanted to open a local file
#We could use the open command below

#local_file_name = "Sonnets.txt"
#my_file = open(local_file_name)

#The above commands only work if Sonnets.txt is in the same folder on your machine as the .py file with your Python program

In the example above, I just went to the file that I wanted in Github, clicked the "raw" button on upper-right part of the Github page, and then compied the URL from my web browser. The reason that I needed to click "raw" is because I wanted the "raw" text of the file without the other Github stuff that appears on the webpage. 

So what does this do? At the moment not much

This code opens a file and creates a file object that is stored in the variable my_file. A file object is just an abstract blob that represents a file in Python. 



---



---

**LINE BY LINE**


Typically, when you have a file object you will want to do one of two things, go through it line-by-line or read the entire file. 

You can use a **for** loop to go through a file line-by-line. Just like a **for** loop lets you do something once for every item in a list, a **for** loop can also let you do something once for every line a file. 

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

favorite_lines = []

for line in my_file:
  line = line.decode("utf-8")
  if "monuments" in line:
    favorite_lines.append(line)

print(favorite_lines)

So the above code goes line by line through the file and identifies any line that contains the word "monumets" and adds that to a list of favorite lines. 

The weird **decode** command is needed because when you grab files off the Internet, they are typically using the "classic" (old fashioned) ASCII format that only works for English and certain similar European languages. (The history of the Internet has a strong bias towards English and western Europe.) Modern Python strings instead use UTF-8 (unicode) which is awesome because it can represent text from any language. The **decode** command takes the old format ASCII stuff from the web and transforms it into modern UTF-8 ... that is, it makes a normal Python string. 


---


At the end you can see that there was only one line in the entire file that contained the word "monuments". 

You will also see a weird /n at the end when you print the favorite lines. The /n is just the standard Python marker for "end of line". It is how computers encode pressing the enter key. (Without having these /n markers in a file, Python and other computer programs wouldn't know when to go to the next line.)

Try changing the word "monuments" to something else and you can see that some words appear in many lines. 

In particular "war" appears in a whole bunch of lines. I wonder how many lines contain the word "war"?

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

line_count = 0

for line in my_file:
  line = line.decode("utf-8")
  if "monuments" in line:
    line_count = line_count + 1

print(line_count)

So the above code once again tells us that **monuments** only appears once in the entire file. 

If you go back and change "monuments" to "war", you can see that "war" appears 39 times in the file. 

Although there is a bit of a problem here, when you ask Python to find "war" in a line, Python will see a line like: 

"And see thy blood warm when thou feel'st it cold.\n"

... and see that "war" is part of "warm" and so the if statement that asks:

**if "war" in line:**

This if statement will answer "Yes" if it seems the previous line about warm blood. 

In [None]:
line = "And see thy blood warm when thou feel'st it cold.\n"

if "war" in line:
  print("I found 'war' in the line of text")



---



---

**LOOKING AT INDIVIDUAL WORDS**

Instead of thinking of a file as a collection of lines, we can also think of a file as a collection of words. 

In the following examples, we are going to read the entire file into a very long string and then do stuff with the stirng. This works as long as your computer has plenty of memory and the file isn't too big. Fortunately, modern computers are able to work with strings that are many Megabytes of text. (It is hard to find text files that are many Gigabytes of text ... but if you ever find such a monster file, going through line by line is probably your only option.)

In the next example, we use the **read** command to grab the entire contents of a file and put it into a long string. We probably don't want to print the whole string, let's instead check what words are present ...




In [11]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

long_string = my_file.read()
long_string = long_string.decode("utf-8")

if "monuments" in long_string:
  print("Good, Monuments is present like it should be")

if "lepinski" in long_string:
  print("Oh No, lepinski shouldn't be in that Sonnet file")
else:
  print("lepinski is missing, as it should be")



Good, Monuments is present like it should be
lepinski is missing, as it should be


Okay, let's print the string just to see what happens ...

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

long_string = my_file.read()
long_string = long_string.decode("utf-8")

print(long_string)

It worked, but that is really long!

So if we are just looking for words, we might not care about the line breaks anymore. We can ask Python to replace all of the line breaks with spaces.

(We could also replace line breaks with "!" or "X" or anything else, but I think spaces will be best for now)

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

long_string = my_file.read()
long_string = long_string.decode("utf-8")

no_line_breaks = long_string.replace("\n", " ")

print(no_line_breaks)

In general the **replace** command in Python takes all instances of a given character/letter/symbol in a string and replaces it with something of your choice. 

For example, we could replace every "e" with "Z" as follows:

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

long_string = my_file.read()
long_string = long_string.decode("utf-8")

no_line_breaks = long_string.replace("\n", " ")

new_string = no_line_breaks.replace("e", "Z")

print(new_string)

Okay, that was a bit foolish, but it shows how Python can easily make replacements in a string. 

If we are interested in working with individual words, we can have Python split the string up into a list of words:

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

long_string = my_file.read()
long_string = long_string.decode("utf-8")

no_line_breaks = long_string.replace("\n", " ")

list_of_words = no_line_breaks.split()

print(list_of_words)

The **split** command in Python is powerful. By default it splits a string into words by breaking everything it sees whitespace. (That is, spaces, tabs, new lines). We will see next week that we can also split at a comma (or other deliminator) by putting "," in parentheses as an input to the **split** command. 

Once we have a list of words, we can easily count how many times "war" appears in the sonnet file. 

In [None]:
import urllib.request

file_name = "https://raw.githubusercontent.com/mlepinski/Python-Worksheets/master/Sonnets.txt"
my_file = urllib.request.urlopen(file_name)

long_string = my_file.read()
long_string = long_string.decode("utf-8")

no_line_breaks = long_string.replace("\n", " ")

list_of_words = no_line_breaks.split()

word_count = 0 

for word in list_of_words:
  if word == "war":
    word_count = word_count + 1

print("The word war appears", word_count, "times in the file")