# Notebook 3.3: File objects and `requests`

### Filepath operations with the `os` package
A type of string that is often the most difficult to properly format is a filepath, as many of you learned when we tried to edit our `~/.bash_profile` files by hand in class last time. If the string representation of a filepath is incorrect by even a single typo then the path will not be found. 

If you are writing a program that needs to access filepaths it needs to be written in a way that it will work on any computer, regardless of how that computer formats filepaths. For example, we've seen in class that the location of the `$HOME` directory is different on Linux, MacOSX, and Windows. Fortunately there are packages in Python that generalize path names across systems to keep them properly formatted for us. One way to do this is with the `os` package, which we will use here. 

### Importing a package
Python is very *atomic* language, meaning that many packages in the standard library are packaged into individual libraries that need to be loaded in order to access their utilities. This makes Python very light weight since the base language does not need to load all of these extra utilities unless we ask it to. To load a package that is installed on our system we can call the `import` function like below. Here we are loading the `os` package which is part of the Python standard library, so it is already installed when we install Python. 

In [74]:
import os
import gzip
import requests

### Using packages
The `os` package is quite large and we be using just a small part of it today, which is the `path` submodule. Python is an object oriented language, and good Python packages are written to take advantage of this which makes them easy to use. This means we can access the `os` package and all of its functions like they are a Python object. Put your cursor after the period in the cell below and press `<tab>` to see available options in `os`. There are many!

In [75]:
## use tab-completion after the '.' to see available options in os
os.

SyntaxError: invalid syntax (<ipython-input-75-ae38a5c94e6a>, line 2)

### Using `os.path`
The `os.path` submodule is used to format filepaths. We can expand shortened path names, we can join together multiple paths, we can search for special directories like $HOME, or current directory. Essentially, the package is making calls similar to those we learned from bash scripting last week, such as `pwd` to show your current directory, or `~` as a shorthand for your home directory. Here we can access those filepaths as string variables and work with them very easily. 

In [128]:
## return my $HOME directory
os.path.expanduser("~")

'C:\\Users\\Montana Airey'

In [129]:
## return my current directory
os.path.abspath('.')

'C:\\Users\\PDSB\\3-Python-basics\\Notebooks'

### Operations on filepaths

In [130]:
## assign my current dir to a variable
curdir = os.path.abspath('.')
curdir

'C:\\Users\\PDSB\\3-Python-basics\\Notebooks'

In [131]:
## get the lowest level directory in curdir
os.path.basename(curdir)

'Notebooks'

In [132]:
## get the directory structure above curdir
os.path.dirname(curdir)

'C:\\Users\\PDSB\\3-Python-basics'

### Joining filepaths
Because it can be hard to keep track of the "/" characters between directories and filepaths it is useful to use the `.join` function of the `os.path` module to join together path names. Here we will create string variable with a new pathname for a file that doesn't yet exist in our current directory. 

In [133]:
## get the full path name to a newfile in our current directory
newfile = os.path.join(curdir, "newfile.txt")
newfile

'C:\\Users\\PDSB\\3-Python-basics\\Notebooks\\newfile.txt'

### Writing files

The function `open` can be used to create views of files. The format for this is `open(filename, mode)` where mode is the thing you plan to do with this file. The main arguments for this are `w` for 'write', `r` for 'read', or `a` for append. Below we will use `w` to write, which we can use to create a new file. 

In [134]:
## get an open file object
ofile = open(newfile, 'w')

## see the file object
ofile

<_io.TextIOWrapper name='C:\\Users\\PDSB\\3-Python-basics\\Notebooks\\newfile.txt' mode='w' encoding='cp1252'>

#### File objects
As with other objects, `ofile` has attributes and functions that we can access and see by using tab-completion. Move your cursor to the end of the object below after the period and use tab to see some of the options. 

In [135]:
## use tab to see options associated with open file objects
ofile.

SyntaxError: invalid syntax (<ipython-input-135-bfb06bdf4a2d>, line 2)

Use the `.write()` function to write a string to the file. 

In [136]:
## write a string to the file. 
## It returns the number of characters written, which we can ignore for now.
ofile.write("Hello world")

11

In [137]:
## when we are done writing to the file use .close()
ofile.close()

### Using `requests` to download data
We will spend more time learning about the `requests` package in the future because it is a super useful tool for accessing data from the web. Here we will use it similar to the how we used the `curl` command when learning bash scripting. We want to query a url and get data from it. By default, `requests` will return the text representation of a web page as a string. We can either parse that string object directly, or write it to a file. Since we're learning about file objects now we'll practice writing to file. 

In [138]:
url1 = "http://eaton-lab.org/data/40578.fastq.gz"
url2 = "http://eaton-lab.org/data/iris-data-dirty.csv"

The standard format for using `requests` is to make a GET request to url, which is a request to read the data from that page. This will return a `response` object which we can then access for information. The `response` object will contain an error message if the url is invalid, or blocked, and it will contain the HTML text of the webpage if it is successful. 

In [139]:
## see the response object (200 means successful GET)
response = requests.get(url2)
response

<Response [200]>

In [140]:
## show the first 50 characters
response.text[:50]

'5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris-s'

In [141]:
## write the data to a file. The .content returns bytes
ffile = open("./40578.fastq.gz", 'wb')
ffile.write(requests.get(url1).content)
ffile.close()

In [142]:
## Same for the second url. The .text returns unicode
ffile = open("./iris-data-dirty.csv", 'w')
ffile.write(requests.get(url2).text)
ffile.close()

### Reading files
To read the data from a file we use a similar format as to write, but with the mode flag `r`. When we show the representation of the file object below you can see that this also returns an open file object, but this time in read mode. We can now access a different set of functions from this object to retrieve data from the file. We will use the `.read()` function to read and return all contents from the file as a string object and store it as the variable `idata`. 

In [143]:
ifile = open("./iris-data-dirty.csv", 'r')
ifile

<_io.TextIOWrapper name='./iris-data-dirty.csv' mode='r' encoding='cp1252'>

In [144]:
## read returns all of the contents as a string
idata = ifile.read()

In [145]:
## show the first 50 characters
idata[:50]

'5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris-s'

In [146]:
## close the file handle
ifile.close()

### Gzip compressed files
Gzip compression, as well as many other forms of compression are easily handled in Python using the standard library. The `gzip` module has an `open()` function that acts just like the regular `open` to create a file object. Let's try it out on the compressed fastq file we just downloaded. 

Let's also practice using `os.path` to find the full filepath of the `40578.fastq.gz` file. 

Then, as in the last example we simply use `.read()` to read the full contents and store it in a variable. Because the data in this file is stored as a bytestring we need to also add `.decode()` to convert it to a `utf-8` string.

In [147]:
## get full path to the file in our current directory
gzfile = os.path.abspath("./40578.fastq.gz")
gzfile

'C:\\Users\\PDSB\\3-Python-basics\\Notebooks\\40578.fastq.gz'

In [148]:
## read compressed byte data from this file
ffile = gzip.open(gzfile, 'rb')
fdata = ffile.read().decode()
ffile.close()

In [149]:
## show some data from the file
print(fdata[:200])

@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN
+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
IIIIIIHIIIIIIIIIGIIIH


### Reading data with the `read()` function
The `read()` function is nice for reading in a large chunk of text, but it then requires us to parse that text using string processing, like we learned in our earlier notebook. Let's use string processing to split the contents of the file into a list. Perhaps instead of separating contents on every line, as we did for this file when we analyzed it from a bash terminal, we instead would like to chunk it up so that it is split into elements that cover four lines. We can do this by using our own "split" separator. From looking at the text above we can see that each four line element is separated by a `"\n@"` character, so we'll use that. 

In [150]:
## split the fdata string on each occurrence of "\n@"
freads = fdata.strip().split("\n@")

## print the first element in the list
print("The first read: \n{}".format(freads[0]))

## print the last element in the list
print("\nThe last read: \n{}".format(freads[-1]))

## print the number of reads in the file
print("\nN reads in the file = {}".format(len(freads)))

The first read: 
@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN
+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
IIIIIIHIIIIIIIIIGIIIHIIIBB:B##############################################

The last read: 
40578_rex.125 GRC13_0027_FC:4:1:2571:1496 length=74
TGCAGCTCACGGTCGTGAGGGTGAGCTTATTTTTTTGTGAACTGTCTCAACTGCTCGTGAGGGTCCTCACGATT
+40578_rex.125 GRC13_0027_FC:4:1:2571:1496 length=74
IIIIIGHIIIIIHIIIIFIIIDIHGIIIBGIIFIDIDIHHIDIHEIHIIIEEEIHIIE>CEEE:DDBDDFECC8

N reads in the file = 125


### Using context to automatically close files

In Python there is a special keyword called `with` that can be used to wrap statements into a context dependency. That means that everything which takes place inside of the with statement will know about what happend in the with statement. This is often used to open a file object. File objects have a context dependency so that when they are opened with `with` they will automatically close themselves when the statement is ended. See an example below. This is a much more compact way of opening and closing files than what we were using before. 

In [151]:
irisfile = "/home/deren/PDSB/iris-data-dirty.csv"

In [152]:
## infile will automatically close when finished.
with open(irisfile, 'r') as infile:
    data = infile.readlines()

FileNotFoundError: [Errno 2] No such file or directory: '/home/deren/PDSB/iris-data-dirty.csv'

In [153]:
data[:10]

''

### Challenges
Your challenge is to perform similar tasks to those we did in the first bash assignment, but using Python. We'll focus on filtering and counting the Iris data set. This will use the skills you learned for operating on strings and lists, as well as reading and writing files. 

In [76]:
## Download the iris data set and write it to a file

In [77]:
import os
import gzip
import requests
iris = "http://eaton-lab.org/data/iris-data-dirty.csv"
irda = open("./iris-data-dirty.csv", 'w')
irda.write(requests.get(iris).text)
irda.close()

In [78]:
## read in the iris data set from its filepath and store the data as a string

In [79]:
file = open("./iris-data-dirty.csv", 'r')
file
data = file.read()

In [80]:
## replace "setsa" with "setosa" and "colour" with "color" in the string data
data=(data.replace("setsa","setosa"))
data=(data.replace("colour","color"))
data

'5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris-setosa\n4.7,3.2,1.3,0.2,Iris-setosa\n4.6,3.1,1.5,0.2,Iris-setosa\n5.0,3.6,1.4,0.2,Iris-setosa\n5.4,3.9,1.7,0.4,Iris-setosa\n4.6,3.4,1.4,0.3,Iris-setosa\n5.0,3.4,1.5,0.2,Iris-setosa\n4.4,2.9,1.4,0.2,Iris-setosa\n4.9,3.1,1.5,0.1,Iris-setosa\n5.4,3.7,1.5,0.2,Iris-setosa\n4.8,3.4,1.6,0.2,Iris-setosa\n4.8,3.0,1.4,0.1,Iris-setosa\n4.3,3.0,1.1,0.1,Iris-setosa\n5.8,4.0,1.2,0.2,Iris-setosa\n5.7,4.4,1.5,0.4,Iris-setosa\n5.4,3.9,1.3,0.4,Iris-setosa\n5.1,3.5,1.4,0.3,Iris-setosa\n5.7,3.8,1.7,0.3,Iris-setosa\n5.1,3.8,1.5,0.3,Iris-setosa\n5.4,3.4,1.7,0.2,Iris-setosa\n5.1,3.7,1.5,0.4,Iris-setosa\n4.6,3.6,1.0,0.2,Iris-setosa\n5.1,3.3,1.7,0.5,Iris-setosa\n4.8,3.4,1.9,0.2,Iris-setosa\n5.0,3.0,1.6,0.2,Iris-setosa\n5.0,3.4,1.6,0.4,Iris-setosa\n5.2,3.5,1.5,0.2,Iris-setosa\n5.2,3.4,1.4,0.2,Iris-setosa\n4.7,3.2,1.6,0.2,Iris-setosa\n4.8,3.1,1.6,0.2,Iris-setosa\n5.4,3.4,1.5,0.4,Iris-setosa\n5.2,4.1,1.5,0.1,Iris-setosa\n5.5,4.2,1.4,0.2,Iris-setosa\n4.9,3.1,1.5,0

In [81]:
## split the string to convert it into a list of lines from the file

In [82]:
list = data.split("\n")
list

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

In [83]:
## strip the newline character from the end of each list element

In [84]:
list= list[:-1]
list

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

In [85]:
## remove any lines that are empty or have "NA" in them.

In [86]:
clean=open('iris_clean',"w")
for line in list:
    if 'NA' not in line:
        clean.write(line + "\n")
clean.close()

In [87]:
clean_data=open("iris_clean","r")

idata=clean_data.read()
idata=idata.split("\n")
idata=idata[:-2]
idata

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

In [88]:
clean_string='\n'.join(map(str,list))
clean_string

'5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris-setosa\n4.7,3.2,1.3,0.2,Iris-setosa\n4.6,3.1,1.5,0.2,Iris-setosa\n5.0,3.6,1.4,0.2,Iris-setosa\n5.4,3.9,1.7,0.4,Iris-setosa\n4.6,3.4,1.4,0.3,Iris-setosa\n5.0,3.4,1.5,0.2,Iris-setosa\n4.4,2.9,1.4,0.2,Iris-setosa\n4.9,3.1,1.5,0.1,Iris-setosa\n5.4,3.7,1.5,0.2,Iris-setosa\n4.8,3.4,1.6,0.2,Iris-setosa\n4.8,3.0,1.4,0.1,Iris-setosa\n4.3,3.0,1.1,0.1,Iris-setosa\n5.8,4.0,1.2,0.2,Iris-setosa\n5.7,4.4,1.5,0.4,Iris-setosa\n5.4,3.9,1.3,0.4,Iris-setosa\n5.1,3.5,1.4,0.3,Iris-setosa\n5.7,3.8,1.7,0.3,Iris-setosa\n5.1,3.8,1.5,0.3,Iris-setosa\n5.4,3.4,1.7,0.2,Iris-setosa\n5.1,3.7,1.5,0.4,Iris-setosa\n4.6,3.6,1.0,0.2,Iris-setosa\n5.1,3.3,1.7,0.5,Iris-setosa\n4.8,3.4,1.9,0.2,Iris-setosa\n5.0,3.0,1.6,0.2,Iris-setosa\n5.0,3.4,1.6,0.4,Iris-setosa\n5.2,3.5,1.5,0.2,Iris-setosa\n5.2,3.4,1.4,0.2,Iris-setosa\n4.7,3.2,1.6,0.2,Iris-setosa\n4.8,3.1,1.6,0.2,Iris-setosa\n5.4,3.4,1.5,0.4,Iris-setosa\n5.2,4.1,1.5,0.1,Iris-setosa\n5.5,4.2,1.4,0.2,Iris-setosa\n4.9,3.1,1.5,0

In [89]:
## write the string to a new file called "iris-data-clean.csv"

In [90]:
newfile=open('iris-data-clean.csv', 'w')
newfile.write(clean_string)
newfile.close()


## Finished
Save this notebook and close it. Push a copy of the notebook to the `assignment/` directory with your name in the filename like `./assignment/<myname>-3.3.ipynb`. 