<div align ="right">Thomas Jefferson University COMP 102: Intro to Scientific Computing</div>

# Importing and exporting data

Importing data into Python is the very first step of most scientific computing problems. Without telling Python where to find our raw data, or directly feeding Python our data, the computer programs we write are **basically useless**. Being able to import data from within a program allows us to reuse the same code on many different datasets. 

And what's the point of going to the trouble of coding something in Python if our goal isn't to automate processes that are time-consuming and repetitive?

Likewise, what's the point of manipulating data in python if we can't store it in a file that can be saved and shared with our scientific collaborators?

## Basic data import and the `open()` function

In COMP 101 we did a simple data import of DNA data that was stored in a text file. First let's review that, just for fun. If we have DNA sequence data stored in a directory called `data` that exists in the same directory as this notebook, we can access it with the code below:

In [1]:
openfile = open("data/dna.txt","r")
DNA_sequence = openfile.read()
openfile.close
print(DNA_sequence)

TCTTGGCACTCGCATACTCAAACAGTCCTTGTGCACGGCGCGCAGCTTTAAACTACGTAGGATGTCTCGCGTGGTCCAGATCCACTCAGCCTACCGCTCAATATCGCGGAGGCCAGTTATCTTTCGGGGAAGAGGGGTAGTGACAGGACAACGGTCCCGCAAACATAAGCTGCTGGAATGAAGAGTCTTCGCGGATCAGTAACTGAATGAGAGCATACCCGTATCCGGTCTACGCTTGGTATAACCCACGCGTTGCTAAATACCACACAATCAGTCACCACTTATGCGACATCTCTCATTGCAAGGTTAAAAGCCTCACTCAATTTGTAAAGATTATAGGGGAAAATTGGAGGATCTGAATCACCTGTACATGCTGATTTCGTCGATCGTTCGTCTATTTTTAATGTCGCACCGCTAGTGCTCAGGCGCAATACAAGATACGCGTAGCACGCGCCTTCACCCTAAGCGTAGTGCTGATTGTGTCCATGGCTATACATCCCTCAGAATGCACTGACGCCAACCTTGCGCTATATTCCATATTGTATCGGTTGTACCTGAAATCGCATCGACAAACTGACCTTTTTGTATTTTCGAGTGATAGTGCTCAATCTGGAGGGACAGGCCGCTCGGCTTGAATTCAATTGCCAGTGAGCGTCCACCTGCCGTAAGAGATGACCGGCACCTACGATCGGGAACTCGCTTTTAGCAGACCGACGAGGAGTCCCCGATGTGGAACGTTAGTACATTGCAGTGGGCGAGTGGTACTTACCCAGACGGTGTCCTCTACGAGGAAGGCTCGGGCTCTAGGCTCAAGATACGAACAGAAGGCAGGGTTGCGACGAAATACGGGCCTAACTATTTCTTCGGCACAATATAATTAACAGCCCCTTCGGCTAAGACAACGGTGCCGCCCGGCCTCTCGGAAGTTGAGATTTCCATTGGACTTTGTCGATTGAAAACAC


In my experience this code can be a little challenging for students to understand, so let's dig into it just a little bit more.

Recall that python likes to work with *objects* that belong to particular *classes*. Those classes have certain characteristics, and also have certain *methods* that can be applied to them. One type of object that we will be working with today is a *file object*.

The python command `open()` creates a file object and assigns that object to the variable name `openfile`. Note that this method contains a parameter, `"r"` that indicates that this is a file object that we will use to read data. If we were to be writing data to a file, or appending data to an existing file, this parameter would be different. 

In the next step, we apply a method, `.read()` to our object named `openfile`.  This method belongs to the file object class. We use this method to read data from a file, and assign this data, the contents of the file, to a new variable, `DNA_sequence`. 

Let's use the `type` function in python to take a look at each of these varibles that we have created:

In [2]:
print(type(openfile))
print(type(DNA_sequence))

<class '_io.TextIOWrapper'>
<class 'str'>


In [4]:
# Move this farther down but it's a good example

openfile = open("data/Gradebook.csv","r")
gradebook = openfile.read()
openfile.close
print(gradebook)
print(type(openfile))
print(type(gradebook))



Student,Exam1,Exam2,Exam3,Homework
DeShawn,89,92,93,95
Marie,92,91,94,95
Bob,75,77,72,97
Eloise,82,86,88,85
Kayla,85,91,83,100
<class '_io.TextIOWrapper'>
<class 'str'>


##  .csv files and different flavors of text files

The type of file we will be working with over the next few examples is not just a string of text. It is a kind of data It is called a CSV file. These files end in the extension `.csv`. CSV stands for Comma-separated values, and it means exactly what you think it mean - the individual values within the data are separated from one another with a *,*. Data can be exported from many different programs, including any spreadsheet program such as microsoft excel or google sheets, in the form of a .csv file. 

In this tutorial we will be working with data that exists in a simple row-and-column configuration. Spreadsheet data, basically that exists as a simple, two dimensional grid. Not all data looks like this, but it's a very large category of data that you will encounter. On the screen, this data is expressed through grids for easy viewing. The data is stored such that each row of the spreadsheet exists as a line of text that contains the values of that row with commas separating the values.

Values, whether that be numbers such as the percentage rereceived on a given test, or strings of letters, such as the name of a student, is stored in the computer as a simple text file, with commas separating each grid (as seen in excel files). These files are often used because they can be viewed either as simple text files, but are also easily readable by or exportable to by any spreadsheet program. 

Note that `.csv` files cannot have multiple sheets as seen with `.xlsx` Excel files.

Here is the same `.csv` data file as it appears in an excel spreadsheet:
![excel.png](images/excel.png "Data in an excel file")

And in a text file
![text.png](images/text.png "Data in a text file")



There are a series of functions in python that can be used to read csv files. I've included the same .csv file shown above in the data folder associated with this exercise. But let's look first at what happens if we attempt to access a `.csv` file just as we accessed the DNA sequence above. 

In [None]:
openfile = open("data/Gradebook.csv","r")
DNA_sequence = openfile.read()
openfile.close
print(DNA_sequence)

Notice what we got back! The bock of text we expected. But what we notice is that there is no structure to this data. Let's check what it actually is stored as:

In [None]:
type(DNA_sequence)

Huh, that's strange. It shows as being just a string. Let's try indexing it.

In [None]:
print(DNA_sequence[0])   #the first character in the string
print(DNA_sequence[7])   #the eight character in the string
print(DNA_sequence[33])  #happens to be last letter of first line
print(DNA_sequence[34])  #what is this?
print(DNA_sequence[35])  #happens to be first letter of second line

Note that we get back the information as if it was a single string of text. Except that the item with index 34 shows up blank. There's actually a special character there called an end of line marker, which is why this information is displayed on a separate line. 

So as we can see, taking in data this way works, but it doesn't work very well. It would be much more efficient if we were able to store information in a data structure. What if we had thousands of students, and we wanted the program to have access to information such as "What was Eloise's score on Exam 2?" This would be a very inefficient way to handle that type of information.  