# Session 2 -- Single Structured File

In this session, we will be covering how to load a single file into a `DataFrame`, as well as some ways to interact with a `DataFrame`.

## Packages
First of all, we need to import some packages.
### `os`
The `os` package enables us to interface with the operating system, particularly with files, environment variables, and command line arguments. We will mostly be using it to interact with files.

In [8]:
import os

### `pandas`
As covered in the previous session, `pandas` gives us access to the `DataFrame` class and all its functions, as well as some file processing functions that can convert many file formats like .csv, .xls, .tsv, flat-width files, and even database queries.

In order to save some typing, the `import` function actually allows us to give an alias to imported libraries. A very common one you will see is using `pd` for `pandas`, so we will do that here. Give an `import` an alias with the `as` command.

In [9]:
import pandas as pd

### `datetime`
The `datetime` package gives us access to date- and time-related objects. This is more useful then just having a `str` type date, as `datetime` objects give us access to each unit of a date or time (year, month, days, hours, minutes, seconds, micro-seconds), and can enable us to convert formatting easily as well.  
For this session, we will only be using the `datetime` object from the `datetime` package (slightly confusing, I know), so use the `from` command.

In [10]:
from datetime import datetime

## Getting the File

We have in our data folder some sample data. If you open it, you will see that it is already nicely structured -- the first row is for column header names, the rest of the file is all data. there is no data outside of the table, and the table is uniform and has data in every every column for every row.

To gain access to this file, we will need to get its filepath as a string. We'll assign it to a variable `infile`. Change the filepath to match where you saved yours:

In [1]:
infile = "C:\Users\161289\Py-R\data\session2\session2data.csv"

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-1-dc5ab235ac2e>, line 1)

You'll see that this errored out. This is due to the '`\`' -- this is a special character. It might not seem so special by itself, but when combined with certain other characters, it is used to denote special characters.  
For example, `\n` is used to represent a newline character.

In [2]:
print ("this is\na new line")

this is
a new line


In order to bypass these special characters and to let Python know that you are trying to actually use a `\` character, you would use a second `\` to "escape" special character sequences.

In [3]:
print ("this is\\nnot a new line")

this is\nnot a new line


Note that doing this "escape" results in just one `\` being printed. 
The following are special character sequences: 
- `\newline`
- `\\`
- `\'`
- `\"`
- `\a`
- `\b`
- `\f`
- `\n`
- `\N`
- `\r`
- `\t`
- `\u`
- `\U`
- `\v`
- `\`(`0`...`9`) (any digit between 0 and 9)
- `\x`


So as you can see, the path I attempted to create violates the `\U` sequence, which expects numbers after the `U` in order to create a `Unicode` sequence of numbers. So let's escape that `\U` by using a second `\` character:

In [4]:
infile = "C:\\Users\161289\Py-R\data\session2\session2data.csv"

Looks like it ran successfully! However, if you check what the value of infile is, you will see that there is an issue here:

In [5]:
infile

'C:\\Usersq289\\Py-R\\data\\session2\\session2data.csv'

You'll also see that we violated the `\`(`0`...`9`) sequence. When three numeric characters are after a backslash, Python recognizes that as an octal character. So let's escape that character as well.

In [None]:
infile = "C:\\Users\\161289\Py-R\data\session2\session2data.csv"
infile

So finally, we got what we wanted. You might be thinking this is very tedious and annoying, especially given that Python might not even raise an error or warn you when such a thing might be happening. This requires a lot of precision and is prone to error!  
However, luckily, there is an easy way to process all of these -- using the character `r` as a prefix.  
This `r` can be thought of as a tag to process the string `raw`.

In [11]:
infile = r"C:\Users\161289\Py-R\data\session2\session2data.csv"
infile

'C:\\Users\\161289\\Py-R\\data\\session2\\session2data.csv'

And you'll see here that we have the desired path that we wanted. 

## Converting the File to a `DataFrame`

We have actually already seen this in our previous session, when we were going over dot notation. To convert a file to a `DataFrame`, we can use the `read_csv()` function of pandas. Inside the parenthesis of the function, we will provide the path of the file we desire to read. We'll give this `DataFrame` the name `df0`. Also, remember that we gave `pandas` the alias of `pd`, so when calling the function, that is what we will refer to it as.

In [12]:
df0 = pd.read_csv(infile)

Just to confirm the contents of `df0`, let's call it here:

In [13]:
df0

Unnamed: 0,DateTime,Recipe,Step,Interval,Pressure,GasFlow,ElectricalPower,Temperature
0,4/2/19 11:00 AM,Etching101,1,1,0.0,0.0,0,60.1
1,4/2/19 11:00 AM,Etching101,1,2,5.2,13.0,0,60.0
2,4/2/19 11:00 AM,Etching101,1,3,7.0,17.5,0,60.1
3,4/2/19 11:00 AM,Etching101,1,4,12.0,30.0,0,60.2
4,4/2/19 11:00 AM,Etching101,1,5,18.0,45.0,0,60.1
5,4/2/19 11:00 AM,Etching101,1,6,25.0,62.5,0,60.0
6,4/2/19 11:00 AM,Etching101,1,7,23.0,57.5,0,60.0
7,4/2/19 11:00 AM,Etching101,1,8,19.0,47.5,0,60.1
8,4/2/19 11:00 AM,Etching101,1,9,20.0,50.0,0,60.0
9,4/2/19 11:00 AM,Etching101,2,10,21.0,50.0,52,60.0


## Reading/Checking a `DataFrame`

As we just saw, we can check the contents of a `DataFrame` by calling it's variable. However, sometimes the data we are dealing with might be too large to reasonably display the whole thing, so here are some ways to call different pieces or descriptors of a `DataFrame`:

### Columns
One quick way to verify data is to see all the columns of a `DataFrame`.
We can see the `columns` attribute of a `DataFrame` using dot notation, as we saw in the previous session.

In [14]:
df0.columns

Index(['DateTime', 'Recipe', 'Step', 'Interval', 'Pressure', 'GasFlow',
       'ElectricalPower', 'Temperature'],
      dtype='object')

We can get the column names in a list if preferred. This can have useful applications in our scripts, as we will see later (plus is looks nicer when returned!)

In [15]:
list(df0.columns)

['DateTime',
 'Recipe',
 'Step',
 'Interval',
 'Pressure',
 'GasFlow',
 'ElectricalPower',
 'Temperature']

An even higher level check of a `DataFrame`'s contents is also to just check the number of columns, especially if there are many columns in the data set.  
`DataFrame`s also have a `shape` attribute, comprised of its row count and its column count. Find the column count using an index of `[1]`:

In [None]:
df0.shape[1]

Alternatively, we can find use the `len` command (can be thought of short for "length") over the `columns` attribute to find the column count.

In [None]:
len(df0.columns)

### Rows
Similarly, we might want to check the number of rows in a `DataFrame`. We can do this using the `.count()` method, which will give us a count of data in each column.

In [None]:
df0.count()

If we just want the overall dimension of the data and just want a count of the rows, we can use the `shape` attribute, using an index of `[0]`

In [None]:
df0.shape[0]

We can also achieve this using `len` of the `index` attribute of the `DataFrame`.

In [None]:
len(df0.index)

### Other Views

`.count` will return full dimensions of a `DataFrame`. It's behavior in Jupyter looks different than it will in Spyder.

In [None]:
df0.count

`.describe()` will return some basic summary statistics of the `DataFrame`:

In [None]:
df0.describe() 

`.head()` will show the top of the `DataFrame`. You can specify how many rows you would like to see in the parenthesis using `n=`:

In [None]:
df0.head()

In [None]:
df0.head(n=3)

`.tail` will show the bottom of the `DataFrame`. You can specify how many rows you would like to see in the parenthesis using `n=`:

In [None]:
df0.tail()

In [None]:
df0.tail(n=3)

## Manipulating a `DataFrame`
### Adding Columns
First, let's create some data to insert. Let's make a data variable `tool` with a value of "ET101", get the current date and time, and get the filename (isolating from the path).

In [None]:
tool = "ET101"

In [None]:
time = datetime.now()

You'll see that the `datetime.now()` function returned a `datetime` object.

In [None]:
time

When putting data into a `DataFrame`, a `datetime` object would not usually make much sense to insert into a file, as it's not a familiar, human-readable format. We can "cast" the `datetime` as a `str` type.

In [None]:
time = str(time)
time

Since we already have the filepath, we can use a function from `os` called `basename` that enables us to isolate the file name from the file path.

In [None]:
filename = os.path.basename(infile)
filename

We can also remove objects from memory using the command `del`:

In [None]:
del tool
tool

...but we still need that, so get `tool` back:

In [None]:
tool = "ET101"
tool

There are several ways to add columns to a `DataFrame`. One way is to use the `assign` method. This places the column at the end of the `DataFrame` by default.

In [None]:
df0 = df0.assign(Tool = tool)
df0.head()

Another way is using direct assignment, using the desired column name as an index. This also places the column at the end of the `DataFrame` by default:

In [None]:
df0["Time"] = time
df0.head()

Direct assignment can also work with direct values, not just with variables:

In [None]:
df0["MyName"] = "Justin Winata"
df0.head()

Finally, there is the `insert()` method. This allows us to control the position of insertion. The `insert` function requires an index (starting at 0) as the first parameter, a column name for the second, and data for the third.

In [None]:
df0.insert(0,"FileName",filename)
df0.head()

We may want to re-order the columns in a `DataFrame`. We can do this by simply calling the columns in the desired order. Note the double brackets in this syntax, due to a `DataFrame` being a two-dimensional structure with the index/rows represented in the first dimension and the columns represented by the second.

In [None]:
df1 = df0[["MyName","Tool","Time","FileName","DateTime",
           "Recipe","Step","Interval","Pressure","GasFlow",
           "ElectricalPower","Temperature"]]
df1.head()

## Converting a File into a `DataFrame`
After modifying the data, we often want a cleaned copy of our final data, or need one to interface the data with other programs. We can do this by using the `pandas` function `.to_csv()`. The first parameter is the file path, and there are many optional parameter afterwards.

In [None]:
df1.to_csv("session2output.csv",index=False)

Notice the `index=False` parameter here. If you look at our `DataFrame`, you'll see that there's an extra column on the left side numbering the rows that wasn't in our original data.

In [None]:
df1.head()

Usually when dealing with data, there will often be a column that already plays this role. In this case, the `Interval` column already numbers the rows. So typically, I set the index to `False`, and when the file is written, this extra index column is not written to the file.

You'll also see that we just saved a file to a default location. We would typically want to specifically set an output directory as well. I like to just create an output folder where the data is. We can use the method `os.path.dirname` to chop off the file name of our `inpath`:

In [None]:
outpath = os.path.dirname(infile) + r"\output"
outpath

But notice that the path we set for the path to be saved in does not exist! We can force this path to exist using functions from `os`. The `os.path.exists()` function checks whether or not a path exists, and `os.makedirs` creates the directories necessary for that path to exist. Here we'll use control flow, using an `if` statement to only execute `os.makedirs()` if it is true that `outpath` does not yet exist:

In [None]:
if not os.path.exists(outpath):
    os.makedirs(outpath)

To build the full file name more dynamically, we can build the path string by simply adding on the desired file name to the file path:

In [None]:
outfile = outpath + r"\session2output.csv"

Finally, we can use the `to.csv()` function to create an output file, right where we want it to be.

In [None]:
df1.to_csv(outfile,index=False)

# Final Script

And our script is complete. If you've been keeping up with the commands throughout this Notebook, if it were all compiled into a script, it would look like this:

In [None]:
# imports -- need os, pandas, and datetime
import os
import pandas as pd
from datetime import datetime

# get input file
infile = r"C:\Users\161289\Py-R\data\session2\session2data.csv"
# convert file to DataFrame
df0 = pd.read_csv(infile)

# metadata to insert -- tool name, current time, file name
tool = "ET101"
time = str(datetime.now())
filename = os.path.basename(infile)

# add Time, MyName, Time, and FileName columns
df0 = df0.assign(Tool = tool)
df0["Time"] = time
df0["MyName"] = "Justin Winata"
df0.insert(0,"FileName",filename)

# re-order columns
df1 = df0[["MyName","Tool","Time","FileName","DateTime",
           "Recipe","Step","Interval","Pressure","GasFlow",
           "ElectricalPower","Temperature"]]

# specify output path and file
outpath = os.path.dirname(infile) + r"\output"
# if path does not exist, create path
if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session2output.csv"

# convert DataFrame to file
df1.to_csv(outfile,index=False)

# Spyder

And that's it for material on this session! Next, we're going to open Spyder and transfer code into a coding development. I've provided a code assist to follow, and we will use the same sample data file. The goal will be to replicate the file that we created in this session. Feel free to use this Notebook as a reference. You can copy and paste code blocks, but of course, it is best practice to type it out youself, even if you are copying the syntax directly.