* Loading data in pandas
* Data summaries and basic operations
* Plotting with Seaborn.

## Data File I/O and Visualizing Data with Seaborn

### Learning goals

* read data files from disk (pandas)
* practice exploring data (pandas + Seaborn)

In this tutorial, we will practice with reading (and writing) data files from disk, and then exploring the data. These may seem like separate and unrelated steps. While they are seperate, they are by no means unrealated. In data science, we are generally dealing with data that we did not collect ourselves, so, when we get our hands on a data set, the immediate steps we take are to:

- read the data in
- examine the data as a data table/structure/data frame
- look the data visually via various plots


##### File input

Reading files from disk is more generally known as file `i/o` where `i` and `o` stand for `i`nput and `o`utput, respectively.  Why do we need to do file `i/o`?

The input case is more obvious. To analyze data, you need data to analyze. Unless you know magic, you access data from  *data files*, which are files just like a PDF documents, JPG images, or whatever, but they are specialized to some degree for containing data. Whether from a colleague, a boss, a webpage, or a government data repository, data will come in a data file that you will need to read as input in order visualize and analyze the data.

##### File output

The output case is perhaps less obvious. You read data into your Jupyter Notebook. The data are stored into Python variables. There are many different types of variables. We will learn about the fundamental ones. The variables can be used inside the Jupyter Notebook to make pretty graphs, and to do some cool analysis. But what if you want to share the numerical values, the results, of the analysis with someone else? In that case, you can write those values to a data file, save the file and send that to your colleagues.  They can then read in the values on their end without having to wade through your notebook and cutting and pasting or whatever.  

#### Import `pandas`

 Remember that, by convention, we'll import `pandas`  with the nickname `pd`. Ok, let's import `pandas`:

In [None]:
import pandas as pd

 We don't *have* to do this. If wanted to, we could import pandas like this:
 ```python
 import pandas as cute_creature_that_eats_shoots_and_leaves
 ```
and the library would import just fine with the nickname `cute_creature_that_eats_shoots_and_leaves`

We've already leaned that `pandas` basically gives us R-like tibble (data frame) functionality in Python. But another thing Python gives us are easy ways to import and export data.

### Data preparation

For this tutorial, we are going to read a data file called `006DataFile.csv`. The data was given to you and you are asked to save it inside a folder called `datasets`. The folder should be contained inside the same folder containing this Jupyter Notebook. Ideally, both `datasets` folder and Jupyter Notebook should be saved inside a GitHub repo.

### Let's read some data!

We read a file with exstension `.csv` (more on this file type in a bit) using the `pandas.read_csv()` function. But, remember, we have imported pandas as `pd`, so we read the `.csv` file, with slightly less typing, like this:

For example, to use a command  `read_csv()` available in `pandas` we would need to use the following line of code `pandas.read_csv()`, using the nickname `pd` the code shortens to `pd.read_csv()`. Nicknames are standardized in python, each libray is generally called with a specific nickname.

In [None]:
import pandas as pd

In [None]:
myDataFromFile = pd.read_csv("./datasets/016_TU_Data1.csv")

This command will work "out of the box" if your copy of the data file is in your "*datasets*" directory, which should be a subdirectory of the one this notebook is in. 

Otherwise, you would have to change the command above to specify the path to the data file – where on the file tree the data file exists (either in '*absolute*' terms from root, or in '*relative*' terms from you current directory).

$\color{blue}{\text{Answer the following questions:}}$

In the line of code above, what is the:

 - name of the library used to load the file?  [Enter answer here]
 - name of the `pandas` function we use to read the data file?  [Enter answer here]
 - data file name?  [Enter answer here]
 - name of the variable used to store the file?  [Enter answer here]
 - name of the folder containing the data file?  [Enter answer here]

### Let's look at what we just read.

Okay, now let's look at the file. We can take a quick peek by using the `display()` function:

In [None]:
display(myDataFromFile)

Here, we can see that this file (like almost all data files) consists of rows and columns. The rows represent *observations* and the columns represent *variables*. This type of data file contains "tidy" data (if you have used R, you may have encountered the tidyverse). Sometimes, we will encounter data files that violate this "rows = observations, columns = variables" rule – untidy data – we will deal with this issue later in the class.

A very common genertic data file type is the comma separated values file, or .csv file. This is the type of data file we just loaded (006DataFile***.csv***). As the name implies, a file in this format consists values separated by commas to form rows, and "carriage returns" (CR) or "line feeds" (LF) marking the end of each row.

---
**Useless Trivia Alert!**:

These terms come from typewritters and old-old-old-school printers, respectively. Typewritters had a "carriage" that held the paper and moved to the left while you typed. When you got to the right edge of the paper, you hit the "*carriage return*" key and the whole carriage flew back (*returned*) to right with a loud clunk and advanced the paper down a line. To this day, the big fat important key on the right side of most keyboards still says "return".

Old-school printers used long continuous "fan fold" sheets of paper (they could be literally hundreds of feet long) and had to be told to advance the paper one line with a "*line feed*" command. Once you were done printing, you ripped/cut your paper off the printer sort of like you do with aluminum foil or plastic wrap!

---

$\color{blue}{\text{Answer the following question:}}$

 - What are the dimensions (the size) of the data?  [Type the Answer here]

**Useful aside!:**
`pandas` has very convenient functionality. For example, we can even copy data to the clipboard and read that in. 

Go to Wikipedia and copy the table of the [population of Burkina Faso by year](https://en.wikipedia.org/wiki/Demographics_of_Burkina_Faso). 

After that, you can read the data from Wikipedia Table into a data table (technically a `pandas` data frame) like this:

In [None]:
cb = pd.read_clipboard()

In [None]:
cb

How cool is that?!?!

$\color{blue}{\text{Answer the following questions:}}$

 - What are the names of the columns of the data copied from Wikipedia? [Type the Answer here]
 - What is the name of the variable I used to save the data in Python? [Type the Answer here]

Okay, now back to the show. In addition to `display()`, we use can use data frame "methods". 

What is a "method"? Methods are things that an object, like our loaded datasets (technically a `pandas` data frame), can do. They are actions that an object can perform for you without any additional coding on your part!

Methods are invoked using the following syntax `ObjectName.MethodName`.

One thing a data frame knows how to do is show you its first few rows with the `head()` method. This method returns the top (leading, or head of a data table):

In [None]:
myDataFromFile.head()

Another method is `tail()`. This second method shows the last rows of the table:

In [None]:
myDataFromFile.tail()

But how do you know what methods a given object has? Python's `dir()` function will give you a directory of any objects methods:

In [None]:
dir(myDataFromFile)

HFS!!! Data frames know how to do a LOT! It's a bit overwhelming actually. 

We can ignore all the things that look like \_\_this\_\_ at the top. Scrolling the the others, the method called `describe()` looks promising. Let's see what it does!

In [None]:
myDataFromFile.describe()

$\color{blue}{\text{Answer the following question:}}$

 - Describe in your own words what the method describe returns of the data. Describe the measures the method returns.
 
 [Type the Answer here]

OMG, that was a good find!

I also noticed a `hist` method. Could it even be possible that data frames know how to draw histograms of themselves?

In [None]:
myDataFromFile.hist()

#### **NO WAY!!!!!**

As you can see, our journey of learning to play with data is going to be part learning to code and part figuring how to use what's already out there!

$\color{blue}{\text{Answer the following questions:}}$

 - What are the titles of the two histograms in the figures created by the method `.hist`?  [Type the answer here]
 - How do the titles of the histograms relate to the data? [Type the answer here]
 - What are the values in the x-axis of the histograms? [Type the answer here]

### Let's see if we can write data to a file!

Now maybe we can write a summary of the original data to a file so we could potentially share it with other. What we'll do is use the `describe()` method again, but this time we'll assign it to new data frame.

In [None]:
mySummary = myDataFromFile.describe()

Let's just quickly that `mySummary` contains what we hope it does. The python command `print` will help us take a look at the summary:

In [None]:
print(mySummary)

See what we did above? Instead of returning the results of the method `.describe()` directly in the notebook output, we saved the output into a variable. We then used `print` to display the content of the variable.

$\color{blue}{\text{Answer the following questions:}}$

 - What is the name of the variable we saved the output of the method `.describe()` in? [Type the answer here]

Next let's write the variable to a file! Given that we are dealing with a table, we will save the variable in a `.csv` file. The variable has a method `.to_csv`, the method can be found by lurking the output of `dir(<varName>)`.

In [None]:
mySummary.to_csv("mySummary.csv")

Okay, but how do we know that worked? Easy! We'll read that file back in using `pandas.read_csv()` and see what it looks like!

In [None]:
mySummary2 = pd.read_csv("mySummary.csv")

And then we can look at it using `display()`.

In [None]:
display(mySummary2)

#### Sweet! We can now read and write data files. File I/O handled!

### Seaborn overview

`Seaborn` is meant to provide a facilitatated access to plotting data. It's like an add-on to `pandas` to accelarate data exploration via visualiation.

`Seaborn` was written to:

* make plots from `pandas` data frames
* create good looking plots "out of the box"

The `seaborn` package is a "high level" plotting package that makes good looking plots while taking care of many details for you under the hood, so making basic plots is easy. If we want fine control over our plots, we need to go another direction, but we'll learn about that later.cv 

The various `seaborn` functions are conceptually structured like this:
![seaborn_overview](./assets/jpnb20/seabornOverview.png)

The three columns correspond to plot types: plots of relationships, plots of data distributions, and plots of categorical data. 

For each plot type, there is a high level function, `relplot()`, `displot()`, and `catplot()`. These pretty much make a figure for you without you having to worry about anything. Technically, they are "figure level" functions.

In addition to the 3 high level functions, there are specific functions (technically "axes level" functions) for making each specific kind of plot directly. Each of these returns an `axes` object, which you can then modify further if you need to.

Don't worry about what "figure level" and "axes level" mean right now – we'll get to that when we get to that.

First, let's import what we'll need:

In [None]:
import pandas as pd
import seaborn as sns

In [16]:
our_data = pd.read_csv("./datasets/016_TU_Data2.csv")

Let's take a quick peak at the data.

In [17]:
our_data

Unnamed: 0.1,Unnamed: 0,RTs,sex,strain
0,0,10.485451,M,Wild Type
1,1,11.747948,M,Wild Type
2,2,13.41258,M,Wild Type
3,3,12.910095,M,Wild Type
4,4,10.36777,M,Wild Type
5,5,11.698422,M,Wild Type
6,6,11.583153,M,Wild Type
7,7,11.447349,M,Wild Type
8,8,10.852276,M,Wild Type
9,9,11.285897,M,Wild Type


In [19]:
our_data = our_data.drop(columns=['Unnamed: 0'])

In [20]:
our_data.head()

Unnamed: 0,RTs,sex,strain
0,10.485451,M,Wild Type
1,11.747948,M,Wild Type
2,13.41258,M,Wild Type
3,12.910095,M,Wild Type
4,10.36777,M,Wild Type


## Figure level plots

 We'll start with some figure level plots.

 We'll start with some figure level plots.