# Introduction to Panda DataFrame

In [None]:
import numpy as np
import pandas as pd

Execute the cell below

In [None]:
df= pd.read_csv('./iris-data-index-column.csv', index_col=0, header=0)
print(df)

- You can think of a DataFrame as a spreadsheet 
- Each column is stored as a separate Panda series
- All the series share an index (in the example, this is d1, a1, ...)
- There are many ways to set up the index and the column name (Sepal-Length, Sepal-Width...)
- Floats are stored as float64, integers are stored as int64 and strings are stored as objects

In [None]:
df.dtypes

Execute the cell below to see the documentation for DataFrame

In [None]:
help(pd.DataFrame)

# Getting the Data

## Checking for data

For the purpose of this notebook, we will be working with an extract from the standard "Iris" dataset. This dataset is often used when working with categorisation-type AI applications.

Execute the next cell to see what iris data files are available. You should see:

    1. iris-data.csv,
    2. iris-data-column.csv
    3. iris-data-index-column.csv
    4. iris-data.txt

In [None]:
%ls -l iris*

## Examining the data

In the three cells below, we use the `%less` magic command to examine the iris data files

In [None]:
%less iris-data.csv 

This is an extract from the original raw data set. We have chosen a small extract so that the output can easily be seen.

In [None]:
%less iris-data-column.csv

In this version of the file, we have added an initial line of column names.

In [None]:
%less iris-data-index-column.csv

And, finally, in this file we also added a column to show how to control the index
- Notice the column label row starts with a comma (,) to show the first column is the index column
- The index column (d1, a1, ...) does not have a name in the header

We don't look at `iris-data.txt`, but this is a description of the full data set.

The command to read data from a csv fle is `pd.read_csv`. 
- Look at all of the parameters for a minute. Then continue on in this notebook.

In [None]:
help(pd.read_csv)

In [None]:
raw_df = pd.read_csv('./iris-data.csv', header=None)
print(raw_df)

If you do not include the keyword option `header=<row index>` then the first line is taken as the header line and made the column headings.

  * `header=None`
    - Columns are numbered starting a 0
    - Or you can name them using the keyword attribute `names=<list of column names>`
  * `header=0`
    - First row (counting starts at 0) is the headers) 

Even if you don't name the columns on import, you can easily change them at will
- The column names are stored as an Index in the DataFrame.

In [None]:
raw_df.columns = ['Sepal-Length', 'Sepal-Width','Petal-Length','Petal-Width', 'Class']
print(raw_df)
print(type(raw_df.columns))

If there is an index column you must specify it.  This is done with the keyword `index_col=<column number>`. This is not done very often. 

The column is stored as an Index. In the DataFrame below I am using the Index as a way to say who the collector of the data was.

Execute the cell below.

In [None]:
df= pd.read_csv('./iris-data-index-column.csv', index_col=0, header=0)
print(df)
print(type(df.index))

You can change the index labels.

Execute the cell below.

In [None]:
df.index = [1, 2, 3, 4, 5, 6]
print(df)

## Saving a DataFrame as a CSV file

To save a DataFrame to a CSV file, use `.to_csv`

In [None]:
help(df.to_csv)

In [None]:
df.to_csv("./saved_iris.csv")

In [None]:
%less saved_iris.csv

* Index
    - To write out the index, add the keyword `index=True`
    - Otherwise leave out keyword
* Header
    - Use `header=False` if no column labels are to be saved 
    - Use `header=True` if column labels are to be saved
    - Use `header=<list of column names>` to write out different set of column names
* Column
    - Use `column=<list of column names>` to write out only the selected columns

There are multiple other ways to create a DataFrame. 

See [Dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

# End of Notebook