# <font color='maroon'>Data wrangling with Pandas</font>

In this lesson we learn how to load data into a tabular data structure called a dataframe. The dataframe is provided by the Pandas library. Pandas is built on top of the Numpy library. In this notebook we'll study the attributes of a dataframe and the functions you can perform on a dataframe. Run the cells as you go along to see the output of the code given. Read through the code to check if you understand what it's doing.

## What is a Pandas dataframe?

In order to manipulate data via Python, we need to store it somewhere. Python provides the Pandas library, which provides dataframes to store data that comes from different file types. A dataframe lets us store information in tabular form (as you might do in Excel) and perform more complicated operations on this information. For example, it’s possible to select a subset of the data based on the values in a particular column, add new columns, combine two dataframes into one larger dataframe and so on. Columns can be given names, and rows can be given an explicit index, to allow us to access a particular piece of information.

Pandas includes other data structures as well, which we may go into later. A more detailed explanation of Pandas is available here: pandas: [powerful Python data analysis toolkit](https://pandas.pydata.org/pandas-docs/stable/). For now, let’s look at how we can get data into a dataframe and start working with it.


## Loading a file into a Pandas dataframe

Before even loading data into your workspace, it is important to have the right tools to allow you to load and manipulate the data. In the LMS, you had a glance at Numpy, Pandas and Matplotlib among others. These are modules or collections of code organized into an easy-to-maintain format for specific purposes you looked at. An entire module can be imported or only parts of it.

To import the entire module, the code below can be used. Notice both have a standard alias widely used by Python users. To read more on this, there is additional reading provided in the LMS referring to Importing Modules.

In [0]:
import numpy as np #imports Numpy using a standard alias, np
import pandas as pd #imports Pandas using a standard alias, pd

Include matplotlib inline. This allows the output of cells with plotting commands to appear directly below their input cells and be stored with/embedded in the notebook.

In [0]:
%matplotlib inline  

### The dataset

We use data from the [UCI Machine Learning repository](http://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise#). This is NASA data on different-sized NACA 0012 airfoils tested at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments. According to the UCI Machine Learning site, the problem has the following inputs:
1. Frequency, in Hertzs.
2. Angle of attack, in degrees.
3. Chord length, in meters.
4. Free-stream velocity, in meters per second.
5. Suction side displacement thickness, in meters.

The only output is:
6. Scaled sound pressure level, in decibels. 


First open the file and investigate the headings in the column and the separator used to separate columns. We observe the columns have no headings, so when we use Pandas to load the file into a dataframe, we provide the names of the columns using the information given above.

In [0]:
data = pd.read_csv('airfoil_self_noise.dat', #name of file
                   sep='\t',  # how columns are separated
                   names = ['Freq(Hz)', 'Angle(deg)', 'Chord_length(m)', 'Velocity(ms)', 'Thickness(m)', 'Pressure(dec)'])

See the column headings.

In [0]:
data.columns

Pandas allows you to load a comma separated file as a dataframe. 

In [0]:
type(data) # make sure you've created a Dataframe

Use the `info()` function to get information about the data types.

In [0]:
data.info()

Note: there is a function `astype()` to change data types to other types. Let's change one of the int types to a float. 

In [0]:
data['Freq(Hz)'].astype('float')

### Indexing a dataframe

We may want to set a column index to make access to data points easier.

In [0]:
data.index

Suppose we set the Freq(Hz) column as the index.

In [0]:
data.set_index('Freq(Hz)', inplace=True)

What happens when you check the data index again?

In [0]:
#write your code here to check data index


In [0]:
#you can also check you data after setting index using the command below
data

### Descriptive information

Let's find out more about the data.

The size of the dataset, or the number of rows and columns in the dataframe, is given by:

In [0]:
data.shape

Find out more about the dataset attributes -- the types of the information in the dataset.

In [0]:
data.info()

See the names of all the columns in the data.

In [0]:
data.columns

To get a small view of the data, type

In [0]:
data.head()

In [0]:
data.tail()

You can view the data in different ways.

In [0]:
data.transpose()

You can also produce summary statistics of the data as follows:

In [0]:
data.describe()

## Creating dataframes

### Selecting a column of the data

You can select a single column from the dataframe by name. When we select the `Pressure(dec)`, the function returns the index column along with the `Pressure` column. 

In [0]:
data['Pressure(dec)']

In [0]:
type(data['Pressure(dec)'])

Note the type of the column selected.

Suppose we want to see stats on a certain frequency. This is where the `index_col` argument is used. Using the function `Dataframe.loc` (here, Dataframe should be replaced with the name of the dataframe we created), we use a label from the index column to get all information related to the value.

In [0]:
data.loc[800, :]

To select any row, we can specify the row number and select it using the iloc function.

In [0]:
data.iloc[1, :]

### Selecting certain data points from dataset

We can use the `Dataframe.loc` function, which accepts a label, to select a single observation point from the dataframe.

In [0]:
data.loc[:,'Thickness(m)'] # index value and thickness values returned

This is similar to using integer indices to select a single data point.In other words, .loc allows one to select columns based on column labels while .iloc can be used for position-based indexing.

In [0]:
data.iloc[:, 3]

### Selecting a subset with slicing

Suppose you are only interested in a subset of the data. 

In [0]:
data = data.sort_index() # sort the data according to the index

In [0]:
data.head()

Use the : operator to select rows and columns. For example, 
    
    loc[:,:] or simply [:,:] returns the whole dataset. The comma separates the rows and columns
    iloc[:,2] returns all the rows in the 3rd column.
    iloc[:, 1:3] returns all the rows in column 1, 2 and 3.

In [0]:
data.loc[200:800, :]           # return all columns with index row entries between 200 and 800 Hz

In [0]:
data.loc[:200,:] # return all columns with Frequency = 200

### Exercise

For all the questions below, write your code and comments to define your approach in a way that another user can understand your choices.

Question one: 
With the `Dataframe.ix` function, you can select the columns you would like to return. Select two columns, namely `Velocity(m)` and `Pressure(dec)`. Display the velocity and pressure for frequencies 200 and 800.

In [0]:
# your answer
#data.ix...


Question two:Using Dataframe.ix, select the first 5 entries of the dataframe for the Angle (deg) column. Use comments to explain how it works (or does not if appropriate)

In [0]:
#data...

Question three: Select the 3rd column of the entries between the 100th to the 250th  inclusive without using the column label. Comment your code to indicate why you chose each command used.

In [0]:
# your input
#index into the dataframe and return all entries with frequencies between 1000 and 10000Hz, for the Velocity column 


