# Pandas DataFrames

### Questions
- How can I do statistical analysis of tabular data?

### Objectives
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Select a subset of a dataframe by a single Boolean criterion.

## About `pandas` DataFrame

- A DataFrame is a collection of Series
    - The DataFrame is the way Pandas represents a table.
    - Series is the data-structure Pandas use to represent a column.

- Pandas is built on top of the `Numpy` library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

- What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

## Selecting values by `DataFrame.iloc[row, column]` using their (entry) position.

First, let's read the file `gapminder_gdp_europe.csv` and inspect the data.

Now, let's read the 1st column of first row (i.e. [0,0])

**Exercise**: What would be the value of `gdpPercap_1957` (the second column) in 'Belgium' row?

## Use `DataFrame.loc[row, column]` to select values by their (entry) label.

**Repeat the last exercise**: What would be the value of second column in 'Belgium' row?

## Use `:` on its own to mean all columns or all rows.

## Select multiple columns or rows using `DataFrame.loc` and a named slice.

In the above code, we discover that slicing using `.loc` is inclusive at both ends, which differs from slicing using `.iloc`, where slicing indicates everything up to but not including the final index.

## Result of slicing can be used in further operations

- Usually we won’t just print a slice.
- Statistical operators that work on entire dataframes work the same way on slices. E.g., calculate `max` or `min` of a slice.

## Use comparisons to select data based on value (Boolean output as `True` or `False`)

In [None]:
subset > 10000 # Select all values above 10000

## Select values or "Not a Number" (NaN) using a Boolean mask

- A frame full of Booleans is sometimes called a mask because of how it can be used.
- Get the value where the mask is true, and NaN where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.

## Use further operations (e.g. `min`, `max`, `describe`)

**Exercise**: Write an expression to find the Per Capita GDP of Serbia in 2007.

## Reconstructing Data

**Exercise**: Explain what each line in the following short program does: what is in first, second, etc.?

```first = pd.read_csv('data/gapminder_all.csv', index_col='country')```

```second = first[first['continent'] == 'Americas']```

```third = second.drop('Puerto Rico')```

```fourth = third.drop('continent', axis = 1)```

```fourth.to_csv('result.csv')```

## Get all the column and row names

## Selecting Indices

**Exercise**: Explain in simple terms what `idxmin` and `idxmax` do in the short program below. When would you use these methods?

In [None]:
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

print(data.idxmin())
print(data.idxmax())

**Exercise**: Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:

- GDP per capita for all countries in 1982.
- GDP per capita for Denmark for all years.
- GDP per capita for all countries for years after 1985.
- GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.

Python includes a `dir` function that can be used to display all of the available methods (functions) that are built into a data object. 

As an example, the functions available for a list data type are:

In [None]:
potatoes = ["Russet", "Norkota", "Yukon Gold", "Pontiac"]
dir(potatoes)

Similarly we can apply `dir` function on pandas DataFrame