![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39118512-dfa1cc1a-46e9-11e8-9547-093d4532451e.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Intro to Pandas DataFrame's

Probably the most important data structure of pandas is the `DataFrame`. It's a tabular structure tightly integrated with `Series`.

A DataFrame is a tabular structure with the following properties:

- It's composed by a ordered series of rows and a ordered series of columns.
- It also uses an index to reference individual rows.
- Each column could have a different NumPy-related type.
- It could be seen as a collection of multiple of Series, all sharing the same index.
- Can be "sliced" horizontally (per row) or vertically (per column).

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on! 

In [None]:
import numpy as np
import pandas as pd

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## DataFrames creation

The `DataFrame` constructor accepts the following parameters:

- **data**: (required) has all the data we want to store on the DataFrame and could be a Series dictionary, a sequences dictionary, a bidimensional ndarray, a Series or another DataFrame.
- **index**: (optional), has all the labels we want to assign to the rows of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: `np.arange(0, len(rows))`.
- **columns**: (optional), has all the labels we want to assign to the columns of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: `np.arange(0, len(columns))`.
- **dtype**: (optional) any NumPy data type to be assigned on columns.

In [None]:
# Using a dictionary of sequences
dataframe = pd.DataFrame({'var1': [1, 2, 3],
                          'var2': ['one', 'two', 'three'],
                          'var3': [1.0, 2.0, 3.0]})

dataframe

In [None]:
# Using a dictionary of Series
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe

In [None]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe

In [None]:
# Using a ndarray with indexes to rows and columns, with fixed type
dataframe = pd.DataFrame(np.arange(16).reshape(4,4), dtype=np.int32)

dataframe

In [None]:
dataframe.dtypes

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## DataFrame elements

`DataFrame`s expose some useful attributes:

In [None]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe

In [None]:
# Show first rows of our DataFrame
dataframe.head()

In [None]:
# Type of our DataFrame columns
dataframe.dtypes

In [None]:
dataframe

In [None]:
# Values of a DataFrame
dataframe.values

In [None]:
dataframe.info()

In [None]:
type(dataframe.values)

In [None]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe

In [None]:
# Index of a DataFrame
dataframe.index

In [None]:
# Columns of a DataFrame
dataframe.columns

In [None]:
# Dimension of the DataFrame
dataframe.ndim

In [None]:
# Shape of the DataFrame
dataframe.shape

In [None]:
# Number of DataFrame elements
dataframe.size

> _Indexes are immutable, so we can't change individual values independently. However, we can change a complete index with a new index._

In [None]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe

In [None]:
# Modifying a row index will give us an error
dataframe.index[0] = 4

In [None]:
# Modifying a column index will give us an error
dataframe.columns[0] = 4

In [None]:
# This will work
dataframe.index = ['r1', 'r2', 'r3']
dataframe

In [None]:
# This will work
dataframe.columns = ['c1', 'c2']
dataframe

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## The Group of Seven

We'll keep our analysis of "[G7 countries](https://en.wikipedia.org/wiki/Group_of_Seven)" and looking now at DataFrames. As said, a DataFrame looks a lot like a table (as the one you can appreciate [here](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing)):

![image](https://user-images.githubusercontent.com/872296/38153492-72c032ca-3443-11e8-80f4-9de9060a5127.png)

Creating `DataFrame`s manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:

In [None]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

_(The `columns` attribute is optional. I'm using it to keep the same order as in the picture above)_

In [None]:
df

In [None]:
df.dtypes

In [None]:
type(df.values)

In [None]:
df.ndim

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.get_dtype_counts()

In [None]:
df.columns

In [None]:
df.index

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Changing column type

In [None]:
df['Population'].astype(np.int)

In [None]:
df

In [None]:
df['Rounded Population'] = df['Population'].astype(np.int)

In [None]:
df

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Changing `DataFrame`s column index

In [None]:
df.columns = ['P', 'GDP', 'SA', 'HDI', 'C', 'RP']

df

_(we'll keep the original column index)_.

In [None]:
df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent', 'Rounded Population']

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Changing `DataFrame`s row index

`DataFrame`s also have indexes. As you can see in the "table" above, pandas has assigned a numeric, autoincremental index automatically to each "row" in our DataFrame. In our case, we know that each row represents a country, so we'll just reassign the index:

In [None]:
df

In [None]:
df.index

In [None]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [None]:
df

In [None]:
df.index

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Removing indexes

We can also discard current indexes from our `DataFrame` at any time including it as a new column of our data. To do that we use the `reset_index()` method. New index will be a numerical sequence.

> Note 1: that `reset_index()` will return a new DataFrame, so if we want to keep it we need to assign it to a variable.

> Note 2: also, if we don't want to keep the old index as a column we can drop it using the `drop=True` parameter.

In [None]:
df = df.reset_index()

df

Also, we can restore a set of columns as `DataFrame` index:

In [None]:
df = df.set_index(['index'])

df

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Changing `DataFrame`s row and column index at once

In [None]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    },
    index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    }
)

In [None]:
df.rename(index=str.upper)

In [None]:
df.rename(index=lambda x: x.lower())

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)