# Pandas - Dataframes

DataFrames are two-dimensional structures of changeable size, with labeled axes (rows and columns). Mathematical operations can be performed considering both axes.

In general, DataFrames can be thought of as dictionary-like containers for storing Series objects. It is the main structure of the pandas.

In [None]:
import pandas as pd

We can create a dataframe from a Series dictionary.

Pandas have a set of functions that make data visualization easier, making the structure much more "beautiful" visually than a Numpy array.

In [None]:
data = {
    'values': pd.Series([1,2,3]),
    'categories': pd.Series(['A', 'B', 'C'])
}
pd.DataFrame(data)

Pandas also contains several functions for reading complex files, such as csv, excel, etc.

In [None]:
df = pd.read_csv('world-happiness-report.csv')

## Getting dataframe information

When you read a csv file, it understands that there are named columns and it already brings that to you. Shall we see the first five lines of a dataframe?

In [None]:
df.head()

In [None]:
df.tail(3)

Sometimes it is relevant to get information about metadata:

In [None]:
print('Columns: ', df.columns.values)
print('Matrix size: ', df.shape)

In [None]:
df.describe()

## Working with columns

Each column works as a Series so we can do the same functions, calling the columns by their name:

In [None]:
df['last_evaluation'].mean()

We can make slices similar to what we do in Series, however, considering the whole dataframe:

In [None]:
higher_than_70 = df['last_evaluation'] > 0.7
high_evaluations = df[higher_than_70]
high_evaluations.shape

In [None]:
smaller_than_70 = df['last_evaluation'] <= 0.7
bigger_than_50 = df['last_evaluation'] > 0.5

median_evaluations = df[smaller_than_70 & bigger_than_50]
median_evaluations.shape

In [None]:
good_evaluations = df[smaller_than_70 | bigger_than_50]
good_evaluations.shape

We can understand the standard behavior of a variable by doing quick counts:

In [None]:
df['salary'].value_counts()

We can erase a column for instance... (look at the axis!)

In [None]:
df.drop("left", axis=1).head()

## Working inplace

You should be careful when using inplace for large dataframes! If you change it, you have to start over!

In [None]:
df.rename(columns={'number_project': 'num_project'})

In [None]:
df.head()

In [None]:
df.rename(columns={'number_project': 'num_project'}, inplace=True)
df.head()

## Geting values

In [None]:
df.iloc[99]

In [None]:
df.iloc[99, 0]

## How can we iterate?

In [None]:
for item, row in df.iterrows():
    print('.', end='')

## Working with multiple dataframes

In [None]:
df1 = pd.DataFrame({'id': range(5), 
                           'valor1': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': range(2, 7), 
                           'valor2': ['f', 'g', 'h', 'i', 'j']})

print(df1)
print('')
print(df2)

In [None]:
pd.merge(df1, df2, on='id', how='inner')

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1, df2], ignore_index=True)