_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____

__Short Lesson Title:__ Pandas DataFrames: Creation and Manipulation

*__Summary:__ This lesson introduces the fundamental concepts of Pandas DataFrames, the core data structure for data analysis in Python. Students will learn how to create DataFrames from various data sources, including NumPy arrays, Pandas Series, Python dictionaries, and external files (CSV, JSON, Excel). The lesson covers essential DataFrame attributes and methods for inspecting data, such as shape, info, describe, head, tail, index, and columns. It also delves into data selection and manipulation techniques using indexing, slicing, boolean indexing, and the loc, iloc, at, and iat methods. Students will learn how to sort data, add and modify rows and columns, handle missing data, and perform basic statistical operations. Finally, the lesson explores data concatenation, merging, grouping, and basic plotting, providing a comprehensive foundation for working with DataFrames.*



# Dataframes

A __data frame__ is a way to store data in "rectangular" grids
* Each row corresponds to measurements or values of an instance, 
* Each column is a vector containing data for a specific variable. 
* Data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, boolean, etc.

So, dataFrames in the Pandas library are defined as a __two-dimensional labeled data structures__ with columns of potentially different types.
       
Pandas' DataFrame consists of three main components: the data, the index, and the columns.
       
The DataFrame can contain data that is:
* **Heterogeneous**: Each column can have a different data type (numeric, string, boolean, etc.).
* **Labeled**: Both rows (index) and columns have labels, making it easy to access and manipulate data.
* **Missing Values**: DataFrames can handle missing or NaN values seamlessly, which is important for data cleaning and preprocessing.
* **Size-Mutable**: DataFrames can grow or shrink in size, allowing for dynamic data manipulation.
* **Data Alignment**: DataFrames automatically align data based on the index and column labels, making it easy to perform operations on different DataFrames.
* etc. 
 
 
![images/01_table_dataframe.svg](images/01_table_dataframe.svg)

Let's start by importing the libraries we will need for this notebook and set inline plotting

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Make a dataframe

So, many things can serve as input to make a `DataFrame`

The basic Dataframe is an empty one as follows

In [None]:
df = pd.DataFrame()
df


### From a numpy array (`nparray`)

To create a DataFrame from a numpy array, we can use the `pd.DataFrame` constructor.  The `data` parameter is the numpy array to be converted to a DataFrame.

In [None]:
data = np.array([[1,2],
                [3,4]])
data

In [None]:
type(data)

In [None]:
pd.DataFrame(data=data)

We can **define an index and name the columns**. The `index` parameter is the index of the DataFrame. The `columns` parameter is the columns of the DataFrame.

In [None]:
pd.DataFrame(data=data,
             index=['row 1', 'row 2'],
             columns=['col 1', 'col 2']
             )

We can also use a structured array as follows

In [None]:
pets = np.array(
    [
        ('Olivia', 1, 13.0),
        ('Violeta Maria', 3, 5.0)    
    ],
    dtype=[('name', 'U10'),   # 10-character string
           ('age', 'i4'),     # 4-byte signed integer
           ('weight', 'f4')]  # 4-byte floating-point number
)
pets

In [None]:
df = pd.DataFrame(pets)
df

### From Pandas' Series

As seen in the previous notebook, a Pandas' Series is a one-dimensional labeled array capable of holding any data type with axis labels or index. 

Further, an example of a Series object is one column from a DataFrame.

But, let us create a DataFrame from a set Series. For example, the following Series

In [None]:
name = pd.Series(['Margaret', 'John', 'Claudia', 'Mary', 'Peter', 'Helga', 'Heidi', 'Mary'])

age = pd.Series([17, 13, 44, 55, 71, 7, 16, 70])

gender = pd.Series(['F', 'M', 'F', 'F', 'M', 'F', 'F', 'F'])

classification = pd.Series([4, 3, 2, 5, 1, 7, 6, 7])

For example, the `name` Series is as follows

In [None]:
name

And the building of the data frame is as follows

In [None]:
persons_df = pd.DataFrame(data=[name, age, gender, classification],
                          # columns=range(8), # this is the default
                          index=['name', 'age', 'gender', 'classification']
                         )

persons_df

If we prefer, the dataframe can be transposed

In [None]:
persons_df.transpose()

Unless explicitly done, most of the operation are **NOT done "inline"**. So, we must store the returned data (or use the `inplace` argument, if available).

In [None]:
persons_df = persons_df.T # equivalent to persons_df.transpose()
persons_df

### From a Python's Dictionary

A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values. Passing a dictionary to the DataFrame constructor will interpret the dictionary keys as the column names and the dictionary values as the data for those columns.

In [None]:
persons_df = pd.DataFrame({
    'name': name,
    'age': age,
    'gender': gender,
    'country': 'Pt',
    'pet': ['cat', 'dog', 'fish', 'cat', 'cat', 'fish', 'bird', 'cat'],
    'height': [170, 172, 178, 160, 165, 150, 151, np.nan],
    'classification': classification
})

persons_df

## How To Create an Empty DataFrame

As already seen, the method that we'll use is the Pandas' `Dataframe()` method.

In [None]:
empty_df = pd.DataFrame()
empty_df


We can use this function to make an empty DataFrame and use `numpy.nan` to initialize your data frame with NaNs. NaNs are used to indicate that the data is not available or missing. Note that `numpy.nan` has type float.

In [None]:
empty_df = pd.DataFrame(np.nan, 
                        index=[0, 1, 2, 3], 
                        columns=['A'])
empty_df

We can force the dataFrame to be of a certain type by adding the attribute `dtype` and filling in the desired type.

In [None]:
empty_df = pd.DataFrame(index=range(0,4),
                        columns=['A'], 
                        dtype='float')
empty_df

## How to get information from the Dataframe?

Returning to the person's dataframe

In [None]:
persons_df

The `shape` attribute of the Dataframe returns a tuple with the number of rows and columns `(number of rows, number of columns)`

In [None]:
m, n = persons_df.shape
print(f'The number of rows is {m} and the number of columns is {n}')

Another way to get the number of rows is to use the `len` function

In [None]:
len(persons_df)

We could also use function `count` to get to know more about the number of elements in your dataFrame. However,  this will exclude the NaN values (if there are any - see the `height` column).

In [None]:
persons_df.count()

To get the values in the dataframe we can use the `values` attribute

In [None]:
persons_df.values

And a statiscal description of the data can be obtained with the `describe` method. For example, the mean, standard deviation, minimum and maximum values, and the quartiles. These are only computed for the numerical columns.

 For the non-numerical columns, the count, unique, top, and freq values are returned instead but only if the `include` parameter is set to `all`.

In [None]:
persons_df.describe(include='all')  

and more information can be obtained with the `info` method, such as the number of non-null values, the type of the data, and the memory usage. The memory usage is not very accurate, but it is a good indication. To get the exact memory usage we can use the `memory_usage` parameter and set it to `deep`.

In [None]:
persons_df.info()

In [None]:
persons_df.info(memory_usage='deep')

## How to set the index

The index is a way to identify each row in the dataframe. To set the index we can use the `set_index` method of the dataframe. The `set_index` method returns a new DataFrame with the new index. The original DataFrame is not modified. To modify the original DataFrame we can use the `inplace` argument and set it to `True`.

In [None]:
persons_df_index_by_name = persons_df.set_index('name')

persons_df_index_by_name

Since the index is now the name, we can use it to access the data.

In [None]:
persons_df_index_by_name.loc['Margaret']

Since _Mary_ is repeated, we get a dataframe with all the rows with the name/index _Mary_.

In [None]:
persons_df_index_by_name.loc['Mary']

## Load and save data
Pandas allows us to load and save data from several formats. For example, we can **save** data to csv, json, excel, and many more. We can also **load** data from csv, json, excel, and many more.

In [None]:
# !pip install openpyxl # you might need to install this package to load and save data to excel files

persons_df.to_csv('persons.csv')
persons_df.to_json('persons.json')
persons_df.to_excel('persons.xlsx')

To load data from csv, json, excel, and many more type use the appropriate read function. For example:

In [None]:
persons_temp = pd.read_csv('persons.csv')
persons_another_temp = pd.read_json('persons.json')
persons_another_another_temp = pd.read_excel('persons.xlsx')

To read data from a url containing a csv file, we can use the `read_csv` function. For example, we can read the iris dataset from the pandas github repository.

In [None]:
iris_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/iris.data')
iris_df.head()

To avoid downloading all the times we can save the data to a local file and then read it from there.

In [None]:
try:
    iris_df = pd.read_csv('iris.data.csv', index_col=0)
    print('loaded from local iris.data.csv')
except:
    print('downloading...')
    iris_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/iris.data')
    print('done')
    iris_df.to_csv('iris.data.csv')
    print('saved to iris.data.csv')

## View Data

Recall what `persons_df` contains the following data:

In [None]:
persons_df

The first 5 lines of the dataframe can be obtained with the `head` method. If we want to get any other number    lines we can pass the number of lines as an argument to the `head` method.

In [None]:
persons_df.head()

For example, to get the first 3 lines we can do

In [None]:
persons_df.head(3)

To get the last 5 lines of the dataframe we can use the `tail` method. If we want to get any other number of lines we can pass the number of lines as an argument to the `tail` method.

In [None]:
persons_df.tail()

In [None]:
persons_df.tail(2)

To get the index of the dataframe we can use the `index` attribute. In this case, the returned object is a `RangeIndex` object, which is a memory-efficient representation of a sequence of integers.

In [None]:
persons_df.index

If we look at the index of the persons_df_index_by_name we can see that it is a `Index` object, which is a memory-efficient representation of an immutable sequence.

In [None]:
persons_df_index_by_name.index

To get the columns of the dataframe we can use the `columns` attribute. The returned object is a `Index` object, which is also a memory-efficient representation of an immutable sequence.

In [None]:
persons_df.columns

## Selection

The Python and NumPy indexing operators [] and attribute operator "." provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits.

For example, it possible to do row slicing, as follows

Get the 1st and 2nd rows of the dataframe

In [None]:
persons_df[0:2]

Get up to the 3rd line

In [None]:
persons_df[:3]

get data every two rows

In [None]:
persons_df[::2]

But the following is not slicing, so it will issue a `KeyError` error exception

In [None]:
persons_df[1]

Being an exception you can catch it with a try/except block

In [None]:
try:
    persons_df[1]
except KeyError as ke:
    print("KeyError: " + str(ke))

But, you can get columns by name using the following syntax

In [None]:
persons_df['pet']

Get 'pet' and 'gender' columns using a list of column names

In [None]:
persons_df[['pet', 'gender']]

We can also access a series (column) using the dot notation

In [None]:
persons_df.pet

### The `loc`, `iloc` and `at` methods

For production code, we recommended that you take advantage of the optimized pandas data access methods exposed next, namely the the use `loc`, `iloc` and `at` methods

* `loc` works on labels in the index.

* `iloc` works on the positions in the index (so it only takes integers - 0-index notation).

* `at` works similarly to `loc` but `at` provides label based scalar lookups (access a single value for a row/column label pair, so use `at` if you only need to get or set a single value in a DataFrame or Series).

Lets us start by redefining the index of the `persons_df` dataframe to be the name of the person.

In [None]:
persons_df = persons_df.set_index('name')
persons_df

Get the row with the label 'Peter'

In [None]:
persons_df.loc['Peter']

Being a 0-based index, the 5th row can be obtained with the following

In [None]:
persons_df.iloc[4]

Let us mesure the time it takes to get the row with the label 'Peter' using the `loc` and `iloc` methods.

In [None]:
# %timeit is an ipython magic function, which can be used to time a particular piece of code (A single execution statement, or a single method). 
%timeit persons_df.loc['Peter']
%timeit persons_df.iloc[4]

Slice the dataframe from the 1st to first row which has the label 'Peter'

In [None]:
persons_df.loc[:'Peter'] # 'Peter' is included!

Note that the slicing must be done using index which are unique. If we try to slice using a non-unique index we will get an error.

In [None]:
persons_df.loc[:'Mary'] # 'Mary' is duplicated in the index


Slice up to the 4th row (not included)

In [None]:
persons_df.iloc[:4]  # 'Peter' is NOT included!

Slice from row labeled 'John' to row labeled 'Peter'

In [None]:
persons_df.loc['John':'Peter']

Simultaneouly, it is possible to do filters and projections at the same time using the `loc` method.

In [None]:
persons_df.loc[:'John', 'age':'country']

Using the `iloc` method, we can do a similar thing

In [None]:
persons_df.iloc[:2, 0:3]

To get a specific value, we can do it differently

In [None]:
persons_df.loc['Peter','country']

or, since `persons_df.loc['Peter']` returns a Series, we can do

In [None]:
persons_df.loc['Peter']['country']

Using the `iloc` method, we can do it differently

In [None]:
persons_df.iloc[4, 2]

In [None]:
persons_df.loc['Peter', 'pet']

In [None]:
persons_df.at['Peter', 'pet']

In [None]:
persons_df.iat[4, 3]   

In [None]:
%timeit persons_df.loc['Peter','pet']
%timeit persons_df.at['Peter','pet']

In [None]:
%timeit persons_df.iloc[4, 3]
%timeit persons_df.iat[4, 3]

## Sorting data
To sort the data, we can use the `sort_values` method. The default sorting is ascending, but we can change it to descending by setting the `ascending` parameter to `False`.

In [None]:
persons_df.sort_values(by='height', ascending=False)

We can sort by multiple columns, by passing a list of column names to the `by` parameter. In this case, the first column will be used as the primary sorting key, and the second column will be used as the secondary sorting key.

In [None]:
persons_df.sort_values(by=['pet', 'classification'], ascending=[True, False])

It's also possible to sort the axis index, by using the `sort_index` method. 

Remember that the `axes` attribute returns a tuple with the index and the columns.

In [None]:
persons_df.axes

To sort the index in ascending order (default behaviour) we use the axis=0 parameter. This is the default behaviour.

In [None]:
persons_df.sort_index(axis=0, ascending=True)

To sort the columns in ascending order (default behaviour) we use the axis=1 parameter.

In [None]:
persons_df.sort_index(axis=1)

We can sort both the index and the columns in the same call. In this case we use the dot notation to chain the two methods as each method returns a new DataFrame.

In [None]:
persons_sorted = persons_df \
    .sort_index(axis=1) \
    .sort_index(axis=0)

persons_sorted

## Boolean Indexing
Pandas also allows you to use boolean indexing to select the data rows that match a specified criterion. For example, we can select the rows where the value of the `height` column is greater than 170.

In [None]:
persons_df.height > 170

In [None]:
persons_df[persons_df.height > 170]

The `isin` method can be used to filter the data using a list of values. For example, we can select the rows where the value of the `pet` column is either 'cat' or 'fish'.

In [None]:
query = persons_df.pet.isin(['cat', 'fish'])
query

In [None]:
persons_df[query]

This could also be done using the conjunction operator `&` (and) and the disjunction operator `|` (or).

In [None]:
query = (persons_df.pet == 'cat') | (persons_df.pet == 'fish')
query

For example if we want to the persons with a cat which are Male we can do the following

In [None]:
query = (persons_df.pet == 'cat') & (persons_df.gender == 'M')
persons_df[query]

## Adding and setting data (single value)
 To set a value in a DataFrame, we can use the `loc` or `iloc` methods.

 Remeber your `persons_df` values

In [None]:
persons_df

To set a values by index/label we can use the `loc` method or the `at` method. The `at` method is faster than the `loc` method if the index is unique.

In [None]:
persons_df.loc['Peter', 'pet'] = 'iguana'
persons_df

In [None]:
persons_df.at['Peter', 'pet'] = 'iguana'
persons_df

In [None]:
%timeit persons_df.loc['Peter', 'pet'] = 'iguana'
%timeit persons_df.at['Peter', 'pet'] = 'iguana'

In [None]:
persons_wihout_mary = persons_df.drop('Mary', axis=0)
%timeit persons_wihout_mary.loc['Peter', 'pet'] = 'iguana'
%timeit persons_wihout_mary.at['Peter', 'pet'] = 'iguana'

The `iloc` method is used to set a value by position. The `iat` method is faster than the `iloc` method. In this case, the i-th index is unique.

In [None]:
persons_df.iloc[4, 3] = 'cat'
persons_df

In [None]:
persons_df.iat[4, 3] = 'cat'
persons_df

In [None]:
%timeit persons_df.iloc[4, 3] = 'cat'
%timeit persons_df.iat[4, 3] = 'cat'

%timeit persons_wihout_mary.iloc[4, 3] = 'cat'
%timeit persons_wihout_mary.iat[4, 3] = 'cat'

In [None]:
# persons_df.at['Peter', 'age'] = 44
# persons_df.at['Peter', 'height'] = 185
# persons_df

In [None]:
# persons_df.iat[0, 0] = 16
# persons_df.iat[0, 4] = 168
# persons_df

## Adding and setting data (rows)
So, how to update a row? or replace it? Easy peasy... we just use the `loc` method to set the values of the row.

In [None]:
persons_df

If the index does __not exist__ then a new row is added, i.e., the row is appended.

In [None]:
persons_df.loc['George'] = [44, 'M', 'Pt', 'snake', 172.0, 8]
persons_df

If the index __exists__ then the data is updated, i.e., the row is replaced.

In [None]:
persons_df.loc['George'] = [34, 'M', 'Pt', 'iguana', 172.0, 8]
persons_df

## Adding  and setting values (columns)

We can add new Series/columns to the DataFrame by assigning them to the DataFrame. For example, we can add a new column called `weight` to the DataFrame.

In [None]:
persons_df['weight'] = [55, 58, 75, 64, 90, 20, 25, 78, 79]
persons_df

And operate Series. For instance the *body mass index* is defined as the body mass divided by the square of the body height,
$$ BMI = \frac{Weight}{Height^2}$$
and is universally expressed in units of $kg/m^2$, resulting from mass in kilograms and height in metres.

In [None]:
persons_df['BMI'] = persons_df['weight'] / (persons_df['height']/100) ** 2
persons_df

What if it is intended to say the person's weight category? We can use the `apply` method to apply a function to each element of the Series.

In [None]:
def set_category(bmi):
    return 'underweight' if bmi < 18.5  \
    else 'Normal weight' if 18.5 <= bmi < 25 \
    else 'Overweight'  if 25 <= bmi < 30 \
    else 'Obesity' if bmi >= 30 \
    else '?'

persons_df['category'] = persons_df.BMI.apply(set_category)

persons_df

## Delete a row / Column

To delete a row use the `drop` method with the index of the rows to be deleted. By default, the `drop` method returns a new DataFrame with the rows dropped, i.e., the default axis is 0.

In [None]:
persons_df.drop(['Mary', 'George'])

To delete a column use also the `drop` method but now tell it the `axis`, i.e., axis=1.

In [None]:
persons_df.drop(['pet'], axis = 1)

you can also drop every cat lover! First we get the indexes of the cat lovers

In [None]:
cat_people_indexes = persons_df.index[persons_df.pet == 'cat']
cat_people_indexes

And then we drop them!

In [None]:
persons_df.drop(cat_people_indexes)

To drop duplicates we can use the `drop_duplicates` method with the `keep` argument. If keep is `first` it keeps the first occurrence. If keep is `last` it keeps the last occurrence. If keep is `False` it drops all duplicates.

In [None]:
persons_df[['gender', 'pet']].drop_duplicates(keep='last')

In [None]:
persons_df[['gender', 'pet']].drop_duplicates(keep='first')

In [None]:
persons_df[['gender', 'pet']].drop_duplicates(keep=False)

## Missing data

pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. Let us add some missing data to our DataFrame.

In [None]:
persons_df_copy = persons_df.copy()
persons_df_copy.at['Peter', 'club'] = 'Associação Académica de Coimbra'
persons_df_copy

Looking at the info of the DataFrame we can see that the `club` column has 1 non-null values.

In [None]:
persons_df_copy.info()

To drop any rows that have missing data we can use the `dropna` method. This method returns a new DataFrame with the missing values dropped from it.

In [None]:
persons_with_full_data = persons_df_copy.dropna()
persons_with_full_data

To fill missing data we can use the `fillna` method. This method returns a new DataFrame with the missing values filled or imputed with the value passed as argument.

In [None]:
persons_with_filled_data = persons_df.copy()
persons_with_filled_data['club'] = persons_df_copy['club'].fillna('SC Farense')
persons_with_filled_data

To get the boolean mask where values are nan we can use the `isnull` method. Later we can use this mask to select the rows with missing data.

In [None]:
persons_df_copy.isnull()

## Stats

Stat operations in general exclude missing data. Examples of statistical operations include: `count`, `sum`, `mean`, `median`, `min`, `max`, `std`, `var`, `sem`, `skew`, `kurt`, `quantile`, `cumsum`, `cumprod`, `cummax`, `cummin`.

These methods can be applied to the whole DataFrame or to a specific column. For example, to get the mean of all numerical columns we can use the `mean` method. The mean is given by 
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

In [None]:
numerical_columns = persons_df.select_dtypes('number').columns
numerical_columns

In [None]:
persons_df[numerical_columns].mean()

or simply

In [None]:
persons_df.select_dtypes('number').mean()

Or for a specific column like `age`

In [None]:
persons_df.age.mean()

Performing the same operation but on the other axis can also be done (it has no meaning in our case!)

In [None]:
persons_df[numerical_columns]

In [None]:
persons_df[numerical_columns].mean(axis=1)

The median of all numerical columns is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution.

In [None]:
persons_df[numerical_columns].median()

The rank method computes the numerical data ranks (1 through n) along axis. By default, equal values are assigned a rank that is the average of the ranks of those values.

In [None]:
persons_df.rank()

The standard deviation of all numerical columns is a measure that is used to quantify the amount of variation or dispersion of a set of data values. The standard deviation is given by
$$ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} $$

In [None]:
persons_df[numerical_columns].std()

Compute pairwise correlation between rows or columns of two DataFrame objects. The correlation os the Pearson correlation coefficient which is the covariance of the two variables divided by the product of their standard deviations. This values ranges from -1 to 1. A value of 1 means that there is a perfect positive correlation between the two variables. A value of -1 means that there is a perfect negative correlation between the two variables. A value of 0 means that there is no correlation between the two variables.

The correlation between the height and the weight is given by
$$ \rho = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} $$

In [None]:
persons_df.height.corr(persons_df.weight)

and the correlation between all variables

In [None]:
persons_df[numerical_columns].corr()

The count method returns the number of non-null values in the DataFrame.

In [None]:
persons_df.count()

The unique method returns the unique values in a column.

In [None]:
persons_df.pet.unique()

## Append/concat & Merge
Let us consider another dataframe with roughly the same set of columns as before

In [None]:
more_persons_df = pd.DataFrame({'name': ['Ronald', 'Christian'], 
                    'age': [23, 46], 
                    'gender': ['M', 'M'], 
                    'country':'Pt',
                    'pet': ['turtle', 'iguana'],
                    'height': [190, 181],
                    'classification': [8, 9]
                   }).set_index('name')
more_persons_df

Now, we can append/concat the new dataframe to the previous one (https://pandas.pydata.org/docs/reference/api/pandas.concat.html).

In [None]:
persons_df = pd.concat([persons_df, more_persons_df])
persons_df

We can also merge two dataframes (https://pandas.pydata.org/docs/reference/api/pandas.merge.html). Let us consider another dataframe with the number of legs of each pet.

In [None]:
pet_info = pd.DataFrame({'pet': ['cat', 'dog', 'fish', 'bird', 'spider'],
                         'pet_num_legs': [4, 4, 0, 2, 8],
                         })
pet_info

The join of the tables is done based on some common column. In this case, the `pet` column. The default join is an inner join.

In [None]:
pd.merge(persons_df, pet_info, on='pet')

We can also do a left join and right join.

In [None]:
pd.merge(persons_df, pet_info, how='left', on='pet')

In [None]:
pd.merge(persons_df, pet_info, how='right', on='pet')

The merge can also be done on a specific column or the index of the dataframes. Let us consider the `pet_info` dataframe with the `pet` column as index.

In [None]:
pet_info = pet_info.set_index('pet')
pet_info

In [None]:
pd.merge(persons_df, pet_info, left_on='pet', right_index=True)

In [None]:
pd.merge(persons_df, pet_info, how='left', left_on='pet', right_index=True)

In [None]:
pd.merge(persons_df, pet_info, how='right', left_on='pet', right_index=True)

## Grouping

By “group by” we are referring to a process involving one or more of the following steps

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

For example, we can count the number of instances by gender of each column

In [None]:
persons_df.groupby('gender').count()

Or see the mean values (of the numerical columns) by gender. In this case, the columns have to be restricted to the numerical columns, otherwise the `mean` operation will fail.

In [None]:
persons_df.groupby('gender')[numerical_columns].mean()

or just for a single column (note that `persons_df.groupby('gender').height` is a SeriesGroupBy object, so we need to call the `mean` method to get the mean value, resulting in a Series object)

In [None]:
persons_df.groupby('gender').height.mean()

or by a slightly different alternative (note that the result is a DataFrame object)

In [None]:
type(persons_df[['gender', "height"]].groupby('gender').mean())

Although in some cases it makes no sense, we can sum the grouped values

In [None]:
persons_df.groupby('gender')[numerical_columns].min()

We can also compute the maximum value of the grouped values (between all full  defined rows)

In [None]:
persons_df.dropna().groupby('gender')[numerical_columns].prod()

Other methods that can be used with `groupby` are `count`, `first`, `last`, `max`, `min`, `median`, `prod`, `std`, `sum`, `var`.

## Plotting

if using jupyterthemes you should run the following commands to adequated colors

In [None]:
# !pip install jupyterthemes #if needed,
from jupyterthemes import jtplot
jtplot.style()

In [None]:
# reload data, if necessary
try:
    iris_df = pd.read_csv('iris.data.csv', index_col=0)
    print('loaded from iris.data.csv')
except:
    print('downloading...')
    iris_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/iris.data')
    print('done')
    iris_df.to_csv('iris.data.csv')
    print('saved to iris.data.csv')

iris_df.head()

Let us see some relevant information about the dataframe

In [None]:
iris_df.info()

and see a description of the numerical columns

In [None]:
iris_df.describe()

The plot method can be used to plot the data. By default, it plots the index on the x-axis and the values on the y-axis.

In [None]:
iris_df.plot()

plt.show()

We can also plot the data grouped by a column. In this case, the `Name` column.

In [None]:
iris_df.groupby('Name').plot()

plt.show()

A scatter plot can be used to plot the data of two columns.

In [None]:
colors = dict(zip(iris_df.Name.unique(), ['red', 'green', 'blue']))

for t1 in ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']:
    for t2 in ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']:
        if t1 > t2:
            iris_df.plot(x=t1, y=t2, kind='scatter', c=iris_df.Name.apply(lambda x : colors[x]))


plt.show()

The use of subplots can be useful to plot the data of all columns in a single figure. In this case, we use the `subplots` method of the `matplotlib` library, and define a 4 by 4 grid of subplots.

In [None]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(4, 4, figsize=(15, 15))

colors = dict(zip(iris_df.Name.unique(), ['red', 'green', 'blue']))
for i1, t1 in enumerate(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']):
    for i2, t2 in enumerate(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']):
        if t1 != t2:
            iris_df.plot(x=t1, y=t2, ax=axes[i1, i2], kind='scatter', c=iris_df.Name.apply(lambda x : colors[x]))
        else:
            iris_df[t1].plot(kind='hist', ax=axes[i1, i2], color='red', alpha=0.5)

plt.show()

A more automatic way to plot the data is using the `scatter_matrix` method of the `pandas.plotting` library.

In [None]:
from pandas.plotting import scatter_matrix

_ = scatter_matrix(iris_df, figsize=(20, 20))

plt.show()

see more plotings at http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html

## Running inplace
Some methods are allowed to be made inplace.  Let us remember the values of the `persons_df` dataframe

In [None]:
persons_df

E.g., the following sort of values will not affect the `persons_df` dataframe. In fact, it will return a new dataframe with the sorted values.

In [None]:
persons_df.sort_values(by=['age'])

So we can use the `inplace` argument (depends no the methods). In this case, the `persons_df` dataframe will be modified.

In [None]:
persons_df.sort_values(by=['age'], inplace=True)
persons_df

# Exercise

Go to folder `exercises` and solve the exercises in notebook `03_exercises_US_names.ipynb`.