<!-- Container with white background -->
<div style="background-color: white; padding: 20px; border-radius: 10px;">

  <!-- Name in bold and approximate LaTeX font style with the logo blue color -->
  <h1 style="font-family: 'Times New Roman', Times, serif; font-weight: bold; color: #38549c; text-align: center;">
    Introduction to Python | Session #2
  </h1>

  <!-- Project name in similar style with the logo blue color in italics -->
  <h2 style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-style: italic;">
    Diploma in Banking Supervision
  </h2>

  <!-- CEMFI logo centered -->
  <div style="text-align: center; margin-bottom: 40px;">
    <img src="https://www.cemfi.es/images/Logo-Azul.png" alt="CEMFI Logo" style="width:200px;">
  </div>

  <!-- Catchy message about authorship -->
  <p style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1.2em;">
    Jesus Villota Miranda © 2025
  </p>

  <!-- Contact information with logos -->
  <p style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1em;">
    Contact:
    <a href="mailto:jesus.villota@cemfi.edu.es" style="color: #38549c;">
      <img src="https://www.logolynx.com/images/logolynx/64/64319177556c729f1806922bcd3adef5.png" alt="Email Logo" style="width: 20px; vertical-align: middle;">
      jesus.villota@cemfi.edu.es
    </a> |
    <a href="https://www.linkedin.com/in/jesusvillotamiranda/" target="_blank" style="color: #38549c;">
      <img src="https://1.bp.blogspot.com/-onvhHUdW1Us/YI52e9j4eKI/AAAAAAAAE4c/6s9wzOpIDYcAo4YmTX1Qg51OlwMFmilFACLcBGAsYHQ/s1600/Logo%2BLinkedin.png" alt="LinkedIn Logo" style="width: 20px; vertical-align: middle;">
      LinkedIn
    </a>
  </p>

</div>


### 5. Data management with Pandas 

Pandas is a Python library designed for efficient and practical data analysis tasks (clean, manipulate, analyze). Differently from NumPy, which is designed to work with numerical arrays, Pandas is designed to handle relational or labeled data (i.e., data that has been given a context or meaning through labels). 

Pandas and NumPy share some similarities as Pandas builds on NumPy. However, rather than arrays, the main object for 2D data management in Pandas are `DataFrames` (equivalent to data.frame in R). `DataFrames` are esentially matrices with labeled columns and rows that can accomodate mixed datatypes and missing values. `DataFrames` are composed by rows and columns. Row labels are known as `indices` (start from 0) and columns labels as `columns names`.

Pandas is a very good tool for: 
- Dealing with missing data
- Adding or deleting columns
- Align data on labels or not on labels (i.e., merging and joining)
- Perform grouped operations on data sets. 
- Transforming other Python data structures to DataFrames.
- Multi-indexing hierarchically. 
- Load data from multiple file tipes. 
- Handling time-series datas, as it has specific time-series functionalities.


**Data structures:**

Pandas builds on two main data structures: `series` and `DataFrames`.  `Series` represent 1D arrays while `DataFrames` are 2D labeled arrays.  The easiest way to think about both structures is to conceptualize `DataFrames` as containers of lower dimension data. That is, `DataFrames` columns are composed of `series`, and each of the elements of a `series` (i.e., the rows of the `DataFrame`) are individual scalar (numbers or strings) values. In plain words, `Series` are columns made of scalar elements and `DataFrames` are collections of `Series` that get an assigned label. The image below represents a `DataFrame`

<img src="./img1.png" alt="df" width="400"/>


All pandas data structures are value-mutable (i.e., we can change the values of elements and replace `DataFrames`) but some are not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame.

#### 5.1 Creating a DataFrame. 

`DataFrames` and `Series` can be created transforming built-in Python datastructures and also by importing data (i.e., reading `.csv` files).

Let's start by importing pandas.

In [3]:
import pandas
pandas.set_option('display.max_columns', None)

In the first example I will create a `DataFrame` from a dictionary. 

In [4]:
phds = {'Name': ['Joël Marbet', 'Alba Miñano-Mañero'], 
        'Undergrad University': ['Univeristy of Bern', 'University of València'],
        'Fields':['Monetary economics','Urban Economics'],
        'PhD Desk': [20,10]}
phd_df = pandas.DataFrame(phds)
phd_df

Unnamed: 0,Name,Undergrad University,Fields,PhD Desk
0,Joël Marbet,Univeristy of Bern,Monetary economics,20
1,Alba Miñano-Mañero,University of València,Urban Economics,10


To find out how many rows and columns has a dataframe, we can call the `.shape` attribute.  For instance, the dataframe we just created has 2 rows and 4 columns. 

In [None]:
phd_df.shape

We can work with single columns or subset our dataframe to a set of columns by calling the columns names in square braces `[]`. The returning column will be of the `pandas.Series` type. 

In [None]:
phd_df['Name']

In [None]:
type(phd_df['Name'])

> Tip: If we call it with double braces (`[[]]`)  we'll get a `pandas.DataFrame` rather than a `pandas.Series`. This is so because the inner pair of brackets is generating a list of columns, and the outer allow to select data from the dataframe. 

In [None]:
phd_df[['Name']]

In [None]:
type(phd_df[['Name']])

We can also access all columns names by calling the attribute `.columns` of a dataframe. 

In [None]:
phd_df.columns

With this in mind, we can copy a selection of our DataFrame to a new one.  We use the `.copy()` method to generate a copy of the data and avoid broadcasting any change to the original frame we are subsetting. 

In [None]:
phd_df2 = phd_df[['Name','Fields']].copy()
phd_df2

We could also create a dataframe from other Python data-structures. In the example below, we create the same `DataFrame` from a list. If we do not specify the column names, it will just give numeric labels to columns from 0 to n. 

In [None]:
phds_list = [['Joël Marbet', 'Univeristy of Bern','Monetary economics', 20],
          ['Alba Miñano-Mañero', 'University of València', 'Urban Economics',10]]
# Create a list of column names
column_names_list = ['Name', 'Undergraduate University', 'Fields','PhD desk']

# Create a DataFrame from the array, using the list as column names and the tuple as row labels
df = pandas.DataFrame(phds_list, columns=column_names_list)
df

It could also be from a tuple:

In [None]:
phds_tu = (('Joël Marbet', 'Univeristy of Bern','Monetary economics', 20),
          ('Alba Miñano-Mañero', 'University of València', 'Urban Economics',10))
# Create a list of column names
column_names_list = ['Name', 'Undergraduate University', 'Fields','PhD desk']

# Create a DataFrame from the array, using the list as column names and the tuple as row labels
df = pandas.DataFrame(phds_tu, columns=column_names_list)
df

In [None]:
type(phds_tu)

`DataFrames` can also receive as data source NumPy arrays: 

In [None]:
import numpy as np 
data_arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create a DataFrame from the NumPy array
df_arr = pandas.DataFrame(data_arr, columns=['A', 'B', 'C'])

df_arr

Similarly, we can also transform built-in structures into `pandas.Series` 

In [None]:
week_series= pandas.Series(['Monday','Tuesday','Wednesday']) # from a list
week_series


The previous line would be equivalente to create first a list and pass it as argument to `pandas.Series`. We can also specify the row index, which can be very useful for join and concatenate operations and we can also specify a name. 

In [None]:
week_list = ['Monday','Tuesday','Wednesday']
week_l_to_series = pandas.Series(week_list,index=range(3,6),name='Week')
week_l_to_series

We could convert that series to a `pandas.DataFrame()` that would have a column name 'Week' and as row index (3,4,5)

In [None]:
pandas.DataFrame(week_l_to_series)

It is also possible to create them from dictionaries.

In [None]:
weekdays_s = {
    'Monday': 'Work',
    'Tuesday': 'Work',
    'Wednesday': 'Work',
    'Thursday': 'Work',
    'Friday': 'Work',
    'Saturday': 'Weekend',
    'Sunday': 'Weekend'
}
dic_to_series = pandas.Series(weekdays_s,name='Day_type')

In [None]:
dic_to_series

Most Pandas operations and methods return `DataFrames` or `Series`. For instance, the `describe()` method, which returns basic statistics of numerical data returns either a `DataFrame` or a `Series`. 

In [None]:
df_arr['A'].describe()

In [None]:
x = df_arr['A'].describe() # I use describe on a series,  returns a series
type(x)

In [None]:
x = df_arr[['A']].describe()  # I use describe on a dataframe, returns a dataframe
type(x)

In [None]:
x

#### 5.2. Importing and exporting data 

Besides creating our own tabular data, we can also use Pandas to import, read and manipulate tabular data. Let's take a look at how to import different data sets. Reading different file types (excel, parquets, json...) is supported  by tha family of `read_*` functions

In [None]:

file_csv ='./card_transdata.csv'
data = pandas.read_csv(file_csv)


Printing the first or last *n* elements of the data is possible using the methods `.head(n)` and `.tail(n)`, for the first and bottom elements respectively. 

In [None]:
data.head(5)

In [None]:
data.tail(5)

It is also possible to get all the particular data types specifications of our columns:

In [None]:
data.dtypes # it's an attribute of the dataframe

Or get a quick technical summary of our imported data, we can use the `.info()` method

In [None]:
data.info()

The summary is telling us that: 
- We have a `pandas.DataFrame`. 
- It has 1,000,000 rows and their index are from 0 to 999,9999
- None of the collumns has missing data
- All columns data type is a 64 bit float. 
- It takes up 61 megabytes of our RAM memory. 

While the `read_*` families allow to import different sources of data, the `to_*` allows us to export it. In the below example, I am saving  a csv that contains only the first 10 rows of the credit transaction data. 

In [None]:
data.head(10).to_csv('./subset.csv')

If we export other data to the same file, it will typically overwrite the existing one (unless we are specifying sheets in an Excel file, for instance). Also, this happens without warning, so it is important to be consistent in our naming to avoid unwanted overwrites

In [None]:
data.head(5).to_csv('./subset.csv')

#### 5.3. Filtering and subsetting dataframes. 

It is possible to use an approach similar to the selection of columns to be able to subset our dataframe to given rows. 

For example, let's extract from the credit transaction data those that happened through online orders. 

In [None]:
online = data[data['online_order']==1]
online.head()

We can check that in our new subset the variable **online_order** is always 1. 

In [None]:
online['online_order'].describe()

In [None]:
online['online_order'].unique()

That is, to select rows based on a column condition, we first write the condition within `[]`. This selection bracket will give use a series of boolean (i.e., a mask) with True values for those rows that satify the condition. 

In [None]:
mask_data = data['online_order']==1
mask_data

In [None]:
type(mask_data)

Then, when we embrace that condition with the outer `[]` what we are doing is subsetting the original dataframe to those rows that are True, that is, that satisfy our condition. 

If our condition involves two discrete values, we can write both conditions within the inner bracket using the `|` separator for *or* conditions and `&` for *and*. That is, we could write `data[(data['var']==x) | (data['var']==y)]`. This line of code is equivalent to using the `.isin()` conditional function with argument `[x,y]`. 

Sometimes, it is also useful to get rid of missing values on given columns using the `.notna()` method. 

If we want to extract to select rows and columns simultaneously, we can use the `.loc[condition,variable to keep]`and `.iloc[condition,variable to keep]` operators.  For instance, if we want to extract the distance from home only for the fraudulent transactions we could do: 


In [None]:
data.loc[data['fraud']==1,'distance_from_home'] # first element is the condition, second one is the column we want. 

`.loc` is also useful to replace the value of columns where a row satisfy certain condition. For instance, let's create a new variable that takes value 1 only for those transactions happenning more than 5km away from home. 

In [None]:
data['more_5km'] = 0
data.loc[data['distance_from_home'] > 5, 'more_5km'] = 1
data[data['more_5km']==1]['distance_from_home'].describe()

To extract using indexing of rows and columns we typically rely in `iloc`instead. For instance, in the snipper of code below, I am extracting the rows from 10 to 20 and the second and third column:

In [None]:
data.iloc[10:21,2:4]

In whatever selection you are doing, keep in mind that the reasoning is always you first generate a **mask** and then you get the rows for which the mask holds.

#### 5.4. Creating new columns

Pandas operates with the traditional mathematical operators in an element-wise fashion, without the need of writing for loops. For instance let's convert the distance from kilometers to miles (i.e., we divide by 1.609)

In [None]:
data['distance_from_home_km'] = data['distance_from_home']/1.609
data

We can also operate with more than one columns. For instance, let's sum the indicators for repeat retailer and used chip so that we can get a new column that directly tell us if a row satisfies both conditions. 

In [None]:
data['sum_2'] = data['repeat_retailer'] + data['used_chip']
data

Because *sum_2* is not a very intuitive name, let's take advantage of that to show we can rename columns: 

In [None]:
data = data.rename(columns={'sum_2':'both_conditions'})
data

#### 5.5. Merging tables.

Sometimes, we find that in the process of cleaning data, we need to put together different dataframes, either vertically or horizontally. This can be achieved concatenating along indexes or, when the dataframes have a common identifier, using `.merge()`

Let's start to show we can concatenate along axis. The default axis is 0, which means that it will concatenate rows (i.e., append dataframes vertically)

In [None]:
data_1 = data.iloc[0:100]
data_2 = data.iloc[100:200]

In [None]:
concat_rows = pandas.concat([data_1,data_2], axis = 0)
concat_rows 

If instead we try to concatenate horizontally, it will by default try to concatenate with the index (unless we specify ignore_index=True). In the below example, since the columns are subset in  different set of rows, it just fills with NaN the rows that do not share an index in the other dataframe. We can also use custom columns to align the dataframes specifying the column name in the 'keys' options

In [None]:
data_c1 =data_1[['distance_from_home','repeat_retailer' ]]
data_c2 =data_2[['distance_from_last_transaction','used_chip' ]]
data_ch = pandas.concat([data_c1,data_c2],axis = 1) 
data_ch 


Using concat for horizontal concatenation can be very useful when we have more than one dataset to concatenate and want to avoid repeated lines using `.merge`, which by default can only concatenate two dataframes.

In [None]:
data_c1 =data_1[['distance_from_home','repeat_retailer' ]]
data_c2 =data_1[['distance_from_last_transaction','used_chip' ]]
data_c3 =data_1[['used_pin_number','online_order']]
data_ch = pandas.concat([data_c1,data_c2,data_c3],axis = 1) 
data_ch 


If instead we wanted to use the `.merge()` method, we would have to repeat it twice:

In [None]:
data_m1 = data_c1.reset_index().merge(data_c2.reset_index(),on='index',how='inner',validate='1:1')
data_m1

In [None]:
data_m1 = data_m1.merge(data_c3.reset_index(),on='index',how='inner',validate='1:1')
data_m1

#### 5.6 Summary statistics and groupby operations. 

Pandas also offers various statistical measures that can be applied to numerical data columns. By default, operations exclude missing data and extend across rows. 

In [None]:
print('Mean distance from home:', data['distance_from_home'].mean())
print('Minimum distance from home:',data['distance_from_home'].min())
print('Median distance from home:', data['distance_from_home'].median())
print('Maximum distance from home:', data['distance_from_home'].max())

If we want more flexibility we can also specify a series of statistics and the columns we want them to be computed on using `.agg`

In [None]:
data.agg(
    {
           "distance_from_home": ["min", "max", "median", "skew"],
            "distance_from_last_transaction": [ "max", "median", "mean"],  
       }
)

> Tip: computing the mean of a dummy variable will tell us the percentage of observations within that category. 

In [None]:
data['fraud'].mean()  ## 8% are frauds. 

Besides this aggreggating operations, we can also compute them within category or by groups. 

In [None]:
data.groupby("fraud")["distance_from_home"].mean()

Because the default return is a `pandas.Series`, it is useful to convert it to a `pandas.DataFrame` which can be done as follows:

In [None]:
pandas.DataFrame(data.groupby("fraud")["distance_from_home"].mean())

Notice that the new indexing variable is automatically set to our groupby one. We can undo this change to recover the groupby variable as accessible:

In [None]:
pandas.DataFrame(data.groupby("fraud")["distance_from_home"].mean()).reset_index()

### 6. Visualization: matplotlib and seaborn. 

The library that will allow us to do data visualization, whether static, animated or interactive, in Python is Matplotlib. 

Matplotlib plots our data on Figures, each one containing axes where data points are defined as x-y coordinates. This means that matplotlib will allow us to manipulate the displays by changing the elements of these two classes: 

**Figure**: 

The entire figure that contains the axes, which are the actual plots. 

**Axes**:

Contains a region for plotting data, and include the x and y axis (and z if is 3D) where we actually plot the data. 

<img src="https://matplotlib.org/stable/_images/anatomy.png" alt="drawing" width="400"/>


In [None]:
import matplotlib.pyplot as plt


For the following examples, we will use data different from the crad transaction. However, just as in most typical real life situations, we will have to do some pre-processing before we can start plotting our data. In this case, we are going to be using data from the [World Health Organization](https://www.who.int/data/gho/data/indicators/indicator-details/GHO/ambient-and-household-air-pollution-attributable-death-rate-(per-100-000-population-age-standardized)) on mortality induced by pollution and country GDP per capita data from the [World Bank](https://ourworldindata.org/grapher/gdp-per-capita-worldbank?tab=table&time=2019..latest)

In [None]:
death_rate = pandas.read_csv('./data.csv')
gdp = pandas.read_csv('./gdp-per-capita-worldbank.csv')

In [None]:
death_rate[(death_rate['Location']=='Algeria') & (death_rate['Dim1']=='Both sexes') & (death_rate['Dim2']=='Total') & (death_rate['IndicatorCode']=='SDGAIRBOD')]

We keep the variables we are interested in. 

In [None]:
keep_vars = ['Indicator','Location','ParentLocationCode','ParentLocation','SpatialDimValueCode','Location','FactValueNumeric']
death_rate = death_rate[(death_rate['Dim1']=='Both sexes') & (death_rate['Dim2']=='Total') & (death_rate['IndicatorCode']=='SDGAIRBOD')][keep_vars]

In [None]:
death_rate

In [None]:
gdp.head()

In [None]:
gdp = gdp[(gdp['Year']==2019) & (gdp['Code'].notna()) & (gdp['Code']!='World')]

In [None]:
gdp

Now we are ready to merge both. 

In [None]:
gdp_pollution = gdp.merge(death_rate, left_on='Code',right_on='SpatialDimValueCode',how='inner',validate='1:1')

In [None]:
gdp_pollution

For a quick visualization, we can use the Pandas `.plot()` method, that will generate a Matplotlib figure object. 
Our sample data is more complicated because we don't have an X axis directly, so we will have a line graph, where the X axis is the index (i.e., row indicator) and the Y axis is the value we are plotting. 

In [None]:
gdp_pollution['FactValueNumeric'].plot()
plt.show()

But this is not very informative. Let's exploit the functionalities of the `.plot()` method to get a better grasp of our data. 

In [None]:
gdp_pollution.plot.scatter(x='GDP per capita, PPP (constant 2017 international $)',y='FactValueNumeric')

This scatterplot is starting to say something: it seems there is a negative relationship between GDP per capita and deaths due to pollution. We would have been able to produce the same graph to using Matplotlib functionalities as follows: 

In [None]:
plt.plot(gdp_pollution['GDP per capita, PPP (constant 2017 international $)'],gdp_pollution['FactValueNumeric'],'o') #'o' = scatterplot
plt.xlabel('GDP Per Capita')  
plt.ylabel('Air pollution attributable death rate (per 100 000 population)') 

While the Pandas functionality allows us to do direct visualizations, further customizations will be required most of the time to achieve the desired output. This is when Matplotlib comes in handy. 

In [None]:
f, ax = plt.subplots(figsize = (5,5)) # in inches. We can define a converting factor
gdp_pollution.plot(ax=ax, x ='GDP per capita, PPP (constant 2017 international $)', y ='FactValueNumeric', kind='scatter')
ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Air pollution attributable death rate (per 100 000 population)')
ax.set_title('Title of this subplot')
x = gdp_pollution['GDP per capita, PPP (constant 2017 international $)'].values
y = gdp_pollution['FactValueNumeric'].values
f.suptitle('My first graph')
f.savefig('./fig1.png')



It is also possible to add the labels that indicate the country code using the annotate method. 

In [None]:
f, ax = plt.subplots(figsize = (5,5)) # in inches. We can define a converting factor
gdp_pollution.plot(ax=ax, x ='GDP per capita, PPP (constant 2017 international $)', y ='FactValueNumeric', kind='scatter')
ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Air pollution attributable death rate (per 100 000 population)')
ax.set_title('Title of this subplot')
x = gdp_pollution['GDP per capita, PPP (constant 2017 international $)'].values
y = gdp_pollution['FactValueNumeric'].values
for i in range(len(gdp_pollution)): 
    plt.annotate(gdp_pollution['Code'][i], (x[i], y[i] + 0.5), fontsize=7)
f.suptitle('My first graph')
f.savefig('./fig1.png')

We can also plot multiple subplots within the same figure. 

In [None]:

f, axs = plt.subplots(nrows=1, ncols=2, figsize = (12,5)) # Now axs contains two elements
gdp_pollution.plot(ax=axs[0], x ='GDP per capita, PPP (constant 2017 international $)', y ='FactValueNumeric', kind='scatter')
axs[0].set_xlabel('GDP Per Capita')
axs[0].set_ylabel('Air pollution attributable death rate (per 100 000 population)')
axs[0].set_title('Title of subplot in axs 0')

gdp_pollution.plot(ax=axs[1], x ='GDP per capita, PPP (constant 2017 international $)', y ='FactValueNumeric', kind='scatter')
axs[1].set_xlabel('GDP Per Capita')
axs[1].set_ylabel('Air pollution attributable death rate (per 100 000 population)')
axs[1].set_title('Title of subplot in axs 1')

f.suptitle('My first figure with two subplots')



To perform more advanced statistical graphs, we can rely on Seaborn. Seaborn is a library that facilitates the creation of statistical graphics by leveraging matplotlib and integrating with pandas data structures.

In [None]:
import seaborn as sns

Functions in `seaborn` are classified as: 
- Figure-level: internally create their own matplotlib figure. When we call this type of functions, they initialize its own figure, so we cannot draw them into an existing axes. To customize its axes, we need to access the Matplotlib axes that are generated within the figure and then add or modify elements. 

- Axis-levels:  the return plot is a matplotlib.pyplot.Axes object, which means we can use them within the Matplotlib figure set up. 

In [None]:
sns.relplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric")

In Seaborn we can add an additional dimension in the scatterplot by using different colors for observations in different categories specifying the "hue" parameter. 

In [None]:
sns.relplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric", hue = 'ParentLocation')

Furthermore, it is also straightforward to use different markers for each category specifying it in the "style" option.  While here we are using the same variable for the differentiation, it is also possible to specify different variables in hue and style. 

In [None]:
sns.relplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric", hue = 'ParentLocation', style='ParentLocation')

To explore more options, let's merge our data with population at the country level. 

In [None]:
population = pandas.read_csv('./population-unwpp.csv')

In [None]:
population = population[(population['Year']==2019) & (population['Code'].notna())]

In [None]:
gdp_pollution = gdp_pollution. merge(population, how='inner', on='Code', validate='1:1')

Now, we can make each point have a different size depending on the population of the country

In [None]:
sns.relplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric", hue = 'ParentLocation', 
            size='Population (historical estimates)', sizes = (15,250))

In [None]:
population.sort_values('Population (historical estimates)')

Seaborn also has a functionality that allows to draw scatterplots with regression lines (`regplot`). The function `lmplot` allows also to draw the regression lines conditioning on other variables (i.e., by category)

In [None]:
sns.regplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric")

In [None]:
sns.lmplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric", hue="ParentLocation" )

Specifying "col" or "row" will draw separate graphs for each category. 

In [None]:
sns.lmplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric", hue="ParentLocation", col ="ParentLocation" )

`lmplot` also performs polynomial regressions. 

In [None]:
sns.lmplot(data=gdp_pollution, x="GDP per capita, PPP (constant 2017 international $)", y="FactValueNumeric", order=2)

With a similar syntaxis, we can draw histogram and density plots, which can also be helpful to understand the distribution of continuous variables. 

For instance, in the code below we are going to draw the histogram for GDP per capita for each parent location separately. 

In [None]:
a = sns.displot(data = gdp_pollution, x = "GDP per capita, PPP (constant 2017 international $)",kind = 'hist',col='ParentLocation')
a.set_axis_labels("GDP", "Count")
a.set_titles("{col_name}")
a.savefig('./export.png')

Similarly, we could have used: 

In [None]:
f, ax = plt.subplots()
sns.histplot(data = gdp_pollution, x = "GDP per capita, PPP (constant 2017 international $)", color="skyblue", label="MaxTemp", kde=True, ax = ax)

plt.legend()
plt.xlabel('GDP per capita')

Heatmaps are also built-in within Seaborn and are a very popular too. to display the correlation between the variables of the dataframe. It's like visualizing a correlation matrix with colors. 

In [None]:
sns.heatmap(gdp_pollution[[ "GDP per capita, PPP (constant 2017 international $)", "FactValueNumeric", 'Population (historical estimates)']].corr())

# Practice:
1. Write a loop that describes all columns of the credit transaction data. 
2. Write a loop that plots in a different subplot the scatterplot of GDP vs mortality for each Parent Location. It should have 3 columns and 2 rows. 