---

author: Юрій Клебан

---

# Exploring data with `dplyr`

## Basic funtions and dataset explore

There are most popular functions in `dplyr` is listed in table.

|dplyr Function|Description	| Equivalent SQL|
|---|---|---|
|select()	|Selecting columns (variables)	|SELECT|
|filter()	|Filter (subset) rows.	|WHERE|
|group_by()|	Group the data	|GROUP BY|
|summarise()|	Summarise (or aggregate) data|	-|
|arrange()|	Sort the data	|ORDER BY
|join()|	Joining data frames (tables)|	JOIN|
|mutate()	|Creating New Variables|	COLUMN ALIAS|

For the next sample we are going to use `gapminder` dataset. [Go to gapminder dataset description](00_Datasets.ipynb#gapminder)

The `gapminder` data frame include six variables:


|variable|meaning|
|---|---|
|country| - |	
|continent| - |		
|year| - |	
|lifeExp|	life expectancy at birth|
|pop	|total population|
|gdpPercap|	per-capita GDP|

`Per-capita GDP` (Gross domestic product) is given in units of international dollars, `a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time` – 2005, in this case.

The `gapminder` data frame is a special kind of data frame: a `tibble`. 

In [None]:
library(dplyr) # for demos
#install.packages("gapminder")
library(gapminder)  # load package and dataset
class(gapminder)

Let's preview it with functions `str()`, `glimpse()`, `head()`, `tail()`, `summary()`.

In [None]:
str(gapminder)

tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...


In [None]:
glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   [3m[90m<fct>[39m[23m "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ continent [3m[90m<fct>[39m[23m Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
$ year      [3m[90m<int>[39m[23m 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
$ lifeExp   [3m[90m<dbl>[39m[23m 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
$ pop       [3m[90m<int>[39m[23m 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
$ gdpPercap [3m[90m<dbl>[39m[23m 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~


In [None]:
head(gapminder) #shows first n-rows, 6 by default

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134


In [None]:
tail(gapminder) #shows last n-rows, 6 by default

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Zimbabwe,Africa,1982,60.363,7636524,788.855
Zimbabwe,Africa,1987,62.351,9216418,706.1573
Zimbabwe,Africa,1992,60.377,10704340,693.4208
Zimbabwe,Africa,1997,46.809,11404948,792.45
Zimbabwe,Africa,2002,39.989,11926563,672.0386
Zimbabwe,Africa,2007,43.487,12311143,469.7093


In [None]:
summary(gapminder)

        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

## `filter()` function

In [None]:
austria <- filter(gapminder, country == "Austria")
austria

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Austria,Europe,1952,66.8,6927772,6137.076
Austria,Europe,1957,67.48,6965860,8842.598
Austria,Europe,1962,69.54,7129864,10750.721
Austria,Europe,1967,70.14,7376998,12834.602
Austria,Europe,1972,70.63,7544201,16661.626
Austria,Europe,1977,72.17,7568430,19749.422
Austria,Europe,1982,73.18,7574613,21597.084
Austria,Europe,1987,74.94,7578903,23687.826
Austria,Europe,1992,76.04,7914969,27042.019
Austria,Europe,1997,77.51,8069876,29095.921


`filter()` takes logical expressions and returns the rows for which all are TRUE.

In [None]:
# task: select rows with lifeExp less than 31
filter(gapminder, lifeExp < 31)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Angola,Africa,1952,30.015,4232095,3520.6103
Gambia,Africa,1952,30.0,284320,485.2307
Rwanda,Africa,1992,23.599,7290203,737.0686
Sierra Leone,Africa,1952,30.331,2143249,879.7877


In [None]:
# task: select Austria only and year after 1980
filter(gapminder, country == "Austria", year > 1980)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Austria,Europe,1982,73.18,7574613,21597.08
Austria,Europe,1987,74.94,7578903,23687.83
Austria,Europe,1992,76.04,7914969,27042.02
Austria,Europe,1997,77.51,8069876,29095.92
Austria,Europe,2002,78.98,8148312,32417.61
Austria,Europe,2007,79.829,8199783,36126.49


In [None]:
# task: select Austria and Belgium
filter(gapminder, country %in% c("Austria", "Belgium"))

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Austria,Europe,1952,66.8,6927772,6137.076
Austria,Europe,1957,67.48,6965860,8842.598
Austria,Europe,1962,69.54,7129864,10750.721
Austria,Europe,1967,70.14,7376998,12834.602
Austria,Europe,1972,70.63,7544201,16661.626
Austria,Europe,1977,72.17,7568430,19749.422
Austria,Europe,1982,73.18,7574613,21597.084
Austria,Europe,1987,74.94,7578903,23687.826
Austria,Europe,1992,76.04,7914969,27042.019
Austria,Europe,1997,77.51,8069876,29095.921


Lets rewrite initial code and record it to the variable/data.frame:

## Pipe (`%>%`/`|>`) operator

`%>%` is `pipe` operator. The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.

`head()` function without pipe and top 4 items:

> In R version before 4.1.0 `pipe` `%>%` operator is not a language build-in and you should install `magrittr` package:

> **Pipe opertor in R 4.1+ `|>`, using this is preferable**

In [None]:
#install.packages("magrittr") # for pipe %>% operator
library(magrittr)

In [None]:
head(gapminder, n = 4)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971


`head()` function with pipe and top 4 items:

In [None]:
gapminder %>% head(4)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971


Output is the same. So, let's rewrire filtering for `Austria` with pipe:

In [None]:
austria <- gapminder |> filter(country == "Austria")
austria

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Austria,Europe,1952,66.8,6927772,6137.076
Austria,Europe,1957,67.48,6965860,8842.598
Austria,Europe,1962,69.54,7129864,10750.721
Austria,Europe,1967,70.14,7376998,12834.602
Austria,Europe,1972,70.63,7544201,16661.626
Austria,Europe,1977,72.17,7568430,19749.422
Austria,Europe,1982,73.18,7574613,21597.084
Austria,Europe,1987,74.94,7578903,23687.826
Austria,Europe,1992,76.04,7914969,27042.019
Austria,Europe,1997,77.51,8069876,29095.921


In [None]:
# add more conditions in filter
austria <- gapminder |> filter(country == "Austria", year > 2000)
austria

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Austria,Europe,2002,78.98,8148312,32417.61
Austria,Europe,2007,79.829,8199783,36126.49


---

## `select()` function

Use `select()` to subset the data on variables/columns by `names` or `index`. You also can define order of columns with `select()`.

In [None]:
gapminder |> 
select(year, country, pop) |>
slice(1: 10)

year,country,pop
<int>,<fct>,<int>
1952,Afghanistan,8425333
1957,Afghanistan,9240934
1962,Afghanistan,10267083
1967,Afghanistan,11537966
1972,Afghanistan,13079460
1977,Afghanistan,14880372
1982,Afghanistan,12881816
1987,Afghanistan,13867957
1992,Afghanistan,16317921
1997,Afghanistan,22227415


Lets combine few functions with `pipe` (`%>%`):

Finally, lest extend our filtering:

In [None]:
# compare dplyr syntax with base R call
gapminder[gapminder$country == "Austria", c("year", "pop", "lifeExp")]

gapminder |> 
    filter(country == "Austria") |>
    select(year, pop, lifeExp)

year,pop,lifeExp
<int>,<int>,<dbl>
1952,6927772,66.8
1957,6965860,67.48
1962,7129864,69.54
1967,7376998,70.14
1972,7544201,70.63
1977,7568430,72.17
1982,7574613,73.18
1987,7578903,74.94
1992,7914969,76.04
1997,8069876,77.51


year,pop,lifeExp
<int>,<int>,<dbl>
1952,6927772,66.8
1957,6965860,67.48
1962,7129864,69.54
1967,7376998,70.14
1972,7544201,70.63
1977,7568430,72.17
1982,7574613,73.18
1987,7578903,74.94
1992,7914969,76.04
1997,8069876,77.51


You can remove some columns using `minus`(operator) and add few filter conditions:

In [None]:
austria <- gapminder |> 
                filter(country == "Austria", year > 2000) |>
                select(-continent, -gdpPercap) |>
                head()
austria

country,year,lifeExp,pop
<fct>,<int>,<dbl>,<int>
Austria,2002,78.98,8148312
Austria,2007,79.829,8199783


You can insert different conditions about columns you need to `select`. 

In [None]:
gapminder |>
    select(!where(is.numeric)) |>  # its 1704 records, because of repeating some records
    slice(1:5)

country,continent
<fct>,<fct>
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia


Let's output all unique pairs `continent -> country` with `distinct()` function:

In [None]:
gapminder |>
    select(country) |>
    distinct() # its 142 records now

country
<fct>
Afghanistan
Albania
Algeria
Angola
Argentina
Australia
Austria
Bahrain
Bangladesh
Belgium


---

## Selecting random $N$ rows

The `sample_n()` function selects random rows from a data frame

In [31]:
gapminder |> sample_n(5)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Norway,Europe,1967,74.08,3786019,16361.8765
Central African Republic,Africa,2002,43.308,4048013,738.6906
Uruguay,Americas,2007,76.384,3447496,10611.463
Togo,Africa,1997,58.39,4320890,982.2869
Paraguay,Americas,2002,70.755,5884491,3783.6742


If you want make `pseudo-random generation` reprodusable use `set.seed()`. Seed is start point of random generation. Different seeds give different output.

In [32]:
set.seed(2023) # example, seed = 2023

The `sample_frac()` function selects random fraction rows from a data frame.
Let's select $1\%$ of data

In [34]:
set.seed(2023) # output not changing, uncomment it 
gapminder %>% sample_frac(0.1)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Switzerland,Europe,2007,81.701,7554661,37506.4191
Djibouti,Africa,2002,53.373,447416,1908.2609
Slovenia,Europe,1972,69.820,1694510,12383.4862
Sao Tome and Principe,Africa,1997,63.306,145608,1339.0760
Turkey,Europe,1987,63.108,52881328,5089.0437
Lebanon,Asia,1957,59.489,1647412,6089.7869
Eritrea,Africa,1972,44.142,2260187,514.3242
Philippines,Asia,1972,58.065,40850141,1989.3741
Tunisia,Africa,1972,55.602,5303507,2753.2860
Uganda,Africa,1952,39.978,5824797,734.7535


---

## Refences

1. [dplyr: A Grammar of Data Manipulation](https://cran.r-project.org/web/packages/dplyr/index.html) on https://cran.r-project.org/.
2. [Data Transformation with splyr::cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf).
3. [DPLYR TUTORIAL : DATA MANIPULATION (50 EXAMPLES)](https://www.listendata.com/2016/08/dplyr-tutorial.html) by Deepanshu Bhalla.
5. [Dplyr Intro](https://stat545.com/dplyr-intro.html) by Stat 545.
6.[R Dplyr Tutorial: Data Manipulation(Join) & Cleaning(Spread)](https://www.guru99.com/r-dplyr-tutorial.html). Introduction to Data Analysis
7. [Loan Default Prediction. Beginners data set for financial analytics Kaggle](https://www.kaggle.com/kmldas/loan-default-prediction)