<img src="images/banner_introRProg.png" align="left" />

<table style="float:right;">
    <tr>
        <td>                      
            <div style="text-align: right"><a href="https://www.research.manchester.ac.uk/portal/syed.murtuzabaker.html" target="_blank">Syed Murtuza Baker</a></div>
            <div style="text-align: right">Research Fellow</div>
            <div style="text-align: right">University of Manchester</div>
         </td>
         <td>
             <img src="images/Syed_Baker.jpg" width="50%" />
         </td>
     </tr>
</table>

# Introduction to Tidyverse
****

#### About this Notebook
This notebook introduces Tidyverse

Level: <code>beginner</code> 

Duration: Approximately 2 hours to complete

<div class="alert alert-block alert-warning"><b>Learning Objectives:</b> 
<br/> At the end of this notebook you will be able to:
    
- Describe the features of the Tidyverse R package
    
- Explain the use of different functions such as Tibble, Pipe, 
    
- Explore deployr, Mutate, Slice & Arrange


</div> 

<a id="top"></a>

<b>Table of contents</b><br>

1.0 [Introduction](#intro)

2.0 [About Tidyverse](#tidyverse)

3.0 [Your Turn](#yourturn)

*****

<a id="intro"></a>

## Introduction


This notebook intrdouces `tidyverse`, a collection of R packages popular for data wrangling.

For help in using this Jupyter notebook please refer to the [Jupyter Notebook User Guide](https://online.manchester.ac.uk/bbcswebdav/orgs/I3116-ADHOC-I3HS-HUB-1/Jupyter%20Notebooks/content/index.html#/)






*****
[back to the top](#top)

<a id="tidyverse"></a>

## About Tidyverse

tidyverse.org defines Tidyverse as

>The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

In [2]:
library(tidyverse)
library(dplyr)
library(scater)

We will use single-cell RNA sequencing on 6826 stem cells from Chronic myelomonocytic leukaemia (CMML) patients and healthy controls using the droplet-based, ultra-high-throughput 10x platform. We found substantial inter and intra-patient heterogeneity, with CMML stem cells displaying distinctive transcriptional programs. Compared with normal controls, CMML stem cells exhibited transcriptomes characterized by increased expression of myeloid-lineage and cell cycle genes, and lower expression of genes selectively expressed by normal haematopoietic stem cells.

In [36]:
sce <- readRDS('/mnt/sce.rds')
sce

class: SingleCellExperiment 
dim: 12695 6826 
metadata(0):
assays(3): counts logcounts norm_exprs
rownames(12695): FO538757.2 AP006222.2 ... AC004556.1 AC240274.1
rowData names(12): id symbol ... total_counts log10_total_counts
colnames(6826): AAACCTGCACCGATAT-1 AAACGGGCACGACTCG-1 ...
  TTTGGTTTCATCTGCC-11 TTTGTCAGTAGGAGTC-11
colData names(59): barcode Sample ... sizeFactor cellType
reducedDimNames(1): tSNE
mainExpName: NULL
altExpNames(0):

*****
[back to the top](#top)

### Tibble

Tibbles are data-frames
tibble() does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!)
it never changes the names of variables, and it never creates row names.
tibble can have column names that are not valid R variable names, aka non-syntactic names.

In [4]:
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb

:),Unnamed: 1_level_0,2000
<chr>,<chr>,<chr>
smile,space,number


### Pipe %>%

Pipe %>% passes the output from one stage to the other.

In [8]:
tbl_df(colData(sce))[1:5,]

barcode,Sample,total_features,log10_total_features,pct_counts_top_50_features,pct_counts_top_100_features,pct_counts_top_200_features,pct_counts_top_500_features,total_features_endogenous,log10_total_features_endogenous,⋯,pct_counts_Mt,pct_counts_in_top_50_features_Mt,pct_counts_in_top_100_features_Mt,pct_counts_in_top_200_features_Mt,pct_counts_in_top_500_features_Mt,CellCycle,Cluster,Phase,sizeFactor,cellType
<fct>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<fct>,<fct>,<dbl>,<chr>
AAACCTGCACCGATAT-1,BC543,2152,3.333044,47.58468,64.94609,73.96214,83.26683,2142,3.331022,⋯,2.657271,100,100,100,100,G1,2,S,0.9487532,HSC
AAACGGGCACGACTCG-1,BC543,2078,3.317854,46.25919,63.97836,74.47798,83.9716,2067,3.315551,⋯,2.789754,100,100,100,100,G1,2,G1,0.9866038,HSC
AAAGCAATCCTAAGTG-1,BC543,2493,3.396896,49.27632,67.14031,75.58654,83.5305,2482,3.394977,⋯,3.218558,100,100,100,100,G1,2,G1,1.2639782,HSC
AAAGTAGGTGATGATA-1,BC543,2298,3.361539,43.84055,61.52083,72.48097,82.09871,2288,3.359646,⋯,3.814357,100,100,100,100,G1,2,G2M,1.0392552,HSC
AAAGTAGTCTCGCTTG-1,BC543,2343,3.369958,48.50873,65.67959,75.49354,84.05766,2333,3.368101,⋯,2.861809,100,100,100,100,G1,2,G1,0.9423322,HSC


In [6]:
tbl_df(colData(sce)) %>%
  group_by(Sample) %>%
  summarise(
    total.features = mean(total_features),
    total.counts = mean(total_counts)
  )

Sample,total.features,total.counts
<chr>,<dbl>,<dbl>
BC278,2971.597,18434.975
BC416,2070.396,9094.933
BC543,1947.178,10966.411
BC572,1471.253,5862.217
BC746,2565.624,13558.324
BC776,2330.206,10740.006
BC786,2507.971,12530.953
HV1,2263.642,10735.622
HV2,2408.25,12868.908
HV3,2283.051,11753.533


`summarise()` ungrouping output (override with `.groups` argument)

### dplyr - Functions as verbs.

The most useful

<code>select()</code>: select columns

<code>mutate()</code>: create new variables, change existing

<code>filter()</code>: subset your data by some criterion

<code>summarize()</code>: summarize your data in some way

<code>group_by()</code>: group your data by a variable

<code>slice()</code>: grab specific rows

<code>select()</code>: select an observation

Some others

<code>count()</code>: count your data

<code>arrange()</code>: arrange your data by a column or variable

<code>distinct()</code>: gather all distinct values of a variable

<code>n_distinct()</code>: count how many distinct values you have (only works with summarize)

<code>n()</code>: count how many observation you have for a subgroup

<code>sample_n()</code>: Grab an N sample of your data

<code>ungroup()</code>: ungroup grouped data by a variable

<code>top_n()</code>: get the top N number of entries from a data frame


To make it easier we copy the metadata for our <code>SingleCellExperiment</code> object <code>sce</code> to <code>d</code>

In [37]:
d <- tbl_df(colData(sce))

In [39]:
d %>% 
  select(Sample, Cluster, cellType) %>%
  head(10)

Sample,Cluster,cellType
<chr>,<fct>,<chr>
BC543,2,HSC
BC543,2,HSC
BC543,2,HSC
BC543,2,HSC
BC543,2,HSC
BC543,2,HSC
BC543,2,HSC
BC543,6,HSC
BC543,6,HSC
BC543,15,HSC


#### Filter : To select rows


In [19]:
d %>% 
  filter(cellType == "HSC") %>%
  head(10)

barcode,Sample,total_features,log10_total_features,pct_counts_top_50_features,pct_counts_top_100_features,pct_counts_top_200_features,pct_counts_top_500_features,total_features_endogenous,log10_total_features_endogenous,⋯,pct_counts_Mt,pct_counts_in_top_50_features_Mt,pct_counts_in_top_100_features_Mt,pct_counts_in_top_200_features_Mt,pct_counts_in_top_500_features_Mt,CellCycle,Cluster,Phase,sizeFactor,cellType
<fct>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<fct>,<fct>,<dbl>,<chr>
AAACCTGCACCGATAT-1,BC543,2152,3.333044,47.58468,64.94609,73.96214,83.26683,2142,3.331022,⋯,2.657271,100,100,100,100,G1,2,S,0.9487532,HSC
AAACGGGCACGACTCG-1,BC543,2078,3.317854,46.25919,63.97836,74.47798,83.9716,2067,3.315551,⋯,2.789754,100,100,100,100,G1,2,G1,0.9866038,HSC
AAAGCAATCCTAAGTG-1,BC543,2493,3.396896,49.27632,67.14031,75.58654,83.5305,2482,3.394977,⋯,3.218558,100,100,100,100,G1,2,G1,1.2639782,HSC
AAAGTAGGTGATGATA-1,BC543,2298,3.361539,43.84055,61.52083,72.48097,82.09871,2288,3.359646,⋯,3.814357,100,100,100,100,G1,2,G2M,1.0392552,HSC
AAAGTAGTCTCGCTTG-1,BC543,2343,3.369958,48.50873,65.67959,75.49354,84.05766,2333,3.368101,⋯,2.861809,100,100,100,100,G1,2,G1,0.9423322,HSC
AACACGTGTTGGTAAA-1,BC543,1658,3.219846,47.34871,65.40645,76.03575,85.48134,1648,3.217221,⋯,2.186254,100,100,100,100,G1,2,G1,0.5750319,HSC
AACACGTTCAGGTAAA-1,BC543,1889,3.276462,50.92943,69.62995,78.85542,86.55766,1879,3.274158,⋯,3.020654,100,100,100,100,G1,2,G1,0.9193944,HSC
AACCATGCACGACGAA-1,BC543,883,2.946452,46.95035,62.2695,73.68794,86.41844,873,2.941511,⋯,2.234043,100,100,100,100,G1,6,G1,0.1917033,HSC
AACCATGTCATAGCAC-1,BC543,812,2.910091,45.69509,59.8908,71.73457,86.89626,801,2.904174,⋯,2.01596,100,100,100,100,G1,6,G1,0.1792775,HSC
AACGTTGTCTCGTTTA-1,BC543,1611,3.207365,47.84624,64.8446,74.53653,84.37841,1600,3.204391,⋯,1.785714,100,100,100,100,G1,15,G1,0.505221,HSC


In [18]:
d %>% 
  select(barcode, Sample, total_features, cellType, Cluster) %>%
  filter(Sample == "BC572") %>%
  head(10)

barcode,Sample,total_features,cellType,Cluster
<fct>,<chr>,<int>,<chr>,<fct>
AAACCTGTCCAAGTAC-2,BC572,1258,HSC,7
AAAGATGGTTGTCGCG-2,BC572,1436,HSC,7
AAAGCAATCACCGTAA-2,BC572,1074,HSC,7
AAATGCCTCGGTTCGG-2,BC572,917,HSC,7
AACTCAGGTCTTCGTC-2,BC572,1090,HSC,7
AACTCTTAGTGGGCTA-2,BC572,983,HSC,7
AAGACCTAGCCCTAAT-2,BC572,2829,HSC,11
AAGGCAGCAGCATACT-2,BC572,1202,HSC,7
AAGTCTGAGAATCTCC-2,BC572,2352,HSC,11
AAGTCTGGTAAACCTC-2,BC572,2776,HSC,11


In [20]:
d %>% 
  filter(cellType == "Erythrocytes", pct_counts_Mt > 1.5) %>% 
  select(barcode, Sample, pct_counts_Mt, cellType, Cluster) %>%
  head(10)

barcode,Sample,pct_counts_Mt,cellType,Cluster
<fct>,<chr>,<dbl>,<chr>,<fct>
AAGGAGCAGTTAGGTA-1,BC543,4.45573,Erythrocytes,17
AGTGGGATCATGTAGC-3,BC746,2.919859,Erythrocytes,9
GAACATCAGCCGATTT-3,BC746,4.329392,Erythrocytes,17
GATCAGTTCCTCATTA-3,BC746,4.152797,Erythrocytes,17
CCATGTCAGTCCAGGA-6,HV3,4.567626,Erythrocytes,17
TTAGGACGTAAGTTCC-6,HV3,3.249836,Erythrocytes,17
GCACTCTGTATTACCG-8,BC786,2.820549,Erythrocytes,17
AAAGATGCAGTCCTTC-9,BC278,3.578209,Erythrocytes,17
AACCGCGGTAGCTAAA-9,BC278,2.954553,Erythrocytes,17
AACGTTGTCCGTAGTA-9,BC278,3.894991,Erythrocytes,17


#### Mutate

To create new variables in the data table:

In [14]:
d_exp <- d
d_exp <- cbind(d_exp, t(logcounts(sce)[c('KLF4','RUNX1','EGR1'),]))

In [21]:
d_exp %>% head(10)

Unnamed: 0_level_0,barcode,Sample,total_features,log10_total_features,pct_counts_top_50_features,pct_counts_top_100_features,pct_counts_top_200_features,pct_counts_top_500_features,total_features_endogenous,log10_total_features_endogenous,⋯,pct_counts_in_top_200_features_Mt,pct_counts_in_top_500_features_Mt,CellCycle,Cluster,Phase,sizeFactor,cellType,KLF4,RUNX1,EGR1
Unnamed: 0_level_1,<fct>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,⋯,<dbl>,<dbl>,<chr>,<fct>,<fct>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>
AAACCTGCACCGATAT-1,AAACCTGCACCGATAT-1,BC543,2152,3.333044,47.58468,64.94609,73.96214,83.26683,2142,3.331022,⋯,100,100,G1,2,S,0.9487532,HSC,0.0,7.4776,6.485672
AAACGGGCACGACTCG-1,AAACGGGCACGACTCG-1,BC543,2078,3.317854,46.25919,63.97836,74.47798,83.9716,2067,3.315551,⋯,100,100,G1,2,G1,0.9866038,HSC,6.429876,8.003639,6.429876
AAAGCAATCCTAAGTG-1,AAAGCAATCCTAAGTG-1,BC543,2493,3.396896,49.27632,67.14031,75.58654,83.5305,2482,3.394977,⋯,100,100,G1,2,G1,1.2639782,HSC,0.0,6.077144,8.381875
AAAGTAGGTGATGATA-1,AAAGTAGGTGATGATA-1,BC543,2298,3.361539,43.84055,61.52083,72.48097,82.09871,2288,3.359646,⋯,100,100,G1,2,G2M,1.0392552,HSC,0.0,6.355761,6.355761
AAAGTAGTCTCGCTTG-1,AAAGTAGTCTCGCTTG-1,BC543,2343,3.369958,48.50873,65.67959,75.49354,84.05766,2333,3.368101,⋯,100,100,G1,2,G1,0.9423322,HSC,0.0,6.49536,8.069622
AACACGTGTTGGTAAA-1,AACACGTGTTGGTAAA-1,BC543,1658,3.219846,47.34871,65.40645,76.03575,85.48134,1648,3.217221,⋯,100,100,G1,2,G1,0.5750319,HSC,0.0,7.201707,0.0
AACACGTTCAGGTAAA-1,AACACGTTCAGGTAAA-1,BC543,1889,3.276462,50.92943,69.62995,78.85542,86.55766,1879,3.274158,⋯,100,100,G1,2,G1,0.9193944,HSC,0.0,0.0,0.0
AACCATGCACGACGAA-1,AACCATGCACGACGAA-1,BC543,883,2.946452,46.95035,62.2695,73.68794,86.41844,873,2.941511,⋯,100,100,G1,6,G1,0.1917033,HSC,0.0,0.0,0.0
AACCATGTCATAGCAC-1,AACCATGTCATAGCAC-1,BC543,812,2.910091,45.69509,59.8908,71.73457,86.89626,801,2.904174,⋯,100,100,G1,6,G1,0.1792775,HSC,0.0,0.0,0.0
AACGTTGTCTCGTTTA-1,AACGTTGTCTCGTTTA-1,BC543,1611,3.207365,47.84624,64.8446,74.53653,84.37841,1600,3.204391,⋯,100,100,G1,15,G1,0.505221,HSC,0.0,0.0,7.387244


In [22]:
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>%
  select(barcode, Sample, cellType, Klf4Diff) %>%
  head(10)

Unnamed: 0_level_0,barcode,Sample,cellType,Klf4Diff
Unnamed: 0_level_1,<fct>,<chr>,<chr>,<dbl>
AAACCTGCACCGATAT-1,AAACCTGCACCGATAT-1,BC543,HSC,7.4776
AAACGGGCACGACTCG-1,AAACGGGCACGACTCG-1,BC543,HSC,1.573763
AAAGCAATCCTAAGTG-1,AAAGCAATCCTAAGTG-1,BC543,HSC,6.077144
AAAGTAGGTGATGATA-1,AAAGTAGGTGATGATA-1,BC543,HSC,6.355761
AAAGTAGTCTCGCTTG-1,AAAGTAGTCTCGCTTG-1,BC543,HSC,6.49536
AACACGTGTTGGTAAA-1,AACACGTGTTGGTAAA-1,BC543,HSC,7.201707
AACACGTTCAGGTAAA-1,AACACGTTCAGGTAAA-1,BC543,HSC,0.0
AACCATGCACGACGAA-1,AACCATGCACGACGAA-1,BC543,HSC,0.0
AACCATGTCATAGCAC-1,AACCATGTCATAGCAC-1,BC543,HSC,0.0
AACGTTGTCTCGTTTA-1,AACGTTGTCTCGTTTA-1,BC543,HSC,0.0


#### Arrange

To order the data by a particular variable:

In [25]:
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) 
  head(5)

Unnamed: 0_level_0,barcode,Sample,total_features,log10_total_features,pct_counts_top_50_features,pct_counts_top_100_features,pct_counts_top_200_features,pct_counts_top_500_features,total_features_endogenous,log10_total_features_endogenous,⋯,pct_counts_in_top_500_features_Mt,CellCycle,Cluster,Phase,sizeFactor,cellType,KLF4,RUNX1,EGR1,Klf4Diff
Unnamed: 0_level_1,<fct>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,⋯,<dbl>,<chr>,<fct>,<fct>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
AGATCTGCAACTGCTA-5,AGATCTGCAACTGCTA-5,HV2,1611,3.207365,33.33333,40.72438,50.23557,67.28504,1600,3.204391,⋯,100,G1,15,G1,0.2841593,Endothelial cells,11.379282,0.000000,0.000000,11.379282
TTGCGTCCATACCATG-2,TTGCGTCCATACCATG-2,BC572,911,2.959995,40.70539,55.56017,66.80498,82.94606,902,2.955688,⋯,100,G1,7,G1,0.1724582,HSC,0.000000,10.515217,0.000000,10.515217
CTCGAAATCACTGGGC-7,CTCGAAATCACTGGGC-7,BC572,1009,3.004321,47.59764,63.22001,73.95336,85.69823,998,2.999565,⋯,100,G1,7,G1,0.2313448,HSC,0.000000,10.506463,0.000000,10.506463
TGGCTGGGTCGTTGTA-7,TGGCTGGGTCGTTGTA-7,BC572,782,2.893762,48.35329,66.05539,76.53443,89.44611,772,2.888179,⋯,100,S,7,G1,0.1799707,HSC,0.000000,10.453745,0.000000,10.453745
GCATGCGCAGGAATGC-7,GCATGCGCAGGAATGC-7,BC572,819,2.913814,47.81746,63.73016,74.04762,87.34127,809,2.908485,⋯,100,G1,7,S,0.1799932,HSC,0.000000,10.453565,0.000000,10.453565
GTCTCGTTCAGGCCCA-2,GTCTCGTTCAGGCCCA-2,BC572,1025,3.011147,42.85191,56.01173,66.20235,80.75513,1015,3.006894,⋯,100,G1,7,G1,0.1828590,HSC,0.000000,10.430792,0.000000,10.430792
CAAGATCTCACTCTTA-2,CAAGATCTCACTCTTA-2,BC572,1028,3.012415,40.99032,54.20700,65.74832,80.34252,1018,3.008174,⋯,100,G1,7,G1,0.1927839,HSC,0.000000,10.354595,0.000000,10.354595
ATTATCCAGACCTAGG-7,ATTATCCAGACCTAGG-7,BC572,885,2.947434,46.65370,62.06226,71.63424,85.01946,878,2.943989,⋯,100,G1,7,S,0.2043489,HSC,0.000000,10.270612,0.000000,10.270612
AGATTGCAGAGTGACC-2,AGATTGCAGAGTGACC-2,BC572,895,2.952308,43.53877,58.72763,70.21869,84.29423,885,2.947434,⋯,100,G1,7,G1,0.2054339,HSC,0.000000,10.262978,0.000000,10.262978
CTCGGAGTCTATCGCC-2,CTCGGAGTCTATCGCC-2,BC572,825,2.916980,45.75163,59.56427,71.15468,85.83878,815,2.911690,⋯,100,G1,7,G1,0.2066755,HSC,0.000000,10.254292,0.000000,10.254292


#### Group by + sumarize : forget about loops

First: group by a particular variables Second: summarize the data with new statistics. Summarize: Turn many rows into one.

Examples:

```
min(x) - minimum value of vector x.

max(x) - maximum value of vector x.

mean(x) - mean value of vector x.

median(x) - median value of vector x.

quantile(x, p) - pth quantile of vector x.

sd(x) - standard deviation of vector x.

var(x) - variance of vector x.

IQR(x) - Inter Quartile Range (IQR) of vector x.

diff(range(x)) - total range of vector x.

```

In [27]:
d %>% 
  group_by(cellType) %>% 
  summarise(mean_total_counts = mean(total_counts, na.rm = TRUE), sd_total_counts = sd(total_counts), 
     mean_pct_Mt_count = mean(pct_counts_Mt), count = n()) %>% 
  #ungroup() %>% 
  slice_max(., n=20, order_by = mean_total_counts)  # note here, it does 

cellType,mean_total_counts,sd_total_counts,mean_pct_Mt_count,count
<chr>,<dbl>,<dbl>,<dbl>,<int>
BM & Prog.,30704.0,,3.585852,1
Erythrocytes,29884.306,9744.2316,3.1332549,265
MEP,26098.203,8539.9938,2.887916,74
GMP,12125.685,5481.6378,2.2918437,146
CMP,11096.714,4576.0677,2.5725066,7
HSC,11036.224,4885.0916,2.2195699,6282
Monocytes,10733.375,4540.2326,2.8227165,8
Pro-Myelocyte,6665.0,5488.5628,1.7153164,2
NK_cell,5750.0,,0.7652174,1
Platelets,5619.8,2339.191,1.3455238,5


`summarise()` ungrouping output (override with `.groups` argument)

<div class="alert alert-success">
    <strong>Note:</strong> <code>mutate()</code> either changes an existing column or adds a new one. summarise() calculates a single value (per group). As you can see, in the first example, new column is added.</a>.
</div>

*****
[back to the top](#top)

<a id="yourturn"></a>
## Your Turn


Let us look at the mpg dataset first

In [28]:
head(mpg)

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact


<div class="alert alert-block alert-info">
    <b>Task 1</b><br/>
   <p>Select only teh audi cars from the above list</p> 
</div>

In [29]:
mpg %>% 
  filter(manufacturer == "audi")

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact


Let us look at how many cars are there in the list

In [33]:
table(mpg$manufacturer)


      audi  chevrolet      dodge       ford      honda    hyundai       jeep 
        18         19         37         25          9         14          8 
land rover    lincoln    mercury     nissan    pontiac     subaru     toyota 
         4          3          4         13          5         14         34 
volkswagen 
        27 

<div class="alert alert-block alert-info">
    <b>Task2</b></br>
<p>How many Ford cars are there?</p>
</div>

In [34]:
mpg %>% 
  filter(manufacturer == "ford") %>%
  summarize(Ford_cars = n())

Ford_cars
<int>
25


*****
[back to the top](#top)

### Notebook details
<br>
<i>Notebook created by <strong>Syed Murtuza Baker</strong>. Other contributors include Fran Hooley... 

Publish date: May 2021<br>
Review date: May 2022</i>

Please give your feedback using the button below:

<a class="typeform-share button" href="https://hub11.typeform.com/to/jUk0oK0O" data-mode="popup" style="display:inline-block;text-decoration:none;background-color:#3A7685;color:white;cursor:pointer;font-family:Helvetica,Arial,sans-serif;font-size:18px;line-height:45px;text-align:center;margin:0;height:45px;padding:0px 30px;border-radius:22px;max-width:100%;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;font-weight:bold;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;" target="_blank">Rate this notebook </a> <script> (function() { var qs,js,q,s,d=document, gi=d.getElementById, ce=d.createElement, gt=d.getElementsByTagName, id="typef_orm_share", b="https://embed.typeform.com/"; if(!gi.call(d,id)){ js=ce.call(d,"script"); js.id=id; js.src=b+"embed.js"; q=gt.call(d,"script")[0]; q.parentNode.insertBefore(js,q) } })() </script>

****

## Your Notes:
