# Count Data  
## Initialising R  
In this lesson we will be exploring the cars database that will be used in your A-level.
We will be using a powerful programming language called *R*.
To run code in *R*:  
1. Click on the box containing code to highlight it in green.
2. Click on the play button above to show the output.

#### **Exercise 1**  
Load the data and settings by running the code below. If you do this successfully you should get a message in the output.

In [0]:
library(tidyverse)
library(readr)

options(repr.plot.width=8,
        repr.plot.height=4,
        warn=-1)

clean_theme <- theme_bw(base_size=12, base_family="sans") + 
  theme(panel.border = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.text = element_text(size = 12),
        legend.title = element_text(size=12),
        axis.line = element_line(size = 0.5, colour = "black"),
        axis.title = element_text(face = "plain"),
        strip.background = element_blank()
  )

clean_theme_hist <- theme_bw(base_size=12, base_family="sans") + 
  theme(panel.border = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.text = element_text(size = 12),
        legend.title = element_text(size=12),
        axis.line = element_line(size = 0.5, colour = "black"),
        axis.title = element_text(face = "plain"),
        strip.background = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()
  )

car_df <-
  read_csv("cars_data.csv")

car_data <-
  car_df %>%
  mutate(YearRegistered = as.factor(YearRegistered),
         GovRegion = as.factor(GovRegion),
         PropulsionTypeId = as.factor(ifelse(PropulsionTypeId == 1, "Petrol",
                                   ifelse(PropulsionTypeId == 2, "Diesel",
                                          ifelse(PropulsionTypeId == 3, "Electric",
                                                 ifelse(PropulsionTypeId == 7, "Gas/Petrol",
                                                        ifelse(PropulsionTypeId == 8, "Electric/Petrol", NA)))))),
         BodyTypeId = as.factor(ifelse(BodyTypeId == 1, "2 door saloon",
                             ifelse(BodyTypeId == 2, "4 door saloon",
                                    ifelse(BodyTypeId == 3, "saloon",
                                           ifelse(BodyTypeId == 4, "convertible",
                                                  ifelse(BodyTypeId == 5, "coupe",
                                                         ifelse(BodyTypeId == 6, "estate",
                                                                ifelse(BodyTypeId == 13, "3 door hatchback",
                                                                       ifelse(BodyTypeId == 14, "5 door hatchback",
                                                                              ifelse(BodyTypeId == 96, "Multi Purpose Vehicle", NA)))))))))),
         KeeperTitleId = as.factor(ifelse(KeeperTitleId == 1, "Male",
                                ifelse(KeeperTitleId == 2, "Female",
                                       ifelse(KeeperTitleId == 3, "(not used)",
                                              ifelse(KeeperTitleId == 4, "unknown (Dr, Rev, etc.)",
                                                     ifelse(KeeperTitleId == 5, "company", NA)))))),
         Make = as.factor(Make))
print("Ready to rumble")

## Inspecting the data

Each row is a single entry (car) and each column contains a particular variable.
Recall, for the **car_data** dataset the variables are as follows:  

* *Reference number*: a unique identifier
* *Make*:             the car manufacturer
* *PropulsionTypeId*: a code for the type of fuel (e.g. petrol, diesel...)
* *BodyTypeId*:       a code for the type of body (e.g. convertable, estate, coupe... )
* *GovRegion*:        where the registered keeper lives
* *EngineSize*:       capacity of engine (cubic cm)
* *YearRegistered*:   the year in which the vehicle was first registered
* *Mass*:             the mass of the vehicle in Kg + 75 (the average person)
* *CO2*:              carbon dioxide emissions (g/km)
* *CO*:               carbon monoxide emissions (g/km)
* *NOX*:              oxides of nitrogen emissions (g/km)
* *part*: particulate emissions (g/km) DIESEL ONLY
* *hc*: hydrocarbon emissions (g/km)
* Random number: a random number betwen 0 and 1 to assist with sampling  

The **head()** function returns the first six rows (cars) of the dataset we feed to it.
To feed a dataset to a function we use the **%>%** symbol. For example:

```
dataset %>%
head()
```

#### **Exercise 2**  
Display the first 6 cars in the *car_data* database

In [0]:
#Your code here

## Comparing counts
Remember, we can summarise each variable using the **summary()** function to display descriptive statistics:  
```
... %>%
summary()
```

For categorical variables, this will display each of the groups with the total number of observations(cars) in each group:
* GROUP 1: number of cars in GROUP 1
* GROUP 2: number of cars in GROUP 2
* .  
* .  
* .  
* GROUP n: number of cars in GROUP n
#### **Exercise 3**  
Display summary statistics for the variables in the *car_data* database  
#### **Questions**
1) Which year had the most cars registered?
2) What is the most popular car manufacturer? and the least?

In [0]:
#Your code here

## Visualising total counts - Bar charts 1  
Bar charts allow us to compare the total number of observations in each group of a categorical variable:
1) Feed the data to our **ggplot()** function using **%>%**
2) Specify **x =** the **categorical variable** whose groups we wish to compare
3) Specify options to make the chart look nice

```
dataset %>%
    ggplot(aes(x = categorical variable))+
    geom_bar(options)+
    more_options
```

#### **Exercise 4**  
Visualise a comparison between the total number of cars registered in 2002 and 2016  
#### **Question**
Can you think of any reasons for the trend you see?

In [0]:
___ %>%
  ggplot(aes(x = ___))+
  geom_bar(colour = "cornflowerblue",
           fill = "cornflowerblue",
           size = 1)+
  labs(title = "Vehicles registered 2002 and 2016",
       x = "Year",
       y = "Total cars")+
  clean_theme

## Comparing total count of groups within groups - Bar charts 2
We may want to compare differences between the total number of different types of vehicle released in each of the years.  
This is the same code as above but a few important additions:
1) In **ggplot()**, set the **colour =** and **fill =** options to **categorical variable 2**
2) In **geom_bar()**, set the **position =** option to **"dodge"**

```
dataset %>%
    ggplot(aes(x = categorical variable 1, colour = categorical variable 2, fill = categorical variable 2))+
    geom_bar(position = "dodge",
             options)+
    more_options
```

## **Exercise 5**
Visualise a comparison between the total number of different types of engine in 2002 and 2006  
(*hint: begin by copying and pasting your code from **exercise 4***)

## **Question**
1) Which engine type shows the biggest increase in total registered from 2002 to 2006?
2) Do any engine types become less popular in 2016?

In [0]:
#Your code here

## Visualising proportions - Tables
Sometimes we prefer to look at proportions rather than total counts.  
To do this we need to transform the data to a proportion measure that we can include in a pie chart.  
1) Start a new dataset by writing a **name** followed by **<-**
2) Feed the original **dataset** to the **group_by()** function using **%>%**
3) Specify a **categorical variable** to **group_by()** whose proportions you wish to visualise.
4) Feed this to the **summarise()** function using **%>%**
5) Set the **total = n()**. This will count the total number of observations in each group.
6) Feed this to the **mutate()** function using **%>%**
7) Set the **proportion = total/sum(total)**. This will divide the number of observations in each group by the total number of observations.
```
name_data <-
  dataset %>%
  group_by(variable) %>%
  summarise(total = n()) %>%
  mutate(proportion = (total/sum(total))
```
Finally we can print our new dataset using:
```
name_data %>%
    print()
```
#### **Exercise 6**
Create a new dataset called **proportion_data** which shows the proportion of cars registered in 2002 and 2016.  
**print()** your new dataset to view the proportions

In [0]:
proportion_data <-
    ___ %>%
      group_by(___) %>%
      summarise(___) %>%
      mutate(___)

___ %>%
    ___


### Visualising proportions - pie charts  
Now you have your proportion data, you can use it to plot a pie chart:
1) Feed your **proportion_dataset** to the **ggplot()** function using **%>%**
2) Set the **y =** axis to your **proportion** measure and **fill =** to your **variable**
3) Add (**+**) the **geom_col()** function
4) Add (**+**) the **coord_polar()** function
5) Finally add any options that make the plot prettier

```
proportion_dataset %>%
  ggplot(aes(x = "", y = proportion, fill = variable))+
  geom_col(options)+
  coord_polar(options)+
  more_options
```
#### **Exercise 7**
Plot a pie chart from your data in **Exercise 6** using the code below

In [0]:
___ %>%
  ggplot(aes(x = "", y = ___, fill = ___))+
  geom_col(colour = "white",
           width = 1)+
  coord_polar("y", start = 0)+
  labs(title = "Cars registered by propulsion type, 2002 to 2016")+
  clean_theme +
  theme_void()

## Comparing proportion of groups within groups - Tables  
We know that more cars were bought in 2016. Therefore, comparing *total counts* of engine types between years like in **Exercise 5** may be misleading.  
Perhaps a better way to make comparisons would be to compare the *proportion* of different engine types registered each year.  
Once again we must construct a new dataset that contains proportion data.  
This is the same as before but we add one more variable to the **group_by()** function to give two categorical variables.  
**Categorical variable 1** is the variable you are comparing between (i.e. year).  
**Categorical variable 2** is the variable whose proportions you wish to compare (i.e. engine type). 

```
name_data <-
  dataset %>%
  group_by(categorical variable 1, categorical variable 2) %>%
  summarise(total = n()) %>%
  mutate(proportion = (total/sum(total))
```
Finally we can print our new dataset using:
```
name_data %>%
    print()
```
#### **Exercise 8**
Create a new dataset called **proportion_data** which shows the proportion of different engine types registered in 2002 and 2016.  
(*hint: copy and paste your code from **exercise 6** to start.  
extra hint: make sure the variables go into **group_by()** in the right order*)

In [0]:
#Your code here

## Comparing proportion of groups within groups - Pie Charts  
Now we have our proportion data, we can compare them in two pie charts.  
This is identical to **Exercise 7** but we add one more line of code.  
We add (**+**) the **grid_wrap()** function below **coord_polar()**
Inside **grid_wrap** we provide the argument **vars(categorical variable 1)**. This tells *R* that we want to display a pie chart for each group within this variable.

```
proportion_dataset %>%
  ggplot(aes(x = "", y = proportion, fill = categorical variable 2))+
  geom_col(options)+
  coord_polar(options)+
  grid_wrap(vars(categorical variable1))+  #NEW LINE OF CODE
  more_options
```

#### **Exercise 9**  
Plot two pie charts comparing the proportion of each engine type registered in 2002 and 2006  
(*hint: remember to put the variables in the correct order!*)

#### **Questions**
Which engine type(s) improve in popularity in 2016?  
Do any engine types become less popular in 2016?  
Have any of your answers from **Exercise 5** changed? Why or why not?  

In [0]:
#Your code here

#### **Optional Exercise**  
How do registered keeper identities change between 2002 and 2016?  
a) Construct a bar chart comparing total counts between 2002 and 2016  
b) Construct a pie chart comparing proportions between 2002 and 2016  
Does this help to explain why more cars were bought in 2016?

In [0]:
#Your code here

In [0]:
#proportion data code

#pie chart code