![](https://www.r-project.org/Rlogo.png)

____________________________________________________________________________________

### Objective:

* To load data into R
* To make sure data_frame has the right data types
* To leverage functions, variables & vectors
* To summarize data
* To find the summary statistics for a specific variable
* To summarize a variable by groups

____________________________________________________________________________________

## Load data into R

In [None]:
# Initial Checking
getwd()
list.files("/kaggle/input")

In [None]:
# Call the "tidyverse" library using the library() function
library(tidyverse)

# read our data file into R and assign it to a variable called "chocolateData". 
# Find out where the data is by expanding the "Input Files"
# box above by clicking the + sign in the left corner.
chocolateData <- read_csv("../input/flavors_of_cacao.csv")
# names(chocolateData) <- make.names(names(chocolateData), unique=TRUE)
# print(names(chocolateData))

# remove the first row of the chocolateData data_frame using a negative index
chocolateData <- chocolateData[-1,]

# check the first few rows of the data using the head() function to make sure it
# looks alright
head(chocolateData)

[1mRows: [22m[34m1795[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): Company 
(Maker-if known), Specific Bean Origin
or Bar Name, Cocoa
...
[32mdbl[39m (3): REF, Review
Date, Rating

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Company (Maker-if known),Specific Bean Origin or Bar Name,REF,Review Date,Cocoa Percent,Company Location,Rating,Bean Type,Broad Bean Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru
A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela
A. Morin,Cuba,1315,2014,70%,France,3.5,,Cuba


In [30]:
# Before we get going, let's get rid of the white spaces in the column names of this
# dataset. This will make it possible for us to refer to columns by thier names, since
# any white space in a name will mess R up.

names(chocolateData) <- gsub("[[:space:]+]", "_", names(chocolateData))
str(chocolateData)

tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : chr [1:1794] "70%" "70%" "70%" "70%" ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


## Data Cleaning

In [None]:
# Use the str() function to check the data type of the columns in chocolateData
str(chocolateData)

tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : chr [1:1794] "70%" "70%" "70%" "70%" ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


In [33]:
#print the first few values from the column named "Rating" in the dataframe "chocolateData" 
head(chocolateData$Rating)

head(chocolateData$REF)

head(chocolateData$Review_Date)

head(chocolateData$Cocoa_Percent)

In [34]:
# automatically convert the data types of our data_frame
chocolateData <- type_convert(chocolateData)


[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
cols(
  `Company _(Maker-if_known)` = [31mcol_character()[39m,
  Specific_Bean_Origin_or_Bar_Name = [31mcol_character()[39m,
  Cocoa_Percent = [31mcol_character()[39m,
  Company_Location = [31mcol_character()[39m,
  Bean_Type = [31mcol_character()[39m,
  Broad_Bean_Origin = [31mcol_character()[39m
)



In [None]:
# Check out the structure of the converted data using the str() function.
# Review column's data type.
str(chocolateData)

tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : chr [1:1794] "70%" "70%" "70%" "70%" ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


In [None]:
# remove all the percent signs in the fifth column.
chocolateData$Cocoa_Percent <- sapply(chocolateData$Cocoa_Percent, function(x) gsub("%", "", x))

# try the type_convert() function agian
chocolateData <- type_convert(chocolateData)

# check the structure to make sure it actually is a percent
str(chocolateData)


[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
cols(
  `Company _(Maker-if_known)` = [31mcol_character()[39m,
  Specific_Bean_Origin_or_Bar_Name = [31mcol_character()[39m,
  Cocoa_Percent = [32mcol_double()[39m,
  Company_Location = [31mcol_character()[39m,
  Bean_Type = [31mcol_character()[39m,
  Broad_Bean_Origin = [31mcol_character()[39m
)



tibble [1,794 × 9] (S3: tbl_df/tbl/data.frame)
 $ Company _(Maker-if_known)       : chr [1:1794] "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
 $ Specific_Bean_Origin_or_Bar_Name: chr [1:1794] "Kpime" "Atsane" "Akata" "Quilla" ...
 $ REF                             : num [1:1794] 1676 1676 1680 1704 1315 ...
 $ Review_Date                     : num [1:1794] 2015 2015 2015 2015 2014 ...
 $ Cocoa_Percent                   : num [1:1794] 70 70 70 70 70 70 70 70 70 70 ...
 $ Company_Location                : chr [1:1794] "France" "France" "France" "France" ...
 $ Rating                          : num [1:1794] 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 2.75 ...
 $ Bean_Type                       : chr [1:1794] " " " " " " " " ...
 $ Broad_Bean_Origin               : chr [1:1794] "Togo" "Togo" "Togo" "Peru" ...


> As the joke goes: “80 percent of data science is preparing data, and the other 20 percent is complaining about preparing data.”

## Summarizing data

Ok, our data is in R and clean. Now let's start summarizing it! There are a couple options for how to do this in R. 

> One thing you'll learn about R is that there are often multiple ways to do the same thing. This flexibility is really nice once you're comfortable in the language, but I also remember it being very frustrating when I was learning.

Let's try two functions. The first, summary() is from base R, while the second, summarise_all(), is part of the Tidyverse. We'll run both and then compare the outputs.

You can learn more about any function by looking at the documentation for that function. You can do that in a kernel by running a cell with a question mark in front of the function name, with no parentheses after it. (If you do this for more than one function in the same cell, you'll only see the documentation for the last one.) You can also use a search engine to find more information.

> **Protip**: Never feel embarrassed about looking things up. Professional programmers look things up all the time; no one knows everything about every programming language!

In [37]:
# run this cell to learn more about the summary() function
?summary

0,1
summary {base},R Documentation

0,1
object,an object for which a summary is desired.
x,a result of the default method of summary().
maxsum,"integer, indicating how many levels should be shown for factors."
digits,"integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame). In summary.default, if not specified (i.e., missing(.)), signif() will not be called anymore (since R >= 3.4.0, where the default has been changed to only round in the print and format methods)."
quantile.type,"integer code used in quantile(*, type=quantile.type) for the default method."
...,additional arguments affecting the summary produced.


In [38]:
# run this cell to learn more about the summarise_all() function
?summarise_all

0,1
summarise_all {dplyr},R Documentation

0,1
.tbl,A tbl object.
.funs,"A function fun, a quosure style lambda ~ fun(.) or a list of either form."
...,"Additional arguments for the function calls in .funs. These are evaluated only once, with tidy dots support."
.predicate,A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected. This argument is passed to rlang::as_function() and thus supports quosure-style lambda functions and strings representing function names.
.vars,"A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions, or NULL."
.cols,This argument has been renamed to .vars to fit dplyr's terminology and is deprecated.


In [40]:
# summary function from base R (base R means no packages)
summary(chocolateData)

 Company _(Maker-if_known) Specific_Bean_Origin_or_Bar_Name      REF      
 Length:1794               Length:1794                      Min.   :   5  
 Class :character          Class :character                 1st Qu.: 576  
 Mode  :character          Mode  :character                 Median :1069  
                                                            Mean   :1035  
                                                            3rd Qu.:1502  
                                                            Max.   :1952  
  Review_Date   Cocoa_Percent   Company_Location       Rating     
 Min.   :2006   Min.   : 42.0   Length:1794        Min.   :1.000  
 1st Qu.:2010   1st Qu.: 70.0   Class :character   1st Qu.:2.812  
 Median :2013   Median : 70.0   Mode  :character   Median :3.250  
 Mean   :2012   Mean   : 71.7                      Mean   :3.186  
 3rd Qu.:2015   3rd Qu.: 75.0                      3rd Qu.:3.500  
 Max.   :2017   Max.   :100.0                      Max.   :5.000  
  Bean

In [None]:
# summary function from the Tidyverse (specifically dplyr).
# Ask for the average using the function mean()
summarise_all(chocolateData, funs(mean))

“[1m[22m`funs()` was deprecated in dplyr 0.8.0.
[36mℹ[39m Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))”
[1m[22m[36mℹ[39m In argument: `Company _(Maker-if_known) = mean(`Company _(Maker-if_known)`)`.
[33m![39m argument is not numeric or logical: returning NA


Company _(Maker-if_known),Specific_Bean_Origin_or_Bar_Name,REF,Review_Date,Cocoa_Percent,Company_Location,Rating,Bean_Type,Broad_Bean_Origin
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,1035.436,2012.323,71.70318,,3.185619,,


In [70]:
# fix deprecated: funs
summarise_all(chocolateData, mean, na.rm = TRUE)

[1m[22m[36mℹ[39m In argument: `Company _(Maker-if_known) = (function (x, ...) ...`.
[33m![39m argument is not numeric or logical: returning NA


Company _(Maker-if_known),Specific_Bean_Origin_or_Bar_Name,REF,Review_Date,Cocoa_Percent,Company_Location,Rating,Bean_Type,Broad_Bean_Origin
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,1035.436,2012.323,71.70318,,3.185619,,


In [None]:
# Use the summarise_all() function to find the standard deviation of each numeric column.
# The function to find the standard deviation is sd()
chocolateData %>%
  summarise(across(where(is.numeric), ~ sd(.x, na.rm = TRUE)))


REF,Review_Date,Cocoa_Percent,Rating
<dbl>,<dbl>,<dbl>,<dbl>
552.6843,2.926739,6.321543,0.47801


## Summarizing a specific variable

The functions we used above give you an overview of the entire dataset, but often you're only interested in one or two variables. We can look at specific variables really easily using the summarise() function and pipes. Pipes are part of the Tidyverse package we loaded in the beginning: if you try to use them without load in the package, you'll get an error.

> A pipe, which looks like this: %>% is a special operator. It takes all the output from the right side and passes it to whatever is on the left side.

Let's take our chocolate dataset and then pipe it to the summarise() function. The summarise() function will return a data_frame, where each column contains a specific type of information we've asked for and has a name we've given in. In this example, we're going to get back two columns. One, called "averageRating" will have the average of the Rating column in it, while the second, called "sdRating" will have the standard deviation of the Rating column in it. 

In [75]:
# return a data_frame with the mean and sd of the Rating column, from the chocolate
# dataset in it
chocolateData %>%
    summarise(averageRating = mean(Rating),
             sdRating = sd(Rating))

averageRating,sdRating
<dbl>,<dbl>
3.185619,0.47801


> ## Line breaks


In [None]:
# this is fine! :)
mean(c(5,6,25,16))

# this is fine! :)
mean(c(5,6,
       25,16))

# this won't break the code, but it's hard to read :(
mean(c(5,6,
25,16))

In [None]:
# this will break the code :'(
mean(c(5,6,2
      5,16))

ERROR: Error in parse(text = x, srcfile = src): <text>:3:7: unexpected numeric constant
2: mean(c(5,6,2
3:       5
         ^


In [None]:
# Use a pipe (%>%) and the summarise() function return a dataframe
# with the average and sd of the Cocoa_Percent column.
# clear names are the most helpful.
chocolateData %>%
    summarise(averageCocoa_Percent = mean(Cocoa_Percent),
             sdCocoa_Percent = sd(Cocoa_Percent))

averageCocoa_Percent,sdCocoa_Percent
<dbl>,<dbl>
71.70318,6.321543


## Summarize a specific variable by group

At first pass, it may seem a bit silly to do things like calculate the mean and standard deviation this way. You can see why this is such a power technique, however, when we look at the the same variable across groups. 

We can use this with a hand function called group_by(). When you pipe a dataset into the group_by() function and tell it the name of a specific column, then it will look at all the values in that column and group together all the rows that have the same value in a given column. Then, when you pipe that data into the summarise() function, it will return the values you asked for for each group separately. Like so:

In [90]:
# Return the average and sd of ratings by the year a rating was given
chocolateData %>%
    group_by(Review_Date) %>%
    summarise(averageRating = mean(Rating),
             sdRating = sd(Rating))

Review_Date,averageRating,sdRating
<dbl>,<dbl>,<dbl>
2006,3.125,0.7691224
2007,3.162338,0.6998193
2008,2.994624,0.5442118
2009,3.073171,0.4591195
2010,3.148649,0.4663426
2011,3.256061,0.4899536
2012,3.178205,0.4835962
2013,3.197011,0.4461178
2014,3.189271,0.4148615
2015,3.246491,0.381096


In [None]:
# Return a data_frame with the average and sd Cocoa_Percent by the year the reviews 
chocolateData %>%
    group_by(Review_Date) %>%
    summarise(averageCocoa_Percent = mean(Cocoa_Percent),
             sdCocoa_Percent = sd(Cocoa_Percent))

Review_Date,averageCocoa_Percent,sdCocoa_Percent
<dbl>,<dbl>,<dbl>
2006,71.0,7.42474
2007,72.03896,6.951792
2008,72.69892,8.412962
2009,70.44309,6.895057
2010,70.77928,7.424678
2011,70.9697,5.377714
2012,71.52821,5.725056
2013,72.2663,8.325992
2014,72.25304,5.201014
2015,72.01404,5.258777


This is a really efficient way to start understanding the data. For example, it looks like chocolate bar ratings might be trending slightly upwards by year. 

To really get a better understanding of this, however, I really want to want to graph this data so that I can see if there's been reliable change over time. So let's move on to the final part: graphing!