## Module 4 Practice - Histograms

This practice notebook has exercises for plotting histograms using **R, ggplot, and plot.ly** libraries. 

A **histogram** is a bar chart where each bar represents the count of data items that fall to the corresponding bin in the x axis. 


In ggplot, histograms can be plotted either by modifying geom_bar, or simply by using geom_histogram() geom. 

We will read the USDA data to plot some histogram examples. 

In [None]:
usda_data = read.csv("/dsa/data/all_datasets/USDA.csv")
head(usda_data)

In [None]:
summary(usda_data)

Remove the NA values from Calories variable. 

In [None]:
usda_data=usda_data[!is.na(usda_data$Calories),]

### Qplot in ggplot

Qplot is a convenient wrapper for ggplot to create a number of different types of plots using a consistent calling scheme that is similar to the base graphics capability of R. It is also referred to as quick plot. 

In below plot, a histogram is plotted using the string **`histogram`** supplied to **`geom`** parameter. 

**Binwidth tells ggplot to form bins of specified width.** With a binwidth of 10, each bin in below plot represents a range of calories like (50-59) on x axis, and the data items falling within these ranges are counted and depicted as the frequencies of corresponding bins.

In [None]:
library(ggplot2)
qplot(Calories, data=usda_data, geom="histogram",binwidth=10)

The **`weight`** aesthetic when used with histograms or bar charts can be used to create weighted histograms and bar charts.

**Here, the height of the bar no longer represents count of observations, but a sum over some other variable.**

In [None]:
qplot(Calories, data=usda_data, geom="histogram", weight=Protein, binwidth=10, ylab = "Protein") 

---


### Layered Grammar of ggplot

We can use the ggplot syntax instead of qplot to create plots that follow the layered grammar convention of ggplot. The histogram can be also plotted like this:

In [None]:

(p <- ggplot(usda_data, aes(x=Calories)) + geom_histogram(binwidth=10, fill="lightblue") + ylab("Frequency"))
    


### Density Curve on Histogram

A density curve can be plotted on a histogram that represents the probability density function of that variable. Density can be overlayed on histogram with a transparent density plot. The alpha value controls the level of transparency as shown in below example. This shows the layered structure of ggplot where two layers (histogram and density) can be plotted on the same plot. 

**..density..** is a derived variable computed by the ggplot on the fly.

In [None]:
# Histogram with density plot
ggplot(usda_data, aes(x=Calories)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="lightblue", binwidth=10) +
 geom_density(alpha=.2, fill="red") 

---

## Plotly Library

We can plot the same histogram with **Plotly**. **Plotly** gives us the ability to **interact with the plot**. It can either directly convert the ggplot object to plotly plot, or we can use plot_ly functions to plot.

If the following cells do not produce plots in the first run, **run them again** and plots should appear in the second run. 

In [None]:
library(plotly)

p <- ggplot(usda_data, aes(x=Calories)) + geom_histogram(binwidth=10, colour="black", fill="lightblue") + ylab("Frequency")
# We are directly using the ggplot object and send it to plotly 
# It'll take a while, after it's rendered move your mouse over the plot. 
ggplotly(p)

In [None]:
# and finally, use plot_ly function for the same 
plot_ly(x=usda_data$Calories , type ="histogram")

### scale_fill_gradient


We can use scale_fill_gradient to fill bars with colors according to frequency. In the below plot, bar colors that are blue represent items that are most frequent, and tan bars indicate food items that are very sparse in the dataset. This is an example of using two visual channels to represent the same quantity.

In [None]:
p <- ggplot(usda_data, aes(x=Calories))
p + geom_histogram(aes(fill = ..count..), binwidth=10) +
  scale_fill_gradient("Count", low = "tan", high = "blue")