R datasets
====
R comes bundled with some standard data sets. To get a complete list:

In [None]:
library(help="datasets")

Description of a specific data set:

In [None]:
help(mtcars)

Data exploration
====
R has some functions to get some summary information about the data.

In [None]:
mtcars

In [None]:
dim(mtcars)

In [None]:
head(mtcars)

In [None]:
tail(mtcars)

In [None]:
summary(mtcars)

Reading a data file
===
Data is frequently stored in tabular form in text files. The `read.table()` function can read a file from your disk or from the internet and return a data frame containing that data.

Suppose we have a data file _mydata.txt_ with the following contents:

    Can 1.70 65
    Cem 1.75 66
    Hande 1.62 61
    Lale 1.76 64
    Arda 1.78 63
    Bilgin 1.77 84
    Cem 1.69 75
    Ozlem 1.75 65
    Ali 1.73 75
    Haluk 1.71 81

The file can be read into a data frame simply with:

In [None]:
hwdata <- read.table("mydata.txt")

In [None]:
hwdata

In [None]:
names(hwdata) <- c("Name", "Height","Weight")
hwdata

The function `read.table()` is quite versatile, and it has a lot of parameters to tune its behavior. The help documentation can be helpful:

In [None]:
help(read.table)

Now let's read the data file _mydata2.txt_ that has a header row, and set the column names of the resulting data frame accordingly:

    Name Height Weight
    Can 1.70 65
    Cem 1.75 66
    Hande 1.62 61
    Lale 1.76 64
    Arda 1.78 63
    Bilgin 1.77 84
    Cem 1.69 75
    Ozlem 1.75 65
    Ali 1.73 75
    Haluk 1.71 81

In [None]:
hwdata <- read.table("mydata2.txt",header = TRUE)
hwdata

Now read the file _mydata3.txt_ whose fields are separated with commas instead of spaces:

    Name,Height,Weight
    Can,1.70,65
    Cem,1.75,66
    Hande,1.62,61
    Lale,1.76,64
    Arda,1.78,63
    Bilgin,1.77,84
    Cem,1.69,75
    Ozlem,1.75,65
    Ali,1.73,75
    Haluk,1.71,81

In [None]:
hwdata <- read.table("mydata3.txt",header = TRUE, sep=",")
hwdata

Let's read the data file _mydata4.txt_ which contains some comments added by the data collector.

    Name,Height,Weight
    Can,1.70,65
    Cem,1.75,66
    # Here is a comment
    Hande,1.62,61
    Lale,1.76,64
    Arda,1.78,63
    Bilgin,1.77,84 # another comment
    Cem,1.69,75
    Ozlem,1.75,65
    Ali,1.73,75
    Haluk,1.71,81

The comment character can be set with the `comment.char` parameter to `read.table()`.

In [None]:
hwdata <- read.table("mydata4.txt",header = TRUE, sep=",", comment.char="#")
hwdata

Actually this was a redundant setting, because by default `comment.char` is already set to `"#"`.

Sometimes a text in a field may contain the separator character. To avoid breaking the field's content, we use quotes in the data file, such as the _mydata5.txt_ file below:

    Name Height Weight
    "Can Can" 1.70 65
    "Cem Cem" 1.75 66
    "Hande Hande" 1.62 61
    "Lale Lale" 1.76 64
    "Arda Arda" 1.78 63
    "Bilgin Bilgin" 1.77 84
    "Cem Cim" 1.69 75
    "Ozlem Ozlem" 1.75 65
    "Ali Ali" 1.73 75
    "Haluk Haluk" 1.71 81

The function `read.table()` recognizes the single- or double quotes by default.

In [None]:
hwdata <- read.table("mydata5.txt", header=TRUE)
hwdata

Other quote characters can be specified with the `quote` parameter to `read.table()`. For example, consider the data file _mydata6.txt_:

    Name Height Weight
    %Can Can% 1.70 65
    %Cem Cem% 1.75 66
    %Hande Hande% 1.62 61
    %Lale Lale% 1.76 64
    %Arda Arda% 1.78 63
    %Bilgin Bilgin% 1.77 84
    %Cem Cim% 1.69 75
    %Ozlem Ozlem% 1.75 65
    %Ali Ali% 1.73 75
    %Haluk Haluk% 1.71 81

In [None]:
hwdata <- read.table("mydata6.txt", header=TRUE, quote="%")
hwdata

Writing to a file
===

Suppose that we process the data file, e.g., add some columns.

In [None]:
hwdata <- read.table("mydata6.txt", header=TRUE, quote="%")
hwdata$BMI <- hwdata$Weight / hwdata$Height^2
hwdata$BMI <- round(hwdata$BMI, 2)  # round to two decimal places
hwdata

The function `write.table()` can be used to store a data frame in a file.

In [None]:
write.table(hwdata,"mydata7.txt")

This function writes the table together with the row names and column names:

    "Name" "Height" "Weight" "BMI"
    "1" "Can Can" 1.7 65 22.49
    "2" "Cem Cem" 1.75 66 21.55
    "3" "Hande Hande" 1.62 61 23.24
    "4" "Lale Lale" 1.76 64 20.66
    "5" "Arda Arda" 1.78 63 19.88
    "6" "Bilgin Bilgin" 1.77 84 26.81
    "7" "Cem Cim" 1.69 75 26.26
    "8" "Ozlem Ozlem" 1.75 65 21.22
    "9" "Ali Ali" 1.73 75 25.06
    "10" "Haluk Haluk" 1.71 81 27.7

We can omit the row and column names with the following parameter settings.

In [None]:
write.table(hwdata,"mydata7.txt",row.names = FALSE, col.names = FALSE)

Plotting
====

Scatter plots
-----

In [None]:
options(repr.plot.width=5, repr.plot.height=5)

A very simple plot:

In [None]:
plot(c(1,3,5,4,6))

A plot with separate x and y vectors:

In [None]:
plot(1:10, (1:10)^2)

Plot the weight of the people in our data set versus their height.

In [None]:
plot(hwdata$Height, hwdata$Weight)

This is a made-up data set, so it does not reflect reality. Get a real life data set of height and weight of 10,000 people: https://github.com/johnmyleswhite/ML_for_Hackers/tree/master/02-Exploration/data (needs internet connection).

In [None]:
url <- "https://raw.githubusercontent.com/johnmyleswhite/ML_for_Hackers/master/02-Exploration/data/01_heights_weights_genders.csv"
heights_weights_gender <- read.table(url, header=T, sep=",")

Let's see how the data looks like:

In [None]:
head(heights_weights_gender)

Put the heights and weights of men on their separate vectors and plot.

In [None]:
men <- heights_weights_gender$Gender == "Male"
men_heights <- heights_weights_gender[["Height"]][men]
men_weights <- heights_weights_gender[["Weight"]][men]

In [None]:
plot(men_heights, men_weights)

Change the axis labels and add a plot title

In [None]:
plot(men_heights, men_weights,xlab = "Height [inches]", ylab="Weight [pounds]",
     main="Weight vs height for men")

Change the marker type and color:

In [None]:
plot(men_heights, men_weights, pch=4, col="blue", xlab = "Height [inches]", ylab="Weight [pounds]")
title("Weight vs height for men")

Do the same for women:

In [None]:
women <- heights_weights_gender$Gender == "Female"
women_heights <- heights_weights_gender[["Height"]][women]
women_weights <- heights_weights_gender[["Weight"]][women]

In [None]:
plot(women_heights, women_weights, pch=20, col="red", xlab = "Height [inches]", ylab="Weight [pounds]")
title("Weight vs height for women")

Let's try to show them on the same plot.

In [None]:
plot(men_heights, men_weights, pch=4, col="blue", xlab = "Height [inches]", ylab="Weight [pounds]")
plot(women_heights, women_weights, pch=20, col="red", xlab = "Height [inches]", ylab="Weight [pounds]")
title("Weight vs height of adults")

This does not work. In order to overlay a scatter plot, we need to use the `points()` function, which adds the new points on the existing plot.

In [None]:
plot(men_heights, men_weights, pch=4, col="blue", xlab = "Height [inches]", ylab="Weight [pounds]")
points(women_heights, women_weights, pch=20, col="red", xlab = "Height [inches]", ylab="Weight [pounds]")
title("Weight vs height of adults")

The plot limits don't look right, because they are automatically set for the male data. Set the limits manually:

In [None]:
plot(men_heights, men_weights, pch=4, col="blue",
     xlab = "Height [inches]", ylab="Weight [pounds]",
     xlim = c(50,80), ylim = c(60,270))
points(women_heights, women_weights, pch=20, col="red",
       xlab = "Height [inches]", ylab="Weight [pounds]",
       xlim = c(50,80), ylim = c(60,270))
title("Weight vs height of adults")

We need a legend to understand which is which:

In [None]:
plot(men_heights, men_weights, pch=4, col="blue",
     xlab = "Height [inches]", ylab="Weight [pounds]",
     xlim = c(50,80), ylim = c(60,270))
points(women_heights, women_weights, pch=20, col="red",
       xlab = "Height [inches]", ylab="Weight [pounds]",
       xlim = c(50,80), ylim = c(60,270))
title("Weight vs height of adults")
legend("bottomright", c("Men","Women"), col=c("blue","red"), pch=c(4,20), inset=0.05, cex=0.8)

Histograms
----

You might be interested in the distribution of heights. Here's how we produce a histogram of the data.

In [None]:
hist(men_heights)

Increase the number of bins to 20 and use relative frequencies, not total counts.

In [None]:
hist(men_heights, breaks=20, freq = FALSE)

Let's show both genders, and use color to differentiate: Use the `rgb()` function whose 4th parameter gives the transparency of the color:

In [None]:
hist(men_heights, breaks=20, freq = FALSE, col=rgb(0,0,1,0.5))
hist(women_heights, breaks=20, freq = FALSE, add=TRUE, col=rgb(1,0,0,0.5))

Fix the title and the x-label of the plot.

In [None]:
hist(men_heights, breaks=20, freq = FALSE, col=rgb(0,0,1,0.5),
     main="Male and female heights", xlab = "Height [inches]", ylim=c(0,0.15))
hist(women_heights, breaks=20, freq = FALSE, col=rgb(1,0,0,0.5), add=T)

Density plots
----

R can estimate distribution as a smooth curve, which might look better than a histogram. Plot the male and female heights with lines of thickness 2.

In [None]:
d1 <- density(men_heights)
d2 <- density(women_heights)
plot(d1, main="", xlab="", col="blue", lwd=4)
lines(d2, col="red", lwd=2)

Let's fix the plot limits and add a text label to mark the curves.

In [None]:
d1 <- density(men_heights)
d2 <- density(women_heights)
plot(d1, main="Height distribution", xlab="Height [inches]", col="blue", lwd=2, xlim = c(50,80))
lines(d2, col="red", lwd=2)
text(59, 0.12, "Women", col="red")
text(72, 0.12, "Men", col="blue")

Line plots
----

Let's use the built-in EuStockMarkets data set to illustrate line plots.

In [None]:
help(EuStockMarkets)

In [None]:
head(EuStockMarkets)

This is a time series object. Let's convert it to a data frame:

In [None]:
eustock <- as.data.frame(EuStockMarkets)

Plot the stocks DAX with lines.

In [None]:
plot(eustock$DAX, type="l")

Plot DAX with a thick red line.

In [None]:
plot(eustock$DAX, type="l", col="red", lwd=3)

Plot DAX and SMI together:

In [None]:
plot(eustock$"DAX", type="l", col="red")
lines(eustock$"SMI", col="green")

Let's plot all the stocks on the same plot:

In [None]:
names(eustock)

In [None]:
plot(eustock$"DAX", type="l", col="red", xlab="Business days", ylab="Stock value")
nstocks <- length(names(eustock))
colors <- rainbow(nstocks)
for(i in 2:nstocks ){
    lines(eustock[[i]], col=colors[i])
    }
legend("topleft", names(eustock), col=colors, lty=rep(1,nstocks), inset=0.05)

Plotting functions
-----

If we have a function with a known mathematical expression, we can plot it easily. For example, consider the function $y = \mathrm{e}^{0.1x^2}\sin(x)$.

In [None]:
x <- seq(-10,10, length.out = 300)
y <- exp(-0.1*x^2)*sin(x)

In [None]:
plot(x,y, type="l", col="darkgreen")
title("A function")

Dot, bar, and pie charts
----

In [None]:
mtcars

In [None]:
dotchart(mtcars$mpg,
         labels=row.names(mtcars),cex=.7,
         main="Gas Mileage for Car Models", 
         xlab="Miles Per Gallon")

In [None]:
table(mtcars$cyl)

In [None]:
barplot(table(mtcars$cyl), main="Cylinder distribution", xlab = "Cylinders")

In [None]:
counts <- table(mtcars$cyl)
barplot(counts, main="Cylinder Distribution", horiz=TRUE,
  names.arg=c("4 cyl", "6 cyl", "8 cyl"))

In [None]:
pie(counts, labels=c("4 cyl", "6 cyl", "8 cyl"),
    col = rainbow(length(counts)),
    main = "Cylinder distribution")