# Reviewing Tables in R; Correlation v.s. Causality
## PS 3 week 3 sections - Junwoo 
<p style='text-align: right;'> Credits to: Yue Lin </p>

Today, we will play around with data in R. We will use the built-in `iris` dataset in R (i.e., no need to import any dataset). 

In [1]:
# We'll be looking at data about Iris flowers 
# Run this cell! Ignore the codes below. 

iris$setosa <- ifelse(iris$Species == "setosa", 1, 0) # Creating dummy variables
iris$versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris$virginica <- ifelse(iris$Species == "virginica", 1, 0)

head(iris) 
# The head() function displays the first n rows present in the input data frame
# instead of all 150+

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,setosa,1,0,0
4.9,3.0,1.4,0.2,setosa,1,0,0
4.7,3.2,1.3,0.2,setosa,1,0,0
4.6,3.1,1.5,0.2,setosa,1,0,0
5.0,3.6,1.4,0.2,setosa,1,0,0
5.4,3.9,1.7,0.4,setosa,1,0,0


In [31]:
# Remember we can get values in a column by using $
head(iris$Sepal.Length) 

In [32]:
# Remember, we can also give things names and reference that name
# For example -- what's the median petal length for all flowers in the dataset?
all_petal_lengths <- iris$Petal.Length
median(all_petal_lengths)

## Subsetting

Let's focus on only the `Iris virginica` flowers in the dataset. We can do this by using the `subset` function, which takes the following arguments:

`subset(table, column_logical)`

A logical in R is the same thing as a Boolean in Python. In other words, it is a value (or set of values) that is either `TRUE` or `FALSE`. You can also think of them as 1 and 0.

We can get these by doing something called a Boolean comparison, where we compare a value to another, and if that condition is True, it will return `TRUE`. Here are some common comparisons:

| Logical Operator | R Code |
| - | - |
| does x equal y? | x == y |
| does x NOT equal y? | x != y |
| is x less than y? | x < y |
| is x greater than y? | x > y |
| is x less than or equal to y? | x <= y |
| is x greater than or equal to y? | x >= y |


In [33]:
# Logical example in R
x <- 5
y <- 10

x == y
x < y

In [34]:
# Let's practice using this to subset:
# subset(table, column_name <comparison> <value>)
virginica <- subset(iris, virginica == 1)
head(virginica)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
101,6.3,3.3,6.0,2.5,virginica,0,0,1
102,5.8,2.7,5.1,1.9,virginica,0,0,1
103,7.1,3.0,5.9,2.1,virginica,0,0,1
104,6.3,2.9,5.6,1.8,virginica,0,0,1
105,6.5,3.0,5.8,2.2,virginica,0,0,1
106,7.6,3.0,6.6,2.1,virginica,0,0,1


In [35]:
# Alternatively, let's use another way to subset:
virginica1 <- subset(iris, Species == 'virginica')
head(virginica1)

# Aha, `virginica` and `virginica1` are identical

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
101,6.3,3.3,6.0,2.5,virginica,0,0,1
102,5.8,2.7,5.1,1.9,virginica,0,0,1
103,7.1,3.0,5.9,2.1,virginica,0,0,1
104,6.3,2.9,5.6,1.8,virginica,0,0,1
105,6.5,3.0,5.8,2.2,virginica,0,0,1
106,7.6,3.0,6.6,2.1,virginica,0,0,1


## Step further:

Using the `subset` function and some other R code, create a table that only has rows of Iris setosa flowers that have a sepal length smaller than the mean sepal length of ALL virginica flowers. Call this new table `small_setosas`.


In [36]:
# First, find the mean sepal length of all virginica flowers
avg_virginica_sepal_length <- mean(virginica$Sepal.Length)

# Then create a variable called "setosa"
setosa <- subset(iris, setosa == 1)

# Next, find out the small setosas that qualify for sepal lengths that are smaller than 
# the mean we found in the first step
small_setosas <- subset(setosa, Sepal.Length < avg_virginica_sepal_length)

# Let's have a look at the table
head(small_setosas)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
1,5.1,3.5,1.4,0.2,setosa,1,0,0
2,4.9,3.0,1.4,0.2,setosa,1,0,0
3,4.7,3.2,1.3,0.2,setosa,1,0,0
4,4.6,3.1,1.5,0.2,setosa,1,0,0
5,5.0,3.6,1.4,0.2,setosa,1,0,0
6,5.4,3.9,1.7,0.4,setosa,1,0,0


## Creating tables

We can use the `table` function to create one and two way tables. One and two way tables are used to summarize the counts of each category in a table. To use the `table` function, just plug in the column that we want to check.

| One way | Two way |
| - | - | 
| table(data\$var1) | table(data\\$var1, data\$var2) |

In [37]:
# Let's see how many flowers are in each category!
table(iris$Species)


    setosa versicolor  virginica 
        50         50         50 

## Correlation v.s. Causality

It's very important to distinguish between correlation and causality. Correlation means that two variables are linearly related without making a statement about cause and effect. By contrast, causality describes a relationship where one event or process causes an effect on the other event or process.

Sometimes, “A correlates with B” ≠ “A causes B,” why?


This may be due to: 1. **Reverse causality** (the possibility that B actually causes A) and 2.
**Omitted variable bias** (a 3rd variable `C` causes both `A` and `B`. Note that `C` does not need to be a variable in the given dataset).

## Spurious correlation

Definition: a mathematical relationship in which two or more events or variables are associated but not causally related, due to coincidence, or the presence of a certain third, unseen factor (i.e., **confounding factor**), or it can just happen without any confounding variables.

Let's see some online examples here: https://www.tylervigen.com/spurious-correlations