# Data frames
Data frames are basically the common tables you know from excel or from anywhere on the internet. Usually data.frame is the product of your long effort to preprocess and clean the data. To combine what we already know, data.frames are lists of vectors of the same size, which have functionality ot easily access rows of data across multiple vectors.

Data frames columns MUST have same length - missing values can be replaced with NAs, NaNs or NULLs;
And similarly to the vector restraint, each column must have only a single variable type.

In [32]:
set.seed(1)
age = sample(c(10:25), 25, replace = T)
gender = sample(c("male", "female"), 25, replace = T)
smoker = sample(c(T, F), 25, replace = T)
BMI = rnorm(25, 20, 2)

df = data.frame(age = age, gender = gender, smoker = smoker, BMI = BMI)

There are some simple functions to examine data.frames

In [33]:
head(df)

Unnamed: 0,age,gender,smoker,BMI
1,14,male,1,22.4766082017068
2,15,male,0,19.4413074362915
3,19,male,1,23.5158061796214
4,24,female,1,21.1214921817761
5,13,male,1,19.0944320548937
6,24,male,1,18.3359134077643


In [34]:
summary(df)

      age           gender     smoker             BMI       
 Min.   :10.00   female:14   Mode :logical   Min.   :15.55  
 1st Qu.:14.00   male  :11   FALSE:8         1st Qu.:18.34  
 Median :19.00               TRUE :17        Median :19.65  
 Mean   :18.04               NA's :0         Mean   :19.84  
 3rd Qu.:22.00                               3rd Qu.:21.12  
 Max.   :25.00                               Max.   :24.88  

In [35]:
nrow(df)
ncol(df)

## Columns
Remember theat each column is basically a vector. Therefore if you select the vector, you can run any functions on it. It is also important to know the different types of subsetting lists. Single [n] will select the n-th element of a list WITH the name of the list - tehrefore it doesn't return a vector per se. Double [[n]] on the 

In [36]:
df[3]
df[[3]]

Unnamed: 0,smoker
1,True
2,False
3,True
4,True
5,True
6,True
7,True
8,False
9,False
10,True


Other way of selecting vectors is to follow the list way of selecting elements by name. That way uses $ operator. This selection is effectively same as the sellection with [[n]]. But remember, that if you want to use name of the column in brackets, you need to put a string there [["smoker"]] (otherwise it will search for a smoker variable).

In [37]:
df$smoker
df[["smoker"]]
df[["smoker"]] == df$smoker

And the data.frame own way to select columns is to use its df[ROW, COLUMN] statement. Column part accepts numbers as well as string

In [38]:
df[,3]
df[,"smoker"]


In [39]:
a = "BMI"
df[, a]

## Subsetting
When we talk about subsetting data frames we usually mean selection of rows while keeping columns. But if you want to only kjeep some columns, use techniquest presented above. 

There are many ways how to subset a data frame. The first thing to realise is that data frame is a list of vectors, therefore we can use similar functionality that lists have. The df[ROW, COLUMN] will also come in handy. If in doubt, go back to varaibles lecture about lists.

Basically we have two major ways of subsetting - using common indexing or using functions

### Indexing
Indexing is possible with the use of either logical vectors or indices of rows. Imagine following daat frame

|age | smoker | weight |
|----|--------|--------|
| 17 |   yes  |  65    |
| 23 |   yes  |  87    |
| 25 |   no   |  74    |


In [40]:
small_df = data.frame(age = c(17, 23, 25), smoker = c(T, T, F), weight = c(65, 87, 74))

That means that you select the second row in these two ways.

In [41]:
small_df[c(F, T, F),]
small_df[2,]

Unnamed: 0,age,smoker,weight
2,23,1,87


Unnamed: 0,age,smoker,weight
2,23,1,87


#### Number indexing

In [42]:
age20smoker = which(df$age > 20 & smoker) # creating vector of indices
age20smoker
df[age20smoker,]

Unnamed: 0,age,gender,smoker,BMI
4,24,female,1,21.1214921817761
6,24,male,1,18.3359134077643
7,25,female,1,17.6668589058306
17,21,female,1,19.8902450525768
21,24,female,1,15.5521994519801


#### Logical indexing

The use of logical vector style is much more common, but maybe a bit harder to wrap your head around. It basically selects all parts that evaluate to true.


In [43]:
numbers = 1:10
log = rep(c(T,F), 5)
numbers
log
numbers[log]

You can use logical vector of the 

In [44]:
age20smoker = age > 20 & smoker #creating logical vector
age20smoker
df[age20smoker,]

Unnamed: 0,age,gender,smoker,BMI
4,24,female,1,21.1214921817761
6,24,male,1,18.3359134077643
7,25,female,1,17.6668589058306
17,21,female,1,19.8902450525768
21,24,female,1,15.5521994519801


In [45]:
select_last = c(rep(F, 24), T)
select_last
df[select_last,]

Unnamed: 0,age,gender,smoker,BMI
25,14,female,1,18.1187016747628


In [46]:
df_smokers = df[smoker,]
df_smokers$BMI
mean(df_smokers$BMI)

In [47]:
zeny = gender == "female"
age22 = age > 22
zeny22 = zeny & age22
df[zeny22,]

Unnamed: 0,age,gender,smoker,BMI
4,24,female,1,21.1214921817761
7,25,female,1,17.6668589058306
18,25,female,0,20.5002826457083
21,24,female,1,15.5521994519801


In [48]:
# maximal BMI "male" age < 24 non-smoker
males = gender == "male"
age24 = age < 24
nonsmoker = !smoker
male24nonsmoker = males & age24 & nonsmoker
df[male24nonsmoker,]$BMI