# 创建leadership数据框

In [2]:
manager <- c(1,2,3,4,5)
date <- c("10/24/08","10/28/08","10/1/08","10/12/08","5/1/09")
country <- c("US","US","UK","UK","UK")
gender <- c("M","F","F","M","F")
age <- c(32,45,25,39,99)
q1 <- c(5,3,3,3,2)
q2 <- c(4,5,5,3,2)
q3 <- c(5,2,5,4,1)
q4 <- c(5,5,5,NA,2)
q5 <- c(5,5,2,NA,1)

leadership <- data.frame(manager,date,country,gender,age,
                        q1,q2,q3,q4,q5,stringAsFactors=FALSE)
leadership

manager,date,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors
1,10/24/08,US,M,32,5,4,5,5.0,5.0,False
2,10/28/08,US,F,45,3,5,2,5.0,5.0,False
3,10/1/08,UK,F,25,3,5,5,5.0,2.0,False
4,10/12/08,UK,M,39,3,3,4,,,False
5,5/1/09,UK,F,99,2,2,1,2.0,1.0,False


# 创建新变量

In [5]:
mydata <- data.frame(x1 = c(2,2,6,4),
                    x2 = c(3,4,2,8))
mydata

x1,x2
2,3
2,4
6,2
4,8


## 法一

In [6]:
mydata$sum <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2
mydata

x1,x2,sum,meanx
2,3,5,2.5
2,4,6,3.0
6,2,8,4.0
4,8,12,6.0


## 法二

In [7]:
attach(mydata)
mydata$sum <- x1 + x2
mydata$meanx <- (x1 + x2)/2
detach(mydata)

mydata

x1,x2,sum,meanx
2,3,5,2.5
2,4,6,3.0
6,2,8,4.0
4,8,12,6.0


## 法三

In [8]:
mydata <- transform(mydata,
                   sum = x1 + x2,
                   meanx = (x1 + x2)/2)

mydata

x1,x2,sum,meanx
2,3,5,2.5
2,4,6,3.0
6,2,8,4.0
4,8,12,6.0


# 变量的重编码

In [9]:
leadership

manager,date,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors
1,10/24/08,US,M,32,5,4,5,5.0,5.0,False
2,10/28/08,US,F,45,3,5,2,5.0,5.0,False
3,10/1/08,UK,F,25,3,5,5,5.0,2.0,False
4,10/12/08,UK,M,39,3,3,4,,,False
5,5/1/09,UK,F,99,2,2,1,2.0,1.0,False


将leadership数据集中经理人的连续型年龄变量age重编码为类别型变量
agecat（Young、 Middle Aged、 Elder）。首先，必须将99岁的年龄值重编码为缺失值

## 法一

In [10]:
leadership$age[leadership$age == 99] <- NA

In [11]:
leadership$agecat[leadership$age > 75] <- "Elder"
leadership$agecat[leadership$age >= 55 &
                 leadership$age <= 75] <- "Middle Aged"
leadership$agecat[leadership$age < 55] <- "Young"

leadership

manager,date,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,10/24/08,US,M,32.0,5,4,5,5.0,5.0,False,Young
2,10/28/08,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,10/1/08,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,10/12/08,UK,M,39.0,3,3,4,,,False,Young
5,5/1/09,UK,F,,2,2,1,2.0,1.0,False,


## 法二

In [None]:
leadership <- within(leadership, {
    agecat <- NA
    agecat[age > 75] <- "Elder"
    agecat[age >= 55 & age <= 75] <- "Middle Aged"
    agecat[age <55 ] <- "Young"
})

# 变量的重命名

## 交互修改

In [12]:
fix(leadership)

## rename

In [14]:
library(reshape)

"package 'reshape' was built under R version 3.4.1"

In [15]:
leadership <- rename(leadership, c(manager="managerID", date="testDate"))

In [16]:
leadership

managerID,testDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,10/24/08,US,M,32.0,5,4,5,5.0,5.0,False,Young
2,10/28/08,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,10/1/08,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,10/12/08,UK,M,39.0,3,3,4,,,False,Young
5,5/1/09,UK,F,,2,2,1,2.0,1.0,False,


## names

In [17]:
names(leadership)[2] <- "TestDate"
leadership

managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,10/24/08,US,M,32.0,5,4,5,5.0,5.0,False,Young
2,10/28/08,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,10/1/08,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,10/12/08,UK,M,39.0,3,3,4,,,False,Young
5,5/1/09,UK,F,,2,2,1,2.0,1.0,False,


# 缺失值

在R中，缺失值以符号NA（Not Available，不可用）表示。不可能出现的值（例如，被0除的结果）
通过符号NaN（Not a Number，非数值）来表示。

## is.na()

In [18]:
y <- c(1,2,3,NA)

In [19]:
is.na(y)

```
缺失值被认为是不可比较的，即便是与缺失值自身的比较。这意味着无法使用比较运算
符来检测缺失值是否存在。例如，逻辑测试myvar == NA的结果永远不会为TRUE。作为
替代，你只能使用处理缺失值的函数（如本节中所述的那些）来识别出R数据对象中的缺
失值
```

## 在分析中排除缺失值  na.omit()

In [21]:
x <- c(1,2,NA,3)
y <- sum(x,na.rm=TRUE)
y

In [22]:
leadership

managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,10/24/08,US,M,32.0,5,4,5,5.0,5.0,False,Young
2,10/28/08,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,10/1/08,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,10/12/08,UK,M,39.0,3,3,4,,,False,Young
5,5/1/09,UK,F,,2,2,1,2.0,1.0,False,


In [23]:
newdata <- na.omit(leadership)
newdata

managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,10/24/08,US,M,32,5,4,5,5,5,False,Young
2,10/28/08,US,F,45,3,5,2,5,5,False,Young
3,10/1/08,UK,F,25,3,5,5,5,2,False,Young


# 日期值 as Date(x, "input_format")

In [24]:
mydates <- as.Date(c("2007-06-22","2004-02-13"))
mydates

In [25]:
strDates <- c("01/05/1965","08/16/1975")
dates <- as.Date(strDates,"%m/%d/%Y")
dates

In [26]:
leadership

managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,10/24/08,US,M,32.0,5,4,5,5.0,5.0,False,Young
2,10/28/08,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,10/1/08,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,10/12/08,UK,M,39.0,3,3,4,,,False,Young
5,5/1/09,UK,F,,2,2,1,2.0,1.0,False,


In [28]:
myformat <- "%m/%d/%y"
leadership$TestDate <- as.Date(leadership$TestDate, myformat)

In [29]:
leadership

managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
1,2008-10-24,US,M,32.0,5,4,5,5.0,5.0,False,Young
2,2008-10-28,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,2008-10-01,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,2008-10-12,UK,M,39.0,3,3,4,,,False,Young
5,2009-05-01,UK,F,,2,2,1,2.0,1.0,False,


In [30]:
Sys.Date()

In [31]:
date()

In [32]:
today <- Sys.Date()
format(today, format="%B %d %Y")

In [33]:
format(today, format="%A")

In [34]:
startdate <- as.Date("2004-02-13")
enddate <- as.Date("2011-01-22")
days <- enddate - startdate
days

Time difference of 2535 days

In [35]:
today <- Sys.Date()
dob <- as.Date("1992-12-26")
difftime(today, dob, units = "weeks")

Time difference of 1287.429 weeks

```
将日期值转换为字符型
strDates <- as.character(dates)
```

```
要了解字符型数据转换为日期的更多细节，请查看help(as.Date)和help(strftime)。
要了解更多关于日期和时间格式的知识，请参考help(ISOdatetime)。 lubridate包中包含了
许多简化日期处理的函数，可以用于识别和解析日期-时间数据，抽取日期—时间成分（例如年份、
月份、日期等），以及对日期—时间值进行算术运算。如果你需要对日期进行复杂的计算，那么
fCalendar包可能会有帮助。它提供了大量的日期处理函数，可以同时处理多个时区，并且提供
了复杂的历法操作功能，支持工作日、周末以及假期
```

# 类型转换

In [36]:
a <- c(1,2,3)
a

In [37]:
is.numeric(a)

In [38]:
is.vector(a)

In [39]:
a <- as.character(a)

In [40]:
is.numeric(a)

In [41]:
is.vector(a)

In [42]:
is.character(a)

# 数据排序

In [43]:
newdata <- leadership[order(leadership$age),]
newdata

Unnamed: 0,managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
3,3,2008-10-01,UK,F,25.0,3,5,5,5.0,2.0,False,Young
1,1,2008-10-24,US,M,32.0,5,4,5,5.0,5.0,False,Young
4,4,2008-10-12,UK,M,39.0,3,3,4,,,False,Young
2,2,2008-10-28,US,F,45.0,3,5,2,5.0,5.0,False,Young
5,5,2009-05-01,UK,F,,2,2,1,2.0,1.0,False,


In [47]:
attach(leadership)
newdata <- leadership[order(gender,age),]
detach(leadership)

newdata

The following objects are masked _by_ .GlobalEnv:

    age, country, gender, q1, q2, q3, q4, q5



Unnamed: 0,managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
3,3,2008-10-01,UK,F,25.0,3,5,5,5.0,2.0,False,Young
2,2,2008-10-28,US,F,45.0,3,5,2,5.0,5.0,False,Young
5,5,2009-05-01,UK,F,,2,2,1,2.0,1.0,False,
1,1,2008-10-24,US,M,32.0,5,4,5,5.0,5.0,False,Young
4,4,2008-10-12,UK,M,39.0,3,3,4,,,False,Young


In [48]:
attach(leadership)
newdata <- leadership[order(gender,-age),]
detach(leadership)

newdata

The following objects are masked _by_ .GlobalEnv:

    age, country, gender, q1, q2, q3, q4, q5



Unnamed: 0,managerID,TestDate,country,gender,age,q1,q2,q3,q4,q5,stringAsFactors,agecat
5,5,2009-05-01,UK,F,,2,2,1,2.0,1.0,False,
2,2,2008-10-28,US,F,45.0,3,5,2,5.0,5.0,False,Young
3,3,2008-10-01,UK,F,25.0,3,5,5,5.0,2.0,False,Young
4,4,2008-10-12,UK,M,39.0,3,3,4,,,False,Young
1,1,2008-10-24,US,M,32.0,5,4,5,5.0,5.0,False,Young


# 数据集合并

## 添加列

```
total <- merge(dataframeA, dataframeB, by = "ID")
total <- merge(dataframeA, dataframeB, by = c("ID","Country"))

如果要直接横向合并两个矩阵或数据框，并且不需要指定一个公共索引，那么可以直接使用cbind()函数：total <- cbind(A, B)
这个函数将横向合并对象A和对象B。为了让它正常工作，每个对象必须拥有相同的行数，且要以相同顺序排序。
```

## 添加行

total <- rbind(dataframeA, dataframeB)

# 数据集取子集

## 法一

In [50]:
newdata <- leadership[, c(6:10)]
newdata

q1,q2,q3,q4,q5
5,4,5,5.0,5.0
3,5,2,5.0,5.0
3,5,5,5.0,2.0
3,3,4,,
2,2,1,2.0,1.0


## 法二

In [51]:
myvars <- c("q1","q2","q3","q4","q5")
newdata <- leadership[myvars]

newdata

q1,q2,q3,q4,q5
5,4,5,5.0,5.0
3,5,2,5.0,5.0
3,5,5,5.0,2.0
3,3,4,,
2,2,1,2.0,1.0


## 法三

In [52]:
myvars <- paste("q",1:5, sep="")
myvars

In [53]:
newdata <- leadership[myvars]
newdata

q1,q2,q3,q4,q5
5,4,5,5.0,5.0
3,5,2,5.0,5.0
3,5,5,5.0,2.0
3,3,4,,
2,2,1,2.0,1.0


# 剔除变量

## 法一

In [55]:
myvars <- names(leadership) %in% c("q3","q4")
newdata <- leadership[!myvars]

newdata

managerID,TestDate,country,gender,age,q1,q2,q5,stringAsFactors,agecat
1,2008-10-24,US,M,32.0,5,4,5.0,False,Young
2,2008-10-28,US,F,45.0,3,5,5.0,False,Young
3,2008-10-01,UK,F,25.0,3,5,2.0,False,Young
4,2008-10-12,UK,M,39.0,3,3,,False,Young
5,2009-05-01,UK,F,,2,2,1.0,False,


## 法二

In [56]:
newdata <- leadership[c(-8,-9)]
newdata

managerID,TestDate,country,gender,age,q1,q2,q5,stringAsFactors,agecat
1,2008-10-24,US,M,32.0,5,4,5.0,False,Young
2,2008-10-28,US,F,45.0,3,5,5.0,False,Young
3,2008-10-01,UK,F,25.0,3,5,2.0,False,Young
4,2008-10-12,UK,M,39.0,3,3,,False,Young
5,2009-05-01,UK,F,,2,2,1.0,False,


## 法三

In [57]:
leadership$q3 <- leadership$q4 <- NULL
leadership

managerID,TestDate,country,gender,age,q1,q2,q5,stringAsFactors,agecat
1,2008-10-24,US,M,32.0,5,4,5.0,False,Young
2,2008-10-28,US,F,45.0,3,5,5.0,False,Young
3,2008-10-01,UK,F,25.0,3,5,2.0,False,Young
4,2008-10-12,UK,M,39.0,3,3,,False,Young
5,2009-05-01,UK,F,,2,2,1.0,False,


这回你将q3和q4两列设为了未定义（NULL）。注意， NULL与NA（表示缺失）是不同的

# 选入观测

## 法一

In [58]:
newdata <- leadership[1:3,]
newdata <- leadership[which(leadership$gender=="M" & 
                           leadership$age >30),]

newdata

Unnamed: 0,managerID,TestDate,country,gender,age,q1,q2,q5,stringAsFactors,agecat
1,1,2008-10-24,US,M,32,5,4,5.0,False,Young
4,4,2008-10-12,UK,M,39,3,3,,False,Young


## 法二

In [59]:
attach(leadership)
newdata <- leadership[which(gender=="M" & age > 30),]
detach(leadership)

newdata

The following objects are masked _by_ .GlobalEnv:

    age, country, gender, q1, q2, q5



Unnamed: 0,managerID,TestDate,country,gender,age,q1,q2,q5,stringAsFactors,agecat
1,1,2008-10-24,US,M,32,5,4,5.0,False,Young
4,4,2008-10-12,UK,M,39,3,3,,False,Young


In [None]:
leadership$TestDate <- as.Date(leadership$TestDate,"%m/%d/%y")
startdate <- as.Date("2009-01-01")
enddate <- as.Date("2009-10-31")
newdata 