# 数据框

数据框 (`data.frame`) 类似矩阵，有行和列，但数据框中的每一列可以使不同的模式 (mode)。

## 创建数据框

In [66]:
kids <- c("Jack", "Jill")
ages <- c(12, 10)
d <- data.frame(
    kids, ages,
    stringsAsFactors=FALSE
)
print(d)

  kids ages
1 Jack   12
2 Jill   10


`stringsAsFactors` 参数用于将字符串转换为因子 (factor)，默认值为 TRUE。

### 访问数据框

数据框是一个列表，可以通过索引或者组件名访问

In [67]:
print(d[[1]])

[1] "Jack" "Jill"


In [68]:
print(d$kids)

[1] "Jack" "Jill"


可以使用类似矩阵的方式

In [69]:
print(d[, 1])

[1] "Jack" "Jill"


`str()` 函数查看数据框

In [70]:
str(d)

'data.frame':	2 obs. of  2 variables:
 $ kids: chr  "Jack" "Jill"
 $ ages: num  12 10


### 扩展案例：考试成绩的回归分析（续）

In [71]:
score <- read.csv("../data/student-mat.csv", header=T)

In [72]:
head(score)

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
2,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
3,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
4,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
5,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
6,GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,10,15,15,15


## 其他矩阵式操作

矩阵操作可以应用到数据框中

### 提取子数据框

In [73]:
score[2:5,]

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
2,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
3,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
4,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
5,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [74]:
print(score[2:5, 32])

[1]  5  8 14 10


In [75]:
print(class(score[2:5, 32]))

[1] "integer"


In [76]:
print(score[2:5, 32, drop=FALSE])

  G2
2  5
3  8
4 14
5 10


In [77]:
print(class(score[2:5, 32, drop=FALSE]))

[1] "data.frame"


筛选

In [78]:
score[score$G1 > 17,]

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
43,GP,M,15,U,GT3,T,4,4,services,teacher,...,4,3,3,1,1,5,2,19,18,18
48,GP,M,16,U,GT3,T,4,3,health,services,...,4,2,2,1,1,2,4,19,19,20
111,GP,M,15,U,LE3,A,4,4,teacher,teacher,...,5,5,3,1,1,4,6,18,19,19
114,GP,M,15,U,LE3,T,4,2,teacher,other,...,3,5,2,1,1,3,10,18,19,19
130,GP,M,16,R,GT3,T,4,4,teacher,teacher,...,3,5,5,2,5,4,8,18,18,18
199,GP,F,17,U,GT3,T,4,4,services,teacher,...,4,2,4,2,3,2,24,18,18,18
246,GP,M,16,U,GT3,T,2,1,other,other,...,4,3,3,1,1,4,6,18,18,18
287,GP,F,18,U,GT3,T,2,2,at_home,at_home,...,4,3,3,1,2,2,5,18,18,19
294,GP,F,17,R,LE3,T,3,1,services,other,...,3,1,2,1,1,3,6,18,18,18
360,MS,F,18,U,LE3,T,1,1,at_home,services,...,5,3,2,1,1,4,0,18,16,16


### 处理缺失值

R 会尽量处理缺失数据，但有些时候需要指定 `na.rm=TRUE` 告诉函数忽略缺失值

In [79]:
x <- c(2, NA, 4)
print(mean(x))

[1] NA


In [80]:
print(mean(x, na.rm=TRUE))

[1] 3


`subset()` 函数会自动忽略缺失值

In [81]:
subset(score, G1>17)

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
43,GP,M,15,U,GT3,T,4,4,services,teacher,...,4,3,3,1,1,5,2,19,18,18
48,GP,M,16,U,GT3,T,4,3,health,services,...,4,2,2,1,1,2,4,19,19,20
111,GP,M,15,U,LE3,A,4,4,teacher,teacher,...,5,5,3,1,1,4,6,18,19,19
114,GP,M,15,U,LE3,T,4,2,teacher,other,...,3,5,2,1,1,3,10,18,19,19
130,GP,M,16,R,GT3,T,4,4,teacher,teacher,...,3,5,5,2,5,4,8,18,18,18
199,GP,F,17,U,GT3,T,4,4,services,teacher,...,4,2,4,2,3,2,24,18,18,18
246,GP,M,16,U,GT3,T,2,1,other,other,...,4,3,3,1,1,4,6,18,18,18
287,GP,F,18,U,GT3,T,2,2,at_home,at_home,...,4,3,3,1,2,2,5,18,18,19
294,GP,F,17,R,LE3,T,3,1,services,other,...,3,1,2,1,1,3,6,18,18,18
360,MS,F,18,U,LE3,T,1,1,at_home,services,...,5,3,2,1,1,4,0,18,16,16


`complete.cases()` 去掉含有缺失值的观测

In [82]:
kids <- c("Jack", NA, "Jillian", "John")
states <- c("CA", "MA", "MA", NA)
d4 <- data.frame(
    kids,
    states,
    stringsAsFactors=FALSE
)
print(d4)

     kids states
1    Jack     CA
2    <NA>     MA
3 Jillian     MA
4    John   <NA>


In [83]:
print(complete.cases(d4))

[1]  TRUE FALSE  TRUE FALSE


In [84]:
d5 <- d4[complete.cases(d4),]
print(d5)

     kids states
1    Jack     CA
3 Jillian     MA


### 使用 `rbind()` 和 `cbind()` 等函数

两个数据框必须有相同的行数或列数

`rbind()` 添加新行时，添加的行通常是数据框或列表

In [85]:
print(d)

  kids ages
1 Jack   12
2 Jill   10


In [86]:
print(rbind(
    d, 
    list("Laura", 19)
))

   kids ages
1  Jack   12
2  Jill   10
3 Laura   19


使用原有列创建新列

In [22]:
eq <- cbind(
    score,
    score$G2 - score$G1
)
print(class(eq))

[1] "data.frame"


In [87]:
head(eq)

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,score$G2 - score$G1
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,1,1,3,6,5,6,6,1
2,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,3,1,1,3,4,5,5,6,0
3,GP,F,15,U,LE3,T,1,1,at_home,other,...,3,2,2,3,3,10,7,8,10,1
4,GP,F,15,U,GT3,T,4,2,health,services,...,2,2,1,1,5,2,15,14,15,-1
5,GP,F,16,U,GT3,T,3,3,other,other,...,3,2,1,2,5,4,6,10,10,4
6,GP,M,16,U,LE3,T,4,3,services,other,...,4,2,1,2,5,10,15,15,15,0


使用数据框的列表属性，增加新列

In [24]:
score$GDiff <- score$G2 - score$G1
head(score)

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,GDiff
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,1,1,3,6,5,6,6,1
2,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,3,1,1,3,4,5,5,6,0
3,GP,F,15,U,LE3,T,1,1,at_home,other,...,3,2,2,3,3,10,7,8,10,1
4,GP,F,15,U,GT3,T,4,2,health,services,...,2,2,1,1,5,2,15,14,15,-1
5,GP,F,16,U,GT3,T,3,3,other,other,...,3,2,1,2,5,4,6,10,10,4
6,GP,M,16,U,LE3,T,4,3,services,other,...,4,2,1,2,5,10,15,15,15,0


属性方式添加新列也支持循环补齐

In [88]:
print(d)

  kids ages
1 Jack   12
2 Jill   10


In [89]:
d$ones <- 1
print(d)

  kids ages ones
1 Jack   12    1
2 Jill   10    1


### 使用 `apply()`

如果数据框中的每一列数据类型相同，可以使用 `apply()` 函数

In [90]:
exam <- score[,31:33]
print(head(exam))

  G1 G2 G3
1  5  6  6
2  5  5  6
3  7  8 10
4 15 14 15
5  6 10 10
6 15 15 15


In [28]:
print(apply(exam, 1, max))

  [1]  6  6 10 15 10 15 12  6 19 15 10 12 14 11 16 14 14 10  6 10 15 15 16 13 10
 [26]  9 12 16 11 12 12 17 17 12 15  8 18 16 12 14 11 12 19 11 10  8 12 20 15  7
 [51] 13 13 11 11 13 10 15 15 10 16 11 11 10 10 10 16 13  7  9 16 15 10  8 14 12
 [76] 10 11 11 10  5 12 11  7 15 10  9  8 14 11  8  8 18  7 11 14 10 15 10 14  9
[101]  7 17 14  7 18 11  8 18 13 16 19 10 13 19  9 16 14 14  9 14 16 16 13 14  8
[126] 13 11  9  7 18 12  8 13 12  9 11 10  4 14 16  9  9 11 14  5 11  7 11  7 10
[151]  6 14 10  5 12 11 16 10 17 12  7  9  7 10  8 12 10 16  7 14  6 16 13  8 11
[176] 10 13  6 10 11  9 13 17  9 13 12 12 15  9 10 13  9  8 10 14 15 17 10 18 10
[201] 16 10 10  7 11 10  7 13 10  7  8 13 14  8 10 15  6  8  8 10  6  6 17 13 14
[226]  9 16 12 10 12 14 11 11 14  9 11 14 13 13  7 12 12  6 13  7 18 13  8  5 15
[251]  8 10  9  9 12  9 14 11 15 10 18  8 13 10 10 17 10 12 10  6  9 15 11 15 10
[276] 12 10  9  9 11  8 11 12 10 11 12 19 13 15 15 12 15 13 18 14 14 10 10 14 16
[301] 12 11 15 18 15 14 18  

### 扩展案例：工资研究

In [29]:
all2006 <- read.csv(
    "../data/2006.csv.short",
    header=TRUE,
    as.is=TRUE
)
head(all2006)

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_From,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
1,A-05243-28497,Atlanta Processing Center,Denied,10/1/2005 0:00:00,,10/1/2005 10:00:32,"QAMAR UL ZAMAN, MD",1035 RICHWOOD AVENUE,,CUMBERLAND,...,178000.0,,Year,Physician,163800.0,Level III,29-1062.00,Family and General Practitioners,OES,
2,A-05275-38245,Atlanta Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 2:48:59,"HYGIA INDUSTRIES, INC.",BOX 25,,TALLMAN,...,29.52,,Year,COMPUTER PROGRAMMER,29.52,Level II,15-1021.00,Computer Programmers,OES,
3,A-05263-34450,Atlanta Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 3:49:09,TRI-SEASON LANDSCAPE & BOULDER COUNTRUCTION,2260 SUNRISE COURT,,SCOTCH PLAINS,...,13.35,,Hour,LANDSCAPING & GROUNDSKEEPING WORKERS,13.35,Level IV,37-3011.00,Landscaping and Groundskeeping Workers,OES,
4,A-05273-38122,Atlanta Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 9:50:15,NIPPON EXPRESS USA,590 MADISON AVE.,SUITE 2401,NEW YORK,...,62000.0,,Year,Senior Logistics Coordinator,61818.0,Level II,11-3071.01,Transportation Managers,OES,
5,C-05265-35535,Chicago Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 12:35:50,VERNON FAIRCHILD JR.,4297 NORTH 1400 EAST,,BUHL,...,1.7,,Hour,Sheep Shearer-Crew Leader,8.69,Level I,45-2093.00,"Farmworkers, Farm and Ranch Animals",Other,Prevailing Wage Specialist
6,C-05275-38256,Chicago Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 13:51:05,VERNON FAIRCHILD JR.,4297 NORTH 1400 EAST,,BUHL,...,1.7,,Hour,Sheep Shearer-Crew Leader,8.69,Level I,45-2093.00,"Farmworkers, Farm and Ranch Animals",Other,prevailig wage specialist


做筛选

In [30]:
all2006 <- all2006[all2006$Wage_Per == "Year",]
all2006 <- all2006[all2006$Wage_Offered_From > 20000,]
all2006 <- all2006[all2006$Prevailing_Wage_Amount > 200,]
all2006

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_From,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
1,A-05243-28497,Atlanta Processing Center,Denied,10/1/2005 0:00:00,,10/1/2005 10:00:32,"QAMAR UL ZAMAN, MD",1035 RICHWOOD AVENUE,,CUMBERLAND,...,178000.0,,Year,Physician,163800.0,Level III,29-1062.00,Family and General Practitioners,OES,
4,A-05273-38122,Atlanta Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 9:50:15,NIPPON EXPRESS USA,590 MADISON AVE.,SUITE 2401,NEW YORK,...,62000.0,,Year,Senior Logistics Coordinator,61818.0,Level II,11-3071.01,Transportation Managers,OES,
11,C-05276-38323,Chicago Processing Center,Denied,9/19/2005 0:00:00,,10/3/2005 8:39:52,"CALIFORNIA STATE UNIVERSITY, LONG BEACH",1250 BELLFLOWER BLVD.,,LONG BEACH,...,55308.0,,Year,"Education Teacher, Postsecondary(Assistant Profess",55308.0,Level II,25-1081.00,"Education Teachers, Postsecondary",OES,
12,C-05276-38330,Chicago Processing Center,Denied,9/21/2005 0:00:00,,10/3/2005 9:25:03,CORPORATE NETWORK SOLUTIONS,5236 S. 40TH STREET,,PHOENIX,...,69000.0,,Year,Senior Network analyst,68869.0,Level IV,15-1081.00,Network Systems and Data Communications Analysts,Other,SESA - Arizona Department of Economic Security
15,C-05264-35092,Chicago Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 9:40:06,"THE STONE QUARRY, INC.",11768 CLAY RD.,,HOUSTON,...,60070.0,62000.0,Year,Stone Industry Specialist,60070.0,Level IV,17-1012.00,Landscape Architects,OES,
17,C-05276-38343,Chicago Processing Center,Denied,9/22/2005 0:00:00,,10/3/2005 10:10:14,"COMMUNITY HOSPITALISTS, LLC",30680 BAINBRIDGE,,CLEVELAND,...,123091.0,,Year,"Physician, Internal Medicine",123094.0,,29-1063.00,"Internists, General",OES,
21,A-05276-38359,Atlanta Processing Center,Denied,9/30/2005 0:00:00,,10/3/2005 10:40:17,CASTLE MANAGEMENT CORPORATION,"3040 STANTON ROAD, S.E.",SUITE 101,WASHINGTON,...,43000.0,,Year,Computer Support Specialist,31221.0,Level I,15-1041.00,Computer Support Specialists,OES,
22,A-05250-30387,Atlanta Processing Center,Denied,9/8/2005 0:00:00,,10/3/2005 10:54:41,CLC OF CHANTILLY,"4460 BROOKFIELD CORP. DRIVE, SUITE P",,CHANTILLY,...,38334.0,,Year,Electrician,38334.0,Level II,47-2111.00,Electricians,OES,
25,C-05257-32561,Chicago Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 11:25:26,"THE TBS GROUP, INC DBA TECHSYS BUSINESS SOLUTIONS","6801 GAYLORD PKWY, SUITE 301",,FRISCO,...,95000.0,,Year,Software Engineer-Systems,76170.0,Level III,15-1032.00,"Computer Software Engineers, Systems Software",OES,
29,A-05276-38392,Atlanta Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 11:40:30,JPSC INC,1465 S UNIVERSITY DRIVE,,PLANTATION,...,30000.0,50000.0,Year,Assistant Creative Director,35000.0,,27-1024.00,Graphic Designers,Employer Conducted,


实际工资与普遍工资的比率

In [31]:
all2006$rat <- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount
all2006

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source,rat
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,A-05243-28497,Atlanta Processing Center,Denied,10/1/2005 0:00:00,,10/1/2005 10:00:32,"QAMAR UL ZAMAN, MD",1035 RICHWOOD AVENUE,,CUMBERLAND,...,,Year,Physician,163800.0,Level III,29-1062.00,Family and General Practitioners,OES,,1.0866911
4,A-05273-38122,Atlanta Processing Center,Denied,10/2/2005 0:00:00,,10/2/2005 9:50:15,NIPPON EXPRESS USA,590 MADISON AVE.,SUITE 2401,NEW YORK,...,,Year,Senior Logistics Coordinator,61818.0,Level II,11-3071.01,Transportation Managers,OES,,1.0029441
11,C-05276-38323,Chicago Processing Center,Denied,9/19/2005 0:00:00,,10/3/2005 8:39:52,"CALIFORNIA STATE UNIVERSITY, LONG BEACH",1250 BELLFLOWER BLVD.,,LONG BEACH,...,,Year,"Education Teacher, Postsecondary(Assistant Profess",55308.0,Level II,25-1081.00,"Education Teachers, Postsecondary",OES,,1.0
12,C-05276-38330,Chicago Processing Center,Denied,9/21/2005 0:00:00,,10/3/2005 9:25:03,CORPORATE NETWORK SOLUTIONS,5236 S. 40TH STREET,,PHOENIX,...,,Year,Senior Network analyst,68869.0,Level IV,15-1081.00,Network Systems and Data Communications Analysts,Other,SESA - Arizona Department of Economic Security,1.0019022
15,C-05264-35092,Chicago Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 9:40:06,"THE STONE QUARRY, INC.",11768 CLAY RD.,,HOUSTON,...,62000.0,Year,Stone Industry Specialist,60070.0,Level IV,17-1012.00,Landscape Architects,OES,,1.0
17,C-05276-38343,Chicago Processing Center,Denied,9/22/2005 0:00:00,,10/3/2005 10:10:14,"COMMUNITY HOSPITALISTS, LLC",30680 BAINBRIDGE,,CLEVELAND,...,,Year,"Physician, Internal Medicine",123094.0,,29-1063.00,"Internists, General",OES,,0.9999756
21,A-05276-38359,Atlanta Processing Center,Denied,9/30/2005 0:00:00,,10/3/2005 10:40:17,CASTLE MANAGEMENT CORPORATION,"3040 STANTON ROAD, S.E.",SUITE 101,WASHINGTON,...,,Year,Computer Support Specialist,31221.0,Level I,15-1041.00,Computer Support Specialists,OES,,1.3772781
22,A-05250-30387,Atlanta Processing Center,Denied,9/8/2005 0:00:00,,10/3/2005 10:54:41,CLC OF CHANTILLY,"4460 BROOKFIELD CORP. DRIVE, SUITE P",,CHANTILLY,...,,Year,Electrician,38334.0,Level II,47-2111.00,Electricians,OES,,1.0
25,C-05257-32561,Chicago Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 11:25:26,"THE TBS GROUP, INC DBA TECHSYS BUSINESS SOLUTIONS","6801 GAYLORD PKWY, SUITE 301",,FRISCO,...,,Year,Software Engineer-Systems,76170.0,Level III,15-1032.00,"Computer Software Engineers, Systems Software",OES,,1.2472102
29,A-05276-38392,Atlanta Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 11:40:30,JPSC INC,1465 S UNIVERSITY DRIVE,,PLANTATION,...,50000.0,Year,Assistant Creative Director,35000.0,,27-1024.00,Graphic Designers,Employer Conducted,,0.8571429


定义一个函数计算 rat 列的中位数

In [32]:
medrat <- function(dataframe) {
    return (median(dataframe$rat, na.rm=T))
}

In [33]:
print(medrat(all2006))

[1] 1.002423


提取三种职业的子集

In [34]:
se2006 <- all2006[grep("Software Engineer", all2006),]
se2006

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source,rat
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
44,C-05276-38438,Chicago Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 12:55:45,NETWORK GENERAL CORPORATION,178 E TASMAN DRIVE,,SAN JOSE,...,,Year,Software Engineer,85196.8,Level II,15-1032.00,"Computer Software Engineers, Systems Software",OES,,1.116563
60,A-05297-45852,Atlanta Processing Center,Denied,11/2/2005 0:00:00,,11/30/2005 12:27:08,MORGAN STANLEY,1585 BROADWAY,,NEW YORK,...,,Year,Accountants and Auditors,72904.0,Level III,13-2011.00,Accountants and Auditors,OES,,1.179633
71,C-05276-38616,Chicago Processing Center,Denied,11/30/2005 0:00:00,,11/30/2005 14:23:52,"YOMIWURI, INC.",1121 MILWAUKEE AVE.,,RIVER WOOD,...,,Year,Korean Cusine Chef/Foreign Specialty,22860.0,Level I,35-1011.00,Chefs and Head Cooks,OES,,1.093613


In [35]:
prg2006 <- all2006[grep("Director", all2006),]
prg2006

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source,rat
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
44,C-05276-38438,Chicago Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 12:55:45,NETWORK GENERAL CORPORATION,178 E TASMAN DRIVE,,SAN JOSE,...,,Year,Software Engineer,85196.8,Level II,15-1032.00,"Computer Software Engineers, Systems Software",OES,,1.116563
60,A-05297-45852,Atlanta Processing Center,Denied,11/2/2005 0:00:00,,11/30/2005 12:27:08,MORGAN STANLEY,1585 BROADWAY,,NEW YORK,...,,Year,Accountants and Auditors,72904.0,Level III,13-2011.00,Accountants and Auditors,OES,,1.179633


In [36]:
ee2006 <- all2006[grep("Industrial Engineer", all2006),]
ee2006

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source,rat
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
60,A-05297-45852,Atlanta Processing Center,Denied,11/2/2005 0:00:00,,11/30/2005 12:27:08,MORGAN STANLEY,1585 BROADWAY,,NEW YORK,...,,Year,Accountants and Auditors,72904,Level III,13-2011.00,Accountants and Auditors,OES,,1.179633
71,C-05276-38616,Chicago Processing Center,Denied,11/30/2005 0:00:00,,11/30/2005 14:23:52,"YOMIWURI, INC.",1121 MILWAUKEE AVE.,,RIVER WOOD,...,,Year,Korean Cusine Chef/Foreign Specialty,22860,Level I,35-1011.00,Chefs and Head Cooks,OES,,1.093613


使用下面的函数提取给定公司的子集

In [37]:
makecorp <- function(corpname) {
    t <- all2006[all2006$Employer_Name == corpname,]
    return (t)
}

In [38]:
makecorp("COMMUNITY HOSPITALISTS, LLC")

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source,rat
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
17,C-05276-38343,Chicago Processing Center,Denied,9/22/2005 0:00:00,,10/3/2005 10:10:14,"COMMUNITY HOSPITALISTS, LLC",30680 BAINBRIDGE,,CLEVELAND,...,,Year,"Physician, Internal Medicine",123094,,29-1063.00,"Internists, General",OES,,0.9999756


In [39]:
corplist <- list(
    "THE OGILVY GROUP INC", "ogilvy",
    "ITT INDUSTRIES", "itt"
)

for (i in 1:(length(corplist)/2)) {
    corp <- corplist[2*i - 1]
    newdtf <- paste(corplist[2*i], "2006", sep="")
    assign(newdtf, makecorp(corp), pos=.GlobalEnv)
}

In [40]:
itt2006

Unnamed: 0_level_0,Case_No,Processing_Center,Final_Case_Status,Received_Date,Certified_Date,Denied_Date,Employer_Name,Employer_Address_1,Employer_Address_2,Employer_City,...,Wage_Offered_To,Wage_Per,Prevailing_Wage_Job_Title,Prevailing_Wage_Amount,Prevailing_Wage_Level,Prevailing_Wage_SOC_CODE,Prevailing_Wage_SOC_Title,Prevailing_Wage_Source,Prevailing_Wage_Other_Source,rat
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
43,A-05276-38405,Atlanta Processing Center,Denied,10/3/2005 0:00:00,,10/3/2005 12:40:42,ITT INDUSTRIES,1761 BUSINESS CENTER DRIVE,,RESTON,...,,Year,Computer Scientist,68578,Level II,15-1011.00,"Computer and Information Scientists, Research",OES,,1


## 合并数据框

使用 `merge()` 函数

In [41]:
d1 <- data.frame(
    kids=c("Jack", "Jill", "Jillian", "John"),
    states=c("CA", "MA", "MA", "HI"),
    stringsAsFactors=FALSE
)
print(d1)

     kids states
1    Jack     CA
2    Jill     MA
3 Jillian     MA
4    John     HI


In [42]:
d2 <- data.frame(
    ages=c(10, 7, 12),
    kids=c("Jill", "Lillian", "Jack"),
    stringsAsFactors=FALSE
)
print(d2)

  ages    kids
1   10    Jill
2    7 Lillian
3   12    Jack


In [43]:
d <- merge(d1, d2)
print(d)

  kids states ages
1 Jack     CA   12
2 Jill     MA   10


`by.x` 和 `by.y` 两个参数可以指定合并的列名

In [44]:
d3 <- data.frame(
    ages=c(12, 10, 7),
    pals=c("Jack", "Jill", "Lillian")
)
print(merge(
    d1, d3,
    by.x="kids",
    by.y="pals"
))

  kids states ages
1 Jack     CA   12
2 Jill     MA   10


重复匹配可能会得出错误的结果

In [45]:
print(d1)

     kids states
1    Jack     CA
2    Jill     MA
3 Jillian     MA
4    John     HI


In [46]:
d2a <- rbind(
    d2,
    list(15, "Jill")
)
print(d2a)

  ages    kids
1   10    Jill
2    7 Lillian
3   12    Jack
4   15    Jill


In [47]:
print(merge(d1, d2a))

  kids states ages
1 Jack     CA   12
2 Jill     MA   10
3 Jill     MA   15


## 应用于数据框的函数

### 在数据框上应用 `lapply()` 和 `sapply()` 函数

`lapply()` 中调用 `f()`，函数会作用域数据框中的每一列，将返回值置于一个列表中

In [48]:
print(d)

  kids states ages
1 Jack     CA   12
2 Jill     MA   10


In [49]:
dl <- lapply(d, sort)
print(dl)

$kids
[1] "Jack" "Jill"

$states
[1] "CA" "MA"

$ages
[1] 10 12



将列表强制转为数据框

注：排序后失去了记录的关联关系，这样做没有意义

In [50]:
print(as.data.frame(dl))

  kids states ages
1 Jack     CA   10
2 Jill     MA   12


### 扩展案例：应用 Logistic 模型

In [51]:
title <- c("Gender", "Length", "Diameter", "Height", "WholeWt", "ShuckedWt", "ViscWt", "ShellWt", "Rings")
aba <- read.csv(
    "../data/abalone.data",
    header=FALSE,
    col.names=title
)

将 Gender 列转换为 factor

In [52]:
aba$Gender <- as.factor(aba$Gender)

删掉幼鱼数据

In [53]:
abamf <- aba[aba$Gender != "I",]

逻辑回归模型训练函数，返回系数

In [54]:
lftn <- function(clmn) {
    glm(abamf$Gender ~ clmn, family=binomial)$coef
}

在每一列中应用该函数

In [55]:
loall <- sapply(abamf[,-1], lftn)

In [56]:
loall

Unnamed: 0,Length,Diameter,Height,WholeWt,ShuckedWt,ViscWt,ShellWt,Rings
(Intercept),1.275832,1.28913,1.027872,0.4300827,0.2855054,0.4829153,0.5103942,0.64823569
clmn,-1.962613,-2.533227,-5.643495,-0.268807,-0.2941351,-1.4647507,-1.2135496,-0.04509376


In [57]:
print(class(loall))

[1] "matrix" "array" 
