# R语言

## R环境创建

- 下载R软件包，点击[此处](https://www.r-project.org/)

- 下载RStudio软件包，点击[此处](https://rstudio.com/)

## 概率分布

### 二项式分布

- 查看文档命令 - help(dbinom)

案例：掷双骰24次至少掷得一对6点的概率为多少？

A：

$P=P(Y>0)$
$=1-P(Y=0)$
$=1-\left(\begin{array}{c}{24} \\ {0}\end{array}\right)\left(\frac{1}{36}\right)^{0}\left(\frac{35}{36}\right)^{24}$
$=1-\left(\frac{35}{36}\right)^{24}$
$=0.491$

In [3]:
# 掷双骰24次至少掷得一对6点的概率
p <- 1 - dbinom(x=0, size=24, prob=1/36)
print(p)

[1] 0.4914039


**习题**

两家影院竞争1000人的顾客源，假设每名顾客对两家影院并无好恶之分，且各人的选择相互独立，令$N$记每家影院的座位数。请问，为了保证因影院客满致使顾客离开的概率小于$1\%$，$N$应为多少？

In [11]:
# 二项式分布求值
p <- qbinom(p=0.99, size=1000, prob=0.5)
print(p)

[1] 537


### 正态分布

- 查看文档命令 - help(dbnorm)

In [14]:
p <- qnorm(p=0.99, mean=500, sd=sqrt(250))
print(p)
print(round(p))

[1] 536.7828
[1] 537


## 使用R进行多元统计分析

### t检验

#### 单样本

In [13]:
# to calculate t and p value by hand
x <- c(4.20, 5.03, 5.86, 6.45, 7.38, 7.54, 8.46, 8.47, 9.87)
mu <- 6.50

# to calculate t value
tvalue <- (mean(x) - mu) / (sd(x)/sqrt(length(x)))
pvalue <- 2 * (1 - pt(tvalue, length(x) - 1))
print(paste0("t value: ", tvalue))
print(paste0("p value: ", pvalue))

[1] "t value: 0.875408313038532"
[1] "p value: 0.406866176567346"


In [14]:
library(MASS)

t.test(x, mu=6.50)


	One Sample t-test

data:  x
t = 0.87541, df = 8, p-value = 0.4069
alternative hypothesis: true mean is not equal to 6.5
95 percent confidence interval:
 5.635688 8.422090
sample estimates:
mean of x 
 7.028889 


#### 双样本

In [22]:
# to calculate t and p value by hand
x1 <- c(22, 34, 52, 62, 30, 40, 64, 84, 56, 59)
x2 <- c(52, 71, 76, 54, 67, 83, 66, 90, 77, 84)
n1 <- length(x1)
n2 <- length(x2)

Sp_square <- ((n1 - 1) * var(x1) + (n2 - 1) * var(x2))/ (n1 + n2 - 2)

tvalue <- (mean(x1) - mean(x2)) / sqrt(Sp_square * (1/n1 + 1/n2))
pvalue <- 2 * pt(tvalue, n1 + n2 - 2)
print(paste0("t value: ", tvalue))
print(paste0("p value: ", pvalue))

[1] 254.0056
[1] "t value: -3.0445501225468"
[1] "p value: 0.00697485661385505"


In [27]:
t.test(x1, x2, var.equal = T)


	Two Sample t-test

data:  x1 and x2
t = -3.0446, df = 18, p-value = 0.006975
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -36.6743  -6.7257
sample estimates:
mean of x mean of y 
     50.3      72.0 


In [36]:
x <- c(x1, x2)
group <- rep(0:1, c(n1, n2))
print(x)
print(group)

 [1] 22 34 52 62 30 40 64 84 56 59 52 71 76 54 67 83 66 90 77 84
 [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1


In [38]:
lmobj <- lm(x ~ group)
summary(lmobj)


Call:
lm(formula = x ~ group)

Residuals:
   Min     1Q Median     3Q    Max 
-28.30 -11.80   2.85  11.18  33.70 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   50.300      5.040   9.980 9.21e-09 ***
group         21.700      7.127   3.045  0.00697 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.94 on 18 degrees of freedom
Multiple R-squared:  0.3399,	Adjusted R-squared:  0.3032 
F-statistic: 9.269 on 1 and 18 DF,  p-value: 0.006975


### 方差分析

In [51]:
library(tidyverse)

x1 <- c(18.5, 24.0, 17.2, 19.9, 18.0)
x2 <- c(26.3, 25.3, 24.0, 21.2, 24.5)
x3 <- c(20.6, 25.2, 20.8, 24.7, 22.9)
x4 <- c(25.4, 19.9, 22.6, 17.5, 20.4)
print(paste("means: ", c(mean(x1), mean(x2), mean(x3), mean(x4))))

data <- tibble(
    x = c(x1, x2, x3, x4),
    group = factor(rep(1:4, c(length(x1), length(x2), length(x3), length(x4))))
)
print(data)

[1] "means:  19.52" "means:  24.26" "means:  22.84" "means:  21.16"
[90m# A tibble: 20 x 2[39m
       x group
   [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m
[90m 1[39m  18.5 1    
[90m 2[39m  24   1    
[90m 3[39m  17.2 1    
[90m 4[39m  19.9 1    
[90m 5[39m  18   1    
[90m 6[39m  26.3 2    
[90m 7[39m  25.3 2    
[90m 8[39m  24   2    
[90m 9[39m  21.2 2    
[90m10[39m  24.5 2    
[90m11[39m  20.6 3    
[90m12[39m  25.2 3    
[90m13[39m  20.8 3    
[90m14[39m  24.7 3    
[90m15[39m  22.9 3    
[90m16[39m  25.4 4    
[90m17[39m  19.9 4    
[90m18[39m  22.6 4    
[90m19[39m  17.5 4    
[90m20[39m  20.4 4    


In [52]:
aov_object <- aov(x ~ group, data)
summary(aov_object)

            Df Sum Sq Mean Sq F value Pr(>F)  
group        3  63.29  21.095   3.462 0.0414 *
Residuals   16  97.50   6.094                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

使用回归得到同样的结果

In [53]:
lm_obj <- lm(x ~ group, data)
summary(lm_obj)


Call:
lm(formula = x ~ group, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-3.660 -1.650 -0.100  1.545  4.480 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   19.520      1.104  17.681 6.34e-12 ***
group2         4.740      1.561   3.036  0.00787 ** 
group3         3.320      1.561   2.126  0.04938 *  
group4         1.640      1.561   1.050  0.30913    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 16 degrees of freedom
Multiple R-squared:  0.3936,	Adjusted R-squared:  0.2799 
F-statistic: 3.462 on 3 and 16 DF,  p-value: 0.04137
