-
Notifications
You must be signed in to change notification settings - Fork 71
/
Lecture 1 - Intro.R
528 lines (292 loc) · 15.7 KB
/
Lecture 1 - Intro.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
#Introduction to R
#The Basics
#R is a powerful analytically tool capable of performing complicated procedures on
#large data sets. When learning R, it may be helpful to think of the program as
#a TI-89 calculator on your laptop. Like a TI-89, R can be used as a calculator
#and graphing software. However, R can also be used to perform advanced statistical
#analysis, machine learning, geo-spatial analysis, and much more. Similar to
#figuring out a TI-89 in high school, becoming proficient in R will take time,
#practice, and inevitably a lot of googling.
#One of the things that makes R such an effective and efficient tool is that everything
#in R is vectorized. Vectorized code means that operations (+,-,x,/,... and more)
#can be performed simultaneously on multiple values without needed to loop the
#code (more on this later).
#NOTE: to run lines of code in R you can use the 'Run' button on the upper left-hand
#corner of this window OR run code line by line using the keys 'ctrl' + 'enter.'
#NOTE: for more information on any function we run, type ?function_name() into the
#console for an explanation of what it does and how to use it.
################################################################################
#### Creating and Using Vectors
#Let's work through an example. To make a simple vector of four values (1 - 4):
a = (1:4)
a = c(1,2,3,4)
a
#Alternatively you can code a vector of numbers individually by using the code:
#a = c(1,2,3,4)
#This may be helpful if you need a vector of non-consecutive numbers.
#We have given this vector the name 'a', this will make it easier to use the
#vector in calculations. You can always write out the vector or value without
#assigning it a name, but it becomes more complicated to re-write the vectors/values
#as things become more complex.
#Now let us multiply this vector by 2
a*2
#In your console the output will show our original vector a where every value
#in a has been doubled. Create a second vector of values 5 - 8 and call it b.
#Multiply this by a
b = (5:8)
a*b
#Here every value in a has been multiplied by its corresponding value in b.
#Now let us create a larger vector of all integers 1 - 100 and call it c.
c = (1:100)
#Let's figure out the summary statistics for c. Using the function summary().
summary(c)
#Great Job! You just used your first function in R! R has many built in functions
#and others that need to be downloaded using 'packages'. But more on this later.
#The mean and median for should be 50.50. Additional summary statistics
#(minimum, 1st quartile, 3rd quartile, maximum) are also provided.
#In addition to summary(), additional functions mean(), max(), and min() can also
#give you useful summary statistics.
################################################################################
###Vectors and Random Number Generation
#Now what if we want a vector or 4 random numbers (taken from the normal distribution)?
#Use the function rnrom() and assign a vector of 4 numbers to the letter d. Find
#the summary statistics.
d = rnorm(4)
summary(d)
d
#The function rnorm() randomly selects values from the normal distribution. This
#means that every time you run the function a different selection of values will be made.
#Type ?rnorm() into the console for a better understanding on the different types
#of random number generation and specifications you can make.
#What if we want to make sure we all get the same numbers to check our results?
#In order to do this we use the function set.seed(). In this function we set the
#starting value of the random number generating sequence, this is called a 'seed'.
#If we all set the same seed, we should see the same results. Let's set the seed to 101.
set.seed(101)
e = rnorm(4)
e
e
#Check with your neighbor to make sure you have the same output.
#NOTE: if you re-run e = rnorm(4) you will get a different selection of numbers
#because you are re-selecting the numbers. If everyone does this we will all get
#the same numbers but they will not be the same as the first selection. The main
#benefit of set.seed is to be able to replicate work to ensure the code is running
#correctly / no errors were made.
################################################################################
###From Vectors to Data Analytics
#Now I am sure you are wondering how you are going to use this information in
#econometric/statistical analysis. Well let us look back at the summary statistics
#of 'a' and 'b'.
a
b
#What if 'a' was the number of girls attending a summer camp on a given day
#and 'b' was the number of boys? Let's create the vector 'total', the total of
#all kids attending the camp.
total = a + b
total
#Now let's rename 'a' and 'b' to 'girls' and 'boys', respectively.
girls = a
boys = b
#Let's turn this mess of vectors into a data table we can use using cbind().
#cbind() is a function that attaches vectors (or columns when thinking about tables)
#together to create new data tables.
attendance = cbind(girls, boys, total)
#Go into the environment and open attendance. You should see three columns in the
#same order that you wrote them in the cbind() function. You can also view attendance
#in the console.
attendance
#Now let's try the summary function on attendance.
summary(attendance)
#Look at that! Three sets of summary statistics for the three variables (columns)
#that we created!
#Now let's try something a bit different. What if the number of girls was not
#counted correctly? How would we go about re-coding the table? To begin create
#a new vector with the numbers (4, 12, 7, 3), name this vector girls_2.
girls_2 = c(4, 12, 7, 3)
#Now we need to replace the column girls with girls_2 in the table. One way to
#do this is re-creating the attendance table. First we need to re-do the total column.
total_2 = girls_2 + boys
attendance_2 = cbind(girls_2, boys, total_2)
attendance_2
#Now let us look at a single column in the table. To do this we use:
#table_name$column. The '$' selects a specific column from a table. Select the
#boys column.
attendance$boys
boys
#You should get the error term "$ operator is invalid for atomic vectors". Why
#is this? R is picky about how data is stored. The table we created is a collection
#of vectors which limits what we can do when coding in R. The atomic designation
#R gives to the columns in the table makes it impossible to view the contents of
#the vectors through the table, you have to look at the vectors themselves. While
#this doesn't seem like a big deal now, not being able to pull up variables/columns
#from specific data sets you download will cause you a lot of trouble.
#To solve this issue we need to convert our table into a data frame. The function t0
#do so is as.data.frame(). Let's remake our attendance table as a data frame and
#name it attendance_df.
attendance_df = as.data.frame(attendance)
attendance
attendance_df
attendance_df$boys
#Now we have our boys column! Next let's select the third value from the boys column.
attendance_df$boys[3]
attendance_df[3,2]
#Great job! Earlier we replaced the entire girls column, now let's change this one
#entry for the boys column from 7 to 11. To do this we need the row and column numbers
#of the value we want to replace. The row for this value is 3 and the column number
#for boys is two. Check and see the location of this value if you do not understand
#how we got these numbers.
attendance_df[3,2] = 11
attendance_df$boys
#Now let's add one more column to this data frame called counselors, for the number
#of camp counselors working that day.
attendance_df$counselor = c(4,4,2,5)
#Coding the counselor variable this way bypasses creating a vector and then adding it
#to the table. If you wanted to first create a vector and then add it to the table
#the code would be:
counselor <- c(4,4,2,5)
attendance_df$counselor = counselor
#Now what if the first two days of the camp actually had 3 counselors instead of 4?
#To re-code multiple values in a table in R use the form:
attendance_df$counselor[attendance_df$counselor == 4] = 3
#This looks complicated (and a bit redundant) but what R is doing is calling the
#column in the data frame (attendance_df$counselor), taking the column and finding
#the values we are interested in changing ([attendance_df$counselor == 4]) and then
#reassigning three to all rows in that column that were listed as fours (= 3).
#Please note that if you have entries that are fours that do not need to be changed
#you will need to use the earlier method and change the entries individually as shown below.
attendance_df[1:2,4] = 6
attendance_df[c(1,2), 4] = 8
#The ':' changes all numbers between those two values, 'c(#,#,#,#...)' replaces specific
#numbers that are not consecutive.
#RECAP
################################################################################
#R uses vectors to preform operations quickly on multiple values.
#Vectors can be created by assigning a letter or word to a string of numbers.
#Vectors can be created using random normally distributed numbers with the
#function rnorm().
#The function summary() gives summary statistics.
#Using as.data.frame() to create a data frame from a table will give you more
#flexibility when coding in R.
################################################################################
######################### Exercises ############################
################################################################################
### Exercise 1.
#You need to find the average temperature and most common kind of weather in a town
#over the last 100 days. We will create this data. The different types of weather
#are 1. sunny, 2. cloudy, 3. partial sun, 4. rain, 5. hail.
#Create 3 vectors of length 100.
#The first vector are the values 1 - 100, name it days. (Use ':')
days = c(1:100)
#The second vector is a repeating sequence of 1 through 5. The code for repeating
#sequences looks like vector_name = rep(repeating_numbers, #_of repetitions).
#Write a line of code that repeats 1 - 5 twenty times. Name this vector weather.
weather = rep(1:5, 20)
#The third vector is a random sample of values. Set the seed to 0. Now go the the
#console and type: ?rnorm . For n put 100, mean = 20, and sd = 5. Name this temp.
set.seed(0)
temp = rnorm(100, 30, 5)
#Combine the columns into a table and make this table a data frame. Name the data
#frame w_report (short for weather report) and find the summary statistics.
w_report = cbind(days, weather, temp)
w_report = as.data.frame(w_report)
summary(w_report)
#What is the average temperature? What kind of weather did they have on the 66th day?
#The average temperature is 30.11 degrees.
#On the 66th day they had sunny weather.
################################################################################
### Exercise 2
#There are five elementary school classes going to a summer camp. Ms. 1's class has
#14 girls and 10 boys, Mr. 2's class has 10 girls and 15 boys, Mr. 3's class has 3 girls
#and 16 boys, Ms. 4's class has 17 girls and 4 boys, Mr. 5's class has 12 girls and 12 boys.
#Create a data frame with a column for:
# 1. All of the different classes called class. Named class.
# 2. The number of girls in a class. Named girls.
# 3. The number of boys in a class. Named boys.
# 4. The total number of children in each class. Named total.
#Answer the following.
# Find the class with the most students and the class with the most boys.
# What is the average number of girls in a class?
#REMEMBER: The ordering of the numbers in the columns matter.
class = c(1:5)
girls = c(14, 10, 3, 17, 12)
boys = c(10, 15, 16, 4, 12)
total = girls + boys
summer_camp = cbind(class, girls, boys, total)
summer_camp = as.data.frame(summer_camp)
summer_camp
#The class with the most students in class 2. The class with the most boys is class 3.
summary(summer_camp$girls)
#The average number of girls in a class is 11.2.
#Half way through the camp it starts raining and several of the kids go home.
# 8 boys from the 1st class go home.
# 3 girls from the third class go home.
# 4 girls and 4 boys from the fourth class go home.
# Everyone from the second and fifth classes stays.
#Go back and correct the numbers in the table. What is total of all the kids remaining
#at the camp? (Remember: to call a specific column from a table use table$column, the
#function sum() will be helpful to use.)
summer_camp[1,3] = 2
summer_camp[3,2] = 0
summer_camp[4,2] = 13
summer_camp[4,3] = 0
summer_camp
summer_camp$total = summer_camp$girls + summer_camp$boys
summer_camp
sum(summer_camp$total)
#The total number of kids left at the summer camp is 94.
################################################################################
### Exercise 3
#A university in Montana made a billing mistake. Certain students over- or
#underpaid their tuition for the semester. The university is issuing refunds to
#students to correct for this issue. The students are from four major cities in
#Montana.
#Create a data frame with 2 columns. Have one column be a random selection from
#the normal distribution, with length 200, a mean of 100, sd of 7.5. Set seed equal to 50.
#Call this refund. Have the other column be the sum of the random sample (given below) and
#the first column. The second column is also 200 entries, name this savings.
#The code for selecting a random sample of certain numbers is given below. In this
#case the code randomly selects integers between 50 and 5000.
selection = sample(c(50:5000), 200)
set.seed(50)
refund = rnorm(200, 100, 7.5)
savings = refund + selection
df_1 = cbind(refund, savings)
df_1 = as.data.frame(df_1)
#Create a second data frame with 2 columns. Have the first column be a repeating
#column of the names "Bozeman", "Missoula", "Helena", and "Billings". This is
#coded the same way you would do a numeric entry, the only difference is the words
#need to have "" around them. Call this column city. The second column is a random
#selection (same seed) with a mean of 500 and sd of 200. Call this column bills.
city = rep(c("Bozeman", "Missoula", "Helena", "Billings"), 50)
bills = rnorm(200, 500, 200)
df_2 = cbind(city, bills)
df_2 = as.data.frame(df_2)
#Combine the data frames, name this student_debt. (If you run into an issue with
#numeric vs. character column designations look at the function as.numeric().)
student_debt = cbind(df_1, df_2)
summary(student_debt)
student_debt$bills = as.numeric(student_debt$bills)
summary(student_debt)
#Now that the data frames are combined create a new column that subtracts savings
#from bills, call this debt. Calculate the average debt held by students. What
#is the smallest and largest refund received by a student?
student_debt$debt = student_debt$savings - student_debt$bills
summary(student_debt)
#The average amount of debt held by a student is $498.19. The largest refund
#received by a student was $120.01, the smallest was -$629.40.
### Challenge - there is no code given for this section. Use the internet and other
#resources to answer the question. **Packages will be discussed in the next section,
#you are welcome to use them if you want. It is possible to answer the question without
#them.**
#Subset the data frame by city. Which city has the largest average debt? What city
#does the student with the smallest debt live in? the largest?
bozeman = student_debt[student_debt$city == "Bozeman", T]
missoula = student_debt[student_debt$city == "Missoula", T]
helena = student_debt[student_debt$city == "Helena", T]
billings = student_debt[student_debt$city == "Billings", T]
summary(bozeman$debt)
summary(missoula$debt)
summary(helena$debt)
summary(billings$debt)
#The city with the largest average debt is Bozeman. The student with the largest
#debt lives in Helena, the smallest lives in Billings.