# EWMBA-257: People Analytics
## Introduction to R (optional)


<img src="images/berkeley_img-4-1.jpg"  />

In this notebook, we introduce the R programming language and some R concepts that will appear in the code for your EWMBA-257 assignment.

Note: **this course will NOT require you to write or interpret R code**. However, you will run pre-written R code and interpret the *outputs* of that code. If you're curious about the code itself, you may find this notebook helpful to glean a bit more about how it's structured and what it does.

This notebook is geared towards people who are new to the R programming language and/or programming in general. By the end of this notebook, you should be able to:

* briefly define or describe R, R data types, and names, expressions, functions, vectors, and dataframes in R
* recognize data types, names, simple expressions, simple functions, vectors, and dataframes when they appear in code cells

Estimated time to complete: 15-30 minutes

### Table of Contents

1.  <a href='#sectionr'>What is R (and why do we use it)?</a>

2. <a href='#sectiondata'>Data Types</a>

3. <a href='#sectionname'>Names</a>

4. <a href='#sectionexpr'>Expressions</a>

5. <a href='#sectionfunc'>Functions</a>

6. <a href='#sectionvect'>Vectors</a>
    
7. <a href='#sectiondf'>Dataframes</a>

8. <a href='#sectionmore'>More Resources</a>


---

## <a id="sectionr"> What is R (and why do we use it)? </a>

### What is R?

[R](https://www.r-project.org/) is a programming language- a way for us to communicate with the computer and give it instructions. 

Just like any language, R has a vocabulary made up of words it can understand, and a syntax giving the rules for how to structure communication.

Like natural human languages, R has rules. It differs from natural language in two important ways:

    The rules are simple. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a few months.
    The rules are rigid. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running R code is not smart enough to do that.

In this notebook, we're going to learn a few of those rules.

### Why use R?

As a programming language, R makes it possible to perform complex analyses on large datasets in a way that is:

- repeatable: you can use the same code (with minor changes) to do the same analysis on different data
- fast: a lot of code to perform common analysis operations has already been created for you to use without writing it yourself
- free: the software is open source and usable by anyone at no cost

R is commonly used for business and research, particularly in fields like finance, economics, and statistics. If you are planning to work in the business world, it's very possible that you will work with R or that you will work with someone who programs in R.

<div class="alert alert-info"><b>What about Excel/Python/SAS/etc? </b> Common analytics tasks can be done using a variety of programming languages and software, which raises a natural question- which one should you use? Generally, no single language is the "best". Your choice will often depend on:
<ul><li>type of task: R was designed for statistical analysis, so it is well tested and documented for that purpose. Others have different specialties (e.g. Python and machine learning)</li>
    <li> size of dataset: Excel has been known to run into problems for big dataset (more than 1 million rows)</li>
    <li> location: many companies and fields have a predominant software used by most people in their community. It's a lot easier to collaborate when you all speak the same language </li></ul>
    
The good news is that once you are "fluent" in one, it becomes much easier to pick up another.
    </div>



---

## <a id="sectiondata"> Data Types </a>

The data we will work with broadly falls into two types: numbers and text. 

Numbers show up green in code cells and can be positive, negative, or include a decimal. 

When we begin working with collections of data like columns and tables, you'll see that R makes a distinction between *numeric* data (decimals; the default classification for all numbers in R) and *integer* data (whole numbers).

<div class="alert alert-info"> <b>NOTE: </b> <p> Lines in code cells that start with a # symbol are <i>comments</i>. The # symbol tells the computer that to the right of it on that line is NOT a command and should be ignored. Comments are purely for humans to organize and explain code.</p></div>

In [None]:
# everything in this cell is numeric data

3.14159

-70.1

# the next three can also be classified as integers
4

87623000983

-667

Text data (also called *character* data or *strings*) show up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings. 

In [None]:
# Strings
"a"

"Hello world!"

"You may write me down in history
With your bitter, twisted lies,
You may tread me in the very dirt
But still, like dust, I'll rise."

# to the computer this is a string, NOT numerical data
"3.14159"

Finally, there's one special kind of data that looks like text at first glance. *Logical* data are truth values- they say whether or not a condition or state is true. There are only two possible logical data values, `TRUE` and `FALSE`.

In [None]:
# logical data
TRUE

FALSE

---

## <a id="sectionexpr"> Expressions </a>

A piece of communication in R is called an expression- it tells the computer what to do with the data we give it.

Here's an example of an expression. 

In [None]:
# an expression
14 + 20

When you run the cell, the computer **evaluates** each expression within it and prints the result. 

<div class="alert alert-info"> Remember- you can run cells by clicking on the cell to highlight in, then clicking the Run button in the toolbar at the top or pressing Shift+Enter on your keyboard. </div>

In [None]:
# more expressions. Run the cell to evaluate them
100 / 10

4.3 + 10.98

33 - 9 * (40000 + 1)

884

Many basic arithmetic operations can be used in R, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](https://www.statmethods.net/management/operators.html). 

The computer evaluates arithmetic according to the PEMDAS order of operations (just like you may have learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction.

In [None]:
# before you run this cell, can you say what it should print?
# run the cell to check your answer
4 - 2 * (1 + 6 / 3)

If you'd like, feel free to use the next cell to experiment with arithmetic in R.

In [None]:
# use this cell to practice creating and executing arithmetic expressions


---

## <a id="sectionerr"> Errors </a>

Whenever you write code, you'll make mistakes. When you run a code cell that has errors, R will produce an *error message* to tell you that the computer doesn't understand what you want it to do.

Errors are completely normal; experienced programmers make many errors every single day (reports vary, but [one study found that professionals spend 35-50% of their programming time fixing errors](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.444.9094&rep=rep1&type=pdf)). When you make an error, the next steps are to find the source of the problem, fix it, and move on.

We have made an error in the next cell. Run it and see what happens.


In [None]:
print("Hello world!"

You should see something like this (minus our annotations):

<img src="images/error.png"/>

The most important part of the message is at the end of the first line (after the last colon, outlined in blue in the image). "unexpected end of input" indicates that there is something missing at the end of the line of code- can you figure out what?

R errors in general can be confusing to look at. Much of the rest of the error message details what specific part of the code (either in the notebook or inside the R software itself) generated the error. A good strategy when you're learning R is to ignore everything other than the explanation. Even the explanations can be confusing, though, which is why most debugging (as a student or as a professional) involves copying your error into a search engine or asking someone else for help.

---

## <a id="sectionname"> Names </a>

Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a **name**.

We can name values using what's called an *assignment* statement.


In [None]:
# assigns 442 to x
x = 442

The assignment statement has three parts. On the left is the *name* (`x`). On the right is the *value* (442). The *equals sign* in the middle tells the computer to assign the value to the name.

You may also see `<-` used to assign names in R. `<-` does the same thing as `=`.

In [None]:
# equivalent expression to assign 442 to x
x <- 442

You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `x` again in the future, it will have the value we assigned it.

In [None]:
# show the value of x
x

You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation.

In [None]:
y = 50 * 2 + 1
y

We can than use these names as if they were the values they represent.

In [None]:
x - 42

In [None]:
x + y

<div class="alert alert-warning"> <p><b>Optional</b>: <p>Use the next cell to practice assigning names to values. Note that names in R can be made up of letters, numbers, periods, and underlines, as long as they start with a letter or a period followed by a letter.</p>


In [None]:
# practice with names


---

## <a id='sectionfunc'>Functions</a>
We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. R has some functions built into it.

In [None]:
# a built-in function 
round

Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The `round` function returns the number it was given, rounded to the nearest whole number.

In [None]:
# a call expression using round
round(1988.74699)

A function may also be called on more than one value (called *arguments*). For instance, the `min` function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas.

In [None]:
min(9, -34, 0, 99)

<div class= "alert alert-warning"><p><b>Optional:</b></p> Practice calling some built-in R functions.
<ul>
    <li>The `abs` function takes one numerical argument (just like `round`)</li>
    <li>The `max` function takes one or more numerical arguments (just like `min`)</li>
    </ul>

<p>Try calling `abs` and `max` in the cell below. What does each function do?</p>

<p>Also try calling each function *incorrectly*, such as with the wrong number of arguments. What kinds of error messages do you see?</p>

In [None]:
# practice with functions


---

## <a id="sectionvect"> Vectors </a>

Ideally, we want to be able to manipulate many values at the same time. We can do this using sequences: collections of data, all sharing the same type (e.g. numerical).

The sequence we'll work with the most is a **vector**. Vectors are made using the `c` function.

In [None]:
# make a vector and name it 'numbers'
numbers = c(4, 8, 15, 16, 23, 42)

# show the numbers vector
numbers

You can retrieve items in a vector by **indexing**. To index an item, put the numerical position of the item in square brackets next to the name of the vector.

In [None]:
# retrieve the second item in the vector
numbers[2]

Some functions take vectors as arguments.

In [None]:
# get the average value of the vector
mean(numbers)

# add up all the numbers within the vector
sum(numbers)

And, you can do arithmetic with vectors. Note that this is *element-wise* arithmetic: for addition, the first element of vector 1 is added to the first element of vector 2, the second element of vector one is added to the second element of vector two, and so on. The result will be a new vector.

In [None]:
# make another vector called primes
primes = c(2, 3, 5, 7, 11, 13)

# add numbers to primes
numbers + primes

<div class= "alert alert-warning"><p><b>Optional:</b></p> Practice creating and manipulating vectors.
</div>

In [None]:
# practice with vectors



---
## <a id="sectiondf">Dataframes</a>

Dataframes are fundamental ways of organizing and displaying data in tables. Technically, they are collections of vectors, where each vector has the same length. Mentally, you can think of a dataframe as a spreadsheet where each vector represents a column.

Since **columns** are vectors, you know that all of the data in each column is of the same type (numerical, text, etc). Columns typically represent a *feature* about the data, such as:

- color
- price
- name

You can also think about a dataframe in terms of its rows. Each row may have data of different types. A row usually represents an *observation* of something in the world with many features, such as:

- a person participating in a research study
- a day of stock trading
- a product at a store

Dataframes are usually loaded in from spreadsheet files. Our data is in a csv (or "comma-separated values") file, so we will use the `read.csv` function. This function takes the *path*, or location, of the file as its argument.

In [None]:
# load the dataset
hr = read.csv("data/HR-Employee-Attrition.csv")

Once the dataset is loaded, we can view the first six rows using the `head` function.

In [None]:
# show the first 6 rows of the dataset
head(hr)

At the very top, we can see the visible dimensions of the dataframe as "number of rows x number of columns" (in this case, 6 visible rows and 35 columns). The names of the columns are listed next, with the type of data in each column underneath (most of these are "int" for "integer"). Finally, it displays the rows of the dataframe.

Note: because this dataframe has a lot of columns, some of the columns in the middle have been cut out and replaced with "...".

Also note: some of our columns have been classified as "fct" for "factor". **Factor** data represents categories- by default, R assumes that the columns with text in them are describing nominal or ordinal categories.

We can use functions on a dataframe, just as we can use functions on values or vectors. Here are some examples of common dataframe functions:

In [None]:
# get the dimensions (number of rows and number of columns) of the dataframe
dim(hr)

# list the names of the columns
colnames(hr)

And you can index a single column from a dataframe, similarly to how you can index a single value from a vector. The syntax is a bit different- you index a column using the `$` symbol.

In [None]:
# index the Department column
hr$Department

---

## <a id="sectionmore"> More Resources for Learning R </a>

This notebook has given an extremely brief overview of the major R concepts used to manipulate the data for the IBM attrition data assignment, including Data Types, Names, Errors, Functions, Vectors, and Dataframes. Hopefully you're now able to recognize some of these concepts in the pre-written code seen in the assignment itself.

If you're interested in learning more about R (and potentially writing your own R code), there are a wealth of resources available:

### UC Berkeley workshops and bootcamps
* [D-Lab](https://dlab.berkeley.edu)

### Websites for R help:  
* [STHDA](http://www.sthda.com/english/)  
* [Quick-R](http://statmethods.net/)  
* [UCLA idre](https://stats.idre.ucla.edu/r/)  
* [R-bloggers](https://www.r-bloggers.com/)  
* [Stack Overflow - R](http://stackoverflow.com/questions/tagged/r)  

### Web Resources
* [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)  
* [The tidyverse style guide](http://style.tidyverse.org/)  
* [Tidy Text Mining](https://www.tidytextmining.com/tidytext.html)
* [Regular expressions with stringr](https://stringr.tidyverse.org/articles/regular-expressions.html)
* [Quick Intro to Parallel Computing in R](https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html) 
* [Software Carpentry](https://swcarpentry.github.io/)  

### Books
* [Kearns GJ. 2010. Introduction to Probability and Statistics in R](http://www.atmos.albany.edu/facstaff/timm/ATM315spring14/R/IPSUR.pdf)
* [Wickham H. 2014. Advanced R](http://adv-r.had.co.nz/)  
* [R for Data Science](http://r4ds.had.co.nz/)  
* [Lander J. 2013. R for everyone: Advanced analytics and graphics](http://www.jaredlander.com/r-for-everyone/)  
* [Matloff N. 2011. The art of R programming: A tour of statistical software design](https://www.nostarch.com/artofr.htm)  
* [Brunsdon C, Comber L. 2015. An Introduction to R for Spatial Analysis and Mapping](https://us.sagepub.com/en-us/nam/an-introduction-to-r-for-spatial-analysis-and-mapping/book241031)
* [James G, Witten D, Hastie T, Tibshirani R. 2013. An Introduction to Statistical Learning: With Applications in R, 7th edition](http://faculty.marshall.usc.edu/gareth-james/ISL/)

---

Notebook created by Keeley Takimoto.

Some text adapted with permission from materials made for Haas Executive Education's Data Science Online course by Keeley Takimoto.

Links in the "More Resources for Learning R" section from the [UC Berkeley D-Lab "R Fundamentals" workshop](https://github.com/dlab-berkeley/R-Fundamentals) by Evan Muzzall, Aniket Kesari, Jae Yeon Kim, Sam Abdel-Ghaffar, Guadalupe Tuñón, Shinhye Choi, Patty Frontiera, Rochelle Terman, and Dillon Niederhut ([CC BY-NC 4.0](https://github.com/dlab-berkeley/R-Fundamentals/blob/master/LICENSE))
