# ECON 326: Introduction to Basic Statistics using Jupyterlab

## Authors
* Jonathan Graves (jonathan.graves@ubc.ca)
* Devan Rawlings (rawling5@student.ubc.ca)

## Prerequisites
* None!

## Outcomes

* Understand how to use UBC's Jupyterlab environment (Syzgy) to take part in the in-class activities
* Review and refresh our knowledge of introductory statistic concepts
* Understand how to estimate and evaluate these concepts using R
* Import and load data into a Jupyter Notebook
* Understand the structure of data objects in R
* Perform simple manipulation of these objects to create or edit variables

*Last Update: 5 May 2021*

# Part 1: Introduction to R and Jupyter

In this tutorial, we will be working with some real-world data: the 2020 Canadian Labour Force Survey, provided by Statistics Canada[<sup id="fn1s">1</sup>](#fn1).  We will learn more about this dataset later, but let's first get a handle on it by using the data in our **Notebook** (this file).  In order to do this, we first need to import the data.  Essentially, we need to tell our computer how to interpret the data so that we can use it for econometric analysis.  In general, importing data can be more difficult than you might expect - there can be issues with formatting, memory, and other kinds of problems.  Fortunately, for this course we have the luxury of working with relatively "nice" data.

In R, we import data using different commands, which are stored in **libraries** that other developers of R have created for us.  Let's import some of the most important libraries now.  

We can do this in a Jupyter notebook by selecting a cell and hitting "shift+enter" or by pressing the "play" button in the menu.
* *Important*: information is shared across cells in a notebook.  However, cells run independently; so, if you run a later cell it doesn't re-evaluate previous cells.  You have to re-run them before you can use the results.  You can re-run all the cells in a notebook with the "fast forward" button.

Try loading some of the packages into memory by running the following cell.

In [None]:
# Run this cell to evaluate the code

library(tidyverse)
library(haven)


source("hands_on_tests_1.r")

You have now imported the packages into memory, and they are available to use in subsequent cells.  You may also see some output, which tells you about how they have been imported.

As an aside, in this course we try (as much as possible) to use R packages which are part of the [tidyverse](https://www.tidyverse.org/) family of packages.  This is because they are well-supported, consistent, and commonly used in data sciences.  There are (usually) other packages that provide similar functions.

## Importing Data into R

The first step in an econometric project is to import, tidy, and examine your data.  For this project, we will be using data from the 2016 Canadian Census, provided by Statistics Canada (see the license notes).  This is _real data_ on real Canadians, and is representative of the overall Canadian population.  Wow!

In this course, we will always work with (and expect) data that is **tidy**.  This is a data-science term that refers to data with a particular format or shape.  Specifically, tidy data is rectangular data in which:
* Each row represents one observation
* Each column represents a single variable

You can imagine this as a spreadsheet in a program like Excel.  This is the way most statistical programs like to recieve data - but it _isn't_ a property of the data itself.  It is only a representation; there can be other kinds of representations.  For example, a panel dataset might have each row as a unit (e.g. country) and each column might represent a variable in a year (e.g. unemployment in 2016, unemployment in 2017, etc.).  Often, when we work with real-world data we have to reshape it in order for it to be usable - but in this course, we won't worry about that.

Data comes in many different formats, which must be interpreted by our statistical programs.  Essentially, we need to tell them what the data means.  This is referred to as **importing** data.  There are different techniques for importing data in different formats; for example:

* Importing data from a ```.dta``` file: this is data that was created by the statistical software STATA and is commonly used in economics.
* Importing data from a delimited text file: this is data which is in a text format, but where the columns are *delimited* or seperated by a special character.  For example, the ```.csv``` (comma-separated variable) format is text, but where the columns are separated by a comma.

Each of these formats has a special **method** associated with it.  Methods are commands, stored in R packages, that tell R to do things.  In fact, you're already seen one!  The ```library(...)``` method imports whatever package is in the brackets into memory.

Let's start by importing our ```.dta``` file into memory, using the ```read_dta``` method.  When we read things into memory, we normally want to use them later, so we have to give them a name.  Let's call this ```census_data``` so we can refer to it later on:

In [None]:
# read the file named "01_census2016.dta" into memory, and assign it to the object "census_data"

census_data <- read_dta("01_census2016.dta")

The symbol ```<-``` is R is the _assignment operator_ and assigns whatever is on the right-hand side into the name on the left-hand side.  When you don't assign something, R just prints the output instead.  Some important things to note:

* You can assign pretty much anything in R using this operator
* However, names are unique and case-sensitive
* If you assign a name that was already assigned, it will be overwritten

Consider the following R code.  Can you tell what it is going to do before you hit run?

In [None]:
x <- 1
y <- 2

x + y

y <- 5

x + y

That's right!  It assigns the value 1 to the variable ```x``` and the value 2 to the variable ```y```, then adds them up.  Notice that when we change the value of ```y``` we get a different result.  This is an advantage, because it lets you make repeated changes to an object without needing a new name for it - for example, adding a variable to a dataset or deleting observations.

We can also do one other clever thing here: test and evaluate your answers.  For example, let's see if you can predict the result of this code:

In [None]:
x <- 1

y <- 3

y <- x + y

#store the value you think y will be in ''answer1'' by completing this code

answer1 <- #replace this 

test_1()

Did you get the right answer?  We will frequently use this technique in these hands-on sessions to test your knowledge and make sure you're on the right track.

## Viewing and Inspecting Data

After we have imported out data, the next important task is inspect and view it, so that we can understand its structure.  R offers several different ways to do this. We will learn three of these commands:

* ```head(...)```
* ```print(...)```
* ```glimpse(...)```

Rather than explain each one, let's just look at them - as applied to ```census_data```:

In [None]:
head(census_data)

The ```head``` command provides a "snapshot" of the first 10 rows of a dataset, which gives you a look at what it would look like in a program like Excel.  This is very helpful, since it can give you a quick feeling for the data's structure.  Note the variable names along the columns; this is **tidy** data, after all!

In [None]:
print(census_data)

The next command is ```print``` which prints out the first few lines of a dataset - but it also provides much more information on the formats and types of the variables.  This is critical, since (for example) coded qualitative variables can look like numbers, but need to be treated very differently when doing analysis. The tag ```<dbl>``` denotes a double variable that can hold a floating-point (or decimal) number. ```<dbl+lbl>``` indicates that certain values of the variable have labels (usually to represent qualitative variables like native language). 

An alternative to ```print``` is the ```glimpse``` command, which does the same thing but formatted vertically.  This is usually easier to read, particularly when you have a dataset with a large number of variables (columns).

In [None]:
glimpse(census_data)

Personally, we recommend the ``glimpse`` command in general.  You can also ``glimpse`` individual variables (we will explore what ``$`` means in the next section)

In [None]:
glimpse(census_data$immstat)

See if you can figure out what the different parts of this output mean!  You don't need to know right now, but as you become better with R, it will start to make sense to you.

### Accessing Variables and Data Frames

If you recall, this dataset is _tidy_ in that each observation is a row, and each column is a variable.  In R, datasets are called *data frames* - this particular one is a special type of data frame called a _tibble_ (like, table).  We don't need to get into too many details about data frames, but basically they collect and organize all of the variables and observations.  Many functions in R need information to be organized into a data frame so that it can be computed.  You will see examples of these later on.

One of the most important things to remember is that you can access the _variables_ in a dataframe in two ways:

1.  First, you can use the `$` operator to directly access the variables
2.  Second, within a command, you can tell the command what data you are working with, then refer to the variables by name.

For example, if we wanted to get the ``mean`` of ``ppsort`` (for some reason...) we could do it this way:

In [None]:
mean(census_data$ppsort)

This says get the variable ``ppsort`` from the dataframe ``census_data``, then compute the mean.  In a command like ``filter`` you can tell the command to work with the data, and then refer to the variables just by their name.  Compare these two commands:

In [None]:
head(filter(.data = census_data, sex == 1))

head(filter(.data = census_data, census_data$sex == 1))

Notice how we didn't need the `$` reference in the top one?  That was because we told filter that the data (``.data``) was ``census_data``.  This gives you deep control over the data when you're working with complicated datasets.


Now that we've loaded our data and have some basic familiarity with R, let's start using it to do some data analysis.  Then, you will take over the analysis and complete exercises.

# Part 2: Hands-On

## The Immigrant Wage Gap

Canada is a nation where immigration is very important, both to our society and to our economy: As the [Wall Street Journal](https://www.wsj.com/articles/canada-looks-to-immigration-to-boost-economic-recovery-11617105739) writes, "Canada is among the most immigrant-reliant advanced economies in the world. Before the pandemic, net migration accounted for more than 80% of Canada’s population growth, compared with about 40% in the U.S."   Canada prides itself on being a welcoming, multi-cultural country: More than 20% of Canada's population are immigrants, and almost 40% of Canadian children have immigrant parents.   Immigrants come to Canada through three main categories:

* Economic immigrants: individuals immigrating to Canada for economic reasons like a new job or business opportunity
* Family immigrants: individuals immigrating to Canada to join a family member
* Refugees: individuals immigrating to Canada to escape unrest or persecution in their home country

Canada particularly tries to recruit and attract skilled immigrants; most immigrants who do not come as part of a family, or as refugees, are accepted under a ["points system"](https://www.canada.ca/en/immigration-refugees-citizenship/services/immigrate-canada/express-entry/eligibility/federal-skilled-workers/six-selection-factors-federal-skilled-workers.html), which particularly looks for immigrants who are (a) young, (b) well-educated, and (c) experienced in desirable fields (such as medicine or engineering).  While refugees receive a lot of press attention, about 60% of immigrants arrive through the economic category - they are seeking better opportunities for themselves and their families.  Each of the other categories is about 20% of the immigrant population.

One of the biggest debates around immigration concerns their abilities to contribute economically to Canadian society.  Economic immigrants are explicitly selected on this basis but they face many challenges integrating into Canadian society.  This can result in serious problems finding a job: we see that immigrants frequently earn a lower wage than domestically born Canadians.  We call this the immigrant wage gap.

The question of "why this gap exists" is critically important, because it tells us whether this is a problem in terms of maximizing the economic benefits of our immigrant population.  It also informs the kinds of policies we should carry out.  To see why this might be a debate, consider the following three rationales for why this gap exists:

1. Immigrants are paid less than domestic-born Canadians because they have less education and less skills than domestic Canadians.
  * If this is the case, immigrants are competing with other low-skill or low-education native Canadians for jobs.  Is this desirable?
2. Immigrants are paid less than domestic-born Canadians because Canadian businesses do not recognize their education or professional credentials.  You have probably heard the story of a doctor driving a cab for a living, or a nurse working as a hospital cleaner.
  * If this is the case, immigrants are being underemployed and not using their skills to the maximum extent.  This is undesirable.
3. Immigrants are paid less than domestic-born Canadians because immigrants are more likely to be women, younger, and have children, all of which are groups which earn lower wages.
  * If this is the case, then the wage gap is largely a product of demographics interacting with existing wage gaps (like the gender wage gap) in the Canadian economy.  This implies that the immigrant-wage gap is not a problem _in and of itself_: it's a consequence of other economic issues.

Which of these rationales is correct?  We don't know.  In fact, all of them might be true to some extent, or maybe none of them.  Maybe there's an alternative - you probably have an opinion on this topic, particularly if you're an immigrant or have an immigrant background.

The key to doing research is to understand how we can use data to try and investigate each of these rationales.  Much of what we are going to do is to help us determine if these explanations are plausible, and to what degree.    We need to think about how the different explanations would manifest in our data.  Essentially, each explanation is going to create a "pattern" - we need to figure out how we can use detect and analyze this pattern.  This is the processing of econometric modelling.

For example, consider the following:

1.  If the first rationale is true, if we compare immigrants and domestic-born Canadians with the same education and experience, the gap should go away.
2.  If the second rationale is true, if we compare immigrants and domestic-born Canadians with similar credentials, the gap should still exist.
3.  If the third rationale is true, if we we compare immigrants and domestic-born Canadians with similar demographic characteristics, the gap should go away.

We can also start to dig deeper into these explanations, to try to understand the fundamental causes, and answer our research question:

> ***What are the causes of the immigrant wage gap in Canada?***

## Getting Started with the Census

In this project, we are going to work with a subset of the 2016 Canadian Census individuals file.  This contains information on a representative number of people in Canada.  We have already restricted attention to people who are:

* Employed full-time during the reference week in which the Census was collected
* Are either immigrants or Canadian-born citizens; so, this excluded non-permanent residents and foreign nationals living in Canada

> _Think Deeper_: What might be the downside of making these kinds of restrictions?  What might we miss by only looking at these kinds of people?

We have already loaded this data earlier; we called it ```census_data``` and have already taken a look at it, but let's do that again.

In [None]:
glimpse(census_data)

We can see there are many variables in this dataset:

* ```ppsort```: which is a unique idenfier for each observation
* ```sex```: the sex of the individual
* ```ageimm```: age at immigration
* ```immstat```: immigration status
* ```wages```: annual wages

We will focus on only a few of them in this lesson.  Most of these variables are _coded_, which means they have labels which are associated with the numbers you can see in the preview of the data.  You can find information on these variables in the documentation: 

* https://abacus.library.ubc.ca/file.xhtml?persistentId=hdl:11272.1/AB2/GDJRT8/1VI5BS&version=1.0

Take a moment to download it now.  This is fine, but we really need to tell R that our variables are actually qualitative (factor) variables, so they are easier to read.  Let's try that now using the ``as_factor`` function.

This function automatically translates labelled values into factor variables.  Compare the results above with the new dataset below.

In [None]:
census_data <- as_factor(census_data)

glimpse(census_data)

See the difference?  It can seem subtle, but this is an indication that R will now correctly interpret qualitative variables.  Now, we're almost ready to get started - but there's one final issue.

### Dealing with Missing Data

Take a close look at ``wages`` in the above table.  What do you see?  Does anything look not-quite-right.

What if we computed the mean of wages?



In [None]:
mean(census_data$wages)

This is one of the most common problems applied economists face: *missing data*.  For many reasons, sometimes data is missing.  In the Census, sometimes people are unable to completely fill out the census for all values - sometimes this is expected (e.g. no wages if you're unemployed) and sometimes its not (e.g. no wages reported but employed full time?).

We can test this using the ``is.na()`` function:

In [None]:
any(is.na(census_data$wages))

We need to decide what to do with these individuals.  The most transparent way is to remove them from the dataset - however, you should always think through the implications of this, especially if data might be missing for economic reasons.

> _Think Deeper_: what are some possible reasons ``wages`` might be missing?  Are they problematic for the analysis?

This process of cleaning and subsetting a dataset is part of the process of developing your *analysis sample*.  Let's finalize our sample now:

In [None]:
census_data <- filter(census_data, !is.na(census_data$wages))
census_data <- filter(census_data, !is.na(census_data$mrkinc))
#what do you think the ! operator does?

**Question**: What does the ``!`` operator do in the above code? 

* A: multiplies the values
* B: repeats the command
* C: logically negates (not)
* D: factorial

Answer by assigning a value below!

In [None]:
# Assign a letter by replacing X below
answer2 <- "X"

test_2()

Now we can see that the census data is all cleaned up, and ready to use.

In [None]:
any(is.na(census_data$wages))
mean(census_data$wages)

## Computing the Immigrant-Wage Gap

In R, there are many ways we could compute basic descriptive statistics.  The simplest way is to use ``summarize`` and ``group_by``:

In [None]:
results <- 
    census_data %>% #this is a pipe (see note below)
    group_by(immstat) %>%
    summarize(m_wage = mean(wages), sd_wage = sd(wages))

results

> **Advanced Note**: _piping_ The above example uses a special R command called a **pipe** (``%>%``).  Mechanically, what piping does is insert the object before the pipe into the object after the pipe.  For example, if we have ``z <- f(x,y)`` we could write this using pipes as ``z <- x %>% f(y)``.  Piping is really most useful when you are chaining (piping) a series of commands together.  You can then of a pipe as saying _and then_ followed by a command.  The item before the pipe will be inputted into the next command.  This lets you do complex data manipulation in a way with is readable.
> For example, the command above (i) starts with ``census_data`` (ii) groups it by ``immstat``, (iii) takes the grouped data and summarizes it.  If we wrote this without using a pipe it would look like:
> ``summarize(group_by(census_data,immstat), m_wage = mean(wages), sd_wage = sd(wages))``
> Not very easy to read!  Now imagine if you have seven or eight manipulations you would need to do!  This is the biggest advantage of using pipes: it makes your code easier to read, which means you are less likely to get confused, and less likely to make mistakes.  You can always find a way to do something without using a pipe, but you should still learn how to read them and use them.


We can also visualize this, using ``ggplot2``, which can create bar graphs and other visualizations.  Here's a bar graph and a boxplot.

In [None]:
f <- ggplot(data = census_data, aes(x = immstat, y = wages)) + xlab("Immigration Status") + ylab("Wages")
f1 <- f + geom_bar(stat = "summary", fun = "mean", fill = "lightblue") #produce a summary statistic, the mean
f1 <- f1 + coord_flip() #make a horizontal bar graph!

options(repr.plot.width=6,repr.plot.height=3) #this controls the size; you can change 6 and 3 to look better

f2 <- f + geom_boxplot(fill = "lightblue") + coord_flip()

f3 <- ggplot(data = census_data, aes(x = wages)) + geom_histogram(binwidth = 1000) + xlab("Wages") + ylab("Count") + facet_grid(. ~ immstat)

f1
f2
f3

> _Think Deeper:_ What does this tell you about the distribution of wages in these datasets?  Could this be a problem for our analysis?

This is all interesting to think about.  However, this is not a formal test of the immigrant-wage gap.  We need to examine this from a statistical perspective.  We can do this using a $t$-test.  This can be performed using the ``t.test`` command:

In [None]:
t1 = t.test(
       x = filter(census_data, immstat == "immigrants")$wages,
       y = filter(census_data, immstat == "non-immigrants")$wages,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t1 #test for the immigrant wage gap

round(t1$estimate[1] - t1$estimate[2],2)

This particular $t$-test was for a 95% confidence level; the difference was $-2,552.88$   As you can see, it looks like there is a much higher wage for non-immigrants than  for immigrants.  This is statistically significant and interesting.  It appears that there _is_ an immigrant wage gap in this data.

## Going Deeper: Demographic Characteristics

The next step is to understand why or how the immigrant wage gap might exist.  If you remember, demographic characteristics were one possibility.  For example, perhaps immigrants are less likely to speak English than other Canadians.  This could, potentially, create a gender wage gap.  Let's take a look at languages in the census, then try to understand how it interacts with immigration status:

In [None]:
results <- 
    census_data %>%
    group_by(fol) %>%
    summarize(m_wage = mean(wages), sd_wage = sd(wages))

results

#we just replaced immstat with fol in the above table

As you can see, there are sigificantly lower wages for individuals who speak languages other than English.

> _Think Deeper:_ Why might this be the case?  

We also also see how this breaks down by immigrant status.  Look at the following table - do you see a pattern?


In [None]:
results <- 
    census_data %>%
    group_by(fol,immstat) %>% #notice the fol here; two groups!
    summarize(m_wage = mean(wages), sd_wage = sd(wages))

results

options(repr.plot.width=10,repr.plot.height=3)

f <- ggplot(data = census_data, aes(x = immstat, y = wages)) + xlab("Immigration Status") + ylab("Wages")
f <- f + geom_bar(stat = "summary", fun = "mean", fill = "lightblue") #produce a summary statistic, the mean
f <- f + facet_grid(. ~ fol) #add a grid by language

f

You can see that, with one exception, immigrants tend to earn less than native-born Canadians.  We can make this even more clear by adding a new variable (``speaks_english``) to our dataset.  Frequently, we will want to make new variables to help us analyze the results, especially when a variable is more complicated than we would like it to be.

You can create this in many ways - but a very useful command is the ``case_when`` command.  Here is an example for our ``speaks_english`` variable.  Pay attention to the use of the ``as_factor`` command at the end to tell R that this is still a qualitative variable.


In [None]:
census_data <- census_data %>% 
               mutate( 
               speaks_english = case_when(#this is an example of this function
                     fol == "both english and french" ~ "Yes", #the ~ seperates the original from the new name
                     fol == "english only" ~ "Yes",
                     fol == "french only" ~ "No",
                     fol == "neither english nor french" ~ "No")) %>%
             mutate(speaks_english = as_factor(speaks_english)) #remember, it's a factor!

glimpse(census_data$speaks_english)

Now, let's repeat the analysis we did above by English-speaking status; then we can perform a $t$-test on each of these sub-groups:

In [None]:
results <- 
    census_data %>%
    group_by(speaks_english,immstat) %>%
    summarize(m_wage = mean(wages), sd_wage = sd(wages))

results #this is the same we did before, just with english status instead of FOL; much easier to read

f <- ggplot(data = census_data, aes(x = immstat, y = wages)) + xlab("Immigration Status") + ylab("Wages")
f <- f + geom_bar(stat = "summary", fun = "mean", fill = "lightblue") #produce a summary statistic, the mean
f <- f + facet_grid(. ~ speaks_english) #add a grid by language

f

In [None]:
eng_data = filter(census_data, speaks_english == "Yes") #english only data 
neng_data = filter(census_data, speaks_english == "No") #not english data

t2 = t.test(
       x = filter(eng_data, immstat == "immigrants")$wages,
       y = filter(eng_data, immstat == "non-immigrants")$wages,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t2 #test for the wage gap in english data

round(t2$estimate[1] - t2$estimate[2],2)


t3 = t.test(
       x = filter(neng_data, immstat == "immigrants")$wages,
       y = filter(neng_data, immstat == "non-immigrants")$wages,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t3 #test for the wage gap in non-english data

round(t3$estimate[1] - t3$estimate[2],2)

Consider the results above.  In which group is the immigrant ways gap the largest?  Do think the relationship between immigration status and English proficiency makes the immigrant-wage gap larger or smaller?

> _Think Deeper_: What would you need to know in order to say this for sure?  Hint: you have probably learned a valuable statistic.

### Wrapping Up

At this point, we have started to explore the immigrant wage gap.  Next, work on the following exercises to learn more.

# Part 3: Exercises


## Immigrant wage gap and education

One important dimension of the immigrant-wage gap is the role of education.  Many professions, such as engineering, medicine, and law face licensing and professional requirements.  Even the most talented lawyer from India cannot provide law in Canada without obtaining the proper credentials.  These can create major barriers to employing immigrants, even skilled ones, to their full extent in the Canadian economy.

In fact, historically many licensing requirements were explicitly designed to keep immigrants out of certain industries, to preserve higher wages for native Canadians.  An infamous example in Canada occured in the 1920s, when many provinces (including BC) passed laws requiring barbers and hairdressers to get a license to cut hair.  Why?  Ostensibly it was for "health" reasons - but the real reason was to force Asian barbers (primarily Chinese) out of business, and keep prices high.  The barber licenses were nearly impossible for Asian barbers to get, forcing them out of business or onto the black market.

While this kind of outright racism and discrimination is no longer commonplace, many economists view licenses and professional credentials in a similar light.  At the very least, they hinder integration into the economy - are the benefits of licenses higher than the costs?

In these exercises, you will invesigate these questions, building on what we did previously and developing your skills with the tools and skills you have seen.  After you are finished, make sure you submit your answers via Canvas.

**Note**:  some of the objects and analysis you complete in this section will be automatically graded.  These are tagged in the code as ``quiz N``.  Other questions require some writing and are listed as ``Short Answer N``.  These correspond to the rubric on Canvas.

### Activity 1 
First, examine ``census_data`` with a focus on education. Create a table that tabulates the average wage by education level. (**Note:** The relevant variable here will be ``hdgree``). A correct ``tab_educ`` table will pass the test.  Try lookin at how we generated some of the earlier tables for inspiration, if you need a hint.

In [None]:
tab_educ <- #fill in the code below; what goes before the %>%?
        %>% 
        %>% 
        summarize(avg_wage = mean(wages), std_dev = sd(wages))

tab_educ

answer3 <- tab_educ

test_3() #quiz 1

#### Short Answer 1:
What type of variable is ``hdgree``? Does it make sense to have ``hdgree`` as that variable type? Why or why not?  Write your answer in the box below:

<font color="red">Answer here (delete this text)</font>

### Activity 2

As we can see, the LFS has very narrow categories for education levels; however, we tend to like to group education into broader categories. Let's do that: consolidate ``hdgree`` into five categories -- Less than high school, high school diploma, some college (less than a bachelor's degree), bachelor's degree and graduate school. ``"Not available"`` can be left as is. A correct ``census_data`` dataframe will pass the test.

In [30]:
census_data <- 
        census_data %>%
        mutate(educ = case_when(
              hdgree == "no certificate, diploma or degree" ~ "Less than high school",
              hdgree == "secondary (high) school diploma or equivalency certificate" ~ "High school diploma",
              hdgree == "trades certificate or diploma other than certificate of apprenticeship or certificate of qualification" ~ "Some college",
              hdgree == "certificate of apprenticeship or certificate of qualification" ~ "Some college",
              hdgree == "program of 3 months to less than 1 year (college, cegep and other non-university certificates or diplomas)" ~ "Some college",
              hdgree == "program of 1 to 2 years (college, cegep and other non-university certificates or diplomas)" ~ "Some college",
              hdgree == "program of more than 2 years (college, cegep and other non-university certificates or diplomas)" ~ "Some college",
              hdgree == "university certificate or diploma below bachelor level" ~ "Some college",
              hdgree == "bachelor's degree" ~ "Bachelor's degree",              
              hdgree == "university certificate or diploma above bachelor level" ~ "Graduate school",
              hdgree == "degree in medicine, dentistry, veterinary medicine or optometry" ~ "Graduate school",
              hdgree == "master's degree" ~ "Graduate school",
              hdgree == "earned doctorate" ~ "Graduate school",
              hdgree == "not available" ~ "not available"
              )) %>%
        mutate(educ = as_factor(educ))


answer4 <- census_data
test_4()

-- Failure (???): Solution is incorrect ----------------------------------------
digest(census_data) not equal to "b09db48ab3bbf872355973edd9312f10".
1/1 mismatches
x[1]: "5384442344ca76cf0ba73e4cf986a342"
y[1]: "b09db48ab3bbf872355973edd9312f10"

[1] "Success!"


Now that we have simplified the education variable, create a table that tabulates average wage by these new education levels. A correct ``tab_educ2`` table will pass the test.

In [None]:
tab_educ2 <- #what goes before the pipes?
        %>% 
        %>% 
        summarize(avg_wage = mean(wages), std_dev = sd(wages))

tab_educ2

answer5 <- tab_educ2
test_5() #quiz 2

### Activity 3
The table that we got in the previous activity is fairly clear, but let's illustrate things with a chart. Contruct a bar graph that charts the average wage by education group. ``educ_graph`` will store this plot. You can see it by running the second code chunk below.

In [None]:
educ_graph <- ggplot(data = , aes(x = , y = wages)) + xlab("") + ylab("Wages")  #what goes in the x = spot?  what goes in xlab("")?
educ_graph <- educ_graph + geom_bar(stat = "summary", fun = mean, fill = "lightblue")
educ_graph <- educ_graph + coord_flip()

In [None]:
educ_graph

#### Short Answer 2
Examine the graph.  What do we observe when we compare the average wages between education levels? What does this suggest?

<font color="red">Answer here (delete this text)</font>

### Activity 4
Now, let's bring our focus back to understanding the immigrant wage gap. Create a table that tabulates average wages by education level and immigration status (immigrant, non-immigrant). This table, labelled ``tab_educ3``, will be tested for correctness.

In [None]:
tab_educ3 <- #what goes before the pipes?
        %>%
        %>%
        summarize(avg_wage = mean(wages), std_dev = sd(wages))

tab_educ3

answer7 <- tab_educ3
test_7() #quiz 3

Next, create bar graphs that compares average wages between immigrants and non-immigrants within education levels (``educ_graph2`` will store this). Note that most of the syntax is provided -- you simply need to fill in the missing code.

In [None]:
educ_graph2 <- ggplot(data = , aes(x = , y = wages)) + xlab("") + ylab("Wages")
educ_graph2 <- educ_graph2 + geom_bar(stat = "summary", fill = "lightblue")
educ_graph2 <- educ_graph2 + facet_grid(. ~ , scale = "free") + scale_x_discrete(guide = guide_axis(n.dodge=3))

educ_graph2

#### Short Answer 3
**Reflect:** Where do the wage gaps between immigrants and non-immigrants appear to be largest on average? Where are they the smallest, and why might this be the case?

<font color="red">Answer here (delete this text)</font>

### Activity 5
Labor economists are often concerned with two aspects of the relationship between immigrant status and education:
* Difference in average wages between immigrants and non-immigrants within education groups
* The difference in returns to education between immigrants and non-immigrants

Let's explore these two topics. First, test whether there are significantly different wages between immigrants and non-immigrants within each education group. Within which education levels do we see significant differences between wages of immigrants and those of non-immigrants?

_Note_: We will test whether you got the answer for ``ths``, the t-test for high school diploma, and ``tbach``, the t-test for bachelor's degree. However, you should also complete t-tests for less than high school (``tlesshs``), some college (``tsocol``) and graduate school (``tgrad``).

In [None]:

#Less than high school 
tlesshs =

tlesshs

#High school diploma

ths = 

ths
round(ths$estimate[1] - ths$estimate[2],2)

test_9() #quiz 4

#Some college

tsocol = 

tsocol

#Bachelor's degree

tbach = 

tbach
round(tbach$estimate[1] - tbach$estimate[2],2)

test_10() #quiz 5

#Grad school

tgrad = 

tgrad

### Activity 6
Next, examine whether returns to education differ between immigrants and non-immigrants. For our purposes, we will define:
> **Returns to Education**: The difference in average wages between two subsequent education levels.

Run this test for the returns to education of: 
* High school diploma (relative to less than high school) and 
* Bachelor's degree (relative to high school)

*The following t-test objects will be tested for correctness:* Returns to education of a high school diploma for immigrants (``retHSI``) and for non-immigrants (``retHSNI``), and returns to education of a bachelor's degree for immigrants (``retBachI``) and for non-immigrants (``retBachNI``).

In [None]:
#Returns to education: High school diploma

##Immigrants

retHSI = #what goes here?

retHSI
round(retHSI$estimate[1] - retHSI$estimate[2],2)

test_11() 

##Non-immigrants

retHSNI = #what goes here?

retHSNI
round(retHSNI$estimate[1] - retHSNI$estimate[2],2)

test_12() #quiz 6

In [None]:
#Returns to education: Bachelor's degree

#Immigrants

retBachI = 

retBachI
round(retBachI$estimate[1] - retBachI$estimate[2],2)

test_13() 

##Non-immigrants

retBachNI =

retBachNI
round(retBachNI$estimate[1] - retBachNI$estimate[2],2)

test_14() #quiz 7

#### Short Answer 4
**Reflect on your analysis:** Interpret the results of the t-tests above. Are the returns to each level of education significant for immigrants? For non-immigrants? Comment on the difference between returns to education for a high school degree and that for a bachelor's degree.

<font color="red">Answer here (delete this text)</font>

#### Short Answer 5
**Discuss your results:** Do the returns to each level of education (for either level of education) differ between immigrants and non-immigrants? What differences between the two groups might explain this difference?

<font color="red">Answer here (delete this text)</font>

### Activity 7
As we mentioned in earlier in the lesson, one reason why immigrants may be paid less than non-immigrants on average is that the former have a higher concentration of women than the latter. The gap could also be driven by differences in average wages between immigrant women and non-immigrant women. To examine this argument, test for an immigrant wage gap _in men only_ (store this in ``tmen`` for testing). Visualize this difference by plotting the wage distributions of immigrant and non-immigrant men with a histogram (store with ``hist_men``). Then, test for an immigrant wage gap within college-educated men (``tmenhs``) and high-school-educated men (``tmenbach``).

In [None]:
#Immigrant wage gap, unconditional, men



tmen = 

tmen
round(tmen$estimate[1] - tmen$estimate[2],2)

test_15()

#Histograms
hist_men <- ggplot(, aes(x = wages)) + geom_histogram(binwidth = 1000) + xlab("Wages") + ylab("Count") + facet_grid(. ~)

hist_men

test_16()

#Immigrant wage gap, High school diploma, men



tmenhs =

tmenhs
round(tmenhs$estimate[1] - tmenhs$estimate[2],2)

test_17()

#Immigrant wage gap, Bachelor's degree, men



tmenbach =

tmenbach
round(tmenbach$estimate[1] - tmenbach$estimate[2],2)

test_18() #quiz 8

#### Short Answer 6 
**Interpret your Findings:** Does the immigrant wage gap disappear when we look only at men? Why or why not? If the immigrant wage gap does not disappear for one (or both) of the education levels, do we see the gap grow or shrink when we exlcude women (compare with Activity 5)? Why?

<font color="red">Answer here (delete this text)</font>

## Submitting Your Work
Each week, you will upload your workbook as *both a ``.ipynb`` and a PDF.* Below are instructions of how to do this:
* While viewing worksheet1, click on the ``File`` tab on the upper-left corner of the screen.
* From there, drag your cursor over ``Download as`` to see the formats that you can export the worksheet to.
* To export to a ``.ipynb``, click on _"Notebook (.ipynb)."_
* Next, to export to a PDF, navigate back to ``File`` > ``Download as`` and click on _"PDF via LaTeX."_
* Upload both of these files to Canvas through the Assignments tab, as you would any other assignment.

Once you have finished uploading both of these files, you have completed the hands-on activity -- well done!

<span id="fn1">[<sup>1</sup>](#fn1s)Provided under the Statistics Canada Open License.  Adapted from Statistics Canada, 2016 Census Public Use Microdata File (PUMF). Individuals File, 2020-08-29. This does not constitute an endorsement by Statistics Canada of this product.</span>