
<h1> Lab 3: Reducing Crime</h1>

<h3> w203 Instructional Team </h3>

# Introduction

Your team has been hired to provide research for a political campaign.  They have obtained a dataset of crime statistics for a selection of counties in North Carolina.

Your task is to examine the data to help the campaign understand the determinants of crime and to generate policy suggestions that are applicable to local government.

You may work in a team of up to 3 students. This is not a requirement, but we strongly encourage you to form a group and believe it will add considerable value to the exercise.

When working in a group, do not use a "division-of-labor" approach to complete the lab.  All students should participate in all aspects of the final report.

# Timeline

The lab takes place over three weeks, with a deliverable due each week.


**Stage 1: Draft Report.**  You will create an intermediary report focused on model building but without statistical inference (no standard errors).

**Stage 2: Peer Feedback.**  Teams will exchange reports and provide each other with feedback.

**Stage 3: Final Report.**  You will create a final report, which includes a complete assessment of the classical linear model assumptions, standard errors, and other elements of statistical inference.

# Instructions

Please submit your answers in _one_ PDF and _one_ Jupyter Notebook.
Only your answers in the PDF document will be considered for grading. The Jupyter Notebook is required to verify that any scripts that you have written can be executed.
You are allowed to submit supplementary files such as images of handwritten notes imported into your Jupyter Notebook. Do note, however, that no handwritten notes will be considered for grading.
Finally, do _not_ upload your documents as one zipped file.


# The Data

The data is provided in a file, crime\_v2.csv.  It was first used in a study by Cornwell and Trumball, researchers from the University of Georgia and West Virginia University (C. Cornwell and W. Trumball (1994), “Estimating the Economic Model of Crime with Panel Data,” Review of Economics and Statistics 76, 360-366.)  While we are only providing you with a single cross-section of data, the original study was based on a multi-year panel.  The authors used panel data methods and instrumental variables to control for some types of omitted variables.  Since you are restricted to ordinary least squares regression, omitted variables will be a major obstacle to your estimates.  You should aim for causal estimates, while clearly explaining how you think omitted variables may affect your conclusions.

While you are free to look at the Cornwell and Trumball study (or other papers in the vast literature on crime), that's not necessary and may even harm your grade.  We want you to focus on learning from the data, which shouldn't require specialized knowledge beyond what's in this document.

The data may have been modified by your instructors to test your abilities.

You are given the following codebook:

variable  | label
----------|------
1    county|               county identifier
2      year|                            1987
3    crmrte|     crimes committed per person
4    prbarr|         'probability' of arrest
5   prbconv|     'probability' of conviction
6   prbpris| 'probability' of prison sentence
7    avgsen|             avg. sentence, days
8     polpc|               police per capita
9   density|             people per sq. mile
10    taxpc|          tax revenue per capita
11     west|           =1 if in western N.C.
12  central|           =1 if in central N.C.
13    urban|                   =1 if in SMSA
14 pctmin80|            perc. minority, 1980
15     wcon|       weekly wage, construction
16     wtuc|    wkly wge, trns, util, commun
17     wtrd| wkly wge, whlesle, retail trade
18     wfir|    wkly wge, fin, ins, real est
19     wser|      wkly wge, service industry
20     wmfg|         wkly wge, manufacturing
21     wfed|         wkly wge, fed employees
22     wsta|       wkly wge, state employees
23     wloc|        wkly wge, local gov emps
24      mix| offense mix: face-to-face/other
25  pctymle|              percent young male

In the literature on crime, researchers often distinguish between the certainty of punishment (do criminals expect to get caught and face punishment) and the severity of punishment (for example, how long prison sentences are).  The former concept is the motivation for the 'probability' variables.  The probability of arrest is proxied by the ratio of arrests to offenses, measures drawn from the FBI's Uniform Crime Reports.  The probability of conviction is proxied by the ratio of convictions to arrests, and the probability of prison sentence is proxied by the convictions resulting in a prison sentence to total convictions.  The data on convictions is taken from the prison and probation files of the North Carolina Department of Correction.

The percent young male variable records the proportion of the population that is male and between the ages of 15 and 24.  This variable, as well as percent minority, was drawn from census data.

The number of police per capita was computed from the FBI's police agency employee counts.

The variables for wages in different sectors were provided by the North Carolina Employment Security Commission.

# Stage 1: Draft Report - Due 24 hours before Live Session 12

In the first stage of the project, you will create a draft report that addresses the concerns of the political campaign.  Your report will include a model building process, culminating in a well formatted regression table that displays a minimum of three model specifications.  In fact, your draft report will be very similar in structure to your final report, but won't include standard errors or a full assessment of the classical linear model assumptions, which we will cover in units 12 and 13.

Here are some things to keep in mind during your model building process:

1. What do you want to measure?  Make sure you identify variables that will be relevant to the concerns of the political campaign.

2. What covariates help you identify a causal effect?  What covariates are problematic, either due to multicollinearity, or because they will absorb some of a causal effect you want to measure?

3. What transformations should you apply to each variable?  This is very important because transformations can reveal linearities in the data, make our results relevant, or help us meet model assumptions.

4. Are your choices supported by EDA?  You will likely start with some general EDA to detect anomalies (missing values, top-coded variables, etc.).  From then on, your EDA should be interspersed with your model building.  Use visual tools to guide your decisions.

At the same time, it is important to remember that you are not trying to create one perfect model.  You will create several specifications, giving the reader a sense of how robust your results are (how sensitive to modeling choices), and to show that you're not just cherry-picking the specification that leads to the largest effects.

At a minimum, you should include the following three specifications:

- One model with only the explanatory variables of key interest (possibly transformed, as determined by your EDA), and no other covariates.

- One model that includes key explanatory variables and only covariates that you believe increase the accuracy of your results without introducing substantial bias (for example, you should not include outcome variables that will absorb some of the causal effect you are interested in).  This model should strike a balance between accuracy and parsimony and reflect your best understanding of the determinants of crime.

- One model that includes the previous covariates, and most, if not all, other covariates.  A key purpose of this model is to demonstrate the robustness of your results to model specification.

Guided by your background knowledge and your EDA, other specifications may make sense.  You are trying to choose points that encircle the space of reasonable modeling choices, to give an overall understanding of how these choices impact results.

You should display all of your model specifications in a regression table, using a package like stargazer to format your output.  It should be easy for the reader to find the coefficients that represent key effects near the top of the regression table, and scan horizontally to see how they change from specification to specification.  Since we won't cover inference for linear regression until unit 12, you should not display any standard errors at this point.  You should also avoid conducting statistical tests for now (but please do point out what tests you think would be valuable).

After your model building process, you should include a substantial discussion of omitted variables.  Identify what you think are the 5-10 most important omitted variables that bias results you care about.  For each variable, you should estimate what direction the bias is in.  If you can argue whether the bias is large or small, that is even better.  State whether you have any variables available that may proxy (even imperfectly) for the omitted variable.   Pay particular attention to whether each omitted variable bias is towards zero or away from zero.  You will use this information to judge whether the effects you find are likely to be real, or whether they might be entirely an artifact of omitted variable bias.

**Submission**

Submit your lab via ISVC; please do not submit via email.

Submit 2 files:

1. A pdf file including the summary, the details of your analysis, and all the R codes used to produce the analysis. **Please do not suppress the code in your pdf file.**

2. The Rmd or ipynb source file used to produce the pdf file.

Each group only needs to submit one set of files.

Be sure to include the names of all team members in your report.  Place the word 'draft' in the file names.

Please limit your submission to 6000 words, excluding code cells and R output.


# Stage 2: Peer Feedback - Due 24 hours before Live Session 13

In Stage 2, you will provide feedback on another team's draft report.  We will ask you to comment separately on different sections.  The following list is very similar to the rubric we will use when grading your final report.

**1.0** Introduction.  Is the introduction clear? Is the research question specific and well defined? Could the research question lead to an actionable policy reccomendation? Does it motivate the analysis?  Note that we're not necessarily expecting a long introduction.  Even a single paragraph is probably enough for most reports.

**2.0** The Initial Data Loading and Cleaning.  Did the team notice any anomalous values?  Is there a sufficient justification for any data points that are removed?  Did the report note any coding features that affect the meaning of variables (e.g. top-coding or bottom-coding)?  Overall, does the report demonstrate a thorough understanding of the data?

**3.0** The Model Building Process. Overall, is each step in the model building process supported by EDA?  Is the outcome variable (or variables) appropriate? Is there a thorough univariate analysis of the outcome variable. Did the team identify at least two key explanatory variables and perform a thorough univariate analysis of each? Did the team clearly state why they chose these explanatory variables, does this explanation make sense in term of their research question? Did the team consider available variable transformations and select them with an eye towards model plausibility and interperability?  Are transformations used to expose linear relationships in scatterplots?  Is there enough explanation in the text to understand the meaning of each visualization?

**4.0** Regression Models: Base Model. Does this model only include key explanatory variables? Does the team identify what they want to measure with each coefficient? Does the team interpret the result of the regression in a thorough and convincing manner. Does the team evaluate all 6 CLM assumptions? Are the conclusions they draw based on this evaluation appropriate? Did the team interpret the results in terms of their research question?

**4.1** Regression Model: Second Model. Does this model include covariates meant to increase the accuracy of the regression? Has the team justified inclusion of each of these additional variables? Does the team identify what they want to measure with each coefficient? Does the team interpret the result of the regression in a thorough and convincing manner. Does the team evaluate all 6 CLM assumptions? Are the conclusions they draw based on this evaluation appropriate? Did the team interpret the results in terms of their research question?

**4.2** Regression Model: Third Model. Has the team explained what value can be derived from this model? Does the team interpret the result of the regression in a thorough and convincing manner. Does the team evaluate all 6 CLM assumptions? Are the conclusions they draw based on this evaluation appropriate? Did the team interpret the results in terms of their research question?

**4.3** The Regression Table.  Are the model specifications properly chosen to outline the boundary of reasonable choices?  Is it easy to find key coefficients in the regression table?  Does the text include a discussion of practical significance for key effects?

**5.0** The Omitted Variables Discussion.  Did the report miss any important sources of omitted variable bias?  Are the estimated directions of bias correct? Was their explanation clear?  Is the discussion connected to whether the key effects are real or whether they may be solely an artifact of omitted variable bias?

**6.0** Conclusion.  Does the conclusion address the high-level concerns of a political campaign?  Does it raise interesting points beyond numerical estimates?  Does it place relevant context around the results?

**7.0** Can you find any other errors, faulty logic, unclear or unpersuasive writing, or other elements that leave you less convinced by the conclusions?

Please be thorough and read the report critically, actively trying to find weaknesses.  Your comments will directly help your peers get the most value out of the project.

# Stage 3: Final Report - Due 24 hours before Live Session 14

In the final stage of the project, you will incorporate the feedback you receive, and use what you've learned about OLS inference to create a final report.

One of the most important tasks at this stage is to add valid standard errors to your regression table.

In a new section of the report, please choose one of your most important model specifications, and present a detailed assessment of all 6 classical linear model assumptions.  Use plots and other diagnostic tools to assess whether the assumptions appear to be violated, and follow best practices in responding to any violations you find.  Note that we only want to see this level of detail for one model specification.  

For the other specifications, you should also conduct a full assessment of the CLM assumptions, but only highlight major surprises that you notice in your text.

Note that you may need to change your model specifications in response to violations of the CLM.  At this point, you should also consider whether changes are appropriate to decrease standard errors for your estimates.  These decisions involve tradeoffs and you should strive to be transparent about them in your report.

Note also that you may need to adjust your conclusions in response to statistical significance.  Make sure that you discuss both statistical and practical significance for your key effects of interest.

You may want to include statistical tests besides the standard t-tests for regression coefficients.

We will assess your final report using a rubric that includes the elements listed above.  We will also consider whether you have correctly included elements of statistical inference in your report.  In particular, we will look to see whether you have correctly assessed the CLM assumptions and whether you have responded appropriately to any violations.

Please limit your submission to 8000 words, excluding code cells and R output.

As above, you must submit both the source and pdf files.  Be sure to include the names of all team members in your report.  Place the word 'final' in the file names.


# Omitted Variables Discussion

> Model1 = lm(log(crmrte) ~ log(prbarr) + log(prbconv2) + log(avgsen), data = df)

> $ 
\text{For all possible omitted variables:} \\
log(crmrte) = \beta_0 + \beta_1* log(prbarr) + \beta_2* log(prbconv2) + \beta_3*log(avgsen) + \beta_4 * OV + u \\
OV = \delta_0 + \delta_1*log(prbarr) + \delta_2*log(prbconv2) + \delta_3 * log(avgsen)+ v \\
$

> $
\text{Substituting into original equation:} \\
\begin{align}
\begin{split}
log(crmrte) 
= {} & \beta_0 + \beta_1* log(prbarr) + \beta_2* log(prbconv2) + \beta_3*log(avgsen) \\
     & + \beta_4 * (\delta_0 + \delta_1*log(prbarr)+ \delta_2*log(prbconv2) + \delta_3 * log(avgsen)+ v) + u
\end{split} \\
\begin{split}
= {} &(\beta_0 + \beta_4*\delta_0) + (\beta_1 + \beta_4*\delta_1)log(prbarr) + (\beta_2 + \beta_4*\delta_2)log(prbconv2) \\
     & + (\beta_3 + \beta_4 * \delta_3) log (avgsen) + (\beta_4*v + u)
\end{split} \\
\end{align}
$


## 1. Unemployment (unemp)

-	Size of bias: Unclear
-	Proxies: None in the data

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 < 0 $, then $OVB = \beta_4\delta_1 < 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled away from zero (more negative), gaining statistical significance. 
-	Estimated direction: away from 0
-	Explanation for direction: If log(prbarr) increases (i.e. more likely to be arrested), we expect unemp to decrease slightly because crime is now less profitable.
-	Impact on whether effects are real: Including the omitted variable unemp would only strengthen the OLS coefficient for log(prbarr).

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 < 0 $, then $OVB = \beta_4\delta_2 < 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled away from zero (more negative), gaining statistical significance. 
-	Estimated direction: away from 0
-	Explanation for direction: If log(prbconv2) increases (i.e. more likely to be convicted), we expect unemp to decrease slightly because crime is now more costly, which might prompt people to seek employment instead.
-	Impact on whether effects are real: Including the omitted variable unemp would only strengthen the OLS coefficient for log(prbconv2).

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 < 0 $, then $OVB = \beta_4\delta_3 < 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled away from zero (more negative), gaining statistical significance.
-	Estimated direction: away from 0
-	Explanation for direction: If log(avgsen) increases (i.e. increase in average sentence length), we expect unemp to decrease slightly because crime is now less profitable.
-	Impact on whether effects are real: Including the omitted variable unemp would only strengthen the OLS coefficient for log(avgsen).

## 2. Income inequality (ineq)

-	Size of bias: Unclear
-	Proxies: Could take the difference between the highest and lowest sectoral wages.

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: If log(prbarr) increases (i.e. more likely to be arrested), we expect inequality to increase because stricter criminal laws tend to be enacted in more unequal places.
-	Impact on whether effects are real: Including the omitted variable ineq would weaken the OLS coefficient for log(prbarr), indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 < 0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: If log(prbconv2) increases (i.e. more likely to be convicted), we expect ineq to increase because we think places with higher changes in probability of convictions tend to occur in places with more inequality.
-	Impact on whether effects are real: Including the omitted variable ineq would weaken the OLS coefficient for log(prbconv2).

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance.
-	Estimated direction: towards 0
-	Explanation for direction: If log(avgsen) increases (i.e. increase in change in average sentence length), we expect ineq to increase because the more unequal places probably have longer incarceration durations.
-	Impact on whether effects are real: Including the omitted variable ineq would strengthen the OLS coefficient for log(avgsen).


## 3. Immigration levels (immi)

-	Size of bias: Unclear
-	Proxies: Potentially the pctmin80 variable, or the wser and wcon variables, assuming that most immigrants end up in the service and/or construction industries. Lower wages than average might indicate the presence of immigrants in those sectors.

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with immi, possibly due to lower levels of social trust. 
-	Impact on whether effects are real: Including the omitted variable ineq would weaken the OLS coefficient for log(prbarr), indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with immi, i.e. places with more immigrants may have more convictions in order to ensure public order and security (whether real or imagined). 
-	Impact on whether effects are real: Including drug would weaken the OLS coefficient for log(prbconv2), reducing its effect size.

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We think log(avgsen) might be somewhat positively associated with immi, i.e. places with more immigrants may have longer sentences in order to because of institutional discrimination. 
-	Impact on whether effects are real: Including ineq would weaken the OLS coefficient for log(avgsen), reducing its effect size.

## 4. Alcohol and drug abuse levels (drug)

-	Size of bias: Unclear
-	Proxies: Potentially the pctymle variable, but only if we assume drug abuse rates among youth are constant across counties.

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with drug - with more drug abuse, we expect that the log(prbarr) variable increases. 
-	Impact on whether effects are real: Including the omitted variable drug would weaken the OLS coefficient for log(prbarr), indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with drug, i.e. counties with more drug abuse are more likely to have convictions in order to deter drug and alcohol abuse and other crime. 
-	Impact on whether effects are real: Including drug would weaken the OLS coefficient for log(prbconv2), reducing its effect size.

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction:  We expect log(avgsen) to be positively associated with drug, i.e. counties with more drug abuse are likely to have longer sentences in order to deter drug and alcohol abuse. 
-	Impact on whether effects are real: Including drug would weaken the OLS coefficient for log(avgsen), reducing its effect size.

## 5. Poverty (poor)

-	Size of bias: Unclear
-	Proxies: The taxpc variable may give an indicator of the poverty levels in a county.

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with poor - because of beliefs that poor people are likelier to commit crime, the poor variable increases when log(prbarr) increases.
-	Impact on whether effects are real: Including the omitted variable poor would weaken the OLS coefficient for log(prbarr), indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with poor, i.e. counties with more poor people are more likely to have convictions in order to deter them from committing opportunistic crime. 
-	Impact on whether effects are real: Including poor would weaken the OLS coefficient for log(prbconv2), reducing its effect size.

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction:  We expect log(avgsen) to be marginally positively associated with poor, i.e. counties with more poor people are likely to have longer sentences because petty crime tends to incur longer sentences. 
-	Impact on whether effects are real: Including poor would weaken the OLS coefficient for log(avgsen), reducing its effect size.

## 6. Parental criminality (prtcrm)

-	Size of bias: Unclear
-	Proxies: None in this data set.

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with prtcrm. Parents with criminal records are likelier to have children who commit crimes. Consequently, as log(prbarr) increases due to increased crime, we expect prtcrm to increase too. 
-	Impact on whether effects are real: Including the omitted variable prtcrm would weaken the OLS coefficient for log(prbarr), indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with prtcrm, i.e. counties with higher parental criminality are more likely to have crime and thus likelier convictions to deter would-be criminals.
-	Impact on whether effects are real: Including prtcrm would weaken the OLS coefficient for log(prbconv2), reducing its effect size.

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(avgsen) to be positively associated with prtcrm, i.e. counties with higher parental criminality are more likely to have crime and thus heavier sentences to deter would-be criminals.
-	Impact on whether effects are real: Including prtcrm would weaken the OLS coefficient for log(avgsen), reducing its effect size.


## 7. Quality of parenting/dysfunctional family background (dysfunc)

-	Size of bias: Unclear
-	Proxies: None in this data set.

### Impact on log(prbarr)
If $\beta_4 > 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with dysfunc. Individuals from dysfunctional families are likelier to commit crimes. Consequently, as log(prbarr) increases, we expect dysfunc to increase too. 
-	Impact on whether effects are real: Including the omitted variable dysfunc would weaken the OLS coefficient for prbarr, indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 > 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with dysfunc, i.e. counties with more dysfunctional families are more likely to have crime and thus likelier convictions to deter would-be criminals.
-	Impact on whether effects are real: Including dysfunc would weaken the OLS coefficient for log(prbconv2), reducing its effect size.

### Impact on log(avgsen)
If $\beta_4 > 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(avgsen) to be positively associated with dysfunc, i.e. counties with more dysfunctional families are more likely to have crime and thus heavier sentences to deter would-be criminals.
-	Impact on whether effects are real: Including dysfunc would weaken the OLS coefficient for log(avgsen), reducing its effect size.


## 8. Education (educ)

-	Size of bias: Unclear
-	Proxies: Perhaps the urban variable, because urban populations tend to be more educated than their non-urban counterparts. Or the wfir variable, assuming that finding employment in finance, insurance and real estate may require more education than the other sectors.

### Impact on log(prbarr)
If $\beta_4 < 0 $, $\delta_1 < 0 $, then $OVB = \beta_4\delta_1 > 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbarr) to be negatively associated with educ. In more educated counties, the log(prbarr) is probably lower. However, we also expect educ to reduce the crmrte variable. Hence, when we take their product to calculate the OVB, we end up with a positive product.
-	Impact on whether effects are real: Including the omitted variable educ would weaken the OLS coefficient for log(prbarr), indicating that its effects may not be that strong or real.

### Impact on log(prbconv2)
If $\beta_4 < 0 $, $\delta_2 < 0 $, then $OVB = \beta_4\delta_2 > 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled towards zero (less negative), losing statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: In more educated counties, the log(prbconv2) is probably lower, as the criminal justice system is probably more likely to give educated people the benefit of the doubt and not convict them. Hence, we expect log(prbconv2) to be negatively associated with educ. Consequently, when we take their product to calculate the OVB, we end up with a positive value.
-	Impact on whether effects are real: Including educ would weaken the OLS coefficient for log(prbconv2), reducing its effect size.

### Impact on log(avgsen)
If $\beta_4 < 0 $, $\delta_3 < 0 $, then $OVB = \beta_4\delta_3 > 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled towards zero (less negative), losing statistical significance. 
-	Estimated direction: towards 0
-	Explanation for direction: In more educated counties, the log(avgsen) is probably lower, as longer sentences are usually awarded to petty and violent crime instead of white-collar crime. Hence, we expect log(avgsen) to be negatively associated with educ. Consequently, when we take their product to calculate the OVB, we end up with a positive value.
-	Impact on whether effects are real: Including educ would weaken the OLS coefficient for log(avgsen), reducing its effect size.


## 9. Protections for women (protec)

-	Size of bias: Unclear
-	Proxies: None in this data set.

### Impact on log(prbarr)
If $\beta_4 < 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 < 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled away from zero (more negative), gaining statistical significance. 
-	Estimated direction: away from 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with protec. We think that counties with stronger protections for women will take a tougher stance against crimes, and thus have larger log(prbarr). However, we also expect protec to reduce the crmrte variable. Hence, we end up with a negative OVB.
-	Impact on whether effects are real: Including the omitted variable protec would strengthen the OLS coefficient for prbarr, increasing its effect size.

### Impact on log(prbconv2)
If $\beta_4 < 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 < 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled away from zero (more negative), gaining statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with protec. Counties with a stronger stance against crime are likely to have protections for women too. However, we also expect protec to reduce the crmrte variable. Hence, when we take their product to calculate the OVB, we end up with a negative value.
-	Impact on whether effects are real: Including protec would strengthen the OLS coefficient for log(prbconv2), increasing its effect size.

### Impact on log(avgsen)
If $\beta_4 < 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 < 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled away from zero (more negative), gaining statistical significance.
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(avgsen) to be positively associated with protec. Counties with a stronger stance against crime are likely to have protections for women too. However, we also expect protec to reduce the crmrte variable. Hence, when we take their product to calculate the OVB, we end up with a negative value.
-	Impact on whether effects are real: Including protec would strengthen the OLS coefficient for log(avgsen), increasing its effect size.

## 10. Investment in social services (invest)

-	Size of bias: Unclear
-	Proxies: taxpc and polpc variables, which indicate the relative amounts of resources that each county has.

### Impact on log(prbarr)
If $\beta_4 < 0 $, $\delta_1 > 0 $, then $OVB = \beta_4\delta_1 < 0 $ and if $\beta_1<0$ then the OLS coefficient on $log(prbarr)$ will be scaled away from zero (more negative), gaining statistical significance. 
-	Estimated direction: away from 0
-	Explanation for direction: We expect log(prbarr) to be positively associated with invest. We think that counties with larger log(prbarr), which indicates their commitment to tackling crime, are likely to have both the commitment and resources for increased social services spending. However, we also expect invest to reduce the crmrte variable. Hence, we end up with a negative OVB.
-	Impact on whether effects are real: Including the omitted variable invest would strengthen the OLS coefficient for log(prbarr), increasing its effect size.

### Impact on log(prbconv2)
If $\beta_4 < 0 $, $\delta_2 > 0 $, then $OVB = \beta_4\delta_2 < 0 $ and if $\beta_2 <0$ then the OLS coefficient on $log(prbconv2)$ will be scaled away from zero (more negative), gaining statistical significance.  
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(prbconv2) to be positively associated with invest. Counties with the resources to take a strong stance against crime are more likely to have social services investment. We also expect invest to reduce the crmrte variable. Hence, when we take their product to calculate the OVB, we end up with a negative value.
-	Impact on whether effects are real: Including invest would strengthen the OLS coefficient for log(prbconv2), increasing its effect size.

### Impact on log(avgsen)
If $\beta_4 < 0 $, $\delta_3 > 0 $, then $OVB = \beta_4\delta_3 < 0 $ and if $\beta_3 <0$ then the OLS coefficient on $log(avgsen)$ will be scaled away from zero (more negative), gaining statistical significance.
-	Estimated direction: towards 0
-	Explanation for direction: We expect log(avgsen) to be positively associated with invest. Counties with the resources to lock people away for longer are more likely to have social services investment. We also expect invest to reduce the crmrte variable. Hence, when we take their product to calculate the OVB, we end up with a negative value.
-	Impact on whether effects are real: Including invest would strengthen the OLS coefficient for log(avgsen), increasing its effect size.

## Conclusions for Omitted Variables Discussion

Overall, only 3 omitted variables unemp, protec and invest would strengthen the OLS coefficients for our key determinants of crime. The remaining 7 of them - ineq, immi, drug, poor, prtcrm, dysfunc, educ - all weaken the OLS coefficients. Given that more omitted variables weaken rather than strengthen the determinants, and that most academics believe ineq, poor and educ to be more important variables than the others, we can reasonably conclude that omitted variable bias would reduce the effects for our determinants, indicating that the effects presented may be weaker or not real at all. More study is required to establish the extent of the reduction in effect size (i.e. the size of the bias).