# Worksheet 3: Working with Numerical Data


## Objectives: ##
To practice:

* Computing numerical summaries of numerical data.
* Using the IQR method to identify potenital outlier.


## Instructions: ##
* 1. Run the data loading function found in the tools section.
* 2. Read the information at the beginning of each question carefully.
* 3. Answer each part of each question by writing code in the code cell, or a sentence in a markdown cell as is appropriate.

## Formulae: ##

$$IQR=Q_3-Q_1$$

$$z=\frac{x-\mu}{\sigma}\approx\frac{x-\bar{x}}{s}$$

**IQR rule for potential outliers**

A point is a potenail outier if it is less than  $Q_1-1.5\times IQR$ 
or greater than $Q_3+1.5\times IQR$



## Tools: ##

In [None]:
census <- read.csv("census150.csv",header=TRUE)

## Data Information: ##

### Data Set: ###
A small random sample of observations from the 2000 U.S. Census Data.
#### Name: #### 
* `census` - observations from the 2000 U.S. Census Data.

#### Variables: ####
* `census_year` - Census Year.
* `state_fips_code` - Name of state.
* `total_family_income` - Total family income (in U.S. dollars).
* `age` - Age.
* `sex` - Sex with levels: Female and Male.
* `race_general` - Race with levels: American Indian or Alaska Native, Black, Chinese, Japanese, Other Asian or Pacific Islander, Two major races, White and Other.
* `marital_status` - Marital status with levels Divorced, Married/spouse absent, Married/spouse present, Never married/single, Separated and Widowed.
* `total_personal_income` - Total personal income (in U.S. dollars).
* `percent_family_income` - the percent of the family income earned by the participant, calculated by dividing the personal income by the family income (unless the family income is 0, then the value of this variable is set to 0) given as a percentage.


## Question 1. Calculate the Summaries ##

We have now encountered a number of functions which compute summaries of data. Recall that to display information about a variable from a data set you need to indicate where the variable can be found. 

The format is &lt;dataName&gt;\$&lt;varName&gt;.

The following are useful commands:

- Minimum value  `min(`&lt;dataName&gt;\$&lt;varName&gt;`)`
- Maximum value  `max(`&lt;dataName&gt;\$&lt;varName&gt;`)`
- Mean   `mean(`&lt;dataName&gt;\$&lt;varName&gt;`)`
- Median `median(`&lt;dataName&gt;\$&lt;varName&gt;`)`

Use the commands to compute the min, max, mean, and median of age and total_family_income

## Question 2. Calculating Quartiles

R has a command  `quantile(`&lt;dataName&gt;\$&lt;varName&gt;, p`)` which finds a value that has proportion p of the data at or below it, and 1-p at or above it. 

We can use this to compute quartiles.  For example the first quartile ($Q_1$) should have  25% of the data at or below (and 75% at or above). So the code we need is `quantile(`&lt;dataName&gt;\$&lt;varName&gt;,0.25`)`. 


a. Run the included code to compute the first quartile of age and total_family income.

b. Copy and modify the code to compute the third quartile for age and total_family_income.

In [None]:
quantile(census$age,0.25)
quantile(census$total_family_income,0.25)

## Question 3. Measures of Spread

We have looked at 3 measures of spread, range, IQR, and standard deviation.

Range and IQR could be computed from the results of previous questions. But there are commands that can make it easier,  `range(`&lt;dataName&gt;\$&lt;varName&gt;`)`  gives the range of values and   `IQR(`&lt;dataName&gt;\$&lt;varName&gt;`)` computes the IQR.

The command  `sd(`&lt;dataName&gt;\$&lt;varName&gt;`)` computes the sample standard deviation.



Find the three measures of spread for the variables age, and total_family_income


Do these return what you expected? 

## Question 4. The summary command.

R has another way to quickly get a numerical picture of the data. The command  `summary(`&lt;dataName&gt;\$&lt;varName&gt;`)` gives the 5 number summary, plus the mean.

Use `summary` to compute the 5 number summary of `age` and `total_family_income`

## Question 5. Searching for Outliers. (IQR rule)

**IQR rule for potential outliers**

A point is a potenail outier if it is less than  $Q_1-1.5\times IQR$ 
or greater than $Q_3+1.5\times IQR$


In class we used this rule to find the potential outliers for the variable `total_personal_income`.

I have included my calculations from class (this time using R)

In [None]:
summary(census$total_personal_income)

First I computed the IQR

In [None]:
31200-0

Then using  $Q_1-1.5\times IQR$  I found the lower cut off.

In [None]:
0-1.5*31200

Then using  $Q_3+1.5\times IQR$  I found the upper cut.

In [None]:
31200+1.5*31200

Since the maximum was 160 000 I know that there are potential outliers.  I can see them in the following boxplot.

In [None]:
boxplot(census$total_personal_income,
        main="Boxplot of Total Personal Income ($)", ylab="Total Personal Income (dollars)")

You will need to do the same to check for outliers for the variables `age` and `total_family_income`

Compute the IQR for `age`

Compute the lower cut off for potential outlier using IQR method and the variable `age`

Compute the upper cut off for potential outlier using IQR method and the variable `age`

Based on what you have already compute about the variable `age` are there any potential outliers?

Type your answer here.


Confirm your answer with a boxplot.

Compute the IQR for  `total_family_income`

Compute the lower cut off for potential outlier using IQR method and the variable `total_family_income`

Compute the upper cut off for potential outlier using IQR method and the variable  `total_family_income`

Based on what you have already compute about the variable  `total_family_income` are there any potential outliers?

type your answer here

Confirm your answer with a boxplot.

## Question 7##

Finding all potential outliers.

When we know that there are potential outliers, it is often necessary to find those particular entries. 

The function  `which(`&lt;dataName&gt;\$&lt;varName&gt;` > number)` will return the numbers of the rows where the indicated variable is larger than a particular number.

I used the following to find where the total_personal_income was above the cut off of 78000

In [None]:
which(census$total_personal_income>78000)

You can also print those lines of the data set using the code below.

In [None]:
census[which(census$total_personal_income>78000),]

Find all of the entries in the data set that have unusually larger total_family_incomes.

Are there any entries that have both unusually high personal income and family income?

type your answer here