# Applied Exercise 1


**This exercise relates to the `College` data set, which can be found in the file `College.csv`. It contains a number of variables for 777 different universities and colleges in the US. The variables are**

- **`Private`: Public/private indicator**
- **`Apps`: Number of applications received**
- **`Accept`: Number of applicants accepted**
- **`Enroll`: Number of new students enrolled**
- **`Top10perc`: New students from top 10% of high school class**
- **`Top25perc`: New students from top 25% of high school class**
- **`F.Undergrad`: Number of full-time undergraduates**
- **`P.Undergrad`: Number of part-time undergraduates**
- **`Outstate`: Out-of-state tuition**
- **`Room.Board`: Room and board costs**
- **`Books`: Estimated book costs**
- **`Personal`: Estimated personal spending**
- **`PhD`: Percent of faculty with Ph.D.'s**
- **`Terminal`: Percent of faculty with terminal degree**
- **`S.F.Ratio`: Student/faculty ratio**
- **`perc.alumni`: Percent of alumni who donate**
- **`Expend`: Instructional expenditure per student**
- **`Grad.Rate`: Graduation rate**

**Before reading the data into `R`, it can be viewed in Excel or a text editor.**

## Part 1
**Use the `read.csv()` function to read the data into `R`. Call the loaded data `college`. Make sure that you have the directory set to the correct location for the data.**

In [None]:
college = read.csv("../input/ISLR-Auto/College.csv", header = TRUE)

## Part 2
**Look at the data using the `fix()` function. You should notice that the first column is just the name of each university. We don't really want `R` to treat this as data. However, it may be handy to have these names for later. Try the following commands:**

```
> rownames(college) = college[, 1]
> fix(college)
```

In [None]:
head(college)

In [None]:
rownames(college) = college[, 1]
head(college)

**You should see that there is now a `row.names` column with the name of each university recorded. This means that `R` has given each row a name corresponding to the appropriate university. `R` will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try**

```
> college = college[, -1]
> fix(college)
```

In [None]:
college = college[, -1]
head(college)

**Now you should see that the first data column is `Private`. Note that another column labeled `row.names` now appears before the `Private` column. However, this is not a data column but rather the name that `R` is giving to each row.**

## Part 3.1
**Use the `summary()` function to produce a numerical summary of the variables in the data set.**

In [None]:
summary(college)

## Part 3.2
**Use the `pairs()` function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix `A` using `A[, 1:10]`.**

In [None]:
pairs(college[, 1:10])

## Part 3.3
**Use the `plot()` function to produce side-by-side boxplots of `Outstate` versus `Private`.**

In [None]:
plot(college$Private, college$Outstate, xlab = "Private", ylab = "Out-of-state tuition (dollars)")

## Part 3.4
**Create a new qualitative variable, called `Elite`, by *binning* the `Top10perc` variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.**

```
> Elite = rep("No", nrow(college))
> Elite[college$Top10per>50] = "Yes"
> Elite = as.factor(Elite)
> college = data.frame(college, Elite)
```

In [None]:
Elite = rep("No", nrow(college))
Elite[college$Top10per > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)

**Use the `summary()` function to see how many elite universities there are. Now use the `plot()` function to produce side-by-side boxplots of `Outstate` versus `Elite`.**

In [None]:
summary(college$Elite)

In [None]:
plot(college$Elite, college$Outstate, xlab = "Elite", ylab = "Out-of-state tuition (dollars)")

## Part 3.5
**Use the `hist()` function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command `par(mfrow = c(2, 2))` useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.**

In [None]:
par(mfrow = c(2, 2))
hist(college$Apps, xlab = "Number of applicants", main = "Histogram for all colleges")
hist(college$Apps[college$Private == "Yes"], xlab = "Number of applicants", main = "Histogram for private schools")
hist(college$Apps[college$Private == "No"], xlab = "Number of applicants", main = "Histogram for public schools")
hist(college$Apps[college$Elite == "Yes"], xlab = "Number of applicants", main = "Histogram for elite schools")

In [None]:
par(mfrow = c(2, 2))
hist(college$Expend, xlab = "Instructional expenditure per student (dollars)", main = "Histogram for all colleges")
hist(college$Expend[college$Private == "Yes"], xlab = "Instructional expenditure per student (dollars)", main = "Histogram for private schools")
hist(college$Expend[college$Private == "No"], xlab = "Instructional expenditure per student (dollars)", main = "Histogram for public schools")
hist(college$Expend[college$Elite == "Yes"], xlab = "Instructional expenditure per student (dollars)", main = "Histogram for elite schools")

In [None]:
par(mfrow = c(2, 2))
hist(college$S.F.Ratio, xlab = "Student-Faculty Ratio", main = "Histogram for all colleges")
hist(college$S.F.Ratio[college$Private == "Yes"], xlab = "Student-Faculty Ratio", main = "Histogram for private schools")
hist(college$S.F.Ratio[college$Private == "No"], xlab = "Student-Faculty Ratio", main = "Histogram for public schools")
hist(college$S.F.Ratio[college$Elite == "Yes"], xlab = "Student-Faculty Ratio", main = "Histogram for elite schools")

## Part 3.6
**Continue exploring the data, and provide a brief summary of what you discover.**

In [None]:
NonTuitionCosts = college$Room.Board + college$Books + college$Personal
college = data.frame(college, NonTuitionCosts)
par(mfrow = c(1, 2))
plot(college$Private, college$NonTuitionCosts, xlab = "Private", ylab = "Total non-tuition costs per year (dollars)")
plot(college$Elite, college$NonTuitionCosts, xlab = "Elite", ylab = "Total non-tuition costs per year (dollars)")

Based on the above box plots, it looks like that, aside from some outlier schools with very high costs, there isn't a wide gap for the median non-tution costs between private schools and public schools. The box plots do show, though, that there is a distinct difference in median non-tuition costs between elite and non-elite schools, with elite schools having higher costs.

In [None]:
AcceptPerc = college$Accept / college$Apps * 100
college = data.frame(college, AcceptPerc)
par(mfrow = c(1, 2))
plot(college$Private, college$AcceptPerc, xlab = "Private", ylab = "Acceptance Rate")
plot(college$Elite, college$AcceptPerc, xlab = "Elite", ylab = "Acceptance Rate")

In [None]:
summary(college$AcceptPerc[college$Private == "Yes"])

In [None]:
summary(college$AcceptPerc[college$Private == "No"])

In [None]:
summary(college$AcceptPerc[college$Elite == "Yes"])

In [None]:
summary(college$AcceptPerc[college$Elite == "No"])

The boxplots show that while the median acceptance rates for both private and public schools are pretty close at around 75-80%, private schools have a much wider range of acceptance rates (going down to a minimum of 15.45%). When we distinguish between elite and non-elite schools, elite schools have a much lower median acceptance rate compared to non-elite ones.

In [None]:
par(mfrow = c(2, 2))
hist(college$perc.alumni, xlab = "Percent of alumni who donate", main = "Histogram for all colleges")
hist(college$perc.alumni[college$Private == "Yes"], xlab = "Percent of alumni who donate", main = "Histogram for private schools")
hist(college$perc.alumni[college$Private == "No"], xlab = "Percent of alumni who donate", main = "Histogram for public schools")
hist(college$perc.alumni[college$Elite == "Yes"], xlab = "Percent of alumni who donate", main = "Histogram for elite schools")

Based on the above histograms, private schools and elite schools tend to have a higher percent of alumni who donate.

In [None]:
par(mfrow = c(2, 2))
plot(college$PhD, college$Grad.Rate, xlab = "Number of faculty with PhDs", ylab = "Graduation Rate")
plot(college$Terminal, college$Grad.Rate, xlab = "Number of faculty with terminal degrees", ylab = "Graduation Rate")
plot(college$S.F.Ratio, college$Grad.Rate, xlab = "Student-faculty ratio", ylab = "Graduation Rate")
plot(college$Expend, college$Grad.Rate, xlab = "Instructional expenditure per student (dollars)", ylab = "Graduation Rate")

The above scatterplots explore some of the factors which might be related to student graduation rates. From the upper-left plot, it appears there is a weak positive relationship between the number of faculty with PhDs and graduation rates. The upper-right plot appears to indicate that there isn't relationship between the number of faculty with terminal degrees and graduation rates. The bottom-left plot indicates that as student-faculty ratios increase, graduation rates generally tend to decrease. Lastly, the bottom-right plot seems to show that there is a definite positive relationship between instructional expenditure per student and graduation rates, with higher expenditures corresponding to higher graduation rates.

# Applied Exercise 2

**This exercise involves the `Auto` data set studied in the lab. Make sure that the missing values have been removed from the data.**

In [None]:
Auto = read.csv("../input/ISLR-Auto/Auto.csv", header = TRUE, na.strings = "?")
Auto = na.omit(Auto)
dim(Auto)

## Part 1
**Which of the predictors are quantitative, and which are qualitative?**

In [None]:
head(Auto)

The quantitative variables are `mpg`, `displacement`, `horsepower`, `weight`, and `acceleration`. Depending on the context, we may want to treat `cylinders` and `year` as quantitative predictors or qualitative ones. Lastly, `origin` and `name` are qualitative predictors. `origin` is a quantitative encoding of a car's country of origin, where 1 being American, 2 being European, and 3 being Japanese.

## Part 2
**What is the *range* of each quantitative predictor? You can answer this using the `range()` function.**

In [None]:
?range

In [None]:
range(Auto$mpg)

In [None]:
range(Auto$cylinders)

In [None]:
range(Auto$displacement)

In [None]:
range(Auto$horsepower)

In [None]:
range(Auto$weight)

In [None]:
range(Auto$acceleration)

In [None]:
range(Auto$year)

We have the following ranges for each quantitative predictor:

- `mpg` = 37.6
- `cylinders` = 5
- `displacement` = 387
- `horsepower` = 184
- `weight` = 3527
- `acceleration` = 16.8
- `year` = 12

## Part 3
**What is the mean and standard deviation of each quantitative predictor?**

In [None]:
colMeans(Auto[, 1:7])

In [None]:
apply(Auto[, 1:7], MARGIN = 2, FUN = "sd")

We have the following mean and standard deviation for each quantitative predictor:

- `mpg`: mean = 23.45, standard deviation = 7.81
- `cylinders`: mean = 5.47, standard deviation = 1.71
- `displacement`: mean = 194.41, standard deviation = 104.64
- `horsepower`: mean = 104.47, standard deviation = 38.49
- `weight`: mean = 2977.58, standard deviation = 849.40
- `acceleration`: mean = 15.54, standard deviation = 2.76
- `year`: mean = 75.98, standard deviation = 3.68

## Part 4
**Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?**

In [None]:
apply(Auto[-(10:85), 1:7], MARGIN = 2, FUN = "range")

In [None]:
apply(Auto[-(10:85), 1:7], MARGIN = 2, FUN = "mean")

In [None]:
apply(Auto[-(10:85), 1:7], MARGIN = 2, FUN = "sd")

We have the following range, mean,standard deviation for each quantitative predictor after the 10th through 85th rows have been removed:

- `mpg`: range = 35.6, mean = 24.40, standard deviation = 7.87
- `cylinders`: range = 5, mean = 5.37, standard deviation = 1.65
- `displacement`: range = 387, mean = 187.24, standard deviation = 99.68
- `horsepower`: range = 184, mean = 100.72, standard deviation = 35.71
- `weight`: range = 3348, mean = 2935.97, standard deviation = 811.30
- `acceleration`: range = 16.3, mean = 15.73, standard deviation = 2.69
- `year`: mean = 77.15, standard deviation = 3.11

## Part 5
**Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.**

In [None]:
par(mfrow = c(2, 2))
plot(Auto$displacement, Auto$mpg, xlab = "Engine displacement (cubic inches)", ylab = "Miles per gallon")
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "Miles per gallon")
plot(Auto$weight, Auto$mpg, xlab = "Car weight (pounds)", ylab = "Miles per gallon")
plot(Auto$year, Auto$mpg, xlab = "Model Year", ylab = "Miles per gallon")

See discussion in Part 6 below.

In [None]:
par(mfrow = c(2, 2))
plot(Auto$year, Auto$acceleration, xlab = "Model Year", ylab = "0 to 60mph time (seconds)")
plot(Auto$year, Auto$displacement, xlab = "Model Year", ylab = "Engine displacement (cubic inches)")
plot(Auto$year, Auto$weight, xlab = "Model Year", ylab = "Car weight (pounds)")
plot(Auto$year, Auto$horsepower, xlab = "Model Year", ylab = "Horsepower")

Looking at how various car characteristics change with model year, we see that there aren't any strong relationships. There are still some weak relationships, such as max engine displacement, car weight, and horsepower generally decreasing from 1970 to 1982. From a historical perspective, these changes could be in response to the 1973 and 1979 oil crises, in which spikes in oil prices pushed auto manufacturers to take measures to improve the efficiency of their cars.

In [None]:
par(mfrow = c(2, 2))
plot(Auto$weight, Auto$acceleration, xlab = "Car weight (pounds)", ylab = "0 to 60mph time (seconds)")
plot(Auto$cylinders, Auto$acceleration, xlab = "Number of engine cylinders", ylab = "0 to 60mph time (seconds)")
plot(Auto$displacement, Auto$acceleration, xlab = "Engine displacement (cubic inches)", ylab = "0 to 60mph time (seconds)")
plot(Auto$horsepower, Auto$acceleration, xlab = "Horsepower", ylab = "0 to 60mph time (seconds)")

Next, I explored the relationship between the number of seconds it takes a car to accelerate from 0 to 60 miles per hour and a number of different factors. As expected, the 0-to-60 time clearly decreases with increased engine displacement and increased horsepower. There is also a weak relationship that as the number of engine cylinders increases the 0-to-60 time tends to decrease. While it may seem counter-intuitive at first, the 0-to-60 time also tends to decrease with car weight. This makes more sense in the context of the two scatterplots below, which shows that the higher weight is correlated with higher horsepower and higher engine displacement.

In [None]:
par(mfrow = c(2, 1))
plot(Auto$weight, Auto$horsepower, xlab = "Car weight (pounds)", ylab = "Horsepower")
plot(Auto$weight, Auto$displacement, xlab = "Car weight (pounds)", ylab = "Engine displacement (cubic inches)")

## Part 6
**Suppose we wish to predict gas mileage (`mpg`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `mpg`? Justify your answer.**

Based on the scatter plots I made in part 5 which relate miles per gallon to the predictors engine displacement, horsepower, car weight, and model year, it seems as if the first three factors would be most helpful in predicting `mpg`, with model year still potentially being helpful but less so. There are clear relationships that increasing engine displacement/horsepower/car weight results in decreased fuel efficiency. There is also a weak relationship that fuel efficiency generally increased going from 1970 to 1982.

In [None]:
Auto$origin[Auto$origin == 1] = "American"
Auto$origin[Auto$origin == 2] = "European"
Auto$origin[Auto$origin == 3] = "Japanese"
Auto$origin = as.factor(Auto$origin)

In [None]:
plot(Auto$origin, Auto$mpg, xlab = "Country of origin", ylab = "Miles per gallon")

Looking at the above box plot, we can also see that there is a relationship between a car's country of origin and fuel efficiency, where on average Japanese cars are the most efficient, followed by European cars and then by American cars.

# Applied Exercise 3

**This exercise involves the `Boston` housing data set.**

## Part 1
**To begin, load the `Boston` data set. The `Boston` data set is part of the `MASS` *library* in `R`.**

```
> library(MASS)
```

In [None]:
library(MASS)

**Now the data set is contained in the object `Boston`.**

```
> Boston
```

In [None]:
head(Boston)

**Read about the data set:**

```
> ?Boston
```

In [None]:
?Boston

**Note** Instead of using the `Boston` data set found in the `MASS` library in `R`, from this point on I will instead be using the corrected Boston data set, which can be downloaded [here](http://lib.stat.cmu.edu/datasets/boston_corrected.txt).

In [None]:
Boston_corrected = read.csv("../input/corrected-boston-housing/boston_corrected.csv", header = TRUE)
head(Boston_corrected)

In [None]:
dim(Boston_corrected)

**How many rows are in this data set? How many columns? What do the rows and columns represent?**

The corrected Boston data set has 506 rows and 20 columns. Each row represents a particular tract of land within the city of Boston. The dataset has the following columns.

- `TOWN`: Name of the town in which the tract is located
- `TOWNNO`: Numeric code corresponding to the town
- `TRACT`: ID number of the tract of land
- `LON`: Longitude of the tract in decimal degrees
- `LAT`: Latitude of the tract in decimal degrees
- `MEDV`: Median value of owner-occupied housing in \\$1000 for the tract
- `CMEDV`: Corrected median value of owner occupied housing in \\$1000 for the tract, since the original values in MEDV were censored in the sense that all median values at or over \\$50000 are set to \\$50000
- `CRIM`: Per capita crime rate for the tract
- `ZN`: Percent of residential land zoned for lots over 25000 square feet per town (constant for all tracts within the same town)
- `INDUS`: Percent of non-retail business acres per town (constant for all tracts within the same town)
- `CHAS`: Dummy variable to indicate whether or not the tract borders the Charles River (1 = Borders Charles River, 0 = Otherwise)
- `NOX`: Nitric oxides concentration (in parts per 10 million) per town (constant for all tracts within the same town)
- `RM`: Average number of rooms per dwelling in the tract
- `AGE`: Percent of owner-occupied units in the tract built prior to 1940 
- `DIS`: Weighted distance from the tract to five Boston employment centers
- `RAD`: Index of accessibility to radial highways per town (constant for all tracts within the same town)
- `TAX`: Full-value property tax rate per \\$10000 per town (constant for all tracts within the same town)
- `PTRATIO`: Pupil-teacher ratio per town (constant for all tracts within the same town)
- `B`: $1000(B - 0.63)^2$, where $B$ is the proportion of black residents in the tract
- `LSTAT`: Percent of tract population designated as lower status

## Part 2
**Make some pairwise scatterplots of the predictors (columns) in the data set. Describe your findings.**

In [None]:
dim(Boston_corrected)

In [None]:
par(mfrow = c(2, 2))
plot(Boston_corrected$AGE, Boston_corrected$CMEDV, xlab = "Percent of units built prior to 1940", ylab = "Median home value in $1000s")
plot(Boston_corrected$LSTAT, Boston_corrected$CMEDV, xlab = "Percent of lower status residents", ylab = "Median home value in $1000s")
plot(Boston_corrected$CMEDV, Boston_corrected$PTRATIO, xlab = "Median home value in $1000s", ylab = "Pupil-teacher ratio")
plot(as.factor(Boston_corrected$CHAS), Boston_corrected$CMEDV, xlab = "Borders Charles River", ylab = "Median home value in $1000s")

First, I generated some plots to explore the relationship between median home value and a number of non-crime factors. There aren't any especially clear patterns I can discern from thes plots aside from the expected result that as a tracts with higher median home values have a greater proportion of lower-status residence. Also, it appears as if tracts that border the Charles river are a high a slightly higher median home value on average.

In [None]:
par(mfrow = c(2, 2))
plot(Boston_corrected$CMEDV, Boston_corrected$NOX, xlab = "Median home value in $1000s", ylab = "Nitric oxides concentration (parts per 10 million)")
plot(Boston_corrected$INDUS, Boston_corrected$NOX, xlab = "Percent of non-retail business acres", ylab = "Nitric oxides concentration (parts per 10 million)")
plot(Boston_corrected$CMEDV, Boston_corrected$B, xlab = "Median home value in $1000s", ylab = "1000(Proportion of black residents - 0.63)^2")
plot(Boston_corrected$DIS, Boston_corrected$CMEDV, xlab = "Weighted distance to Boston employment centers", ylab = "Median home value in $1000s")

The first two scatter plots in this next group explore factors that might relate to the concentration of nitric oxides. While there isn't a strong relationship, it appears that tracts with higher median home value also weakly tend to have lower concentrations of nitric oxides. There is a much clearer relationship with the percentage of non-retail business acres -- tracts with a higher proportion of non-retail business acres tend to have higher concentrations of nitric oxides. The bottom two plots look at some more factors which might be related to the median home value of a tract. 

The bottom-left plot seems to indicate that there is a relationship between the value of `B` and `CMEDV`, where `B` increases as `CMEDV` increases. If I am interpreting this correctly, this means that tracts with high median home values have a very low (close to 0%) proportion of Black residents, while tracts with low median home values have a much higher proportion (close to 63%). The bottom-right plot appears to indicate that there is also a relationship between proximity to Boston employment centers and median home value, with home values generally increasing as one gets further away from the employment centers.

## Part 3
**Are any of the predictors associated with per capita crime rate? If so, explain the relationship.**

In [None]:
par(mfrow = c(2, 2))
plot(Boston_corrected$B, Boston_corrected$CRIM, xlab = "1000(Proportion of black residents - 0.63)^2", ylab = "Per capita crime rate")
plot(Boston_corrected$LSTAT, Boston_corrected$CRIM, xlab = "Percent of lower status residents", ylab = "Per capita crime rate")
plot(Boston_corrected$CMEDV, Boston_corrected$CRIM, xlab = "Median home value in $1000s", ylab = "Per capita crime rate")
plot(Boston_corrected$DIS, Boston_corrected$CRIM, xlab = "Weighted distance to Boston employment centers", ylab = "Per capita crime rate")

Based on the above four scatter plots, it appears that there are pretty clear relationships between crime rate and median home value, percent of lower status residents, and proximity to Boston employment centers. Tracts with lower home values tend to have higher crime rates, as do tracts which are closer to Boston employment centers. In addiion, tracts with higher proportion of lower status residents tend to have higher crime rates. I was also curious if there would be a relationship between crime rate and `B`, which serves as some kind of measurement for the proportion of Black residents. Based on the scatter plot between those two variables, there doesn't appear to be a clear relationship.

## Part 4
**Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.**

In [None]:
par(mfrow = c(2, 2))
hist(Boston_corrected$CRIM, xlab = "Per capita crime rate", main = "Histogram of Boston crime rates")
hist(Boston_corrected$TAX, xlab = "Tax rate per 10000 USD", main = "Histogram of Boston tax rates")
hist(Boston_corrected$PTRATIO, xlab = "Pupil-teacher ratio", main = "Histogram of Boston pupil-teacher ratios")

In [None]:
summary(Boston_corrected[, c(8, 17, 18)])

Based on the histograms and the numerical summary, there do appear to be tracts within Boston which have particularly high crime rates, tax rates, or pupil-teacher ratios. The minimum crime rate is 0.00632, while the maximum is 88.97620, with a median of 0.25651. The minimum tax rate is \\$187 per \\$10000, while the maximum is \\$711, with a median of \\$330. The minimum pupil-teacher ratio is 12.60 pupils per teacher, while the maximum is 22, with a median of 19.05. Given the median value, the maximum pupil-teacher ratio in the data set isn't outrageously high, since about half of the tracts have a ratio of 19 or more.

## Part 5
**How many of the suburbs in this data set bound the Charles river?**

In [None]:
sum(Boston_corrected$CHAS)

In this data set, 35 tracts neighbor the Charles river.

## Part 6
**What is the median pupil-teacher ratio among towns in this data set?**

In [None]:
summary(Boston_corrected$PTRATIO)

The median pupil-teacher ratio among towns in this data set is 19.05 pupils per teacher.

## Part 7
**Which suburb of Boston has the lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.**

In [None]:
min(Boston_corrected$CMEDV)

In [None]:
Boston_corrected[Boston_corrected$CMEDV == 5, ]

In [None]:
summary(Boston_corrected[, c(8:10, 12:20)])

Two of the tracts of South Boston have the lowest median value of owner-occupied homes, at $5000. Both of these tracts have very high crime rates compared to the overall range for that variable, with values 38.3518 and 67.9208 putting them far into the upper quartile and into the range of being outliers. These tracts have no land zoned for residential lots of 25000 sq. ft., though this is in line with at least half of the tracts in the overall set given the median for `ZN` is 0. The two tracts do have a relatively high proportion of non-retail business acres, with values of 18.1 being right at the third quartile. Similarly, the tracts also have concentrations of nitric oxides in the upper quartile of the overall set with a value of 0.693 parts per ten million. The average number of rooms per dwelling for these two tracts is at the low end, with values of 5.453 and 5.683 putting them at the bottom quartile. Next, these two tracts are among those with the highest proportion of owner-occupied homes built prior to 1940, with a value of 100. The tracts are also quite close Boston employment centers with `DIS` values of 1.4896 and 1.4254 putting them at the bottom quartile. The tracts also are very close to radial highways with the maximum value of `RAD` at 24. Next, the tracts have above average property tax rates, with a value of \\$666 per \\$10000, putting them at the third quartile. The pupil-teacher ratio of 20.2 also puts these tracts at the third quartile. The tracts have relatively high values for `B`, though one tract has a maximum value while the other, with a value of 384.97, is in between the first and second quartiles. Lastly, the tracts have a high proportion of lower status residents (values of 30.59 and 22.98), putting them in the top quartile of the data.

In summary, these two tracts with the lowest median value of owner-occupied homes have predictors generally at the extreme ends of their respective ranges.

## Part 8 
**In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.**

In [None]:
sum(Boston_corrected$RM > 7)

In [None]:
sum(Boston_corrected$RM > 8)

In this data set, there are 64 tracts which average more than seven rooms per dwelling, and 13 of those tracts which average more than 8 rooms per dwelling.

In [None]:
Boston_corrected[Boston_corrected$RM > 8, ]

In [None]:
summary(Boston_corrected[Boston_corrected$RM > 8, c(7:10, 12:20)])

From the numerical summary, one thing that stands out is that the tracts which average at least eight rooms per dwelling have low crime rates, low concentrations of nitric oxides, low proportions of Black residents (high values of `B`), and low proportions of lower status residents compared to the overall data set.