# Chapter 4: Random Variables and Empirical distributions

It's not too much of a stretch to say that the entire formalism of probability theory that we have studied up to now merely serves as a foundation for random variables. And now that we have learned about random variables, our progamming asssignments will begin to align much more closely with what we do in class.

What have we learned about random variables? Among other things:

1. Random variables have their own probability distributions, along with mass and density functions, cumulative distribution functions, and quantiles.
2. Random variables have their own algebra, and they can be "transformed" by plugging them into functions.
3. Random variables have expected values, variances, and standard deviations.

You have learned about these things primarily as theoretical objects, which might lead you to believe that they only live in classrooms. However, as you will see in this programming assignment, these things also manifest themselves in *real-world* datasets. So, our goal is to begin connecting theory to practice, in anticipation of [Chapter 6](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html) where we explore these ideas in even greater detail.

## Directions

1. The programming assignment is organized into sequences of short problems. You can see the structure of the programming assignment by opening the "Table of Contents" along the left side of the notebook (if you are using Google Colab or Jupyter Lab).

3. Each problem contains a blank cell containing the following comment: `# ENTER YOUR CODE IN THIS CELL`. Enter your code in these cells below the comment, being sure to not erase the comment. There are usually directions on the precise syntax that you will use to enter your solution properly. Please pay very careful attention to these directions.

4. Below most of the solution cells are "autograder" cells. Do not alter the autograder cells in any way.

5. Do not add any cells of your own to the notebook, or delete any existing cells (either code or markdown).

## Submission instructions

1. Once you have finished entering all your solutions, you will want to rerun all cells from scratch to ensure that everything works OK. To do this in Google Colab, click "Runtime -> Restart and run all" along the top of the notebook.

2. Now scroll back through your notebook and make sure that all code cells ran properly.

3. If everything looks OK, save your assignment and upload the `.ipynb` file at the provided link on the course <a href="https://github.com/jmyers7/stats-book-materials">GitHub repo</a>. Late submissions are not accepted.

4. You may submit multiple times, but I will only grade your last submission.

## Importing the dataset

The dataset that we will study in this assignment is the _Ames housing dataset_. This is a well-known dataset that contains information on the sale prices of 2,930 homes in Ames, Iowa. Besides sale prices, the dataset contains observations on 79 other variables, beginning with basic things like square footage, all the way to niche variables like the number of fireplaces. (The fame of this dataset partly derives from its popularization on a [machine learning competition website](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques).)

We will work with a restricted version of the dataset containing observations on only six variables, including sale price. Let's import the dataset as usual as a Pandas DataFrame, and also import Matplotlib and NumPy for later use:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import numpy as np
url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-3-1.csv'
df = pd.read_csv(url)

Let's get our eyes on the data by printing out the first five rows using the `head` method:

In [None]:
df.head()

Here are brief descriptions of the variables:

* The `quality` variable is a rating of the overall quality of the home, on a scale from 1 (= bad) to 10 (= good).

* The `year` variable is the year in which the home was built.

* The `area` variable is the square footage of the home.

* The `rooms` variable is the number of rooms in the home.

* The `fireplaces` variable is the number of fireplaces in the home.

* The `price` variable is the selling price of the home, in thousands of (US) dollars.

We can get an idea of the distribution of the `quality` variable, for example, by calling the `value_counts` method on the column:

In [None]:
df['quality'].value_counts()

This shows us a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) (essentially a 1-dimensional array with some of the same functionality as a DataFrame) with indices for each "level" of the `quality` variable along with the count at each level. For example, there are 825 homes in the dataset with a `quality` of 5, while there are four homes with a `quality` of 1.

Because the `value_counts` method will be very important in this assignment, it's worth studying it in a bit more detail.

First, note that the method prints the counts in descending order by default; we can reverse the order by passing in the parameter `ascending=True`:

In [None]:
df['quality'].value_counts(ascending=True)

We can also sort the counts by putting the levels themselves in increasing order. We do this by "chaining" the `value_counts` method to the `sort_index` method through _two_ uses of the dot operator `.`:

In [None]:
df['quality'].value_counts().sort_index()

## Enter probability: empirical distributions

Let's take the `quality` variable as a running example. We imagine that it corresponds to a random variable

$$
Q:S\to \mathbb{R},
$$

where $S$ is the sample space of an appropriate probability space. Though we do not have enough contextual information to determine it exactly, we might imagine that $S$ consists of _all_ homes in Ames that have been sold recently, or that it consists of _all_ homes in the US. Whatever $S$ happens to be, we imagine feeding a specific home from $S$ into $Q$, and then $Q$ returns the quality:

$$
Q(\text{home}) = \text{quality}.
$$

We then conceptualize the `quality` column in our dataset as consisting of observations

$$
q_1, q_2, \ldots, q_{2{,}930} \in \mathbb{R} \tag{$\ast$}
$$

of the random variable $Q$. Notice the distinction between upper- and lower-case letters!

Now, like any random variable, $Q$ has a probability distribution $P_Q$ that lives on $\mathbb{R}$. But the dataset $(\ast)$ _also_ has a probability distribution, called the _empirical distribution_ of the dataset. This latter distribution is discrete with mass function $p(q)$ given by

$$
p(q) = \frac{\text{number of $q_k$'s in the dataset $(\ast)$ that equal $q$}}{2{,}930},
$$

for all $q\in \mathbb{R}$. For example, the probability of observing a level of $Q=5$ in the dataset is given by

$$
P(Q=5) = p(5) = \frac{825}{2{,}930} \approx 0.282,
$$

where I retrieved the value $825$ from our computations above with the `value_counts` method. Pretty simple, right? (We will see this definitition again [later](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#emp-dist-defn).)

But what if we wanted to obtain _all_ the values of the mass function $p(q)$ in a single line of code? Conveniently, this is possible by passing the parameter `normalize=True` into the `value_counts` method:


In [None]:
df['quality'].value_counts(normalize=True)

Now, instead of seeing the raw counts at each level, we see the "normalized" counts, meaning each count has been divided by the size of the dataset. But these are exactly the values of the mass function $p(q)$! For example, from the printout we see

$$
p(4) \approx 0.077 \quad \text{and} \quad p(2) \approx 0.004.
$$

Numerical printouts are nice---but what about a visual description of the empirical distribution? How might we produce a probability histogram, for example?

The Pandas library includes methods for producing plots _directly_ from DataFrames and Series using Matplotlib as the backend (which we know well from the first programming assignment). For example, if we wanted to produce a probability histogram, we would chain the `plot` method in Pandas onto the `value_counts` method, passing in the parameter `kind='bar'` to obtain a probability histogram:

In [None]:
df['quality'].value_counts(normalize=True).plot(kind='bar')
plt.show()

Looking at the figure, two problems are evident: (1) the levels along the horizontal axis are in the wrong order, and (2) the level labels are rotated at an awkward 90-degree angle by default. We can fix the first problem by chaining with the `sort_index` method as I showed you above, and we can fix the second problem by passing the parameter `rot=0` into the `plot` method along with `kind='bar'`:

In [None]:
df['quality'].value_counts(normalize=True).sort_index().plot(kind='bar', rot=0)
plt.show()

That's better! We can see from the histogram that the majority of the probability mass is concentrated in the middle of the range, with very few homes having qualities at the extremes. Over half of the homes in the dataset have a quality of either $5$ or $6$.

## The big picture

In this situation, the specific identity of the (theoretical) random variable

$$
Q: S \to \mathbb{R}
$$

and its probability distribution $P_Q$ will often be chosen by the analyst based on many factors. This is part of the _modeling process_ that we will talk about later.

With the theoretical model in place, one conceptualizes the empirical distribution of the dataset

$$
q_1,q_2,\ldots,q_{2{,}930} \in \mathbb{R}
$$

as an _estimate_ of the (theoretical) model distribution $P_Q$. Then, all the (theoretical) quantities that might be computed from $P_Q$---like expected values, variances, and quantiles---have their own _empirical_ counterparts that are conceptualized as _estimates_ of the theoretical quantities. (We will meet some of these below.)

So, there are really two things that we need to learn:

1. How to construct (theoretical) probabilistic models.

2. How to estimate various properties and features of the models based on datasets and empirical measures.

This programming assignment focuses on the second item. We will practice the first one later.

### Problem 1 --- The empirical distribution of _rooms_

For grading purposes, note that the return values of the `value_counts` and `sort_index` methods may be saved into a variable (since they are simply Pandas Series). For example, inspect (and run) the following code:

In [None]:
probs = df['quality'].value_counts(normalize=True).sort_index()
probs

In the first line, we save the output of `sort_index` into the variable `probs`, and in the second line we print it out to make sure it looks correct.

Now, in the next code block, compute the mass function of the empirical distribution of the `rooms` variable in the DataFrame `df` by using the `value_counts` method. Save your result into the variable `probs` as I just showed you, and make sure to put the levels in ascending order.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


Using what I taught you above, in the next code block produce a probability histogram of the empirical distribution of `rooms`. Make sure the levels along the horizontal axis are in ascending order and that you correct for the default 90-degree orientation! (_Hint_: Assuming you did the previous problem correctly, you can call the _plot_ method directly on _probs_...)

In [None]:
# ENTER YOUR CODE IN THIS CELL



## "Binned" histograms

Let's take a look at the empirical distribution of the `years` variable:

In [None]:
df['year'].value_counts(normalize=True).sort_index().plot(kind='bar', figsize=(10,5))
plt.gca().xaxis.set_major_locator(MultipleLocator(10))
plt.show()

Note the second line of code that calls the `gca` method, which "gets the current axes." I chained `xaxis` to grab the horizontal axis, and then chained the `set_major_locator` method with `MultipleLocator(10)` passed in as a parameter to set the tick labels every ten-ish years.

What if we tried to produce a probability histogram of the `price` variable? Here's what we would get:

In [None]:
df['price'].value_counts(normalize=True).sort_index().plot(kind='bar')
plt.show()

Wow. That just looks..._terrible_.

The reason this plot looks so gross is that the `price` variable has a large number of closely packed levels. In fact, there are 1,032 of them with the most frequent level (i.e., the _mode_) repeated only 34 times:

In [None]:
df['price'].value_counts()

Also, there are 629 prices that appear only _once_ in the dataset:

In [None]:
(df['price'].value_counts() == 1).sum()

Though the (theoretic) `price` random variable is technically discrete (why?), it is much closer to being continuous than the `quality` or `rooms` random variables. The rule of thumb is to avoid using (plain) probability histograms for large datasets that are nearly continuous.

Instead, we may use a "binned" version of a histogram to visualize the empirical distribution of prices. The rough idea (see the pictures [here](https://mml.johnmyersmath.com/stats-book/chapters/theory-to-practice.html#histograms)) is that we imagine plotting the `price` dataset along the horizontal axis which is chopped up into bins, and we then draw rectangles on top of the bins whose heights are proportional to the number of datapoints that fall in each bin. We get something like this:

In [None]:
df['price'].plot(kind='hist', ec='black', density=True)
plt.gca().set_ylabel('')
plt.show()

In this code, we directly call the `plot` method on the column `df['price']`, passing in the parameter `kind='hist'` to produce a binned histogram along with `ec='black'` to change the edge colors of the rectangles to black. The `density=True` parameter normalizes the heights of the rectangles so that their areas all sum to 1. (Why is the second line of code there? Try removing it to see why.)

This histogram does a much better job conveying the shape of the `price` distribution. We see that there are large number of homes with sale prices in the $\$100$k-$\$200$k range, and that the distribution is right-skewed (i.e., it has a long tail stretching to the right).

Though Pandas and Matplotlib will automatically select the number of bins for you (along with their widths), it is possible to control the number manually through the `bins` parameter of the `plot` method:

In [None]:
df['price'].plot(kind='hist', ec='black', density=True, bins=20)
plt.gca().set_ylabel('')
plt.show()

### Problem 2 --- The empirical distribution of _area_

In the next code block, produce a "binned" histogram of the `area` variable with 20 bins:


In [None]:
# ENTER YOUR CODE IN THIS CELL



## Transformations

For various reasons (that we'll talk about later), we often want to transform skewed data to remove tails. For data that are greater than $1$ and right-skewed, we may accomplish this through a _log transform_.

Theoretically, this is just an application of the logarithm function to a random variable. For example, if our `price` column consists of observations of a "price" random variable

$$
P: S \to \mathbb{R},
$$

then a _log transform_ is simply the random variable $\log{P}$. (The "$P$" here stands for "price," not "probability"!) Take a moment to convince yourself that the logarithm function (base $e$, say) has exactly the right shape needed to remove right-skewed tails.

### Problem 3 --- Log transforms

In the next code cell, add a new column called `log_price` to the DataFrame `df` by applying a log transform to the `price` column. You will have to figure out how to do this on your own---I suggest querying ChatGPT with the following prompt:

* _How do I apply a log transform to a column in a Pandas DataFrame?_

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


You will know that your log transform is correct if the empirical distribution of the new `log_price` variable looks like this:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/283aecb67b4abf55b4828e1feb98f96393597472/img/log-price.png?raw=true" width="600" align="center">
</center>

Notice that the skewness of the data has largely been removed.

In the next code block, re-create this (binned) histogram of the `log_price` variable. Use 20 bins.

In [None]:
# ENTER YOUR CODE IN THIS CELL



## Scatter plots

The `plot` method of a Pandas DataFrame can produce lots of different types of plots, in addition to the two types of histograms that we saw above.

### Problem 4 -- Plot of _price_ versus _area_

For example, suppose that we wanted to visualize the relationship between the `price` and `area` variables in our DataFrame. We may use the `plot` method to produce the following _scatter plot_:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/e69d81862b2fd5034f5fedfa9be47d40c3c44352/img/price-vs-area.png?raw=true" width="600" align="center">
</center>

The scatter plot shows what we might have expected: That the `price` variable is essentially (linearly) correlated with `area`, with larger areas corresponding to larger prices.

In the next code block, re-create this scatter plot. Again, I will leave you to figure out how to do this on your own. Instead of using ChatGPT, I encourage you to peek at the description of the `plot` method in the technical documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html). In particular, pay close attention to the `kind`, `x`, and `y` parameters! In order to change the opacity (i.e., transparency) of the points in the scatter plot, you'll need to pass in `alpha=0.2` as a parameter to the `plot` method.

In [None]:
# ENTER YOUR CODE IN THIS CELL



For those of you who take this Python stuff seriously---and **especially** those of you who will put Python experience on your resume!---being familiar with and knowing how to read technical documentation (colloquially known as "_the docs_") is a very important skill.

## Means and variances

Let's return to our `quality` variable, consisting of the dataset

$$
q_1,q_2,\ldots,q_{2{,}930} \in \mathbb{R}.
$$

The _empirical mean value_, denoted $\bar{q}$, is given by the formula

$$
\bar{q} = \frac{1}{2{,}930} \sum_{i=1}^{2{,}930} q_i.
$$

This is, of course, nothing but the usual average value that you're surely familiar with.

Recall that we conceptualize the empirical distribution of the dataset as an _approximation_ to the "true" distribution $P_Q$ of a random variable

$$
Q: S \to \mathbb{R}
$$

defined on some appropriate sample space $S$. Like every random variable, $Q$ has an expected value $\mu_Q = E(Q)$. But if the empirical distribution is an _approximation_ to $P_Q$, then it is natural to also view the empirical mean $\bar{q}$ as an _approximation_ to the "true" mean $\mu_Q$:

$$
\bar{q} \approx \mu_Q.
$$

The mean value $\bar{q}$ is often refered to as the _sample mean_, but this is technically incorrect, since the latter is actually a _function_. (This is a problem of confusing an _estimate_ with an _estimator_. But don't worry about this now---we will discuss it all later when we talk about [statistics and estimators](https://mml.johnmyersmath.com/stats-book/chapters/stats-estimators.html).)

Pandas makes it easy to compute empirical means. For example, here's how you obtain the empirical mean of the `quality` variable:

In [None]:
df['quality'].mean()

The (theoretical) variance $\sigma_Q^2 = V(Q)$ and standard deviation $\sigma_Q$ also have their empirical approximations: The _empirical variance_ of the dataset, denoted $s^2$, is defined via the formula

$$
s^2 = \frac{1}{2{,}930-1} \sum_{i=1}^{2{,}930}(q_i - \bar{q})^2,
$$

while the _empirical standard deviation_, denoted $s$, is given by

$$
s = \sqrt{s^2}.
$$

Notice the curious value of $2{,}930-1$ in the denominator of the empirical variance $s^2$; this is _one less_ than the size of the dataset. Perhaps you might have expected the _actual_ size $2{,}930$ of the dataset to appear in the denominator, but it turns out that the slightly smaller denominator creates an "unbiased estimator" for the "true" variance $\sigma^2_Q$. (This, too, will be discussed later.)

Here's how you obtain empirical variances and standard deviations. First, the variance:

In [None]:
df['quality'].var()

And the standard deviation:

In [None]:
df['quality'].std()

### Problem 5 --- Computing means the easy way

Using the `mean` method on the `fireplaces` variable in the DataFrame, compute the mean number of fireplaces in the next code block. Save your answer into the variable `mean`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


### Problem 6 --- Computing means from scratch

Remember, we imagine that the data in the `fireplaces` column constitutes $2{,}930$ observed values

$$
f_1,f_2,\ldots,f_{2{,}930} \in \mathbb{R}
$$

of a (theoretical) random variable

$$
F: S \to \mathbb{R}
$$

defined on some appropriate sample space $S$. ($F$ and $f$ for "fireplace".) For this problem, let's pretend that the sample space $S$ consists _exactly_ of those homes in the DataFrame. In this case, the "true" probability measure $P_F$ is given _exactly_ by the empirical distribution of the dataset, and the "true" expected value $E(F)$ is _exactly_ equal to the empirical mean that you computed in the previous problem.

This means that the empirical mean $\bar{f}$ may be computed by the formula used to compute $E(F)$ that we learned in lecture,

$$
\bar{f} = \sum_{f\in \mathbb{R}} f p_{F}(f),
$$

where $p_F(f)$ is the mass function of the empirical distribution. In this problem, you are going to compute the empirical mean $\bar{f}$ (again) using this latter formula, rather than the Pandas method that you used in the previous problem.

Start off by computing the mass function $p_F(f)$ for the empirical distribution as I showed you above. Save your answer into the variable `probs`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


If your code is correct, then `probs` should be a Pandas Series. From `probs`, in the next code block extract the levels for the `fireplaces` variable by pulling out the indices using the dot operator `.` and the `index` attribute. Save your answer into the variable `levels`. (_Hint_: If you need help, check out the examples at the [docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html) for `index`. In your code block, you can include a line with just `levels` to print your answer and check that it is correct. You should see an object with the prefix `Int64Index`.)

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


You now have a Pandas Series `probs` representing the values of the mass function $p_F(f)$ in the formula

$$
\bar{f} = \sum_{f\in \mathbb{R}} f p_F(f).
$$

The mass function has support at the numbers in the Pandas Index object `levels`. To compute the summands in the formula for $\bar{f}$, we need to compute the element-wise product of `levels` and `probs`. But you can do this easily using the standard Python multiplication operator `*`. Do this in the next code block. Save your answer into the variable `product`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


If your code is correct, `product` should be a Pandas Series containing the summands in the formula above for $\bar{f}$. To finish the computation, we need to sum together all the numbers in this Series. Do this in the next code block using the `sum` method on `product` (see [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html)), saving your answer into the variable `mean` (overwriting the value from the previous problem). Make sure to print out `mean` to check that it matches the value in the previous problem!

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.


## Quantiles

We may also compute _empirical quantiles_. However, their definition is somewhat different compared to the definition that we learned in class.

To explain, let's consider a simple example of a dataset with eight values:

$$
(x_1,x_2,\ldots,x_8) = (1, 2, 2, 3, 4, 4, 4, 5).
$$

Notice that the dataset is written in non-decreasing order as the subscripts increase:

$$
x_1 \leq x_2 \leq \cdots \leq x_8.
$$

Then, each datapoint $x_k$ is assigned to the _empirical $q$-th quantile_ where

$$
q \stackrel{\text{def}}{=} \frac{k-1}{7}. \tag{$\ast$}
$$

This fraction is exactly the proportion of datapoints (excluding $x_k$) that fall to the _left_ of $x_k$ in the listing

$$
(x_1,x_2,\ldots,x_k, \ldots, x_8).
$$

(Notice that the "_excluding $x_k$_" part accounts for the $7$ in the denominator, rather than $8$.) In particular, $x_1$ is the $0$-quantile, while $x_8$ is the $1$-quantile.

As we iterate through the values $k=1,2,\ldots,8$, the values of $q$ given in $(\ast)$ will cycle through the following (Python) list of numbers:

In [None]:
q = [(k - 1) / 7 for k in range(1, 9)]
q

We see, for example, that $x_2$ is (approximately) the $0.143$-quantile, while $x_7$ is (approximately) the $0.857$-quantile.

This process sets up a function that maps these eight $q$-values to the eight $x$-values in the dataset:

$$
q = \frac{k-1}{7} \mapsto x_k.
$$

Notice that this is a "partial" function, however; it is _not_ defined for _all_ $q$-values between $0$ and $1$, rather it is only defined for those eight $q$-values in the list above.

But what happens if you want to compute, say, the empirical $0.5$-quantile (i.e., the _empirical median_)? Notice that $q=0.5$ is not in the list above.

To explain this part, it will be helpful to graph this "partial" function. To do so, let's toss the new dataset into a DataFrame along with the list `q` from above:

In [None]:
new_data = [1, 2, 2, 3, 4, 4, 4, 5]
df_new = pd.DataFrame({'x': new_data, 'q': q})
df_new

Now, using our new DataFrame, let's plot the "partial" function:

In [None]:
df_new.plot(kind='scatter', x='q', y='x')
plt.show()

The goal is to "complete" this "partial" function so that it is defined for all $q$-values. The (default) implementation in Pandas linearly interpolates between these eight points:

In [None]:
grid = np.linspace(0, 1)
df_new.plot(kind='scatter', x='q', y='x')
plt.plot(grid, df_new['x'].quantile(grid))
plt.show()

From this new graph, we see that the empirical $0.5$-quantile is approximately $3.5$. We may confirm that it is _exactly_ $3.5$ by calling the `quantile` method on the `x` column in the DataFrame with `q=0.5` passed in as a parameter:

In [None]:
df_new['x'].quantile(q=0.5)

Pandas allows other methods for interpolation besides linear interpolation. If you want to see them, check out the `interpolation` parameter of the `quantile` method at [the docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html).

### Problem 7 --- Plotting quantiles

By mining the code blocks above for hints, produce a plot in the next code block of the quantile function for the `price` variable in the Ames dataset. Here's what you're aiming for:

<center>
<img src="https://github.com/jmyers7/stats-book-materials/blob/8244796611b3f7a4ed558c5538874ddea64bc06b/img/price-quantile.png?raw=true" width="600" align="center">
</center>

I am looking for an _exact_ recreation of this plot. In particular, notice the labels on the horizontal and vertical axes. (_Hint_: You can do this in four lines of code. Do _not_ attempt to plot the "partial" function.)

In [None]:
# ENTER YOUR CODE IN THIS CELL



From the graph, it appears that the empirical median of the `price` variable is $\approx \$150$k. Notice also that the right-skewness of the empirical distribution of `price` manifests itself as the the long, steep portion of the graph toward the right edge of the plot. Why do you think this quantile plot appears to be a smooth curve, rather than piecewise linear like the quantile plot of the toy eight-element dataset we explored above?

To finish off this programming assignment, in the next code block compute the _exact_ empirical median to check that is is near $\$150$k. Save your answer into the variable `median`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



In [None]:
# Autograder cell. Do NOT alter or delete this cell.
