isualizing Numerical Distributions 
Many of the variables that data scientists study are quantitative or numerical. Their values are numbers on which you can perform arithmetic. Examples that we have seen include the number of periods in chapters of a book, the amount of money made by movies, and the age of people in the United States.

There are 200 movies on the list. Here are the top ten according to the unadjusted gross receipts in the column Gross.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as tick 

In [2]:
movies = pd.read_csv( 'top_movies.csv')
movies.head(10)

FileNotFoundError: [Errno 2] File top_movies.csv does not exist: 'top_movies.csv'

## Visualizing the Distribution of the Adjusted Receipts 
In this section we will draw graphs of the distribution of the numerical variable in the column Gross (Adjusted). For simplicity, let's create a smaller table that has the information that we need. And since three-digit numbers are easier to work with than nine-digit numbers, let's measure the Adjusted Gross receipts in millions of dollars. Note how round is used to retain only two decimal places.

In [None]:
millions = movies[['Title','Gross (Adjusted)']]
p=millions.rename(columns={"Gross (Adjusted)": "Adjusted Gross"})
p['Adjusted Gross'] = np.round(p['Adjusted Gross']/1e6, 2)
p

A Histogram 
A histogram of a numerical dataset looks very much like a bar chart, though it has some important differences that we will examine in this section. First, let's just draw a histogram of the adjusted receipts.

The hist method generates a histogram of the values in a column. The optional unit argument is used in the labels on the two axes. The histogram shows the distribution of the adjusted gross amounts, in millions of 2016 dollars.

In [None]:
q=p['Adjusted Gross']
fig, ax = plt.subplots()
ax.hist(q,color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

The Horizontal Axis 
The amounts have been grouped into contiguous intervals called bins. Although in this dataset no movie grossed an amount that is exactly on the edge between two bins, hist does have to account for situations where there might have been values at the edges. So hist has an endpoint convention: bins include the data at their left endpoint, but not the data at their right endpoint.

The optional argument bins can be used with hist to specify the endpoints of the bins. It must consist of a sequence of numbers that starts with the left end of the first bin and ends with the right end of the last bin. We will start by setting the numbers in bins to be 300, 400, 500, and so on, ending with 2000.

In [None]:
q=p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(q, bins=np.arange(300,2001,100), density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

## The Counts in the Bins

The horizontal axis of this figure is easier to read. The labels 200, 400, 600, and so on are centered at the corresponding values. The tallest bar is for movies that grossed between 300 million and 400 million dollars.

A very small number of movies grossed 800 million dollars or more. This results in the figure being "skewed to the right," or, less formally, having "a long right hand tail." Distributions of variables like income or rent in large populations also often have this kind of shape.



The Counts in the Bins 
The counts of values in the bins can be computed from a table using the bin method, which takes a column label or index and an optional sequence or number of bins. The result is a tabular form of a histogram. The first column lists the left endpoints of the bins (but see the note about the final value, below). The second column contains the counts of all values in the Adjusted Gross column that are in the corresponding bin. That is, it counts all the Adjusted Gross values that are greater than or equal to the value in bin, but less than the next value in bin.

In [None]:
bins

In [None]:
len(bins)

In [None]:
n = n * len(q) * np.diff(bins)
len(n)

In [None]:
n.sum()

In [None]:
bin_counts = pd.DataFrame({
    'bin':bins,
    'Adjusted Gross count': np.append(n, 0)
})
bin_counts

Notice the bin value 2000 in the last row. That's not the left end-point of any bar – it's the right end point of the last bar. By the endpoint convention, the data there are not included. So the corresponding count is recorded as 0, and would have been recorded as 0 even if there had been movies that made more than $2,000$ million dollars. When either bin or hist is called with a bins argument, the graph only considers values that are in the specified bins.

Once values have been binned, the resulting counts can be used to generate a histogram using the bin_column named argument to specify which column contains the bin lower bounds.

In [None]:
q = p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(q, bins=bin_counts['bin'], density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

## The Vertical Axis: Density Scale 

The horizontal axis of a histogram is straightforward to read, once we have taken care of details like the ends of the bins. The features of the vertical axis require a little more attention. We will go over them one by one.

Let's start by examining how to calculate the numbers on the vertical axis. If the calculation seems a little strange, have patience – the rest of the section will explain the reasoning.

Calculation. The height of each bar is the percent of elements that fall into the corresponding bin, relative to the width of the bin.

The calculations will become clear if we just examine the first row of the table.

Remember that there are 200 movies in the dataset. The [300, 400) bin contains 81 movies. That's 40.5% of all the movies:

Percent=81200⋅100=40.5
 
The width of the [300, 400) bin is  400−300=100 . So

Height=40.5100=0.405
 
The code for calculating the heights used the facts that there are 200 movies in all and that the width of each bin is 100.

In [None]:
Percent=bin_counts['Adjusted Gross count']/200*100
Percent

In [None]:
counts = bin_counts.rename(columns={
    'Adjusted Gross count': 'Count'
})
percents = counts.copy()
percents['Percent'] = (counts['Count']/200)*100
heights = percents.copy()
heights['Height'] = percents['Percent']/100
heights

## Unequal Bins 
An advantage of the histogram over a bar chart is that a histogram can contain bins of unequal width. Below, the values in the Millions column are binned into three uneven categories.

In [None]:
uneven = np.array((300, 400, 600, 1500))
data = p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(data, bins=uneven, density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

In [None]:
uneven

In [None]:
data

In [None]:
len(uneven)

In [None]:
n = n * len(data) * np.diff(uneven)
n

In [None]:
bin_counts = pd.DataFrame({
    'bin':uneven,
    'Adjusted Gross count': np.append(n, 0)
})
bin_counts

## The Problem with Simply Plotting Counts 
It is possible to display counts directly in a chart, using the normed=False option of the hist method. The resulting chart has the same shape as a histogram when the bins all have equal widths, though the numbers on the vertical axis are different.

In [None]:
q = p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(q, bins=np.arange(300,2001,100), density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
plt.show()

While the count scale is perhaps more natural to interpret than the density scale, the chart becomes highly misleading when bins have different widths. Below, it appears (due to the count scale) that high-grossing movies are quite common, when in fact we have seen that they are relatively rare.

In [None]:
q = p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(q, bins=uneven, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
plt.show()

Even though the method used is called hist, the figure above is NOT A HISTOGRAM. It misleadingly exaggerates the proportion of movies grossing at least 600 million dollars. The height of each bar is simply plotted at the number of movies in the bin, without accounting for the difference in the widths of the bins.

The picture becomes even more absurd if the last two bins are combined.

In [None]:
very_uneven = np.array((300, 400, 1500))
data =p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(data, bins=very_uneven, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
plt.show()

## The Histogram: General Principles and Calculation 
The figure above shows that what the eye perceives as "big" is area, not just height. This observation becomes particularly important when the bins have different widths.

That is why a histogram has two defining properties:

The bins are drawn to scale and are contiguous (though some might be empty), because the values on the horizontal axis are numerical.
The area of each bar is proportional to the number of entries in the bin.

## Flat Tops and the Level of Detail
Even though the density scale correctly represents percents using area, some detail is lost by grouping values into bins.

Take another look at the [300, 400) bin in the figure below. The flat top of the bar, at the level 0.405% per million dollars, hides the fact that the movies are somewhat unevenly distributed across that bin.

Take another look at the [300, 400) bin in the figure below. The flat top of the bar, at the level 0.405% per million dollars, hides the fact that the movies are somewhat unevenly distributed across that bin.

In [None]:
uneven = np.array((300, 400, 600, 1500))
data =p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(data, bins=uneven, density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

To see this, let us split the [300, 400) bin into 10 narrower bins, each of width 10 million dollars.

In [None]:
uneven = np.array((300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 600,1500))
data = p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(data, bins=uneven, density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

## Histograms Q&A 
Let's draw the histogram again, this time with four bins, and check our understanding of the concepts.

In [None]:
uneven = np.array((300, 350, 400, 450, 1500))
data = p['Adjusted Gross']
fig, ax = plt.subplots()
n, bins, patches = ax.hist(data, bins=uneven, density=True, color='slategrey')
ax.set_xlabel('Adjusted Gross (Million Dollars)')
ax.set_ylabel('Percent per Million Dollars')
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x,_: f'{(x * 100):.1f}'))
plt.show()

In [None]:
bins

In [None]:
len(bins)

In [None]:
n = n * len(data) * np.diff(bins)
n

In [None]:
len(n)

In [None]:
n.sum()

In [None]:
bin_counts = pd.DataFrame({
    'bin':bins,
    'Adjusted Gross count': np.append(n, 0)
})
bin_counts

Differences Between Bar Charts and Histograms
Bar charts display one quantity per category. They are often used to display the distributions of categorical variables. Histograms display the distributions of quantitative variables.
All the bars in a bar chart have the same width, and there is an equal amount of space between consecutive bars. The bars of a histogram can have different widths, and they are contiguous.
The lengths (or heights, if the bars are drawn vertically) of the bars in a bar chart are proportional to the value for each category. The heights of bars in a histogram measure densities; the areas of bars in a histogram are proportional to the numbers of entries in the bins.