### **Introduction to data visualization: Distributions**


If we take the minimum value and the maximum value of our data set we have the range within which all our data is contained. But within that range the values ​​can be distributed in many different ways. Sometimes they are very close to the minimum value, sometimes they are very close to the maximum value; sometimes almost all of them are crowded around the median and only a few take the extreme values; sometimes they even generate two "heaps" around which most of the data is concentrated. There are so many possibilities.

Using individual values ​​it is impossible to have a general idea of ​​our set and that is why we usually use some techniques that take into account the whole data set at the same time. Today we are going to learn how through data visualization we can get a much more precise idea of ​​how the data as a whole is organized.


**Boxplots**

Boxplots (or box diagrams) are a way to visualize our data in such a way that the organization of the percentiles becomes very evident.

Boxplots help us discern if our data is skewed (has a trend), is spread out or clustered, and if there are outliers with extreme values.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
df = pd.read_csv('../../Datasets/melbourne_housing-clean.csv', index_col=0)


In [None]:
df.head()


In [None]:
sns.set(style="whitegrid")
sns.boxplot(x=df['price'])

What does this all mean?

- The box is delimited by 2 values: The 25th percentile and the 75th percentile.
- The vertical line inside the box indicates the 50th percentile (that is, the median).
- The "whiskers" attempt to encompass the rest of the data to the left and right of the box, BUT do not extend beyond a distance equal to 1.5 * Interquartile Range. As you will remember, the interquartile range is the difference between the 75th percentile and the 25th percentile. If we multiply 1.5 by that Interquartile Range we obtain the maximum size of the whiskers.
- The individual points outside the whiskers are obviously the samples whose value exceeds the maximum size of the whiskers. We cannot take this as an "Absolute Rule", but these values ​​are generally considered to be the outliers of our set.
As you can see, this graph gives us a lot of very useful information.

We now know that most of our data is concentrated in values ​​less than 2,000,000 and that very high prices are anomalies in our set.
We know that, within the full range of the data, we have a distribution that tends towards the smallest values.
We also know that our data is generally very concentrated (that is, not very dispersed), but that there is a "tail" of data to the right that extends quite far.

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x=df['price'])
plt.axvline(df['price'].mean(), c='y')

As you can see, despite such extreme outliers, we have so many values ​​in the lower range of our data that the average is fairly close to the median.


Frequency Table


Percentiles segment our data into segments of different sizes in which we have the same number of samples. Instead, frequency tables segment our data into segments that measure the same thing but contain a different number of samples.

This can give us another perspective on our data that is also very useful. We are going to learn how to generate a frequency table using pandas.


In [None]:
df.head()


In [None]:
prices = df['price']
prices.max() - prices.min()

In [None]:
pd.cut(prices, 20)


In [None]:
segments = pd.cut(prices, 20)

df['price'].groupby(segments).count()


And ready! We have a table where the indices are the 20 bins our dataset was divided into, and the values ​​in the table are the counts for each grouping. In this way, outliers are even more evident, since we can see several segments where the number of samples is very low.


###**Histograms**


Histograms are a way to visualize our frequency tables.

The x-axis is the range of our data and is divided by segments (like the ones we generated in the last Example).

The y-axis indicates the count of samples in each segment.


In [None]:
normal = np.random.normal(loc=0, scale=5, size=10000)


In [None]:
sns.distplot(normal, kde=False, norm_hist=False);


In [None]:
tail = np.array([2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8])

In [None]:
sns.distplot(tail, kde=False, norm_hist=False, bins=7);


In [None]:
positive_skewness = np.random.exponential(scale=1.0, size=10000)


In [None]:
sns.distplot(positive_skewness, kde=False, norm_hist=False);


In [None]:
normal_left = np.random.normal(loc=-2.5, scale=1, size=10000)
normal_right = np.random.normal(loc=2.5, scale=1, size=10000)
bimodal = np.concatenate([normal_left, normal_right])


In [None]:
sns.distplot(bimodal, kde=False, norm_hist=False);


### **Density plots**


We can obtain density plots using Seaborn's distplot method, just by changing the arguments we pass to it:


In [None]:
import numpy as np
import seaborn as sns

sns.set(style='whitegrid')

In [None]:
laplace = np.random.laplace(loc=0.0, scale=1, size=10000)
sns.distplot(laplace, hist=False);


In [None]:
chisquare = np.random.chisquare(4, size=10000)
sns.distplot(chisquare, hist=False)


In [None]:
sns.set(style='white')

normal_1 = np.random.normal(loc=-2, scale=3, size=10000)
normal_2 = np.random.normal(loc=4.5, scale=1, size=10000)
exponential = np.random.exponential(scale=1.0, size=10000) - 1

sns.distplot(normal_1, hist = False, kde_kws = {'shade': True})
sns.distplot(normal_2, hist = False, kde_kws = {'shade': True})
sns.distplot(exponential, hist = False, kde_kws = {'shade': True})

As you can see, using density plots we get a lot of clarity in the comparison. In this example we add the flag kde_kws = {'shade': True} to get the fill color of each layout. In a later session we will learn how to style our charts in depth.


### **Annotating our graphs**


A data scientist is a communicator, and as such it is very important that we can generate understandable and easy-to-interpret graphs so that the information we find can be transmitted.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')

In [None]:
df = pd.read_csv('../../Datasets/athlete_olympic_events-clean.csv', index_col=0)

df.head()

Adding titles and names for our axes is as easy as calling the set method. Previously we had been calling our Seaborn methods without assigning the results to any variables. If we assign our result to the variable ax, we can now call that variable's set method to annotate our graph:


In [None]:
ax = sns.distplot(df['age'], kde=False, norm_hist=False)
ax.set(title='Age of the athletes', xlabel='Age', ylabel='Count');


In the event that we have two or more graphs (or categories) at the same time, we can add a legend to our graph to be able to differentiate our data:


In [None]:
by_athlete = df.groupby(level=0)[['age', 'height', 'weight']].mean()
sex = df.groupby(level=0)['sex'].last()
merged = by_athlete.merge(sex, left_index=True, right_index=True)
males = by_athlete[merged['sex'] == 'M']
females = by_athlete[merged['sex'] == 'F']

In [None]:
ax = sns.distplot(males['height'], hist=False, kde_kws = {'shade': True}, label='men')
sns.distplot(females['height'], hist=False, kde_kws = {'shade': True}, ax=ax, label='women')
ax.set_title('Height distributions of male and female athletes', fontsize=13, pad=15);
ax.set(xlabel='height');
ax.legend(loc='upper right');

Every time you create a new plot with seaborn, this plot is contained within something we call figures in matplotlib. A figure can contain multiple graphs. Every time you generate a graph with seaborn, a new figure is generated automatically. We can manually generate the figure and then use it to customize our graph. To determine the size of our graph we can do the following:


In [None]:
fig = plt.figure(figsize=(10, 10))
ax = sns.distplot(df['age'], kde=False, norm_hist=False)
ax.set_title('Ages of athletes who participated in the Olympic Games', fontsize=20, pad=15)
ax.set(xlabel='age', ylabel='count');

### **Bar charts**


In [None]:
import pandas as pd
import seaborn as sns
sns.set_style('white')

In [None]:
df = pd.read_json('../../Datasets/zomato_reviews-clean.json')

df.head()

In [None]:
df['user_rating'].unique()


In [None]:
df['user_rating'].nunique()


In [None]:
df['user_rating'].value_counts()


In [None]:
counts = df['user_rating'].value_counts()


In [None]:
ax = sns.barplot(counts.index, counts)
ax.set_title('Restaurant Rating Count')
ax.set(ylabel='count');

In [None]:
as_percentages = counts * 100 / counts.sum()


In [None]:
ax = sns.barplot(as_percentages.index, as_percentages)
ax.set_title('Count Restaurant Ratings (as percentages)')
ax.set(ylabel='percent of total');


In [None]:
ax = sns.barplot(as_percentages.index, as_percentages)
ax.set_title('Count Restaurant Ratings (as percentages)')
ax.set(ylabel='percent of total')
ax.set_xticklabels(ax.get_xticklabels(), rotation=50);


In [None]:
ax = sns.barplot(as_percentages, as_percentages.index, orient='h')
ax.set_title('Count Restaurant Ratings (as percentages)')
ax.set(xlabel='percent of total')


In [None]:
df['user_rating'].mode()


### **Violinplots**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white')

In [None]:
df = pd.read_csv('../../Datasets/athlete_olympic_events-clean.csv', index_col=0)

df.head()

In [None]:
by_athlete = df.groupby(level=0)[['age', 'height', 'weight']].mean()
sex = df.groupby(level=0)['sex'].last()
merged = by_athlete.merge(sex, left_index=True, right_index=True)

In [None]:
merged

In [None]:
sns.boxplot(df['weight'])


In [None]:
plt.figure(figsize=(5, 10))
sns.boxplot(data=merged, x='sex', y='weight');


In [None]:
plt.figure(figsize=(5, 10))
sns.violinplot(data=merged, x='sex', y='weight');
