
# Introduction to Descriptive Statistics using Python
## Last updated: 8-October-2020

**Written by: Pramudita Satria Palar, Faculty of Mechanical and Aerospace Engineering, Bandung Institute of Technology**



# Introduction

This Python notebook is used as a teaching material in the Research Methodology course at the Faculty of Mechanical and Aerospace Engineering, Bandung Institute of Technology. I made this notebook as simple as possible and you can directly try the Python code so that you will get a better grip on the use of statistics for research. 

There are several means to do statistics with Python. You can use either the ```math``` or the ```statistics``` package from python, ```numpy```, or ```pandas```. This notebook primarily uses ```numpy```, the statistics package from ```scipy```, and Python's ```statistics``` package, although I also rely on other packages to explain, for example, linear regression (I use ```sklearn``` for this). For plotting, we will use ```matplotlib```.

Some notes before you go:
*   I import ```numpy``` as ```np```, ```statistics``` as ```stat```, ```scipy.stats``` as ```stats```, ```matplotlib.pyplot``` as ```plt```
*  Please execute the "Initialization" cell that containts the code to import the necessary packages. After that, all cells can be executed independently.
*  Most of the implementations of the tools here are basic. If you want to know more about the capabilities of the tools, you can read the reference (I give you the pointers for that).
*  The data is given in the form of one-dimensional Numpy array (```np.array([])```), although you can do most of the stuffs here with a Python list (```[]```)
*  I mostly give labels for the plot, so I added extra lines to make the plot looks better. However, for minimum implementation you can just execute the main function for plotting (e.g., ```plt.scatter(x,y)```).




# Before you start..
This material assumes that you are familiar with basic programming. However, it is better to treat the Python implementation here as a "software", thus, actually you only need to make only few changes (say, if you want to try with a new data set). 

First, let's talk about how we write the data. If your data set is: $X = \{1,2,3,4,5,6,7\}$, there are at least two ways that you can write this data in Python, namely, as a :

*   Python list: ```x = [1,2,3,4,5,6,7] ```, or
*   Numpy array (one dimensional): ```x = np.array([1,2,3,4,5,6,7])```

Pay attention that, to create a Python list, we use the bracket ```[]```, while for Numpy array we use ```np.array([])``` (notice that we import ```numpy``` as ```np```). I will use Numpy array throughout this lecture. An $n-$dimensional array can also be created, but so far we will not implement that in this lecture. But for the sake of clarity, here is how you create an $n-$dimensional array (e.g., 2-dimensional):

*   ```x = np.array([[1,2,3],[4,5,6]])```

to represent the following matrix:

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6\end{bmatrix}$

Notice that there are brackets inside brackets, and that is how we represent a matrix with Numpy.

See the example below about how to create a list or a 1-dimensional array in Python:

**You can use (ctrl+enter) or press the run cell button (hover your mouse around the left side of the cell to find the button)** 




In [None]:
import numpy as np # Import numpy

x1 = [1,2,3,4,5,6,7] # Write x as a Python list
x2 = np.array([1,2,3,4,5,6,7]) # write x as a numpy array

print(type(x1)) # The type should be list
print(type(x2)) # The type should be Numpy array


<class 'list'>
<class 'numpy.ndarray'>


To create a list of string, we will also use Python list. It is worth noting that a Python list can contain elements with multiple data types (e.g., numerical and string together). For example,

*   ```list_of_items = ['Laptop','Computer','Smartphone']```
*   ```list_of_data = ['Laptop',50,22.44]```

Try that by running the code below:

In [None]:
list_of_items = ['Laptop','Computer','Smartphone'] # This contains all strings
list_of_data = ['Laptop',50,22.44] # This contains both numerical values and strings

To perform the majority of tasks in this notebook, we will frequently call **Python functions**. A python function receives your input (mostly, your data) and then outputs various quantities of interest such as mean of the data. The general syntax of a Python function looks like this:

```output = name_of_function(input1, input2,...) ```

For example, if we want to calculate the mean of the data, we will use the```mean``` function from ```numpy```. To be exact, we will call it like this: ```xbar = np.mean(x)```, where ```x``` is your data and ```xbar``` is the mean of data. You are, of course, to assign any names to your data and the output, ```xbar``` and ```x``` here are for illustration only. 

In [None]:
import numpy as np # Import numpy

x = np.array([1,2,3,4,5,6,7]) # write x as a numpy array

print(np.mean(x)) # Calculate and print the mean of the data


# Measures of central tendency
We will begin with **descriptive statistics**, in which our aim is to obtain several numbers / coefficients that describe our data. By obtaining these coefficients, you will get a first impression on the nature of your data (e.g., what is the central tendency of your data?). 

Let's begin with the **measures of central tendency**. The central tendency, as the name suggests, is a measure that shows the "centerness" of your data. Let's say that you have $n$ data collected into a vector $X = \{x_{1},x_{2},\ldots,x_{n}\}$ and you want to calculate the central tendency of $X$. Several popular measures of central tendency that you can use include the following:


1.   Mean: denoted as $\bar{X}$, where $\bar{X}= (\sum_{i=1}^{n}x_{i})/n$
2.   Median: By median, the data is sorted with ascending order. This sorted data is then splitted into the higher half and the lower half. The point where the data is splitted is called the median.
3.   Mode: The most frequent value in your data set.

For example, if your data set is $X = \{4,3,2,1,5,7,6,7,7\}$ ($n=9$). Then,

*   Your mean is $\bar{X}=(4+3+2+1+5+7+6+7+7)/9 =4.6667$
*   Your median is 5. You got this by sorting your data first, i.e., $X_{sorted} = \{1,2,3,4,\boldsymbol{5},6,7,7,7\}$. See that 5 is in the center.
*   Your mode is 7, as you can see that it appears 3 times.

Before we can calculate everything, we need to import ```numpy``` as ```np```, ```statistics``` as ```stat```, the ```stats``` module from ```scipy``` (notice that ```stat``` and ```stats``` are two different modules), and ```matplotlib.pyplot``` as ```plt``` for plotting purpose. We will also put our $X$ as a Numpy array first (and also all of our other data). To create a Numpy array, we will use ```np.array()``` to create a one-dimensional array for our $X$:

```X = np.array([4,3,2,1,5,7,6,7,7])```

**Make sure that you run the code in the cell below first**:

In [None]:
#@title Initialization
import numpy as np # Import numpy
import statistics as stat # Import statistics package
from scipy import stats # Import statistics module from scipy
import matplotlib.pyplot as plt # Import the plot module matplotlib

The Numpy and statistics function that we can use to calculate these measures of central tendency are:
*   ```np.mean()``` for mean [(reference)](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)
*   ```np.median()``` for median [(reference)](https://numpy.org/doc/stable/reference/generated/numpy.median.html)
*   ```stat.mode()``` for mode [(reference)](https://www.geeksforgeeks.org/python-statistics-mode-function/)

Run the code below to put that into practice:

In [None]:
#@title Mean, median, and mode
X = np.array([4,3,2,1,5,7,6,7,7]) # Create a data set in a form of one-dimensional array
 
meanX = np.mean(X) # Calculate mean of X and save it into variable meanX
medianX = np.median(X) # Calculate mean of X and save it into variable medianX
modeX = stat.mode(X) # Calculate mode of X and save it into variable modeX
 
# Print mean, median, and mode of X
print("Mean of given data set is {:.4f}".format(meanX))
print("Median of given data set is {:.4f}".format(medianX))
print("Mode of given data set is {:.4f}".format(modeX))

# Minimum and maximum

You might also have interest in knowing the mean and the maximum of your data. In that case, you can use ```np.min()``` and ```np.max()``` to compute the minimum and the maximum of your data set, respectively. Based on our previous $X$, the minimum and the maximum values are 1 and 7, respectively. See and run the code below to try it:


In [None]:
#@title Minimum and maximum

X = np.array([4,3,2,1,5,7,6,7,7]) # Create a data set in a form of one-dimensional array

minX = np.min(X) # Calculate minimum of X and save it into variable minX
maxX = np.max(X) # Calculate maximum of X and save it into variable maxX

# Print mean, median, and mode of X
print("Minimum of given data set is {:.4f}".format(minX))
print("Maximum of given data set is {:.4f}".format(maxX))

Minimum of given data set is 1.0000
Maximum of given data set is 7.0000


# Measures of variability
Besides the measures of central tendency, you typically want to know the **dispersion of your data with respect to your central tendency**. Some measures that you will typically use are:


*   **Standard deviation** ($\sigma$), the most popular one, i.e., $\sigma(X)=\sqrt{\frac{1}{n} \sum_{i=1}^{n}(x_{i}-\bar{X})}$
*  **Range**, probably the simplest one, where $\text{Range}(X)=\text{max}(X)-\text{min}(X)$
*  **Interquartile range (IQR)$**, 
the difference between the first and the third quartile (i.e.,  Q1 and Q3, respectively). See "Quantile, percentile, quartile" sections for more explanations about IQR.

We will use ```np.std()``` [(reference)](https://numpy.org/devdocs/reference/generated/numpy.std.html) and ```np.ptp()``` [(reference)](https://numpy.org/doc/stable/reference/generated/numpy.ptp.html) to calculate standard deviation and range, respectively. As for IQR, the easiest way is to use ```stats.iqr()``` (notice that this is from ```scipy```) [(reference)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.iqr.html). Another way to compute the IQR is to calculate the Q1 and Q3 first from ```np.percentile()``` or ```np.quantile()``` (see "Quantile, percentile, quartile")

Recall your previous $X$ again, and you will obtain $\sigma(X)=2.1602$, $\text{Range}(X)=7-1=6$, and $\text{IQR}(X)=4$. Try the code below:


In [None]:
#@title Standard deviation, range, and IQR
X = np.array([4,3,2,1,5,7,6,7,7]) # Create a data set in a form of one-dimensional array

stdX = np.std(X) # Calculate std of X and save it into variable stdX
rangeX = np.ptp(X) # Calculate range of X and save it into variable rangeX
iqrX = stats.iqr(X) # Calculate IQR of X and save it into variable IQRX

# Print std, range, and IQR of X
print("Standard deviation of given data set is {:.4f}".format(stdX))
print("Range of given data set is {:.4f}".format(rangeX))
print("IQR of given data set is {:.4f}".format(iqrX))

# Generating data set from a given probability distribution
For following explanations, we will frequently use data sets generated from Python functions. The concept of probability is vital here since we generate a data according to a given **probability density function (PDF)**, let's call it $f(x)$, where $x$ is a random variable. Each specific PDF has **parameters** which determine the shape of the distribution. In this notebook, we will use **uniform** and **normal** (Gaussian distribution), respectively, defined as

*   $f(x)=\frac{1}{b-a}$ for $x \in [a,b]$ and $f({x})=0$ otherwise

and
*   $f(x)=\frac{1}{\sigma_{n} \sqrt{2\pi}} e^{-\frac{1}{2}\big(\frac{x-\mu_{n}}{\sigma_{n}}\big)^{2}}$


We use the notation $\mathcal{U}(a,b)$ and $\mathcal{N}(\mu_{n},\sigma_{n})$ to describe the uniform and normal distribution, respectively. As you can see from the PDFs, the uniform distribution has two parameters, i.e., $a$ and $b$ which correspond the lower and the upper bound of the distribution. On the other hand, the normal distribution also has two parameters, i.e., $\mu_{n}$ and $\sigma_{n}$ which correspond to the mean and the standard deviation of the normal distribution, respectively. A normal distribution with $\mathcal{N}(0,1)$ is called the **standard normal distribution**

Run the code below to visualize various normal distributions with $\mathcal{N}(0,1)$, $\mathcal{N}(2,3)$, and $\mathcal{N}(4,1.5)$. As you can see from the plot, changing $\mu_{n}$ will move the center of the distribution while changing $\sigma_{n}$ will affect the spread of the distribution. That is, you will get flatter curve if you set a higher value of $\sigma_{n}$.


In [None]:
#@title Plotting various normal distributions
x = np.linspace(-12,12, 1000) # Create 1000 points between -12 and 12 for plotting purpose

plt.figure(1)
plt.plot(x, stats.norm.pdf(x,0,1)) # Normal distribution with mean = 0 and std = 1
plt.plot(x, stats.norm.pdf(x,-2,2)) # Normal distribution with mean = 2 and std = 3
plt.plot(x, stats.norm.pdf(x,8,0.5)) # Normal distribution with mean = 4 and std = 1.5
plt.legend(['N(0,1)','N(2,3)','N(4,1.5)']) # Plot the legend
plt.ylabel('x') # label of the y-axis
plt.xlabel('f(x)') # label of the x-axis
plt.grid(True) # activate grid
plt.show()

To generate data from normal distribution, you can use ```np.random.randn()``` that takes a minimum one argument, i.e., the number of samples generated. So if you try
```x = np.random.randn(1000)```, you will generate 1000 random samples from a standard normal distribution (and name the data ```x```). Notice that with ```np.random.randn```, you generate samples from standard normal distribution. To change $\mu_{n}$ and $\sigma_{n}$, say if you want $\mathcal{N}(4,1.5)$, then you can do the following trick: ```x = 1.5*np.random.randn(1000)+4```. Let's do just that and plot the histograms (notice that you are plotting the histogram, not the PDF!):


In [None]:
#@title Creating histograms from samples generated by various normal distributions
x_1 = np.random.randn(5000) # Generate 1000 samples from standard normal distribution
x_2 = 1.5*np.random.randn(1000)+4 # Generate 1000 samples from normal distribution with mean = 4 and std = 1.5
 
nbins = 20 # Set number of bins to 20
 
plt.figure(1, facecolor= 'white') # Create figure
counts1, bins1, bar1 = plt.hist(x_1, bins = nbins) # Plot histogram for variable x_1
counts2, bins2, bar2 = plt.hist(x_2, bins = nbins) # Plot histogram for variable x_2
plt.show()

Notice that we use the matplotlib package to plot the histogram. To be exact, we use ```plt.hist()``` to depict the histogram. ```plt.hist()```[(reference)](https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html) takes at least one argument as the input, i.e., your data set. You can also change the number of bins. As you can see here that I set the number of bins to 20.

In the limit (that is, if you have large samples), and you normalize the histogram, then you will closely approximate the true PDF. **Execute the following cell after you execute the previous cell**:

In [None]:
#@title Comparison of the normalized histogram and the true PDF
bins1center = (bins1[1:nbins+1] + bins1[0:nbins])/2 # Center of the bins (first data set)
df = bins1[1]-bins1[0] # Length of bins (first data set)
counts1norm = counts1 / np.sum(df*counts1) # Normalize the histogram (first data set)
 
bins2center = (bins2[1:nbins+1] + bins2[0:nbins])/2 # Center of the bins (second data set)
df = bins2[1]-bins2[0] # Length of bins (second data set)
counts2norm = counts2 / np.sum(df*counts2) # Normalize the histogram (second data set)
 
x = np.linspace(-5,12, 1000) # x for plotting
plt.figure(1, facecolor = 'white')
plt.scatter(bins1center,counts1norm) # Plot the normalized histogram (first data set)
plt.scatter(bins2center,counts2norm) # Plot the normalized histogram (second data set)
plt.plot(x, stats.norm.pdf(x,0,1)) # Plot the true PDF
plt.plot(x, stats.norm.pdf(x,4,1.5)) # Plot the true PDF
plt.show()

**Cumulative distribution function**

In many applications (e.g. hypothesis testing), we need to calculate the cumulative distribution function (CDF) $F_{X}(x)$, which is defined as

$F_{X}(x)=P(X\leq x).$

In a layman's term, $F_{X}(x)$ is the probability that $X$ will take a value less than or equal to $x$.

We can also use $F_{X}(x)$ to calculate the probability that $X$ will take a value between a defined interval: 

$P(a\leq X <b) = F_{X}(b)-F_{X}(a).$

To compute the CDF for a normal distribution, we can use ```stats.norm.cdf(x,loc,scale)```, where ```loc```  and ```scale``` are the standard deviation of your normal distribution, respectively. For example, if you want to compute the CDF for a standard normal distribution $\mathcal{N}(0,1)$ for $x=0.025$, type ```stats.norm.cdf(-1.96,loc=0,scale=1) ```, which should return $F_{X}(-1.96) \approx 0.025$.

You might want to do the inverse thing: you want to know $x$ that yields $F_{X}(x)$. This is especially important in hypothesis testing where you need to calculate the $p$-value. The function that you will need is ```stats.norm.cdf(Fx,loc,scale)```, where ```Fx``` is $F_{X}(x)$. For example, if you want to compute the $x$ that yields $F_{X}(x) = 0.025$ for a standard normal distribution $\mathcal{N}(0,1)$, type ```stats.norm.ppf(0.025,loc=0,scale=1) ```, which should return $x = -1.96$. Try the cell below:

[See this link for more information about normal distribution with Python](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html)

In [None]:
#@title Cumulative distribution function (NEW)
Fx = stats.norm.cdf(-1.96,loc=0,scale=1)
print(Fx)

x = stats.norm.ppf(0.025,loc=0,scale=1)
print(x)


# Measures of asymmetry
For a given data set, you also want to know the tendency of your data set in a sense like this: "is it leaning more toward the left side, or the right side?". Let's say that you are talking about your salary per week and then you collect the data for about two or three years. You obviously want your salary data to lean more toward the right side, that is, you frequently get high salary per week. To measure that, you need to measure the **asymmetry** in your data by calculating the **skewness** of your data set. The value of the skewness $s$ might indicate three different trends: 


*   If $s>0$ (positive skew), your data leans more toward the left side and the tail is heavier on the right side.
*   If $s<0$ (negative skew), your data leans more toward the right side and the tail is heavier on the left side.
*   If $s=0$, or close to zero, your data is symmetric.

The Scipy function that we will use to do this task is ```stats.skew```.

Let's apply that in practice by generating skewed data sets (by using ```stats.skewnorm.rvs```) with various skewness and then calculate their skewness. See and run the code below (remember that the parameter that we use to generate the distribution is not equal to the skewness):


In [None]:
#@title Plotting skewed data and calculating skewness

s1 = -10 # Generate a first data set with negative skew
X_skew = stats.skewnorm.rvs(s1,size=10000) # Generate data with skew
skew = stats.skew(X_skew) # Calculate the skewness of the data
s2 = 10 # Generate a second data set with positive skew
X_skew2 = stats.skewnorm.rvs(s2,size=10000) # Generate data with skew
skew2 = stats.skew(X_skew2) # Calculate the skewness of the data
s3 = 0 # Generate a second data set with zero skew (the result will be close to zero)
X_skew3 = stats.skewnorm.rvs(s3,size=10000) # Generate data with skew
skew3 = stats.skew(X_skew3) # Calculate the skewness of the data

# plot figures
plt.figure(1, facecolor='white')
plt.hist(X_skew, bins = 100) # Plot the histogram of the data
plt.text(-3.5,300,"Skewness of given data set is {:.4f}".format(skew))
plt.xlim((-4, 1))
plt.ylim((0,400))
plt.ylabel('Frequency')
plt.xlabel('x')
plt.grid(True)
plt.show()

plt.figure(2, facecolor='white')
plt.hist(X_skew2, bins = 100) # Plot the histogram of the data
plt.text(1,300,"Skewness of given data set is {:.4f}".format(skew2))
plt.xlim((-0.5, 4))
plt.ylim((0,400))
plt.ylabel('Frequency')
plt.xlabel('x')
plt.grid(True)
plt.show()

plt.figure(3, facecolor='white')
plt.hist(X_skew3, bins = 100) # Plot the histogram of the data
plt.text(-2,340,"Skewness of given data set is {:.4f}".format(skew3))
plt.xlim((-4, 4))
plt.ylim((0,400))
plt.ylabel('Frequency')
plt.xlabel('x')
plt.grid(True)
plt.show()

# Intermezzo: Quantile, percentile, quartile
To grasp the concept of IQR, we need to know the concept of **quartile** first. By quartile, we mean that the (sorted) data is divided into four equal parts (quarters). Quartile itself is a type of **quantile**, which is the percent of points below the given value. For example, 0.4 (or 40%) quantile is the value at which 40% percent of the data fall below that particular value. There are various types of quantile, some of them are:


*   **Percentiles**, the 100-quantiles.
*   **Deciles**, the 10-quantiles.
*   **Quartiles**, the 4-quantiles.

In terms of quartiles, we have Q1, Q2, and Q3, which corresponds to 0.25, 0.5, and 0.75 quantile. Notice that the median itself is Q2 (0.5 quartile) because it is the point at which 50% of data fall below and another 50% fall above that value. 

The Numpy function that we will use to calculate the percentile is ```np.percentile()``` that you can also use to calculate the quartiles. Let's try this by using a standard normal distribution, which has $Q1=-0.674$, $Q2=0$, and $Q3 = 0.674$. Notice that we will not get exactly these values if we generate a random data set from a standard normal distribution; however, the empirical values will be close to the true values. 

In [None]:
#@title Quartiles of samples from standard normal distribution

X_ex = np.random.randn(10000) # Create a new data set X_ex from standard normal distribution

q3, q2, q1= np.percentile(X_ex, [75, 50, 25],interpolation='midpoint') # Calculate q3 and q1 from percentile

print("The Q1, Q2, and Q3 of given data set is {:.4f}, {:.4f}, {:.4f}".format(q1, q2, q3))

# Plot the data and the quartiles, the black vertical lines are Q1, Q2, Q3 (from the left to the right)
plt.figure(1, facecolor = 'white')
plt.hist(X_ex, bins=100)
plt.ylabel('Frequency')
plt.xlabel('x')
plt.xlim((-4,4))
plt.ylim((0, 400))
plt.vlines(q1,0,400)
plt.vlines(q2,0,400)
plt.vlines(q3,0,400)
plt.grid(True)
plt.show()

# Correlation and dependence
When analyzing two different data sets, you might have interest in investigating the relationship between the two data sets. You have questions such as: (1) If I increase $X$, will $Y$ decrease?, (2) Does changing $X$ really affects $Y$?, (3) if there is indeed a relationship, how strong is the relationship?, etc.

One way to answer such questions is to use **Pearson correlation coefficient** which measures the linear relationship between two datasets. The Pearson correlation coefficient is calculated as

$\rho_{X,Y} = \text{corr}(X,Y)=\frac{\text{cov}(X,Y)}{\sigma_{X}\sigma_{Y}}=\frac{\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y})]}{\sigma_{X}\sigma_{Y}}$

where $\mathbb{E}$ is the expectation and $\text{cov}$ is the covariance.

In practice, $\rho_{X,Y}$ is estimated from your sample as

$\rho_{X,Y} = \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}} \sqrt{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}}$

The value of $\rho_{X,Y}$ varies between -1 and +1. In this regard, $\rho_{X,Y}=1$ indicates an exact linear positive relationship (i.e., $X$ increases if $Y$ increases), while $\rho_{X,Y}=-1$ indicates an exact linear negative relationship (i.e., $X$ increases if $Y$ decreases). If $\rho_{X,Y}=0$, or very close to zero, this implies no correlation.

The code below shows the implementation of ```stats.pearsonr(x,y)``` to calculate the Pearson correlation coefficient of your data (i.e., X and Y). The original relationship is $Y = X+2$ and $Y=-(X+2)$ for positive and negative correlation, respectively. The $X$ data for this example is $X=(1,3,5,8,9,11,12)$. The scatter plot of $X$ and $Y$ is also shown to give you the sense of the physical meaning of the Pearson correlation coefficient.

You can uncomment some lines below to simulate cases with strong/weak positive correlation or strong/weak negative correlation, which is accomplished by adding a weak or strong random noise to the original relationship.

In [None]:
#@title Calculating Pearson correlation coefficient
X = np.array([1,3,5,8,9,11,12])
Y = X+2+np.random.randn(X.shape[0]) # create Y by Y = (X+2) + random noise with standard normal distribution (strong + correlation)
# Y = np.random.randn(X.shape[0]) # Y equals to random noise, try this for correlation that is close to zero
# Y = -(X+2)+np.random.randn(X.shape[0]) # create Y by Y = -(X+2) + random noise with standard normal distribution (strong - correlation)
# Y = X+2 + 8*np.random.randn(X.shape[0]) # create Y by Y = (X+2) + random noise with mean = 10 and std. dev = 1 (weak + correlation)
# Y = -(X+2) + 8*np.random.randn(X.shape[0]) # create Y by Y = -(X+2) + random noise with mean = 10 and std. dev = 1 (weak - correlation)

plt.figure(1, facecolor = 'white')
plt.scatter(X,Y) # Scatter plot of X and Y

pearcorr = stats.pearsonr(X,Y) # Calculate the pearson correlation coefficient and save into variable pearcorr
print("The Pearsonn correlation coefficient of the given data sets is {:.4f}".format(pearcorr[0]))
plt.ylabel('Y')
plt.xlabel('X')
plt.grid(True)


# Plotting your data

I will now introduce some simple and useful methods for data visualization based on ```matplotlib.pyplot```. A single image can tell more than thousand words; that might be true in the context of data visualization. Notice that I only give simple examples here. If you want to exploit the visualization tools better, please open the links shown in the explanations. Remember that we import ```matplotlib.pyplot``` as ```plt```.

**Histogram**

The very first type of plot that we will use is histogram. It allows you to discover and show the frequency distribution of your data. By using histogram, you can also inspect your data visually (e.g. checking the outliers or the skewness) or detect outliers or anomaly in your data. Histogram works by dividing your data into several bins, each bin has a range, and count the number of your data that fall into these independent bins. 

In Python, we use ```plt.hist(x)``` [(check the description here)](https://https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html) from ```matplotlib.pyplot``` to create a histogram from your data. The function takes a minimum one argument (i.e., your data set), in which the number of bins will be automatically recommended by the matplotlib. You can also set the number of bins by yourself. In the example below, I plot two data sets, in which one is generated from a standard normal distribution and another one is from a uniform distribution.

In [None]:
#@title Histogram plot example 1
x_1 = np.random.randn(1000) # Generate 1000 samples from standard normal distribution
x_2 = np.random.rand(1000) # Generate 1000 samples from uniform distribution

nbins = 20 # Set number of bins to 20

plt.figure(1, facecolor= 'white') # Create figure
counts1, bins1, bar1 = plt.hist(x_1, bins = nbins) # Plot histogram for variable x_1
plt.figure(2, facecolor= 'white') # Create figure
counts2, bins2, bar2 = plt.hist(x_2, bins = nbins) # Plot histogram for variable x_2

I give you another simple example, consider a following data set: 

$X_{1}={1,4,5,4,12,13,11,5,4,5,10,7,8,2,3,4,8,9,0,10,11,15,14,16,17,14,8,7,6,3,2,20,22,12,4,5,6}$

Not very convenient if you want to see the trend. How about if we plot the histogram? Try the code below and you will see that most of the data is concentrated on values around 5:

In [None]:
#@title Histogram plot example 2
x_1 = np.array([1,4,5,4,12,13,11,5,4,5,10,7,8,2,3,4,8,9,0,10,11,15,14,16,17,14,8,7,6,3,2,20,22,12,4,5,6])
nbins = 8
plt.figure(1, facecolor= 'white') # Create figure
counts1, bins1, bar1 = plt.hist(x_1, bins = nbins) # Plot histogram for variable x_1
plt.show()


**Barplot**

To create a barplot, we need to use ```plt.bar(x,height)``` from matplotlib [(see the following)](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html). A barplot is obviously a simple and very useful method to show your data in terms of height. Notice that a histogram itself can also be seen as a form of a barplot with frequency in the y-axis. However, a barplot is more general in the sense that anything can be plotted in both the x- and y-axis.

Consider a following example, the company A, B, and C, and D just sold 4500, 8700, 2300, and 9100 of their products last month, respectively. Now you want to plot the data and show this to your colleagues:

In [None]:
#@title Barplot example
x = ['A', 'B', 'C', 'D'] # Name of company
data = np.array([4500, 8700, 2300, 9100]) # Number of products sold by the company

plt.figure(1, facecolor= 'white') # Create figure
plt.bar(x,data) # Plot the bar plot
plt.xlabel('Company') # Give x-label
plt.ylabel('Number of Products sold') # Give y-label
plt.show()

**Scatterplot**

With scatterplot, you want to show the data in the form of 'dots' that are not connected to each other (unlike the standard plot). Scatterplot is particularly useful if your data is scattered (as the name suggests) and you don't want the data to be connected by lines. We use ```plt.scatter(x,y)``` [(open this link for more details)](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.scatter.html) for this purpose, where $x$ and $y$ is the data on your x- and y-axis, respectively (both x and y are minimum inputs for the scatterplot)

Consider a simple following example where you recorded the number of visitors per month and you want to uncover the general monthly trend:

In [None]:
#@title Scatter plot example 1

month = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
nvisitors = np.array([2300,4030,5440,5565,4901,7604,8903,10232,7504,6424,5021,4723])

plt.figure(1, facecolor= 'white') # Create figure
plt.scatter(month,nvisitors) # Create scatter plot
plt.xlabel('Month') # Give x-label
plt.ylabel('Number of visitors') # Give y-label
plt.show()

It can be seen from the above figure that the number of visitors peaked on August and most of the visitors came between June and September. 

Let's try something more colourful, shall we? As I said, the scatter plot from Matplotlib requires at least two inputs (i.e., x and y). You can make the data more informative, whenever it is needed, by changing the size and the color of the markers. We will now add two more arguments to the scatterplot, i.e., ```color```, and ```size```, i.e.,```plt.scatter(x,y,size,color```). See and run the code below:

In [None]:
#@title Scatter plot example 2

month = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
nvisitors = np.array([2300,4030,5440,5565,4901,7604,8903,10232,7504,6424,5021,4723])

plt.figure(1, facecolor= 'white') # Create figure
size = nvisitors/50
color = nvisitors
plt.scatter(month, nvisitors, s=size, c= color) # Create scatter plot
plt.xlabel('Month') # Give x-label
plt.ylabel('Number of visitors') # Give y-label
plt.show()


A scatter plot can also be used for outlier delection. This outlier is characterized by a significant deviation from the main trend of the data. If there is an impostor among us, you might detect it from the scatterplot. Let's consider the following example:

In [None]:
#@title Scatter plot example 3

x = np.array([1,3,5,6,7,8,9,10,12])
y = np.array([4,6,7,8,25,9,12,13,14])

plt.figure(1, facecolor= 'white') # Create figure
plt.scatter(x, y) # Create scatter plot
plt.xlabel('x') # Give x-label
plt.ylabel('y') # Give y-label
plt.show()

As you can see, one data at x=7 significantly deviates from the main trend. This might be due to wrong sampling, to take an example. So you better check by yourself whether this particular data is correct or not.

**Pie chart**

Pie chart comes in handy when you want plot data that wants to say something like "Who gets the largest share of pie?" or "Which countries have the largest contributions to ASEAN's Gross Domestic Product?". We will use ```plt.pie``` to that end [(see this reference)](https://www.geeksforgeeks.org/plot-a-pie-chart-in-python-using-matplotlib/). We will exactly plot the share of the GDP in ASEAN in 2019, see below:

In [None]:
country = 'Indonesia','Thailand','Philippines','Singapore', 'Malaysia', 'Vietnam', 'Myanmar', 'Cambodia', 'Laos', 'Brunei'
GDP = [1119191, 543650, 376796, 372063, 364702, 261921, 76086, 27089, 18174, 13469]

plt.figure(1, facecolor= 'white') # Create figure
plt.pie(GDP, labels = country) 
plt.show()

**Boxplot**
Boxplot is very useful if you want to depict your data in terms of five numbers summary, namely, median ($Q2$), $Q1$, $Q3$, $Q1-1.5 IQR$ (minimum), and $Q3+1.5 IQR$ (maximum). The line in the center is the median, the lower bound of the box is $Q1$, the upper bound of the box is $Q2$, the whisker on the bottom is $Q1-1.5 IQR$, and the whisker on the top is $Q3+1.5 IQR$. Remember that you can also change the orientation of the boxplot from vertical to horizontal.

We will use ```plt.boxplot(x)``` [(see this for more details)](https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.boxplot.html), where $x$ is the array of your data, to create a boxplot. We can use ```plt.boxplot(x)``` to create a boxplot for one set of data or plot multiple data sets together. 

We will try that by generating boxplots for data generated from normal distributions with $\mathcal{N}(4,0.2)$ and $\mathcal{N}(2,0.8)$. See below:

(You can also do some experiments by changing the data sets)



In [None]:
#@title Box plot example 1

data1 = 0.2*np.random.rand(100)+4 # Create data set 1
data2 = 0.8*np.random.rand(100)+2 # Create data set 2
dataall = [data1, data2] # Combine data set 1 and data set 2

# Plot data set 1 only
plt.figure(1, facecolor= 'white') # Create figure
plt.boxplot(data1) # Create scatter plot
plt.xlabel('Data 1') # Give x-label
plt.ylabel('y') # Give y-label
plt.show()

# Plot data set 1 and 2 together
plt.figure(2, facecolor= 'white') # Create figure
plt.boxplot(dataall) # Create scatter plot
plt.xlabel('All data') # Give x-label
plt.ylabel('y') # Give y-label
plt.show()

