# Click "Edit App" to see the code

In [None]:
# python packages
from sys import stdout
import pandas as pd # Dataframes and reading CSV files
import numpy as np # Numerical libraries
import matplotlib.pyplot as plt # Plotting library
#%matplotlib notebook
from lmfit import Model # Least squares fitting library

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
   return false;
}

Let's start by creating a pandas dataframe that contains the data we generated.
From the activity page created two files, to compare the distribution from the first and second random number generator.
We can use the pandas' function read_csv to store the file and store its content in a dataframe.

In [None]:
# data = pd.read_csv("random1.csv")
data = pd.read_csv("random2.csv")

We can then print the dataframe to see what it contains

In [None]:
print(data)

We can also change the names of the columns of the dataframe.
This could be useful for referencing the content of the data frame later.

In [None]:
data.columns = ("X","Y")
print(data)

Ther first thing we can do with the dataframe is to count the number of lines, i.e. of data we have.

In [None]:
nval = len(data["Y"].index)
print("Number of data points :",nval)

We can then compute the average of the data in the second column.
The first column is just an index. The simplest way to do this is to use the NumPy function _mean_.
There are multiple ways of selecting the data in the second column of the dataframe.
Here we use the *name* of the column

In [None]:
avg = np.mean(data["Y"])
print("Average, method #1 :",avg)

Here we use the *iloc* function, to specify the address in the dataframe.
Note that python starts counting from zero and that the upper limit of the range is not included.

In [None]:
avg1 = np.mean(data.iloc[0:nval,1])
print("Average, method #2 :",avg1)

Alternatively we can write a simple loop to compute the average.
All three methods should give the same results, bar rounding errors for large datasets

In [None]:
avg2 = 0
for val in data.iloc[0:nval,1]:
    avg2 = avg2 + val
avg2 = avg2 / nval
print("Average, method #3 :",avg2)

We can then compute the standard deviation using the NumPy function _std_

In [None]:
stdev = np.std(data["Y"])
print("Standard deviation :",stdev)

There is no NumPy function for the standard error, so we have to use the definition
\begin{equation}
StdErr = \frac{\sigma}{\sqrt{N}}
\end{equation}
where $\sigma$ is the standard deviation and $N$ is the total number of data points.

In [None]:
stder = np.std(data["Y"])  / np.sqrt(len(data.index))
print("Standard error",stder)

We can then compute the histogram of the data, and compare it with the "normal" distribution that we have seen in statistics. If our data obey the normal distribution, the "normalised" hystogram should resemble a gaussian function centred on the average of the data, whose width is given by the standerd deviation.

In order to compute the hystogram we can use the function "histogram" in NumPy.
This function produces two arrays in output, one with the position of the bins
and one with the hight of the bar of the histogram.
In the example below we compute the histogram of the data in the column labelled "Y"
in out dataframe and we use 50 bins
Try changing the number of bins to see how your plot would change.

In [None]:
histogram , bins = np.histogram(data["Y"],bins=50)

The histogram that numpy generates is just a count, it is not a probability.
This is because the sum of the hystogram gives you the number of points you have in the dataframe,
while a probaility requires the area under the hystogram to be 1.
You can verify that we can use the numpy function "sum" to add all 
elements in the array _histogram_

In [None]:
print("Sum of the heights of the histogram bars :",np.sum(histogram))

The area under the each bar of the histogram is then simply the heigh og the bar times its width
delta is the width of each bar, which can be computed as the difference between the 
position of two successive bins

In [None]:
delta = bins[1]-bins[0]
totalArea = delta * np.sum(histogram)

We can now normalise the histogram

In [None]:
histogram = histogram / totalArea
print("Area under the histogram :",np.sum(histogram)*delta)

We can then define a function to generate data that correspond to the _normal_ distribution for the average and standard deviation that we have computed beforeat the same values as the histogram bars. We can do this by creating a function.

In [None]:
def gaussian(x,x0,std):
    return np.exp(-0.5*(x-x0)**2 / std**2) / (std * np.sqrt(2*np.pi))

Then we can evaluate the function at the positions of the _bins_.
Note that because of the way NumPy created the array _bins_, we have to specify the range aa [:-1]

In [None]:
func = gaussian(bins[:-1],avg,stdev)

Let's now make a plot of the histogram of the data, and compare it with the "normal" distribution 
that we have seen in statistics. If our data obey the normal distribution, the "normalised" hystogram 
should resemble a gaussian function centred on the average of the data, whose width is given 
by the standerd deviation.

We first have to create an object for the figure and its axes
This method will allow us to add more lines to the same graph
(6,4) is the size of the figure that is produced

We then have to add our normalised histogram to the figure that we have create using 
the axes (_ax_) and the "bar" style
The label is what will appear as a legend in the plot.

We then add the _normal_ distribution to the same axes, and colour it in red.

We can now prettify our figure by adding labels to the axes and the legend.

Finally we can show the graph and save is as an image

In [None]:
fig1 , ax = plt.subplots(figsize=(6,4))

ax.bar(bins[:-1],histogram,width=0.1, label="BAR")
fig1.show()

ax.plot(bins[:-1],func, label="Gauss",color='red')

ax.set(xlabel="Values")
ax.set(ylabel="Probability")

ax.legend()

fig1.show()
fig1.savefig("fig.png")