1\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn!):

* Fill a numpy array, x,  of len(N) (with N=O(100)) with a variable normally distributed, with a given mean a standard deviation
* Fill an histogram in pyplot taking properly care about the aesthetic
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* for every element of x, create a gaussian with the mean corresponding the element value and std as a parameter that can be tuned. The std default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5.}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with np.sum()) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method


In [1]:
import numpy as np # N-dimensional array / broadcasting / random numbers
import matplotlib.pyplot as plt # graphs
#from matplotlib.ticker import AutoMinorLocator, MultipleLocator, FuncFormatter
import scipy as sp
from scipy import stats
import pandas as pd # DataFrame and label-based slicing
import seaborn as sns  # data visualization
import math

In [2]:
np.linspace(0,y.max(),10,dtype=int)
std_gauss  = 1.06*x.std()*(x.size**(-1/5))
print(std_gauss)
sp.stats.norm.pdf(x,loc=x[1],scale=std_gauss).shape
len(b_height)
len(b_edges)
print(y.shape)
print(norm_gauss.shape)

NameError: name 'y' is not defined

Choosing the number of bins and the bins sizes needs a lot of care. Typically the content of each $i$th bin, $N_i$, should be statistically significant, i.e. the corresponding Poisson uncertainty, $1/\sqrt{N_i}$, should be small.

In [None]:
mean,std,N=0,1,100
x=np.random.normal(mean,std,N)
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(14,7))
n_bins=10
b_height, b_edges, _  = ax1.hist(x,bins=n_bins,histtype='step',label='My distribution',color='blue')

central_points = (b_edges[1:]+b_edges[:-1])/2
ax1.errorbar(central_points, b_height, 1/np.sqrt(b_height), fmt = 'none', label = "Errorbars", ecolor="red")

ax1.set_title('Histogram and KDE')
ax1.set_ylabel('Frequencies')
ax1.set_yticks(ticks=np.linspace(0,y.max(),10,dtype=int))

std_gauss  = 1.06*x.std()*(x.size**(-1/5))

ax2.set_title('gaussian')
ax2.set_ylabel('gaussian distributions')
n=1000
somma=np.zeros((x.size,x.size))
m=0
for i in x:  
    interval=np.linspace(i-3*std_gauss,i+3*std_gauss, N)
    ax2.plot(interval, sp.stats.norm.pdf(interval,loc=i,scale=std_gauss),ls='-')
    somma[m,:]=sp.stats.norm.pdf(x,loc=i,scale=std_gauss)
    #print(sp.stats.norm.pdf(x,loc=i,scale=std_gauss))
    m +=1
    

y=np.sum(somma,axis=0)
norm_gauss = sp.integrate.trapz(b_height,b_edges[1:])
ax1.plot(x,y/N*norm_gauss,'go',label='KDE')
ax1.legend(markerscale=1)

2\. **Color-coded scatter plot**

Produce a scatter plot out of a dataset with two categories

* Write a function that generate a 2D datasets of 2 categories. Each category should distribute as a 2D gaussian with a given mean and std (clearly it is better to have different values means..)
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below

You can try to make the procedure more general by allowing a given number $n\ge 2$ of categories

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# always useful
import numpy as np
import seaborn as sns
import pandas as pd
import scipy as sp

In [None]:
 ! wget https://www.dropbox.com/s/u4y3k4kk5tc7j46/two_categories_scatter_plot.png
from IPython.display import Image
Image('../data/two_categories_scatter_plot.png')


In [None]:
n_cat=10 
mean=20 
std=2 
lowest=100 
highest=1000 

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 10))
lowest=100 
highest=1000 
for i in range(n_cat):
    size=np.random.randint(lowest,highest)
    m1=np.random.uniform(-mean,mean)
    m2=np.random.uniform(-mean,mean)
    std1=np.random.uniform(0,std)
    std2=np.random.uniform(0,std)

    x = np.random.normal(m1,std1,size)
    y = np.random.normal(m2,std2,size)

    ax.scatter(x,y,alpha=0.7,marker='*',label='cat_set'+str(i+1))

ax.grid(True)
ax.legend(markerscale=2)

3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following dataset and load it as a pandas dataframe:
```bash
wget https://www.dropbox.com/s/hgnvyj9abatk8g6/residuals_261.npy
```
Note that you should you the `np.load()` function to load the file as a numpy array and then pass it to the `pd.DataFrame()` constructor.
* Inspect the dataset, you'll find two variables (features)
* Clean the sample by selecting the entries (rows) with the variable "residual" in absolute value smaller than 2
* perform a linear regression of "residuals" versus "distances" using `scipy.stats.linregress()` 
* plot a seaborn jointplot of  "residuals" versus "distances", having seaborn performing a linear regression. The result of the regression should be displayed on the plot
* Fill 3 numpy arrays
  * x, serving as an array of bin centers for the "distance" variable. It should range from 0 to 20 with reasonable number of steps (bins)
  * y, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * erry, the standard deviation of the  of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

In [None]:
df = pd.DataFrame(np.load('../data/residuals_261.npy').item())

In [None]:
df

Of paramount importance is the condensation of the scatter plots into "profiles". The procedure runs as follow: data are binned along the $x$ (if you had to bin on the other variable, just invert the axes), for every bin take the mean and the standard deviation of the corresponding $y$ values, display those as data points and their error.
These are also called "box plots".

## JOINTPLOT with linear regression

In [None]:
df_cleaned = df[np.absolute(df['residuals'])<2]
#x are distances, y are residuals
slope,intercept,_, _, _ = sp.stats.linregress(df_cleaned['distances'],df_cleaned['residuals'])
#{0:.3f} 0 is the argument, .3f is the number's format
joint = sns.jointplot(x = "distances", y = "residuals", data = df_cleaned, kind="reg",scatter_kws={'alpha':0.2},joint_kws={'label':"y={0:.3f}x+{1:.3f}".format(slope,intercept)},color='blue')
joint.ax_joint.legend()


## PROFILE PLOT

In [None]:
#{0:.3f} 0 is the argument, .3f is the number's format
joint = sns.jointplot(x = "distances", y = "residuals", data = df_cleaned, kind="reg",scatter_kws={'alpha':0.2},joint_kws={'label':"y={0:.3f}x+{1:.3f}".format(slope,intercept)},color='blue')
joint.ax_joint.legend()

#getting the bin centers
nbin=20 
bin_edges = np.linspace(0,20,nbin+1)
central_points =(bin_edges[1:] + bin_edges[:-1])/2 

#filling the array
#y=mean, red point of the profile plot
#erry=error bar size
y=np.array([df_cleaned.loc[(df_cleaned['distances'] >=bin_edges[i]) & (df_cleaned['distances'] <bin_edges[i+1])]['residuals'].mean() for i in range(nbin)])
erry=np.array([df_cleaned.loc[(df_cleaned['distances'] >=bin_edges[i]) & (df_cleaned['distances'] <bin_edges[i+1])]['residuals'].std() for i in range(nbin)])
#plotting profile
plt.errorbar(central_points,y,yerr=erry, label='Profile Plot',linewidth=1.5,color='r',marker='o')
joint.ax_joint.legend()