1\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn!):

* Fill a numpy array, x,  of len(N) (with N=O(100)) with a variable normally distributed, with a given mean a standard deviation
* Fill an histogram in pyplot taking properly care about the aesthetic
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* for every element of x, create a gaussian with the mean corresponding the element value and std as a parameter that can be tuned. The std default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5.}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with np.sum()) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.ticker import AutoMinorLocator, MultipleLocator, FuncFormatter
from scipy.stats import norm
import scipy as sc
import seaborn as sns

N = 100
v = np.random.randn(N)


fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(10,5))
ax1.hist(v, 10)
ax1.set_ylabel('number of values for every bin')
ax1.yaxis.set_major_locator(MultipleLocator(2))


size=len(v)
std=1.06*v.std()*(size**(-1/5))
gaussians_temp = []

for i in v:
    temp = norm(loc=i, scale=std)
    x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)
    ax2.plot(x, temp.pdf(x))
    gaussians_temp.append(temp)
ax2.yaxis.set_major_locator(MultipleLocator(0.1))

integrate1 = sc.integrate.trapz(v)
x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)
graph = np.zeros(100)
k=0
for i in x:
    for e in v:
        graph[k]+=norm.pdf(i, loc=e, scale=std)
    k+=1
    
ax3.plot(x, graph)

plt.show()

2\. **Color-coded scatter plot**

Produce a scatter plot out of a dataset with two categories

* Write a function that generate a 2D datasets of 2 categories. Each category should distribute as a 2D gaussian with a given mean and std (clearly it is better to have different values means..)
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below

You can try to make the procedure more general by allowing a given number $n\ge 2$ of categories

In [None]:
! wget https://www.dropbox.com/s/u4y3k4kk5tc7j46/two_categories_scatter_plot.png
from IPython.display import Image
Image('two_categories_scatter_plot.png')

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

N = 300  #number of points

num = 4  #number of gaussians

means = np.random.randint(-10, 10, (num,2))
sigmas = 5*np.random.rand(num)
      
      
def Gaussians2D(means, sigma, N):
    n = len(sigma)
    gaussians = np.random.randn(n*N, 3)
    for i in range (n):
        deviation=sigma[i]
        gaussians[N*i:N*(i+1),0]*=deviation
        gaussians[N*i:N*(i+1),1]*=deviation
        gaussians[N*i:N*(i+1),0]+=means[i, 0]
        gaussians[N*i:N*(i+1),1]+=means[i, 1]
        gaussians[N*i:N*(i+1),2]=i+1
    return gaussians

df = pd.DataFrame(Gaussians2D(means, sigmas, N), columns=['X', 'Y', 'n'])

sns.relplot(x="X", y="Y", hue="n", palette="deep", data=df);

3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following dataset and load it as a pandas dataframe:
```bash
wget https://www.dropbox.com/s/hgnvyj9abatk8g6/residuals_261.npy
```
Note that you should use the `np.load()` function to load the file as a numpy array, call the `.item()` method, and then pass it to the `pd.DataFrame()` constructor.
* Inspect the dataset, you'll find two variables (features)
* Clean the sample by selecting the entries (rows) with the variable "residual" in absolute value smaller than 2
* perform a linear regression of "residuals" versus "distances" using `scipy.stats.linregress()` 
* plot a seaborn jointplot of  "residuals" versus "distances", having seaborn performing a linear regression. The result of the regression should be displayed on the plot
* Fill 3 numpy arrays
  * x, serving as an array of bin centers for the "distance" variable. It should range from 0 to 20 with reasonable number of steps (bins)
  * y, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * erry, the standard deviation of the  of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sc
import matplotlib.pyplot as plt

dataArray = np.load("residuals_261.npy", allow_pickle=True)
df = pd.DataFrame(dataArray.item())

df = df[abs(df['residuals']) < 2]

slope, intercept, r_value, p_value, std_err = sc.stats.linregress(df['residuals'], df['distances'])

sns.jointplot(x="residuals", y="distances", data=df, kind="reg");

x = np.arange(0, 41, 1)/2
y_temp = []
erry_temp = []
for e in x:
    if(e==0):
        temp = df[df['distances'] <= e]
        means=temp.mean(axis=0)
        std=temp.std(axis=0)
        y_temp.append(means[0])
        erry_temp.append(std[0])
    elif(e==20):
        temp = df[df['distances'] >= e]
        means=temp.mean(axis=0)
        std=temp.std(axis=0)
        y_temp.append(means[0])
        erry_temp.append(std[0])
    else:
        temp = df[df['distances'] >= e]
        temp = temp[temp['distances'] <= e+0.5]
        means=temp.mean(axis=0)
        std=temp.std(axis=0)
        y_temp.append(means[0])
        erry_temp.append(std[0])
    
    
y = np.asarray(y_temp)
erry = np.asarray(erry_temp)


fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(4, 8))
ax1.scatter(x, y)
ax2.scatter(df['residuals'], df['distances'])
