1\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn!):

* Fill a numpy array, x,  of len(N) (with N=O(100)) with a variable normally distributed, with a given mean a standard deviation
* Fill an histogram in pyplot taking properly care about the aesthetic
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* for every element of x, create a gaussian with the mean corresponding the element value and std as a parameter that can be tuned. The std default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5.}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with np.sum()) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as rnd
from scipy.stats import norm
import matplotlib.mlab as mlab

In [None]:
## Kernel Density Estimate ##

# INTRINSIC PARAMETER
N = 100
sigma = 1
mu = 3


#NORMAL DISTRIBUTION
x = sigma*np.random.randn(N)+mu

# FINDING THE OPTIMAL NBINS (from stats.stackexchange.com)
IQR = np.percentile(x, 75) - np.percentile(x, 25)
h = 2*N**(-1/3)*IQR
nbins = np.floor((np.max(x)-np.min(x))/h)
nbins = int(nbins)
print('\nOptimal Number of bins = ',nbins)

# PLOT HIST
fig, ax = plt.subplots()
ax.hist(x, color = 'blue', edgecolor = 'black',bins = nbins)


# ADD LABELS
ax.set_title('Histogram of a Variable Normally Distributed With errors')
ax.set_xlabel('Range')
ax.set_ylabel('Number of Occurrence')

# Histograms as data points with errors
counts,bin_edges = np.histogram(x,nbins)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
err = np.sqrt(counts) #poisson error
ax.errorbar(bin_centres, counts, yerr=err, fmt='o',ecolor="red")

#locs, labels = xticks()
rng = int((np.max(counts)+np.max(err)-np.min(counts))/7)
plt.yticks(np.arange(np.min(counts), np.max(counts)+np.max(err),step=rng))
ax.grid()
plt.show()



### Create a gaussian for every element of x
std = 3*(1.06*x.std()*x.size**(-1/5))
linespace = np.linspace(mu - 6, mu + 6, 100)
gau_element = np.zeros((N,len(linespace)))

for i in range(N):
    mu = x[i]
    g = norm(mu,std)
    gau_element [i,:] = g.pdf(linespace)
    plt.plot(linespace,g.pdf(linespace))

plt.grid()
plt.show()

#### Guassian sum 
gau_sum = np.sum(gau_element,axis =0)

#### plotting
#fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(10, 20))
#
#ax1.plot(linespace,gau_sum)
#ax1.set_title('Guassian')
#ax1.set_xlabel('$x$')
#ax1.set_ylabel('Guassian Distribution')

2\. **Color-coded scatter plot**

Produce a scatter plot out of a dataset with two categories

* Write a function that generate a 2D datasets of 2 categories. Each category should distribute as a 2D gaussian with a given mean and std (clearly it is better to have different values means..)
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below

You can try to make the procedure more general by allowing a given number $n\ge 2$ of categories

In [None]:
! wget https://www.dropbox.com/s/u4y3k4kk5tc7j46/two_categories_scatter_plot.png
from IPython.display import Image
Image('two_categories_scatter_plot.png')


##### Answer 2)

In [None]:
#### producing data
import matplotlib.cm as cm
#n_categ = 5 #number of categories
#mu = 3  #given mean range
#std = 5 #given std range
min_data = 10    #Minimum range of data
max_data = 1000  #Maximum range of data

fig, ax = plt.subplots(figsize=(10, 5))

def two_D_gau (n_categ,mu=3,std=5,):
    for i in range(n_categ):
        data_size = np.random.randint(min_data,max_data)

        m1=np.random.uniform(-mu,mu)
        m2=np.random.uniform(-mu,mu)

        std1=np.random.uniform(0,std)
        std2=np.random.uniform(0,std)
        
        x = np.random.normal(m1,std1,data_size)
        y = np.random.normal(m2,std2,data_size)
        
        #colors = cm.rainbow(np.linspace(0, 1, len(x)))
        
        ax.scatter(x,y,label='category'+str(i+1))
        
two_D_gau(8,5,2)
ax.legend()
plt.show()

3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following dataset and load it as a pandas dataframe:
```bash
wget https://www.dropbox.com/s/hgnvyj9abatk8g6/residuals_261.npy
```
Note that you should you the `np.load()` function to load the file as a numpy array and then pass it to the `pd.DataFrame()` constructor.
* Inspect the dataset, you'll find two variables (features)
* Clean the sample by selecting the entries (rows) with the variable "residual" in absolute value smaller than 2
* perform a linear regression of "residuals" versus "distances" using `scipy.stats.linregress()` 
* plot a seaborn jointplot of  "residuals" versus "distances", having seaborn performing a linear regression. The result of the regression should be displayed on the plot
* Fill 3 numpy arrays
  * x, serving as an array of bin centers for the "distance" variable. It should range from 0 to 20 with reasonable number of steps (bins)
  * y, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * erry, the standard deviation of the  of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

##### Answer 3)

In [None]:
### Load and Cleaning the data

import pandas as pd
res = np.load('residuals_261.npy').item()
df = pd.DataFrame(res)
df.head()

In [None]:
### Implementing the regression

import scipy.stats as ss
df = df[abs(df["residuals"])<2]
slope,intercept,_, _, _ = ss.linregress(df['residuals'],df['distances'])

In [None]:
import seaborn as sns
#### Joint plot
joint_plt = sns.jointplot(x = "distances", y = "residuals", data = df, kind="reg")

#### 3-numpy array

#x as bin centers
nbin=20
bin_edges = np.linspace(0,20,nbin+1)
x =(bin_edges[1:] + bin_edges[:-1])/2

#y as mean value of each center
y = np.zeros(nbin)
for i in range(nbin):
    y[i] = df[(df['distances'] >=bin_edges[i]) & (df['distances'] <bin_edges[i+1])]['residuals'].mean()

#erry as std of distances
erry = np.zeros(nbin)
for i in range(nbin):
    erry[i] = df[(df['distances'] >=bin_edges[i]) & (df['distances'] <bin_edges[i+1])]['residuals'].mean()

#plotting profile
plt.errorbar(x,y,yerr=erry,color ='r',marker='*')
plt.show()