1\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn!):

* Fill a numpy array, x,  of len(N) (with N=O(100)) with a variable normally distributed, with a given mean a standard deviation
* Fill an histogram in pyplot taking properly care about the aesthetic
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* for every element of x, create a gaussian with the mean corresponding the element value and std as a parameter that can be tuned. The std default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5.}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with np.sum()) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import integrate
#normally distributed numpy array, x, of len(N) (with N=O(100))
mu, sigma = 10, 2 # mean and standard deviation
N=100
x = np.random.normal(mu, sigma, N)


#Fill an histogram
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(16, 8))

Nbins=10
bins=np.arange(x.min(), x.max(), (x.max()-x.min())/Nbins)
freq, bins, patches = ax1.hist(x=x, bins=bins,alpha=1, histtype='bar', rwidth=0.8)

ax1.grid()
ax1.set_xlabel('x')
ax1.set_ylabel('frequencies')

#display error 
bincenters = (bins[1:] + bins[:-1])/2
ax1.errorbar(x=bincenters, y=freq, yerr=np.sqrt(freq), fmt='o', c='r', marker='*', markersize=4, capsize=5)

#gaussian
std_def = 1.06 * x.std() * (x.size ** -0.2)
xrange = np.arange(x.min()-sigma, x.max()+sigma)
gaussians = []
for i, dat in enumerate(x):
    gaussians.append( norm(loc=dat, scale=std_def).pdf(xrange) )
    ax2.plot(xrange, gaussians[i], alpha=0.5)
ax2.set_xlabel('x')
ax2.set_ylabel('f(x)')
ax2.grid()

#Sum all the gaussian functions and normalize the result
area=integrate.trapz(freq, dx=bins[1]-bins[0])
kde=np.array(gaussians).sum(axis=0)
ax1.plot(xrange,kde/sum(kde)*area,label='kde')
ax1.legend()

plt.show()

2\. **Color-coded scatter plot**

Produce a scatter plot out of a dataset with two categories

* Write a function that generate a 2D datasets of 2 categories. Each category should distribute as a 2D gaussian with a given mean and std (clearly it is better to have different values means..)
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below

You can try to make the procedure more general by allowing a given number $n\ge 2$ of categories

In [None]:
! wget https://www.dropbox.com/s/u4y3k4kk5tc7j46/two_categories_scatter_plot.png
from IPython.display import Image
Image('two_categories_scatter_plot.png')

In [None]:
import numpy as np
import matplotlib.pyplot as plt
#2D Gaussian array 1
x1, y1 = np.meshgrid(np.linspace(-1,5,4), np.linspace(-1,10,4)) 
dst1 = np.sqrt(x1*x1+y1*y1) 
sigma1 = 10
muu1 = 2
gauss1 = np.exp(-( (dst1-muu1)**2 / ( 2.0 * sigma1**2 ) ) ) 
print("2D Gaussian array 1:\n") 
print(gauss1)
print(np.shape(gauss1))

#2D Gaussian array 2
x2, y2 = np.meshgrid(np.linspace(5,10,4), np.linspace(5,8,4)) 
dst2 = np.sqrt(x2*x2+y2*y2) 
sigma2 = 5
muu2 = 9  
gauss2 = np.exp(-( (dst2-muu2)**2 / ( 2.0 * sigma2**2 ) ) )  
print("2D Gaussian array 2:\n") 
print(gauss2)
print(np.shape(gauss2))

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(gauss1,gauss1, c='b', marker="s", label='first')
ax1.scatter(gauss2,gauss2, c='r', marker="o", label='second')
plt.legend(loc='upper left');
plt.show()


3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following dataset and load it as a pandas dataframe:
```bash
wget https://www.dropbox.com/s/hgnvyj9abatk8g6/residuals_261.npy
```
Note that you should use the `np.load()` function to load the file as a numpy array, call the `.item()` method, and then pass it to the `pd.DataFrame()` constructor.
* Inspect the dataset, you'll find two variables (features)
* Clean the sample by selecting the entries (rows) with the variable "residual" in absolute value smaller than 2
* perform a linear regression of "residuals" versus "distances" using `scipy.stats.linregress()` 
* plot a seaborn jointplot of  "residuals" versus "distances", having seaborn performing a linear regression. The result of the regression should be displayed on the plot
* Fill 3 numpy arrays
  * x, serving as an array of bin centers for the "distance" variable. It should range from 0 to 20 with reasonable number of steps (bins)
  * y, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * erry, the standard deviation of the  of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

In [None]:
import pandas as pd   
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
#please change the file path 
data=np.load("/Users/Selen/data/residuals_261.npy",allow_pickle=True).item()
data=pd.DataFrame(data)
# Inspect the dataset, you'll find two variables (features)
print(data)
print(data.info())
print("Columns:", data.columns)
# Clean the sample by selecting the entries (rows) with the variable "residual" in absolute value smaller than 2
data.drop( data[ data['residuals'] < 2 ].index , inplace=True)
data=data[data['residuals']>2] #different way to delete
print(data)
x = data.residuals
y = data.distances
#original data plot
plt.scatter(x,y, color="red", marker="o", label="Original data")
plt.show()
#Perform the linear regression:
slope, intercept, r_value, p_value, stderr = stats.linregress(x, y)
print("slope =",slope," intercept =",intercept," r_value =",r_value," p_value =",p_value," stderr =",stderr)
#seaborn, linear regression
sns.jointplot(data=data, x="residuals", y="distances", kind="reg")
# binning of distances
dis_array=np.array(data.distances)
bins=[0,5,10,15,20]
print(dis_array)
dis_bin=np.histogram(dis_array,bins = bins ) 
print(dis_bin)
# #different way to bin
binned_data=pd.cut(dis_array,bins)
print(binned_data)
#mean and standard deviatoin of residuals according to bins of distances
mean_std=data.groupby(pd.cut(data['distances'], bins=bins))['residuals'].agg(['mean','std'])
print(mean_std)
