In [None]:
import numpy as np;
import matplotlib as mp;
import matplotlib.pyplot as plt;
import matplotlib.patches as patches;
import pandas as pd;
import math;

## Covariance and correlation

Let us calculate the covariance and correlation for real data using Pandas. First we specify the data file name and have a "peek" at the data frame:

In [None]:
data_file = 'ASC_cg_export.tsv';
pd.read_csv(data_file,sep = '\t',low_memory=False)

Now let's actually read the data into a pandas data frame.

In [None]:
data_frame = pd.read_csv(data_file,sep = '\t',low_memory=False);

Let's check the column names:

In [None]:
data_frame.columns

Before actually calculating the variance-covariance matrix, let's simply look at scatter plots, where we plot the value in a given column as a function of the value in another column.

In [None]:
plt.clf();
plt.plot(data_frame['cg00510787'][:],data_frame['cg03169527'][:],'bo', alpha =0.4);
plt.xlabel('cg00510787');
plt.ylabel('cg03169527');
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');

Now create a grid of scatter plots as the figure indicates. Try to set the axis labels from the dataset.

In [None]:
plt.clf();
plt.subplot(121)
plt.plot(data_frame['cg00510787'][:],data_frame['cg03169527'][:],'bo', alpha =0.4);
plt.xlabel('cg00510787');
plt.ylabel('cg03169527');
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');

plt.subplot(122)
plt.plot(data_frame.iloc[:,1],data_frame.iloc[:,3],'bo',alpha = 0.4);
plt.xlabel(data_frame.columns[1]);
plt.ylabel(data_frame.columns[3]);
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');
plt.tight_layout();
plt.show();

plt.clf();
plt.subplot(121)
plt.plot(data_frame.iloc[:,1],data_frame.iloc[:,4],'bo',alpha = 0.4);
plt.xlabel(data_frame.columns[1]);
plt.ylabel(data_frame.columns[4]);
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');

plt.subplot(122)
plt.plot(data_frame.iloc[:,2],data_frame.iloc[:,3],'bo',alpha = 0.4);
plt.xlabel(data_frame.columns[2]);
plt.ylabel(data_frame.columns[3]);
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');
plt.tight_layout();
plt.show();

plt.clf();
plt.subplot(121)
plt.plot(data_frame.iloc[:,2],data_frame.iloc[:,4],'bo',alpha = 0.4);
plt.xlabel(data_frame.columns[2]);
plt.ylabel(data_frame.columns[4]);
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');

plt.subplot(122)
plt.plot(data_frame.iloc[:,3],data_frame.iloc[:,4],'bo',alpha = 0.4);
plt.xlabel(data_frame.columns[3]);
plt.ylabel(data_frame.columns[4]);
plt.xlim(0,1);
plt.ylim(0,1);
plt.axis('square');
plt.tight_layout();
plt.show();

Let's now calculate the co-variance and the correlation matrices.

In [None]:
data_frame.cov()

In [None]:
data_frame.corr()

Of course, covariance can be calculated also in numpy. To test this, let us first generate a multinomial data set. First we define the parameters of multinomial distribution, $N$ and a couple of $p_i$, where $\sum_i p_i=1$.

In [None]:
N_mn = 50
p_mn = [.1,.2,.3,.4] # set a list of likelihoods, eg. 4 probabilities
print(sum(p_mn))

In [None]:
multinom_data = np.random.multinomial(N_mn, p_mn, 10)
print(multinom_data);

In [None]:
multinom_cov_matr = np.cov(np.transpose(multinom_data));
multinom_cov_matr

As a comparison, let's also calculate the theoretical variance-covariance matrix.

In [None]:
theor_cov_matr = [[- N_mn*p_mn[i]*p_mn[j] if j!= i else N_mn*p_mn[i]*(1.0-p_mn[i]) for j in range(0,len(p_mn))] for i in range(0,len(p_mn))];
for line in theor_cov_matr:
    print(line);

# Inequalities

Let us assume IID Bernoulli random variables such as coin flips, where we do not know the parameter $p$, and only have access to the generated sample.

Our best guess (estimate) for $p$ based on the observed data is given by
\begin{equation}
\widehat{p}=\overline{X_n}=\frac{1}{n}\sum_{i=1}^n x_i.\nonumber
\end{equation}


Hoeffding's inequality states that as $n$ is increased, this estimate is getting exponentialy close to the true value of $p$, since
\begin{equation}
P\left(\left| \overline{X_n}- p\right| \geq \epsilon\right)\leq 2 e^{-2n\epsilon^2}.\nonumber
\end{equation}

Based on that we can construct an interval (called confidence interval) around the estimate $\widehat{p}=\overline{X_n}$ for which we can write down a guaranteed lower bound on the probability that the true $p$ falls within the interval. 

Let us choose the guaranteed lower bound as $1-\alpha$, where $\alpha$ is a parameter. For a fixed $\alpha$, the interval can be given as ??????YOUR_FOMULA_HERE?????? where  
\begin{equation}
\epsilon = YOURFORMULA, \nonumber
\end{equation}
and in such a setting find the formula for the probability, that the interval covers the $p$ value.
\begin{equation}
P ....
\end{equation}




Let us test this by actually generating Bernoulli random variables and measuring the **coverage** of the above interval, where coverage is the ratio of experiments where the true $p$ did fall into the interval.

First, we define the parameters.

In [None]:
p_true = 0.4;       # the p parameter of the Bernoulli distribution
num_flips = 50;   # the number of data points in one experiment
num_series = 1000;   # the number of experiments.
alpha = 0.05;

Let's just try out generating a small sample.

In [None]:
Bernoulli_samp = YOURCODE_FOR_10_RANDOMLY_CHOSEN_0_or_1_where_0_coccures_with_ptrue_probability # use numpy!
print(Bernoulli_samp);

OK, it seems working, let's now generate the longer samples.

In [None]:
B_series = [your_code_for_list_of_size_is_num_flips for i in range(0,num_series)];

And finally let's see how does the coverage depend on the number of flips. 

In [None]:
cover_list,epsilon_list = [],[];
cover_list.clear();
epsilon_list.clear();
for n in range(1,num_flips):
    epsilon = FORMULA_FROM_ABOVE
    epsilon_list.append(epsilon);
    cover_indicator = [ 1 if_p_true_in_interval];
    coverage = YOUR_CODE
    cover_list.append(coverage);    
    #print('n=',n,'coverage=',coverage);    

In [None]:
plt.clf();
x_list = range(1,len(cover_list)+1);
plt.plot(x_list,cover_list);
plt.ylim(0.99,1.01);
plt.xlabel('num. flips');
plt.ylabel('coverage in '+str(num_series)+' series');
plt.show();