### Sampling

* There are instances where using the entire dataset is unfeasible, unpractical or costly

* It's often possible to explore or work on a subset of the data and run only once, if needed, on the complete dataset.

* Sampling consists of selecting a subset of the instances in a dataset as a proxy for the whole dataset
* In statistical terms, the original data set is known as the population, while the subset is known as a sample. 

* Sampling comes in several flavors and produces a smaller representation of the data that can be processed more efficiently


### Sampling

![](https://www.dropbox.com/s/vzr0w2cvx6b5bm0/sampling.png?dl=1)



### Representing Data 

* When working with data, a unit of observation, or simply observation, is the unit described by the data to be analyzed
  * Observations are often referred to as data points (or simply points), instances, or simply instances.
    
* Given a dataset with D-variables, a data point is often viewed as points in a D-dimensional Euclidean space
  * Example, data with two variables is data in two dimensions.


<img src="https://www.dropbox.com/s/t0ca5t1qzkr635b/2d-data.png?dl=1" alt="Drawing" width="700px;"/>


### Representing Data  -- Example
<img src="https://www.dropbox.com/s/sy00s9t8bq0wbxb/data_2d.png?dl=1" alt="Drawing" width="900px;"/>


### Representing Data  -- Higher Dimensional Space

<img src="https://www.dropbox.com/s/4flfaojlfb2a101/3d-data.png?dl=1" alt="Drawing" style="width: 700px;"/>

### Sampling Strategies

* Various ideas can be used for samples. 
  * Ex. By selecting the first half of the data and discarding the second half
    * Replace half with any fraction
      * Clearly not a wise idea, especially if the data is organized according to some attribute
       * e.g.: the first half consists of males and the second half of females
  * The above is a poor strategy because it uses something arbitrary
   * position in a list

* The following introduces a non-exhaustive list of sampling strategies commonly used to reduce the size of a dataset
   * Those are used exclusively with tall data

### Random Subsample

* Takes in a list/collection and returns random sub-samples
* Each item has an equal probability of being picked
* Ideal when data is:
  * Balanced relatively equal category proportions
  * We don't care about missing edge cases

In [15]:
import random
random.sample([1,2,3,4,5], 2)

[4, 1]

### Subsampling 

* In some cases, we may want to explicitly select samples from all categories

![](https://www.dropbox.com/s/ld7kps6aq47jq8p/airport_obs.png?dl=1)

* Example, the objective is to carry out some analysis on a certain number of cities

* Other examples?

### Stratified Random Sampling

* The previous idea is referred to as stratified random sampling
  * Used when data can clearly be associated with a subgroup.  
  * Ensures that the "distribution" of values for a particular feature within the sample matches the distribution of values for the same feature in the overall population. 

* Allows you to: 
    * maintains the structure of the population


![](https://www.dropbox.com/s/dq2wkc2l844857m/Screen%20Shot%202022-08-24%20at%204.19.11%20PM.png?dl=1)

### Stratified Random Sampling - Cont'd

* Instances in the population are first divided into homogenous subgroups known as strata
* The number of elements selected in each stratum is proportional to the size of the stratum
  * Thus maintaining the structure


### Subsampling 

* The previous approach required each instance to fall into predetermined strata

* What if strata information is not readily available?

* We could assign the data to hypothetical trata
   * This is called clustering and will be covered later in the course
   * Clustering identified clusters (lumps) of data where data points are very similar and where 
     * Ideally points are very dissimilar across clusters.

### Representation of an Observation

* Recall that data points can be represented as a point in high-dimensional space

<img src="https://www.dropbox.com/s/jbriqx9i7jyloyl/data_clusts_1.png?dl=1" alt="drawing" width="500"/>



### Identifying Clusters

<img src="https://www.dropbox.com/s/jb81jew9hyf44pp/data_clusts_2.png?dl=1" alt="drawing" width="500"/>




### Cluster-Based Sampling 

<img src="https://www.dropbox.com/s/bkd32khr1q8dfhx/before_after.png?dl=1" alt="drawing" width="500"/>


### Cluster-Based Sampling

* Pros
  * Computationally efficient once clusters are established
  * Can be used to retain points close to centers or edge cases

* Cons
  * Number of strata can be difficult to guess
  * Big data can contain noisy observations and outliers that can affect the algorithm
  * Can result in loss of cluster shape
    * may make further data analysis complex (e.g., difficult to cluster)


### Kernel-Based Importance Sampling 

* Select with probability inversely proportional to the cluster center

<img src="https://www.dropbox.com/s/ncrusv71mylkat8/dist_to_center.png?dl=1" alt="drawing" width="700"/>


### Weighted Density Sampling

* Weighted density sampling is a method to discard events based on their local density. 

* The probability that a data point will be sampled is based on the local density; i.e., density around that data point.

* Local density is defined as the number of data points within a given radius of the point. 

  * The radius used is typically a user-defined parameter  

* Some memory-efficient algorithm can approximate the local density using a single scan of the data

### Hash-Based (Projection-Based) Sampling

* Divide the data into grid cells and select a number of points per cell
  It is possible to choose a fixed number of points or to select points based on some cell-specific criteria


![](https://www.dropbox.com/s/nijshpr93rth8fn/grid.png?dl=1)

### Random Projections: An intuition

* Given some randomly selected line, project a point so that the projection is perpendicular to that line

* Two points that are close in higher dimensional space will also be close on the line with high probability




### Projecting Onto Vector

<img src="https://www.dropbox.com/s/wdiqdrphg0wzhmi/a_b_projected.png?dl=1" alt="drawing" width="600"/>


### Normalizing by the Vector Size
<img src="https://www.dropbox.com/s/bqaeehgtvm9c9ga/a_b_line.png?dl=1" alt="drawing" width="800"/>


### Project on New Orthogonal Vectors

<img src="https://www.dropbox.com/s/wf0nh45do2193i8/a_b_vec_2.png?dl=1" alt="drawing" width="800"/>


### Convert to New Space (Grid)


<img src="https://www.dropbox.com/s/ggohljkopdi6l6n/a_b_new_space.png?dl=1" alt="drawing" width="800"/>



## Similarity Preserving Projection Based Approach

Pros: 
* Some theoretical guarantees that pointly close in high dimensional space will be assigned to the same grid elements (buckets)
  * [See](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)
  * Approach similar to locality sensitive hashing or locating preserving hashing
    [See](https://en.wikipedia.org/wiki/Locality-sensitive_hashing)

* Computationally efficient
  * Can be easily/quickly computed on GPUs
* Works well with wide data

Cons:
  * Probabilistic in nature: infinite number of possible grids to choose from
  * While we select a 2-D grid, depending on the dimension of the data, grids in much higher dimensional space need to be used
  * Labor intensive to assess quality of resulting bins.

  


### Projection as a Dot Product


* How do you project a point onto a new axis?
 
* The dot product (or inner product) of two vectors is the projection of one onto the line spanned by the other.

* It describes the projected vector in terms of the reference vector
  * After normalizing, we get a quantity that represents the magnitude of the projected vector in terms of the reference vector

$$
Proj_B~A = \frac{A \cdot B}{|B|},
$$
where $A \cdot B$ is simply the dot product of A and B.


$$
A \cdot B = \sum_i^n{A_i \times B_i },
$$


where $n$ is the dimension of $A$ and $B$ and $|B|$ is the magniture of $B$ and computed as: 

  * $|B| = \sqrt{\sum{B_i^2}}$

* The numerator is needed to normalize the resulting quantity (express it in terms of vector B)


In [12]:
import math

# compute the normalized projection of A onto B.
A = (3,4)
B = (5,2)

A_dot_B = A[0]*B[0] + A[1]*B[1]
amp_B = math.sqrt(B[0]**2 + B[1]**2)
print(f"The magnitude of B is: {amp_B}")
proj = A_dot_B / amp_B
print (f"The magnitude of the projection  {proj}")

print(f"The ratio of the projection in terms of B is {proj/amp_B}")

The magnitude of B is: 5.385164807134504
The magnitude of the projection  4.270992778072193
The ratio of the projection in terms of B is 0.7931034482758621



<img src="https://www.dropbox.com/s/ggohljkopdi6l6n/a_b_new_space.png?dl=1" alt="drawing" width="800"/>


In [13]:
### Projection as a Dot Product

A = (0.95, 8)
B = (5,7)
C = (6,5)
D = (4,2)

amp_D = math.sqrt(D[0]**2 + D[1]**2)
print(f"The magnitude of D is: {amp_D}\n")


A_dot_D = A[0]*D[0] + A[1]*D[1]
proj_A = A_dot_D / amp_D
print (f"The magnitude of the projection  {proj_A}")
print(f"The ratio of the projection in terms of D is {proj_A/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_A/amp_D)} \n")


B_dot_D = B[0]*D[0] + B[1]*D[1]
proj_B = B_dot_D / amp_D
print (f"The magnitude of the projection  {proj_B}")
print(f"The ratio of the projection in terms of D is {proj_B/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_B/amp_D)} \n")


C_dot_D = C[0]*D[0] + B[1]*D[1]
proj_C = C_dot_D / amp_D
print (f"The magnitude of the projection  {proj_C}")
print(f"The ratio of the projection in terms of D is {proj_C/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_C/amp_D)} \n")


The magnitude of D is: 4.47213595499958

The magnitude of the projection  4.427414595449584
The ratio of the projection in terms of D is 0.99
Projection occurs in bin 1 

The magnitude of the projection  7.602631123499284
The ratio of the projection in terms of D is 1.6999999999999997
Projection occurs in bin 2 

The magnitude of the projection  8.497058314499201
The ratio of the projection in terms of D is 1.9
Projection occurs in bin 2 



### Time Series

* A time series is a collection of data points indexed in a time order. 
* A great deal of big data is available in the form of time series

<img src="https://www.dropbox.com/s/2lv0jdplyv6tyfa/stock_value.png?dl=1" alt="drawing" width="800"/>



### Sampling from Time Series

* Some commonly used strategies
  * Random: select every x (random) time point
  * Subsample ever k records
  * Splits time series data into sections or "buckets", then takes equal samples from each bucket.
    * bucket is hourly or daily, for example
  * Use a sliding window and use strategy to pick points. For example:
    * Select points randomly using distance from the mean in the window
    * Select a fixed number of points per window,
    * etc.