### Sampling

* Sometimes using the entire dataset is infeasible, impractical, or expensive.
* Often, it's possible to work with a subset of the data and only process the complete dataset if necessary.
* Sampling involves selecting a subset of instances from a dataset as a representation of the entire dataset.
* In statistical language, the full dataset is termed the **population**, while the subset is called a **sample**.
* There are various methods and strategies for sampling.
    *  provide a condensed version of the data for efficient processing.



### Sampling

<div align="center">
    <img src="https://www.dropbox.com/s/vzr0w2cvx6b5bm0/sampling.png?dl=1" width="700" alt="Sampling Image">
</div>


### Representing Data 

* A unit of observation, also termed as an "observation", is the individual entity described by the data.
* Such observations are also called data points, instances, or simply points.
* For a dataset with D variables, a data point can be visualized as a point in a D-dimensional Euclidean space.
  * For instance, a dataset with two variables can be represented in two dimensions.
<div align="center">
<img src="https://www.dropbox.com/s/t0ca5t1qzkr635b/2d-data.png?dl=1" alt="Drawing" width="700px;"/>
</div>



### Representing Data  -- Example

<div align="center">
    <img src="https://www.dropbox.com/s/sy00s9t8bq0wbxb/data_2d.png?dl=1" alt="Drawing" width="900px;"/>
</div>

### Representing Data  -- Higher Dimensional Space
<div align="center">
<img src="https://www.dropbox.com/s/4flfaojlfb2a101/3d-data.png?dl=1" alt="Drawing" style="width: 700px;"/>
</div>

### Sampling Strategies

* Different methods can be employed to draw samples.
  * A simple method might involve selecting the first half of the data and discarding the rest.
    * This is not always effective, especially if the data is organized based on specific attributes, such as gender.
  * Using position in a list as a strategy is arbitrary and not always effective.
* Several more sophisticated sampling strategies exist to efficiently reduce dataset size, especially for extensive datasets.


### Random Subsample
* Random subsampling is a method returns random sub-samples from a collection.
* Each item has an equal chance of being selected.
* Best suited for datasets that:
  * Have relatively balanced category proportions.
  * don't need to specifically account for or analyze the edge cases or outliers in the dataset

In [15]:
import random
random.sample([1,2,3,4,5], 2)

[4, 1]

### Special Cases

* While random subsampling treats every data point equally, there are scenarios where our data has more complex structures or patterns.
  * For exakple, We might want our sample to mirror certain characteristics of the full dataset or ensure specific segments are represented.
* For this, we turn to subsampling, a broader umbrella term that encompasses various methods including random and stratified subsampling.

### Subsampling 


### Subsampling 

* Unstructured Data: If data doesn't have predefined categories or strata, how do you ensure a representative sample?

* Clustering: For such data, we can create 'hypothetical' groups or strata using clustering.

* Clustering works by grouping data points based on how similar they are, ensuring that members of one group are more alike than those in other groups.
  * Explicit Selection: Sometimes, we need a sample from each category or group.

* For instance, if we're analyzing cities, we might want to make sure we have samples from each city in our study.

<div align="center">
    <img src="https://www.dropbox.com/s/ld7kps6aq47jq8p/airport_obs.png?dl=1" width="700px">
</div>


* Your Turn: Can you think of other scenarios where specific subsampling methods might be useful?






### Stratified Random Sampling

* Utilized when data can be associated with specific subgroups.
* It ensures that the sample retains the distribution of values for specific features present in the entire population.
* This approach:
  * Maintains the population's structure.
  * Divides the population into homogeneous subgroups or strata.
  * Selects elements from each stratum in proportion to the stratum's size.

![](https://www.dropbox.com/s/dq2wkc2l844857m/Screen%20Shot%202022-08-24%20at%204.19.11%20PM.png?dl=1)

### Stratified Random Sampling - Cont'd

* Instances in the population are first divided into homogenous subgroups known as strata
* The number of elements selected in each stratum is proportional to the size of the stratum
  * Thus maintaining the structure


### Stratified Random Sampling - Cont'd

* Sometimes, data doesn't come with predefined strata.
* In such cases, data can be assigned to hypothetical strata, a process known as clustering.
* Clustering groups data points based on similarity, ideally ensuring that data points in one cluster are more similar to each other than to those in other clusters.

### Representation of an Observation

* Recall that data points can be represented as a point in high-dimensional space

<div align="center">
<img src="https://www.dropbox.com/s/jbriqx9i7jyloyl/data_clusts_1.png?dl=1" alt="drawing" width="500"/>
</div>



### Identifying Clusters
<div align="center">
<img src="https://www.dropbox.com/s/jb81jew9hyf44pp/data_clusts_2.png?dl=1" alt="drawing" width="500"/>
</div>



### Cluster-Based Sampling 
* Cluster before vs. Cluster after Sampling.
<div align="center">
    <img src="https://www.dropbox.com/s/bkd32khr1q8dfhx/before_after.png?dl=1" alt="drawing" width="500"/>
</div>

### Cluster-Based Sampling

* Advantages:
  * Efficient once clusters are defined.
  * Can include central points or edge cases.

* Limitations:
  * Determining the number of strata can be challenging.
  * The presence of outliers or noisy observations can skew clustering.
  * Might distort the shape of clusters, complicating subsequent analysis.

### Kernel-Based Importance Sampling 

* Select with probability inversely proportional to the distance of the cluster center

<div align="center">
<img src="https://www.dropbox.com/s/ncrusv71mylkat8/dist_to_center.png?dl=1" alt="drawing" width="700"/>
</div>

### Weighted Density Sampling

* Weighted density sampling is a method used to select (or discard) events based on their local density.
  * The probability that a data point will be sampled is determined by its local density; that is, the density around that specific data point.
* Local density is defined as the number of data points within a given radius of that point.
* The radius used is typically defined by the user.


### Grid-based Sampling

* Divide the data into grid cells and select a number of points per cell
  * It is possible to choose a fixed number of points or to select points based on some cell-specific criteria

* Provides an organized way of ensuring that samples are spread out across the data space
  * Especially important when the distribution of data points is uneven.

* Ideal for spatial and Geospatial Data
![](https://www.dropbox.com/s/nijshpr93rth8fn/grid.png?dl=1)

###  Grid-based Sampling
* Advantages:
  * Uniform Coverage: ensures that the entire data space is uniformly represented.
    * Mitigate the risk of sampling bias, especially in areas of sparse data,   
  * Flexibility: either fixed number of points from each cell or use cell-specific criterion
  * The grid is natural stratification of the data space
    
* Limitations:  
  * The size and shape of the grid cells can influence the sampling results.
    * Too large a grid might miss localized patterns, while too small a grid might lead to over-representation of noisy patterns.
  * Computationally Intensive: Especially for high-dimensional data, creating and managing grid cells can be computationally challenging.



### Grid-based Sampling Using Random Projections: An intuition

* Given some randomly selected line, project a point so that the projection is perpendicular to that line

* Two points that are close in higher dimensional space will also be close on the line with high probability




### Projecting Onto Vector
<div align="center">
    <img src="https://www.dropbox.com/s/wdiqdrphg0wzhmi/a_b_projected.png?dl=1" alt="drawing" width="600"/>
</div>

### Normalizing by the Vector Size
<img src="https://www.dropbox.com/s/bqaeehgtvm9c9ga/a_b_line.png?dl=1" alt="drawing" width="800"/>


### Finding Orthogonal Vectors

* To find a vector that is orthogonal (or perpendicular) to a given vector, you need a vector whose dot product with the given vector is zero.

* Given vector $ \mathbf{v} = \begin{bmatrix} 5 \\ 2 \end{bmatrix} $.

To find a vector $ \mathbf{u} = \begin{bmatrix} x \\ y \end{bmatrix} $ that is orthogonal to $ \mathbf{v} $, the dot product $ \mathbf{u} \cdot \mathbf{v} $ must be zero:

$$ 
\mathbf{u} \cdot \mathbf{v} = 5x + 2y = 0 
$$



### Finding Orthogonal Vectors -- cont'd

* One way to find an orthogonal vector is to swap the components and negate one of them. Using this method:

* If $ \mathbf{v} = \begin{bmatrix} a \\ b \end{bmatrix} $, Then an orthogonal $ \mathbf{u} $ can be $ \begin{bmatrix} -b \\ a \end{bmatrix} $.


* There are infinitely many vectors orthogonal to $ \mathbf{v} $, but $ \begin{bmatrix} -2 \\ 5 \end{bmatrix} $ is one of the simplest ones.   * You could scale this vector by any scalar, and it would still be orthogonal to $ \mathbf{v} $.


### Project on New Orthogonal Vectors

<img src="https://www.dropbox.com/s/wf0nh45do2193i8/a_b_vec_2.png?dl=1" alt="drawing" width="800"/>


### Convert to New Space (Grid)


<img src="https://www.dropbox.com/s/ggohljkopdi6l6n/a_b_new_space.png?dl=1" alt="drawing" width="800"/>



## Similarity Preserving Projection Based Approach

* Pros:
  * Offers theoretical guarantees that points close in high-dimensional space will be assigned to the same grid elements (buckets).
    * Reference: Johnson–Lindenstrauss lemma
  * Computationally efficient.
  * Easily and swiftly computed on GPUs.
  * Effective with wide data sets.

Cons:
  * Probabilistic nature results in an infinite number of potential grids to select from.
  * Although a 2-D grid is chosen, grids in a much higher-dimensional space may be required based on the data's dimensionality.
  * Assessing the quality of resulting bins can be labor-intensive.

### Projection as a Dot Product

* How do you project a point onto a new axis?
 
* The dot product (or inner product) of two vectors provides the projection of one vector onto the line spanned by the other.

* This projection describes the vector in terms of the reference vector.
  * After normalization, the quantity represents the magnitude of the projected vector relative to the reference vector.

$$
\text{Proj}_{B}A = \frac{A \cdot B}{|B|}
$$
where $A \cdot B$ is the dot product of vectors A and B.

$$
A \cdot B = \sum_{i=1}^{n} A_i \times B_i 
$$

Here, $n$ is the dimension of vectors $A$ and $B$. $|B|$ denotes the magnitude of vector $B$, and it's computed as:

$$
|B| = \sqrt{\sum B_i^2}
$$

The numerator is used to normalize the resulting quantity, expressing it in terms of vector B.



In [12]:
import math

# compute the normalized projection of A onto B.
A = (3,4)
B = (5,2)

A_dot_B = A[0]*B[0] + A[1]*B[1]
amp_B = math.sqrt(B[0]**2 + B[1]**2)
print(f"The magnitude of B is: {amp_B}")
proj = A_dot_B / amp_B
print (f"The magnitude of the projection  {proj}")

print(f"The ratio of the projection in terms of B is {proj/amp_B}")

The magnitude of B is: 5.385164807134504
The magnitude of the projection  4.270992778072193
The ratio of the projection in terms of B is 0.7931034482758621



<img src="https://www.dropbox.com/s/ggohljkopdi6l6n/a_b_new_space.png?dl=1" alt="drawing" width="800"/>


In [13]:
### Projection as a Dot Product

A = (0.95, 8)
B = (5,7)
C = (6,5)
D = (4,2)

amp_D = math.sqrt(D[0]**2 + D[1]**2)
print(f"The magnitude of D is: {amp_D}\n")


A_dot_D = A[0]*D[0] + A[1]*D[1]
proj_A = A_dot_D / amp_D
print (f"The magnitude of the projection  {proj_A}")
print(f"The ratio of the projection in terms of D is {proj_A/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_A/amp_D)} \n")


B_dot_D = B[0]*D[0] + B[1]*D[1]
proj_B = B_dot_D / amp_D
print (f"The magnitude of the projection  {proj_B}")
print(f"The ratio of the projection in terms of D is {proj_B/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_B/amp_D)} \n")


C_dot_D = C[0]*D[0] + B[1]*D[1]
proj_C = C_dot_D / amp_D
print (f"The magnitude of the projection  {proj_C}")
print(f"The ratio of the projection in terms of D is {proj_C/amp_D}")
print(f"Projection occurs in bin {math.ceil(proj_C/amp_D)} \n")


The magnitude of D is: 4.47213595499958

The magnitude of the projection  4.427414595449584
The ratio of the projection in terms of D is 0.99
Projection occurs in bin 1 

The magnitude of the projection  7.602631123499284
The ratio of the projection in terms of D is 1.6999999999999997
Projection occurs in bin 2 

The magnitude of the projection  8.497058314499201
The ratio of the projection in terms of D is 1.9
Projection occurs in bin 2 



### Time Series

* A time series is essentially a sequence of data points, each associated with a specific moment in time.
* These data points are organized chronologically, providing a way to analyze changes over a designated period.
* An enormous amount of big data exists in time series format.
* This is particularly common in sectors like finance, healthcare, and meteorology, where tracking changes over time is crucial for analysis and decision-making.


### Time Series

<img src="https://www.dropbox.com/s/2lv0jdplyv6tyfa/stock_value.png?dl=1" alt="drawing" width="800"/>


### Sampling from Time Series

Common Strategies for Sampling:

  * Subsampling: Here, you pick every k-th record from the time series data. For example, if k=5, you'd select every 5th data point. This strategy is simple but can risk missing out on important variations in the data.

  * Bucket-Based Sampling: This method involves dividing the time series data into discrete sections or "buckets" based on a time unit like an hour or a day. You then select an equal number of samples from each bucket. This approach ensures that the sample is evenly distributed across the time range of interest.

  * Adaptive Sampling: This technique adjusts the sampling rate based on the variability or importance of the data points. For example, if the time series shows rapid changes, the sampling rate increases, and it decreases during periods of little to no change.
