### Sampling

* There are instances where using the entire dataset is unfeasable or unpractical or costly

* It's often possible to explore or work on a subset of the data and run only once, if needed, on the complete dataset.

* Sampling is a process of selecting a subset of the instances in a dataset as a proxy for the whole. 
* In statistical terms, the original data set is known as the population, while the subset is known as a sample. 

* Sampling comes in several flavors.
  * The most common form of sampling is uniform sampling
  * Each element (item) of the population has an equal probability of being selected

* Sampling produces a smaller representation of the data that can be processed more efficiently


### Sampling

![](https://www.dropbox.com/s/vzr0w2cvx6b5bm0/sampling.png?dl=1)



### Representing observations

* When working with data, a unit of observation, or simply observtion is the unit described by the data to be analyzed 
  * Observations are often referred to as data points (or simply points), or instances, or simply instances.
    
* Given a dataset with D-variables, an data point are often viewed as points in a D-dimensional Euclidean space
  * Example, data with two variables is data in two dimensions.


<img src="https://www.dropbox.com/s/t0ca5t1qzkr635b/2d-data.png?dl=1" alt="Drawing" width="700px;"/>


In [None]:

https://www.dropbox.com/s/sy00s9t8bq0wbxb/data_2d.png?dl=0

### Sampling Strategies

* Various ideas can be used for samples. 
  * Ex. by selecting the first half of the data and discard the second half
    * Replace half by any fraction
      * Clearly not a good idea, particularly if the data organized according to some attribute
       * ex. first hals if males
  * The above is a poor strateg becauses it uses something arbitrary

* The following introduces a non-exhaustives list of sampling strategies commonly used to reduce the size of a dataset
   * Those are used exclusively with tall data

### Random Subsample

* Takes in a list/collection and returns random sub-samples
* each item has equal probability of being picked
* Idea when data is:
  * Balanced: relatively equal category proportions
  * We don't care about missing edge cases

### Subsampling 

* Ideally, we want to retain the structure of the dataset

![](https://www.dropbox.com/s/ld7kps6aq47jq8p/airport_obs.png?dl=1)

* Example, the objective is to carry out some analysis on a certain number of cities

* Other examples?


### Stratified Random Sampling

* The previous idea is referred to as a stratified random sampling
  * Used when data can clearly be associated with a subgroup.  
  * Ensures that the "distribution" of values for a particular feature within the sample matches the distribution of values for the same feature in the overall population. 

* Allows you to: 
    * maintain structure of the population


![](https://www.dropbox.com/s/dq2wkc2l844857m/Screen%20Shot%202022-08-24%20at%204.19.11%20PM.png?dl=1)


### Stratified Random Sampling - Cont'd

* Instances in the population are first divided into homogenous subgroups known as strata
* The number of elements selected in each stratum is proportional to the size of the stratum
  * Thus maintaining the structure


### Subsampling 

* The previous approach requires that falls in predetermined strata

* What if not strata information is readily avialable?

* We could assign the data into possible artificial strata
   * This is called clustering and will be covered later in the course
   * Clustering identified clusters (lumps) of data where data points are very similar and where 
     * Idealy points are very dissimilar across clusters.





### Representation of an Observation

* Recall that a data points can be represented as a point in high-dimensional space

<img src="https://www.dropbox.com/s/jbriqx9i7jyloyl/data_clusts_1.png?dl=1" alt="drawing" width="500"/>




### Representation of an Observation

* Identifying clusters

<img src="https://www.dropbox.com/s/jb81jew9hyf44pp/data_clusts_2.png?dl=1" alt="drawing" width="500"/>




### Representation of an Observation

<img src="https://www.dropbox.com/s/lqbhe65hkmevsiz/data_clusts_3.png?dl=1" alt="drawing" width="500"/>

### Representation of an Observation
* Number of strata can be difficult to guess
* Big data can contain noisy observations and outliers that can affect the algorithm
* Can result in loss of cluster shape
    * may make further data analysis complex (e.g.: hard to cluster)



### Importance Sampling Based on a Kernel

* Select with probility inversely proportional to to cluster center

<img src="https://www.dropbox.com/s/ncrusv71mylkat8/dist_to_center.png?dl=1" alt="drawing" width="500"/>


### Representation of an Observation

<img src="https://www.dropbox.com/s/bkd32khr1q8dfhx/before_after.png?dl=1" alt="drawing" width="500"/>


### Weighted Density Sampling

* Weighted Density sampling is a method to discard events based on their local density. 

* The probability that a data point will be sampled is based on the local density around that data point.

* The local density of a data point is defined as the number of data points within a certain radius around it. 

  * The radius used is often a user-defined parameter  

* Some memory efficient algorithm can approximate the local density using a single scan of the data

In [7]:
### Weighted Density Sampling - Cont'd






### Hash Based Sampling
* Hash the data into grid cells and select a fixed number of points per cell
![](https://www.dropbox.com/s/nijshpr93rth8fn/grid.png?dl=1)

### Time Series

* A time series is collection of data points indexed in a time order. 
* Time Series represent a lot of available big data sets



### Sampling from Time Series

* Some commonly used strategies
  * Random: select every x (random) time point
  * Subsample ever k records
  * Splits time series data into sections or "buckets", then takes equal samples from each bucket.
    * bucket is hourly or daily, for example
  * Use a sliding windown
    * points have probability of getting selected
    * Select point randomly using distance from the mean in the window

In [None]:
### Conclusion

* Various