# Sampling and Splitting Data

- Sampling
- Imbalanced Data
- Data Split Example
- Splitting Your Data
- Randomization

## Sampling

### Introduction to Sampling

It's often a struggle to gather enough data for a machine learning project. Sometimes, however, there is too much data, and you must select a subset of examples for training.

How do you select that subset? As an exampe, consider Google Search. At what granularity would you sample its massive amounts of data? Would you use random queries? Random sessions? Random users?

Ultimately, the answer depends on the problem: what do we want to predict, and what features do we want?

- To use the feature *previous query*, you need to sample at the session level, becuase sessions contain a sequence of queries.

- To use the feature *user behavior form previous days,* you need to sample at the user level.

### Filtering for PII (Personally Identifiable Information)

If your data includes PII (persnally identifable information), you may need to filter if from your data. A policy may require you to remove infrequent features, for example.

This filtering will skew your distribution. You'll lose information in the tail (the part of the distribution with vwery low values, far fro mthe mean).

This filtering helpful because **very infrequent features are hard to learn**. But **it’s important to realize that your dataset will be biased toward the head queries**. At serving time, you can expect to do worse on serving examples from the tail, since these were the examples that got filtered out of your training data. Although this skew can’t be avoided, be aware of it during your analysis.

---------------------

## Imbalanced Data

A classification data set with skewed class proportions is called [imbalanced](https://developers.google.com/machine-learning/glossary#class_imbalanced_data_set). Classes that make up a large proportion of the data set are called [majority classes](https://developers.google.com/machine-learning/glossary#majority_class). Those that make up a smaller proportion are [minority classes](https://developers.google.com/machine-learning/glossary#minority_class).

What counts as imbalanced? The answer could range from mild to extreme, as the table below shows.


| **Degree of imbalance** | **Proportion of Minority Class** |
|:--------------|:------------|
| Mild | 20-40% of the dataset |
| Moderate | 1-20% of the data set |
| Extreme | <1% of the dataset |

Why look out for imbalanced data? You may need to apply a particular sampling technique if you have a classification task with an imbalanced data set.

Consider the following example of a model that detects fraud. Instances of fraud happen once per 200 transactions in this data set, so in the true distribution, about 0.5% of the data is positive.

![](05.png)

Why would this be problematic? With so few positives relative to negatives, the training model will spend most of its time on negative examples and not learn enough from positive ones. For example, if your batch size is 128, many batches will have no positive examples, so the gradients will be less informative.

If you have an imbalanced data set, **first try training on the true distribution.** If the model works well and generalizes, you're done! If not, try the following downsampling and upweighting technique.

### Downsampling and Upweighting
An effective way to handle imbalanced data is to downsample and upweight the majority class. Let's start by defining those two new terms:

- **Downsampling** (in this context) means training on a disproportionately low subset of the majority class examples.

- **Upweighting** means adding an example weight to the downsampled class equal to the factor by which you downsampled.

**Step 1: Downsample the majority class:** Consider again our example of the fraud data set, with 1 positive to 200 negatives. We can downsample by a factor of 20, taking 1/10 negatives. Now about 10% of our data is positive, which will be much better for training our model.

![](06.png)

**Step 2: Upweight the downsampled class:** The last step is to add example weights to the downsampled class. Since we downsampled by a factor of 20, the example weight should be 20.

![](07.png)

You may be used to hearing the term *weight* when it refers to model parameters, like connections in a neural network. Here we're talking about example *weights*, which means counting an individual example more importantly during training. An example weight of 10 means the model treats the example as 10 times as important (when computing loss) as it would an example of weight 1.

The weight should be equal to the factor you used to downsample:

$$\{example weight\} = \{original example weight\} \times \{downsampling factor\}$$

### Why Downsample and Upweight?

It may seem odd to add example weights after downsampling. We were trying to make our model improve on the minority class -- why would we upweight the majority? These are the resulting changes:

- **Faster convergence:** During training, we see the minority class more often, which will help the model converge faster.

- **Disk space**: By consolidating the majority class into fewer examples with larger weights, we spend less disk space storing them. This savings allows more disk space for the minority class, so we can collect a greater number and a wider range of examples from that class.

- **Calibration**: Upweighting ensures our model is still calibrated; the outputs can still be interpreted as probabilities.

-----------------

## Data Split Example
After collecting your data and sampling where needed, the next step is to split your data into [**training sets**](https://developers.google.com/machine-learning/glossary#training_set), [**validation sets**](https://developers.google.com/machine-learning/glossary#validation_set), and [**testing sets**](https://developers.google.com/machine-learning/glossary#test_set).

### When Random Splitting isn't the Best Approach
While random splitting is the best approach for many ML problems, it isn't always the right solution. For example, consider data sets in which the examples are naturally clustered into similar examples.

Suppose you want your model to classify the topic from the text of a news article. Why would a random split be problematic?

![](08.png)

News stories appear in clusters: multiple stories about the same topic are published around the same time. If we split the data randomly, therefore, the test set and the training set will likely contain the same stories. In reality, it wouldn't work this way because all the stories will come in at the same time, so doing the split like this would cause skew.

![](09.png)

A simple approach to fixing this problem would be to split our data based on when the story was published, perhaps by day the story was published. This results in stories from the same day being placed in the same split.

![](10.png)

With tens of thousands or more news stories, a percentage may get divided across the days. That's okay, though; in reality these stories were split across two days of the news cycle. Alternatively, you could throw out data within a certain distance of your cutoff to ensure you don't have any overlap. For example, you could train on stories for the month of April, and then use the second week of May as the test set, with the week gap preventing overlap.

-------------------

## Splitting Your Data
As the [news story example](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/example) demonstrates, a pure random split is not always the right approach.

A frequent technique for online systems is to split the data by time, such that you would:

- Collect 30 days of data.
- Train on data from Days 1-29.
- Evaluate on data from Day 30.

For online systems, the training data is older than the serving data, so this technique ensures your validation set mirrors the lag between training and serving. However, time-based splits work best with very large datasets, such as those with tens of millions of examples. In projects with less data, the distributions end up quite different between training, validation, and testing.

Recall also the data split flaw from the [machine learning literature project described in the Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/18th-century-literature). The data was literature penned by one of three authors, so data fell into three main groups. Because the team applied a random split, data from each group was present in the training, evaluation, and testing sets, so the model learned from information it wouldn't necessarily have at prediction time. This problem can happen anytime your data is grouped, whether as time series data, or clustered by other criteria. Domain knowledge can inform how you split your data.

--------------------

## Randomization

### Practical Considerations

Make your data generation pipeline reproducible. Say you want to add a feature to see how it affects model quality. For a fair experiment, your datasets should be identical except for this new feature. If your data generation runs are not reproducible, you can't make these datasets.

In that spirit, make sure any randomization in data generation can be made deterministic:

- **Seed your random number generators** (RNGs). Seeding ensures that the RNG outputs the same values in the same order each time you run it, recreating your dataset.

- **Use invariant hash keys**. Hashing is a common way to split or sample data. You can hash each example, and use the resulting integer to decide in which split to place the example. The inputs to your hash function shouldn't change each time you run the data generation program. Don't use the current time or a random number in your hash, for example, if you want to recreate your hashes on demand.

The preceding approaches apply both to sampling and splitting your data.

### Considerations for Hashing

Imagine again you were collecting SEarch qureries and using hashing to includeImagine again you were collecting Search queries and using hashing to include or exclude queries. If the hash key only used the query, then across multiple days of data, you’ll either *always* include that query or *always* exclude it. Always including or always excluding a query is bad because:


- Your training set will see a less diverse set of queries.

- Your evaluation sets will be artificially hard, because they won't overlap with your training data. In reality, at serving time, you'll have seen some of the live traffic in your training data, so your evaluation should reflect that.

Instead you can hash on query + date, which would result in a different hashing each day.

![](hashing_on_query.gif)