## Collecting Data

### The Size and Quality of a Data Set

<code>“Garbage in, garbage out”</code>

The preceding adage applies to machine learning. After all, your model is only as good as your data. But how do you measure your data set's quality and improve it? And how much data do you need to get useful results? The answers depend on the type of problem you’re solving.

### The Size of a Data Set

As a rough rule of thumb, your model should train on at least an order of magnitude more exaples than trainable parameters. Simple models on large data sets generally beat fancy models on small data sets. Google has had great success training simple linear regression models on large data sets.

What counts as "a lot" of data? It depends on the project. Consider the relative size of these data sets:

| **Data set** | **Size (number of exaples)** |
|:------|:------|
| [Iris flower data set](https://archive.ics.uci.edu/ml/datasets/iris) | 150 (total set) |
| [MovieLens (the 20M data set)](https://grouplens.org/datasets/movielens/20m/) | 20,000,263 (total set) |
| [Google Gmail SmartReply](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45189.pdf) | 238,000,000 (training set) |
| Google Books Ngram | 486,000,000,000 (total set) |
| Google Translate | trillions |

As you can see, data sets come in a variety of sizes.


### The Quality of a Data Set

It's no use having a lot of data if it's bad data; quality matters, too. But what counts as "quality"? It's a fuzzy term. Consider taking an empirical approach and picking the option that produces the best outcome. With that mindset, a quality data set is one that lets you succeed with the business problem you care about. In ohter words, the data is *good* if it accomplishes its intended task.

However, while collecting data, it's helpful to have a more concrete definition of quality. Certain aspects of quality tend to correspond to better-performing models:
- reliability
- feature representation
- minimizaing skew


-------------

#### Reliability

**Reliability** refers to the degree to which you can *turst* your data. A model trained on a reliable data set is more likely to yeild useful predictions than a omdel trained on unreliable data. In measuring reliability, you must determine:

- How common are label errors" For example, if your data is labled by humans, sometimes humans make mistakes.

- Are your features noisy? For exaple, GPS measurements fluctuate. Some noise is okay. You'll never purge your data set of *all* noise. You can collect more examples too.

- Is the data properly filtered for your problem? For exaple, should your data set include search qureies from bots? If you're building a spam-detection system, then likely the answer is yes, but if you're trying to improve search results for humans, then no.

What makes data unreliable? Recall from the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/representation/cleaning-data) that many examples in data sets are unreliable due to one more of the following.

- Omitted values. For isntance, a person forgot to enter a value for a house's age.
- Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
- Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.
- Bad feature values. For example, someone typed an extra digit, or a thermometer was left out in the sun.

Google Translate focused on reliability to pick the "best supset" of its data; that is some data had higer quaility labels than ohter parts.

#### Feature Representation 

Recall from the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/representation/cleaning-data) that representation is the mapping of dta to useful features. You'll want to consider the following questions:

- How is data shown to the model?
- Should you [normalize](https://developers.google.com/machine-learning/glossary#normalization) numeric values?
- How should you handle [outliers](https://developers.google.com/machine-learning/glossary#outliers)?

The [Transform Your Data](https://developers.google.com/machine-learning/data-prep/transform/introduction) section of this course will focus on feature representation.

#### Training versus Prediction

Let's say you get great results offline. Then in your live experiment, those results don't hold up. what could be happening?

This problem suggests training/serving skew-that is, different results are computed for your metrics at training time vs. serving time. Causes of skew can be subtle but habe deadly effects on your results. Always consider what data is available to your model at prediction time. During training, use only the features that you'll have available in serving, and make sure your training set is representation of your sefing traffic.



---------------------
---------------------

## Joining Data Logs
When assembling a training set, you must sometimes join multiple sources of data.

### Types of Logs

You might work with any of the folliwing kinds of input data:

- transactional logs
- attribute data
- aggregate statistics

**Transacational logs** record a specific event. For example, transactional log migh record an IP address making a qurey and the date and time at which the qurey was made. Transcational events correspond to a specific event.

**Attribute data** contains snapshots of information. For example:

- user demographics
- search history at time of query

Attribute data isn't specific to an event or moment in time, but can still be useful for making predictions. For prediction tasks not tied to specific event (for example, predicting user churn, which involves a range of time rahter than an individual moment), attribute data might be the only type of data.

Attribute data and transactional logs are related. For example, you can create a type of attribute data by aggregating several transactional logs, creating aggregate statistics. In this case, you can look at many transactional logs to create a single attribute for a user

**Aggregate statistics** create an attribute from muptiple transactional logs. For exampe:

- frequency of user queries
- average click rate on a certain ad

### Joining Log Sources
Each type of log tends to be in a different location. When collecting data for your machine learning model, you must join together different sources to create your data set. Some examples:

- Leverage the user's ID and timestamp in transactional logs to look up user attributes at time of event.
- Use the transaction timestamp to select search history at time of query.

------------
------------

## Identifying Labels and Sources

### Direct vs. Derived Labels

Machine learning is easier when your labels are well-defined. The best label is a **direct label** of what you want to predict. For example, if you want to predict whether a user is a Taylor Swift fan, a direct label would be "User is a Taylor Swift fan."

A simpler test of fanhood might be whether the user has watched a Taylor Swift video on YouTube. The label "user has watched a Taylor Swift video on YouTube" is a **derived label** because it does not directly measure what you want to predict. Is this derived label a reliable indicator that the user likes Taylor Swift? Your model will only be as good as the connection between your derived label and your desired prediction.

### Label Sources

The output of your model could be either an Event or an Attribute. This results in the following two types of labels:

- **Direct labe for Events**, such as “Did the user click the top search result?”
- **Direct label for Attributes**, such as “Will the advertiser spend more than $X in the next week?”

### Direct Labels for Events

For events, direct labels are typically straightforward, because you can log the user behavior during the event for use as the label. When labeling events, ask yourself the following questions:

- How are your logs structured?
- What is considered an “event” in your logs?

For example, does the system log a user clicking on a search result or when a user makes a search? If you have click logs, realize that you'll never see an impression without a click. You would need logs where the events are impressions, so you cover all cases in which a user sees a top search result.

### Direct Labels for Attributes
Let's say your label is, “The advertiser will spend more than $X in the next week.” Typically, you'd use the previous days of data to predict what will happen in the subsequent days. For example, the following illustration shows the ten days of training data that predict the subsequent seven days:

![](04.png)

Remember to consider seasonality or cyclical effects; for example, advertisers might spend more on weekends. For that reason, you may prefer to use a 14-day window instead, or to use the date as a feature so the model can learn yearly effects.
