# Data Wrangling in R

**Data wrangling** covers loading, manipulating, reshaping, and exploring data structures.

## Table of Contents

- [Understanding Data](#data)


---
<a id='data'></a>

## Understanding Data

1. **Where data come from?** - `When working with any data set, it is vital to consider where the data came from, who, how and why recorded it - to effectively and meaningfully analyze it`.
    - **Sensors** - assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
    - **Surveys** - are dependent on individuals self-reporting, their quality may vary. The biases inherent in survey responses should be recognized and (if possible) adjusted.
    - **Record keeping** - based on manual or automatic process to keep track of different (business) activities. The reliability depends on quality of systems producing it and the way they are gathered. Record keeping may only focus on particular tasks.
    - **Secondary data analysis** - data compiled from existing knowledge artifacts or measurements (such as historical texts). These artifacts may already exclude perspectives.
    

2. **Dataset sources**:

    - Government piblications
    - News and journalism (New York time, FiveThirtyEight)
    - Scientific research (Nature Recommended Data Repositories)
    - Social networks and media organizations (Facebook, Twitter, Google)
    - Online communities (Kaggle, Socrata, UCI ML Repository)
    

3. Once you acquire a data set, you will have to **understand its structure and content** before (programmatically) investigating it. You need to know what kinds of statistical analysis will be valid for different types of data, as well as how to interpret what that data are measuring.

4. **Data interpretation**. Working with data requires domain knowledge, at least a basic level of understanding of the problem domain (the meaning of data), significance and purpose of any feature (to detect outliers and errors), and some of the subtleties that may not be explicit in the data set (such as biases or aggregations that may hide important causalities). `Gathering domain knowledge almost always requires outside research`.

5. **Organize your data into data structures**. Usually these structures allow building one or more (connected) tables where columns represent features and rows observations. You need to understand data schema and specific context for all values. Use meta-data (data about data) as a starter.

6. **Use data to answer questions**, this will require translating from various domain questions to specific observations and features in your data set. You need to be able to decide **what precisely is meant by a question** - a task that requires understanding the nuances found in the questions' problem domain.

---
<a id='data'></a>