# Prepare Data for Exploration

Notes from this course: https://www.coursera.org/learn/data-preparation/

## Module 1: Data types and structures

### Learning log

#### Collecting data
- As a data analyst, you need to be an expert at structuring, extracting, and making sure the data you are working with is reliable
- Develop a general idea of how all data is generated and collected since every organization structures data differently
- No matter what data structure you are faced with in your new role, you will feel confident working with it
- Data might be biased instead of credible, or dirty instead of clean
- Your goal is to learn how to analyze data for bias and credibility and to understand what clean data means
- When data analysts work with data, they always check that the data is unbiased and credible
- How to collect, apply, organize, and protect data
- How is data collected
    - Interviews
    - Observations
    - Forms
    - Questionnaires
    - Surveys
    - Cookies
- Knowing how it's generated can help add context to the data
- Knowing how it's collected can help the data analysis process more efficient
- Data collection considerations
    - How the data will be collected
        - Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party
    - Choose data sources
        - First-party 
            - Data collected by an individual or group using their own resources
            - Preferred method because you know exactly where it came from
        - Second-party
            - Data collected by a group directly from its audience and then sold
        - Third-party
            - Data collected from outside sources who did not collect it directly
            - Might come from a number of different sources
            - Might not be as reliable but doesn't mean it can't be useful
            - Need to be checked for accuracy, bias, and credibility
            - Needs to be inspected for accuracy and trustworthiness no matter what kind of data you use
            - Sold by a provider that didn’t collect the data themselves
    - Decide what data to use
        - Choose data that can actually help solve your problem question
        - For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates
    - How much data to collect
        - If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs. 
    - Select the right data type
    - Determine the timeframe
        - If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists. 
- Population
    - All possible data values in a certain dataset
- Sample
    - A part of a population that is representative of the population

##### Flowchart
![flowchart](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/5TyGAFZrRi28hgBWa-Ytcg_a723a1a4d78b42e1bcb6ddd2178adc42_Screen-Shot-2020-12-14-at-2.19.22-PM.png?expiry=1699142400000&hmac=IoGziIegBLGSwhMhZYoO09rZvWVvHTug1JfkzOnTrQw)

#### Differentiate between data formats and structure
- Quantitative data
    - Can be measured, counted, expressed as a number
    - Data with a certain quantity, amount, or range
    - Can be broken down into discrete or continuous data
    - Example:
        - Percentage of board certified doctors who are women
        - Population of elephants in Africa
        - Distance from Earth to Mars
- Qualitative data
    - Can't be counted, measured, or easily expressed using numbers
    - Usually listed as a name, category, or description
    - Example:
        - Exercise activity most enjoyed
        - Favorite brands of most loyal customers
        - Fashion preferences of young adults
- Discrete data
    - Data that is counted and has a limited number of values
    - Example: Budgets, Starts, Points
    - When partial measurements (half-stars or quarter-points) aren't allowed, the data is discrete
    - If you don't accept anything other than full stars or points, the data is considered discrete.
    - Example:
        - Number of people who visit a hospital on a daily basis (10, 20, 200)
        - Room’s maximum capacity allowed
        - Tickets sold in the current month
- Continuous data
    - Data that is measured and can have almost any numeric value
    - Can be measured using a timer
    - Can be shown as a decimal with several places
    - Example:
        - Height of kids in third grade classes (52.5 inches, 65.7 inches)
        - Runtime markers in a video
        - Temperature
- Nominal data
    - A type of qualitative data that is categorized without a set order
    - Doesn't have a sequence
    - Example: 
        - First time customer, returning customer, regular customer
        - New job applicant, existing applicant, internal applicant
        - New listing, reduced price listing, foreclosure
        - Q: Ask people if they've watched a movie. 
        - A:Yes, No, Not sure
    - Choices doesn't have particular order
- Ordinal data
    - A type of qualitative data with a set order or scale
    - Example:
        - Movie ratings (number of stars: 1 star, 2 stars, 3 stars)
        - Ranked-choice voting selections (1st, 2nd, 3rd)
        - Income level (low income, middle income, high income)
        - Q: Rank a movie from 1 to 5
        - A: 1, 2, 3, 4, 5
    - Rankings are in order of how much each person liked the movie
- Internal data
    - Data that lives within a company's own systems
    - Usually more reliable and easier to collect
    - Example:
        - Wages of employees across different business units tracked by HR
        - Product inventory levels across distribution centers
- External data
    - Data that lives and is generated outside of an organization
    - Valuable when your analysis depends on as many sources as possible
    - Example:
        - National average wages for the various positions throughout your organization
        - Credit reports for customers of an auto dealership
- Structured data
    - Data organized in a certain format such as rows and columns
    - Spreadsheets and Relational Databases are examples of software that can store data in a structured way
    - Example:
        - Expense reports
        - Tax returns
        - Store inventory
- Unstructured data
    - Data that is not organized in any easily identifiable manner
    - Example: Audio and video files
    - No clear way to identify or organize their content
    - Might have internal structure but the data doesn't fit neatly in rows and columns like structured data
    - Example:
        - Social media posts
        - Emails
        - Videos
- Primary data
    - Collected by a researcher from first-hand sources
    - Examples:
        - Data from an interview you conducted
        - Data from a survey returned from 20 participants
        - Data from questionnaires you got back from a group of workers
- Secondary data
    - Gathered by other people or from other research
    - Examples:
        - Data you bought from a local data analytics firm’s customer profiles
        - Demographic data collected by a university
        - Census data gathered by the federal government

##### Data formats in practice
![formats](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/lpSSp7kPSMqUkqe5D6jKhQ_d475227147854cadb95f7724129bc6f1_C3M1L2R1.png?expiry=1699142400000&hmac=UHdvrLiU1J4jhAPBxdAvFbR-ZeBGuK4E-bCIed6LdBg)

#### Explore data types, fields, and values


---

## Module 2: Bias, credibility, privacy, ethics, and access

---

## Module 3: Databases: Where data lives

---

## Module 4: Organizing and protecting your data

---

## Module 5: Optional: Engaging in the data community

---

## Module 6: Course challenge