# Prepare Data for Exploration

Notes from this course: https://www.coursera.org/learn/data-preparation/

## Module 1: Data types and structures

### Learning log

#### Collecting data
- As a data analyst, you need to be an expert at structuring, extracting, and making sure the data you are working with is reliable
- Develop a general idea of how all data is generated and collected since every organization structures data differently
- No matter what data structure you are faced with in your new role, you will feel confident working with it
- Data might be biased instead of credible, or dirty instead of clean
- Your goal is to learn how to analyze data for bias and credibility and to understand what clean data means
- When data analysts work with data, they always check that the data is unbiased and credible
- How to collect, apply, organize, and protect data
- How is data collected
    - Interviews
    - Observations
    - Forms
    - Questionnaires
    - Surveys
    - Cookies
- Knowing how it's generated can help add context to the data
- Knowing how it's collected can help the data analysis process more efficient
- Data collection considerations
    - How the data will be collected
        - Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party
    - Choose data sources
        - First-party 
            - Data collected by an individual or group using their own resources
            - Preferred method because you know exactly where it came from
        - Second-party
            - Data collected by a group directly from its audience and then sold
        - Third-party
            - Data collected from outside sources who did not collect it directly
            - Might come from a number of different sources
            - Might not be as reliable but doesn't mean it can't be useful
            - Need to be checked for accuracy, bias, and credibility
            - Needs to be inspected for accuracy and trustworthiness no matter what kind of data you use
            - Sold by a provider that didn’t collect the data themselves
    - Decide what data to use
        - Choose data that can actually help solve your problem question
        - For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates
    - How much data to collect
        - If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs. 
    - Select the right data type
    - Determine the timeframe
        - If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists. 
- Population
    - All possible data values in a certain dataset
- Sample
    - A part of a population that is representative of the population

##### Flowchart
![flowchart](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/5TyGAFZrRi28hgBWa-Ytcg_a723a1a4d78b42e1bcb6ddd2178adc42_Screen-Shot-2020-12-14-at-2.19.22-PM.png?expiry=1699142400000&hmac=IoGziIegBLGSwhMhZYoO09rZvWVvHTug1JfkzOnTrQw)

#### Differentiate between data formats and structure
- Quantitative data
    - Can be measured, counted, expressed as a number
    - Data with a certain quantity, amount, or range
    - Can be broken down into discrete or continuous data
    - Example:
        - Percentage of board certified doctors who are women
        - Population of elephants in Africa
        - Distance from Earth to Mars
- Qualitative data
    - Can't be counted, measured, or easily expressed using numbers
    - Usually listed as a name, category, or description
    - Example:
        - Exercise activity most enjoyed
        - Favorite brands of most loyal customers
        - Fashion preferences of young adults
- Discrete data
    - Data that is counted and has a limited number of values
    - Example: Budgets, Starts, Points
    - When partial measurements (half-stars or quarter-points) aren't allowed, the data is discrete
    - If you don't accept anything other than full stars or points, the data is considered discrete.
    - Example:
        - Number of people who visit a hospital on a daily basis (10, 20, 200)
        - Room’s maximum capacity allowed
        - Tickets sold in the current month
- Continuous data
    - Data that is measured and can have almost any numeric value
    - Can be measured using a timer
    - Can be shown as a decimal with several places
    - Example:
        - Height of kids in third grade classes (52.5 inches, 65.7 inches)
        - Runtime markers in a video
        - Temperature
- Nominal data
    - A type of qualitative data that is categorized without a set order
    - Doesn't have a sequence
    - Example: 
        - First time customer, returning customer, regular customer
        - New job applicant, existing applicant, internal applicant
        - New listing, reduced price listing, foreclosure
        - Q: Ask people if they've watched a movie. 
        - A:Yes, No, Not sure
    - Choices doesn't have particular order
- Ordinal data
    - A type of qualitative data with a set order or scale
    - Example:
        - Movie ratings (number of stars: 1 star, 2 stars, 3 stars)
        - Ranked-choice voting selections (1st, 2nd, 3rd)
        - Income level (low income, middle income, high income)
        - Q: Rank a movie from 1 to 5
        - A: 1, 2, 3, 4, 5
    - Rankings are in order of how much each person liked the movie
- Internal data
    - Data that lives within a company's own systems
    - Usually more reliable and easier to collect
    - Example:
        - Wages of employees across different business units tracked by HR
        - Product inventory levels across distribution centers
- External data
    - Data that lives and is generated outside of an organization
    - Valuable when your analysis depends on as many sources as possible
    - Example:
        - National average wages for the various positions throughout your organization
        - Credit reports for customers of an auto dealership
- Structured data
    - Data organized in a certain format such as rows and columns
    - Spreadsheets and Relational Databases are examples of software that can store data in a structured way
    - Makes it easy for analysts to enter, query, and analyze data whenever they need to
    - Makes data visualization pretty easily because it can be applied to charts, graphs, heat maps, dashboards, and most other visual representations of data
    - Stored in relational databases and data warehouses
    - Example:
        - Expense reports
        - Tax returns
        - Store inventory
        - Excel, Google Sheets, SQL, customer data, phone records, transaction history
- Unstructured data
    - Data that is not organized in any easily identifiable manner
    - Example: Audio and video files
    - No clear way to identify or organize their content
    - Might have internal structure but the data doesn't fit neatly in rows and columns like structured data
    - Stored in data lakes, data warehouses, and NoSQL databases
    - Example:
        - Social media posts
        - Emails
        - Videos
        - Audio
        - Photos
        - Text messages, social media comments, phone call transcriptions, various log files, images, audio, video
- Primary data
    - Collected by a researcher from first-hand sources
    - Examples:
        - Data from an interview you conducted
        - Data from a survey returned from 20 participants
        - Data from questionnaires you got back from a group of workers
- Secondary data
    - Gathered by other people or from other research
    - Examples:
        - Data you bought from a local data analytics firm’s customer profiles
        - Demographic data collected by a university
        - Census data gathered by the federal government
- Data model
    - A model that is used for organizing data elements and how they relate to one another
    - Help keep data consistent
    - Provide a map of how data is organized
    - Makes it easier for analysts and stakeholders to make sense of their data and use it for business purposes
- Data elements
    - Pieces of information, such as people's names, account numbers, and addresses
- Fairness issue
    - The new challenge facing data scientists is making sure advancements in artificial intelligence and machine learning algorithms are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others.
    - An unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis
- Data modeling
    - The process of creating diagrams that visually represent how data is organized and structured
    - These visual representations are called data models
    - You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole
    - Can help you explore the high-level details of your data and how it is related across the organization’s information systems
    - Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data
    - Make it easier for everyone in your organization to understand and collaborate with you on your data
- Levels of data modeling
    - Conceptual
        - Business concepts
        - Gives a high-level view of the data structure, such as how data interacts across an organization
        - For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details
    - Logical
        - Data entities
        - Focuses on the technical details of a database such as relationships, attributes, and entities
        - For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables
    - Physical
        - Physical tables
        - Depicts how a database operates
        - Physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database
- Data-modeling techniques
    - Two common methods are: 
        - Entity Relationship Diagram (ERD)
            - Visual way to understand the relationship between entities in the data model
        - Unified Modeling Language (UML)
            - Very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships

##### Data formats in practice
![formats](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/lpSSp7kPSMqUkqe5D6jKhQ_d475227147854cadb95f7724129bc6f1_C3M1L2R1.png?expiry=1699142400000&hmac=UHdvrLiU1J4jhAPBxdAvFbR-ZeBGuK4E-bCIed6LdBg)

##### Further reading
- Comparison of data models - https://www.1keydata.com/datawarehousing/data-modeling-levels.html
- Data modeling technique - https://dataedo.com/blog/basic-data-modeling-techniques

#### Explore data types, fields, and values
- Data type
    - A specific kind of data attribute that tells what kind of value the data is
    - Tells you what kind of data you're working with
- Data type in spreadsheets
    - Number
    - Text or string
    - Boolean
- Data tables
    - Rows
        - Can be referred to as Records
    - Columns
        - Can be referred to as Fields
    - Records and Fields can be used for any kinds of data tables while Rows and Columns are usually reserved for spreadsheets
- Wide data
    - Every data subject has a single row with multiple columns to hold the values of various attributes of the subject
    - Let's you easily identify and quickly compare different columns
    - Makes it easy to find and compare
    - Preferred when:
        - Creating tables and charts with a few variables about each subject
        - Comparing straightforward line 
- Long data
    - Each row is one time point per subject, so each subject will have data in multiple rows
    - Great format for storing and organizing data when there's multiple variables for each subject at each time point that we want to observe
    - Preferred when:
        - Storing a lot of variables about each subject. For example, 60 years worth of interest rates for each bank
        - Performing advanced statistical analysis or graphing
- Data transformation
    - The process of changing the data’s format, structure, or values
    - Usually involves:
        - Adding, copying, or replicating data
        - Deleting fields or records
        - Standardizing the names of variables
        - Renaming, moving, or combining columns in a database
        - Joining one set of data with another
        - Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file.
- Why transform data?
    - Data organization: better organized data is easier to use
    - Data compatibility: different applications or systems can then use the same data
    - Data migration: data with matching formats can be moved from one system to another
    - Data merging: data with the same organization can be merged together
    - Data enhancement: data can be displayed with more detailed fields
    - Data comparison: apples-to-apples comparisons of the data can then be made
    
##### Further reading
- Tips for searching with Boolean operators - https://libguides.mit.edu/c.php?g=175963&p=1158594
- Origins of Boolean Algebra in the Logic of Classes - https://maa.org/press/periodicals/convergence/origins-of-boolean-algebra-in-the-logic-of-classes-george-boole-john-venn-and-c-s-peirce
- Machine Learning Tutorial for Beginners - https://www.kaggle.com/code/kanncaa1/machine-learning-tutorial-for-beginners
- [gganimate](https://www.kaggle.com/code/mrisdal/gganimate/notebook)
- [Getting staRted in R: First Steps](https://www.kaggle.com/code/rtatman/getting-started-in-r-first-steps/notebook)
- [Writing Hamilton Lyrics with Tensorflow/R](https://www.kaggle.com/code/anasofiauzsoy/writing-hamilton-lyrics-with-tensorflow-r/notebook)
- [Dive into dplyr (tutorial #1)](https://www.kaggle.com/code/jessemostipak/dive-into-dplyr-tutorial-1/notebook)

#### Glossary
https://docs.google.com/document/d/1qmlDAzuprOPslSCjEA63Ok645KKdjvaKMzYZzJcKSyQ/template/preview

---

## Module 2: Bias, credibility, privacy, ethics, and access

---

## Module 3: Databases: Where data lives

---

## Module 4: Organizing and protecting your data

---

## Module 5: Optional: Engaging in the data community

---

## Module 6: Course challenge