# Prepare Data for Exploration

Notes from this course: https://www.coursera.org/learn/data-preparation/

## Module 1: Data types and structures

### Learning log

#### Collecting data
- As a data analyst, you need to be an expert at structuring, extracting, and making sure the data you are working with is reliable
- Develop a general idea of how all data is generated and collected since every organization structures data differently
- No matter what data structure you are faced with in your new role, you will feel confident working with it
- Data might be biased instead of credible, or dirty instead of clean
- Your goal is to learn how to analyze data for bias and credibility and to understand what clean data means
- When data analysts work with data, they always check that the data is unbiased and credible
- How to collect, apply, organize, and protect data
- How is data collected
    - Interviews
    - Observations
    - Forms
    - Questionnaires
    - Surveys
    - Cookies
- Knowing how it's generated can help add context to the data
- Knowing how it's collected can help the data analysis process more efficient
- Data collection considerations
    - How the data will be collected
        - Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party
    - Choose data sources
        - First-party 
            - Data collected by an individual or group using their own resources
            - Preferred method because you know exactly where it came from
        - Second-party
            - Data collected by a group directly from its audience and then sold
        - Third-party
            - Data collected from outside sources who did not collect it directly
            - Might come from a number of different sources
            - Might not be as reliable but doesn't mean it can't be useful
            - Need to be checked for accuracy, bias, and credibility
            - Needs to be inspected for accuracy and trustworthiness no matter what kind of data you use
            - Sold by a provider that didn’t collect the data themselves
    - Decide what data to use
        - Choose data that can actually help solve your problem question
        - For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates
    - How much data to collect
        - If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs. 
    - Select the right data type
    - Determine the timeframe
        - If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists. 
- Population
    - All possible data values in a certain dataset
- Sample
    - A part of a population that is representative of the population

##### Flowchart
![flowchart](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/5TyGAFZrRi28hgBWa-Ytcg_a723a1a4d78b42e1bcb6ddd2178adc42_Screen-Shot-2020-12-14-at-2.19.22-PM.png?expiry=1699142400000&hmac=IoGziIegBLGSwhMhZYoO09rZvWVvHTug1JfkzOnTrQw)

#### Differentiate between data formats and structure
- Quantitative data
    - Can be measured, counted, expressed as a number
    - Data with a certain quantity, amount, or range
    - Can be broken down into discrete or continuous data
    - Example:
        - Percentage of board certified doctors who are women
        - Population of elephants in Africa
        - Distance from Earth to Mars
- Qualitative data
    - Can't be counted, measured, or easily expressed using numbers
    - Usually listed as a name, category, or description
    - Example:
        - Exercise activity most enjoyed
        - Favorite brands of most loyal customers
        - Fashion preferences of young adults
- Discrete data
    - Data that is counted and has a limited number of values
    - Example: Budgets, Starts, Points
    - When partial measurements (half-stars or quarter-points) aren't allowed, the data is discrete
    - If you don't accept anything other than full stars or points, the data is considered discrete.
    - Example:
        - Number of people who visit a hospital on a daily basis (10, 20, 200)
        - Room’s maximum capacity allowed
        - Tickets sold in the current month
- Continuous data
    - Data that is measured and can have almost any numeric value
    - Can be measured using a timer
    - Can be shown as a decimal with several places
    - Example:
        - Height of kids in third grade classes (52.5 inches, 65.7 inches)
        - Runtime markers in a video
        - Temperature
- Nominal data
    - A type of qualitative data that is categorized without a set order
    - Doesn't have a sequence
    - Example: 
        - First time customer, returning customer, regular customer
        - New job applicant, existing applicant, internal applicant
        - New listing, reduced price listing, foreclosure
        - Q: Ask people if they've watched a movie. 
        - A:Yes, No, Not sure
    - Choices doesn't have particular order
- Ordinal data
    - A type of qualitative data with a set order or scale
    - Example:
        - Movie ratings (number of stars: 1 star, 2 stars, 3 stars)
        - Ranked-choice voting selections (1st, 2nd, 3rd)
        - Income level (low income, middle income, high income)
        - Q: Rank a movie from 1 to 5
        - A: 1, 2, 3, 4, 5
    - Rankings are in order of how much each person liked the movie
- Internal data
    - Data that lives within a company's own systems
    - Usually more reliable and easier to collect
    - Example:
        - Wages of employees across different business units tracked by HR
        - Product inventory levels across distribution centers
- External data
    - Data that lives and is generated outside of an organization
    - Valuable when your analysis depends on as many sources as possible
    - Example:
        - National average wages for the various positions throughout your organization
        - Credit reports for customers of an auto dealership
- Structured data
    - Data organized in a certain format such as rows and columns
    - Spreadsheets and Relational Databases are examples of software that can store data in a structured way
    - Makes it easy for analysts to enter, query, and analyze data whenever they need to
    - Makes data visualization pretty easily because it can be applied to charts, graphs, heat maps, dashboards, and most other visual representations of data
    - Stored in relational databases and data warehouses
    - Example:
        - Expense reports
        - Tax returns
        - Store inventory
        - Excel, Google Sheets, SQL, customer data, phone records, transaction history
- Unstructured data
    - Data that is not organized in any easily identifiable manner
    - Example: Audio and video files
    - No clear way to identify or organize their content
    - Might have internal structure but the data doesn't fit neatly in rows and columns like structured data
    - Stored in data lakes, data warehouses, and NoSQL databases
    - Example:
        - Social media posts
        - Emails
        - Videos
        - Audio
        - Photos
        - Text messages, social media comments, phone call transcriptions, various log files, images, audio, video
- Primary data
    - Collected by a researcher from first-hand sources
    - Examples:
        - Data from an interview you conducted
        - Data from a survey returned from 20 participants
        - Data from questionnaires you got back from a group of workers
- Secondary data
    - Gathered by other people or from other research
    - Examples:
        - Data you bought from a local data analytics firm’s customer profiles
        - Demographic data collected by a university
        - Census data gathered by the federal government
- Data model
    - A model that is used for organizing data elements and how they relate to one another
    - Help keep data consistent
    - Provide a map of how data is organized
    - Makes it easier for analysts and stakeholders to make sense of their data and use it for business purposes
- Data elements
    - Pieces of information, such as people's names, account numbers, and addresses
- Fairness issue
    - The new challenge facing data scientists is making sure advancements in artificial intelligence and machine learning algorithms are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others.
    - An unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis
- Data modeling
    - The process of creating diagrams that visually represent how data is organized and structured
    - These visual representations are called data models
    - You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole
    - Can help you explore the high-level details of your data and how it is related across the organization’s information systems
    - Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data
    - Make it easier for everyone in your organization to understand and collaborate with you on your data
- Levels of data modeling
    - Conceptual
        - Business concepts
        - Gives a high-level view of the data structure, such as how data interacts across an organization
        - For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details
    - Logical
        - Data entities
        - Focuses on the technical details of a database such as relationships, attributes, and entities
        - For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables
    - Physical
        - Physical tables
        - Depicts how a database operates
        - Physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database
- Data-modeling techniques
    - Two common methods are: 
        - Entity Relationship Diagram (ERD)
            - Visual way to understand the relationship between entities in the data model
        - Unified Modeling Language (UML)
            - Very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships

##### Data formats in practice
![formats](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/lpSSp7kPSMqUkqe5D6jKhQ_d475227147854cadb95f7724129bc6f1_C3M1L2R1.png?expiry=1699142400000&hmac=UHdvrLiU1J4jhAPBxdAvFbR-ZeBGuK4E-bCIed6LdBg)

##### Further reading
- Comparison of data models - https://www.1keydata.com/datawarehousing/data-modeling-levels.html
- Data modeling technique - https://dataedo.com/blog/basic-data-modeling-techniques

#### Explore data types, fields, and values
- Data type
    - A specific kind of data attribute that tells what kind of value the data is
    - Tells you what kind of data you're working with
- Data type in spreadsheets
    - Number
    - Text or string
    - Boolean
- Data tables
    - Rows
        - Can be referred to as Records
    - Columns
        - Can be referred to as Fields
    - Records and Fields can be used for any kinds of data tables while Rows and Columns are usually reserved for spreadsheets
- Wide data
    - Every data subject has a single row with multiple columns to hold the values of various attributes of the subject
    - Let's you easily identify and quickly compare different columns
    - Makes it easy to find and compare
    - Preferred when:
        - Creating tables and charts with a few variables about each subject
        - Comparing straightforward line 
- Long data
    - Each row is one time point per subject, so each subject will have data in multiple rows
    - Great format for storing and organizing data when there's multiple variables for each subject at each time point that we want to observe
    - Preferred when:
        - Storing a lot of variables about each subject. For example, 60 years worth of interest rates for each bank
        - Performing advanced statistical analysis or graphing
- Data transformation
    - The process of changing the data’s format, structure, or values
    - Usually involves:
        - Adding, copying, or replicating data
        - Deleting fields or records
        - Standardizing the names of variables
        - Renaming, moving, or combining columns in a database
        - Joining one set of data with another
        - Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file.
- Why transform data?
    - Data organization: better organized data is easier to use
    - Data compatibility: different applications or systems can then use the same data
    - Data migration: data with matching formats can be moved from one system to another
    - Data merging: data with the same organization can be merged together
    - Data enhancement: data can be displayed with more detailed fields
    - Data comparison: apples-to-apples comparisons of the data can then be made
    
##### Further reading
- Tips for searching with Boolean operators - https://libguides.mit.edu/c.php?g=175963&p=1158594
- Origins of Boolean Algebra in the Logic of Classes - https://maa.org/press/periodicals/convergence/origins-of-boolean-algebra-in-the-logic-of-classes-george-boole-john-venn-and-c-s-peirce
- Machine Learning Tutorial for Beginners - https://www.kaggle.com/code/kanncaa1/machine-learning-tutorial-for-beginners
- [gganimate](https://www.kaggle.com/code/mrisdal/gganimate/notebook)
- [Getting staRted in R: First Steps](https://www.kaggle.com/code/rtatman/getting-started-in-r-first-steps/notebook)
- [Writing Hamilton Lyrics with Tensorflow/R](https://www.kaggle.com/code/anasofiauzsoy/writing-hamilton-lyrics-with-tensorflow-r/notebook)
- [Dive into dplyr (tutorial #1)](https://www.kaggle.com/code/jessemostipak/dive-into-dplyr-tutorial-1/notebook)

#### Glossary
https://docs.google.com/document/d/1qmlDAzuprOPslSCjEA63Ok645KKdjvaKMzYZzJcKSyQ/template/preview

---

## Module 2: Bias, credibility, privacy, ethics, and access

### Learning log

#### Unbiased and objective data
- Bias
    - A preference in favor of or against a person, group of people, or thing
    - It can be conscious or subconscious
    - Once we know and accept that we have bias, we can start to recognize our own patterns of thinking and learn how to manage it
    - Can also happen if a sample group lacks inclusivity
    - You have to think about bias and fairness from the moment you start collecting data to the time you present your conclusions. After all, those conclusions can have serious implications.
- Data bias
    - A type of error that systematically skews results in a certain direction
- Sampling bias
    - When a sample isn't representative of the population as a whole
    - Can be avoided by making sure the sample is chosen at random, so that all parts of the population have an equal chance of being included
    - If you don't use random sampling during data collection, you end up favoring one outcome
- Unbiased sampling
    - When a sample is representative of the population being measured
    - Another great way to discover if you're working with unbiased data is to bring the results to life with visualizations
- More types of data bias
    - Observer bias
    - Interpretation bias
    - Confirmation bias
- Observer bias
    - Sometimes referred to as experimenter bias or research bias
    - The tendency for different people to observe things differently
    - Example:
        - When two scientists looking into the same microscope might see different things
- Interpretation bias
    - The tendency to always interpret ambiguous situations in a positive, or negative way
    - Can lead to two people seeing or hearing the exact same thing, and interpreting it in a variety of different ways, because they have different backgrounds, and experiences
    - Example:
        - Let's say you're having lunch with a colleague, when you get a voicemail from your boss, asking you to call her back. You put the phone down in a huff, certain that she's angry, and you're on the hot seat for something. But when you play the message for your friend, he doesn't hear anger at all, he actually thinks she sounds calm and straightforward
- Confirmation bias
    - "People see what they want to see"
    - The tendency to search for, or interpret information in a way that confirms preexisting beliefs
    - Someone might be so eager to confirm a gut feeling, that they only notice things that support it, ignoring all other signals
    - Example:
        - We might get our news from a certain website because the writers share our beliefs, or we socialize with people because we know that they hold similar views. After all, conflicting viewpoints might cause us to question our worldview, which can lead us to changing our whole belief system, and let's face, it, change is tough
- The four types of bias are all unique, but have one thing in common. They each affect the way we collect, and make sense of the data.
- No matter what kind of data you use, all of it needs to be inspected for accuracy, and trustworthiness

#### Explore data credibility
- Good is subjective
    - What I think is good and what you think is good might be different
    - What about good data sources. Are those subjective too?
- The more high quality data we have, the more confidence we can have in our decisions
- ROCCC
    - Reliable
        - Good data sources are reliable
        - You can trust that you're getting accurate, complete, and unbiased information that's been vetted and proven fit for use
    - Original
        - Data discovered through a second or third party source should be validated with original source to make sure you're dealing with good data
    - Comprehensive
        - The best data sources contain all critical information needed to answer the question or find the solution
        - Example:
            - You wouldn't want to work for a company just because you found one great online review about it. You'd research every aspect of the organization to make sure it was the right fit
    - Current
        - Usefulness of data decreases as time passes
        - The best data sources are current and relevant to the task at hand
    - Cited
        - Citing makes the information you're providing more credible
        - When you're choosing a data source, think about three things:
            - Who created the data set?
            - Is it part of a credible organization?
            - When was the data last refreshed?
- Lots of places that are known for having good data
    - Vetted public data sets
    - Academic papers
    - Financial data
    - Governmental agency data
- What is "bad" data?
    - Flat-out wrong or filled with human error
    - Not ROCCC
    - Not reliable
        - Bad data can't be trusted because it's inaccurate, incomplete, or biased
        - This could be data that has sample selection bias because it doesn't reflect the overall population. Or it could be data visualizations and graphs that are just misleading
    - Not original
        - If you can't locate the original data source and you're just relying on second or third party information, that can signal you may need to be extra careful in understanding your data
    - Not comprehensive
        - Bad data sources are missing important information needed to answer the question or find the solution. What's worse, they may contain human error, too
    - Not current
        - Bad data sources are out of date and irrelevant. Many respected sources refresh their data regularly, giving you confidence that it's the most current info available
    - Not cited
        - If your source hasn't been cited or vetted, it's a no-go
- It's important for data analysts to understand and keep an eye out for bad data because it can have serious and lasting impacts
- Every good solution is found by avoiding bad data

#### Data ethics and privacy
- Ethics
    - Set of principles to live by
    - Our personal ethics evolve and become more rational, giving us a moral compass to use as we face life's questions, challenges, and opportunities
    - When we analyze data, we're also faced with questions, challenges, and opportunities, but we have to rely on more than just our personal code of ethics to address them
    - An exact definition is still under discussion in philosophy
    - One practical view is that ethics refers to well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness or specific virtues.
- Data ethics
    - Refers to well- founded standards of right and wrong that dictate how data is collected, shared, and used
    - The ability to collect, share and use data in such large quantities is relatively new, the rules that regulate and govern the process are still evolving
    - Data ethics tries to get to the root of the accountability companies have in protecting and responsibly using the data they collect
    - Think about:
        - Who's collecting the data?
        - Why are they collecting it?
        - How are they collecting it?
        - For what purpose?
        - How to keep aspects of their data protected and private?
        - How we can have mechanisms of giving users and giving consumers more control over their data?
- GDPR
    - General Data Protection Regulation of the European Union
- Aspects of data ethics
    - Ownership
        - This answers the question who owns data?
        - It isn't the organization that invested time and money collecting, storing, processing, and analyzing it. It's individuals who own the raw data they provide, and they have primary control over its usage, how it's processed and how it's shared
    - Transaction transparency
        - All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data
        - This is in response to concerns over data bias
        - Biased outcomes can lead to negative consequences. To avoid them, it's helpful to provide transparent analysis especially to the people who share their data
        - This lets people judge whether the outcome is fair and unbiased and allows them to raise potential concerns
    - Consent
        - This is an individual's right to know explicit details about how and why their data will be used before agreeing to provide it
        - They should know answers to questions like:
            - Why is the data being collected?
            - How will it be used?
            - How long will it be stored?
        - Consent is important because it prevents all populations from being unfairly targeted which is a very big deal for marginalized groups who are often disproportionately misrepresented by biased data
    - Currency
        - Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions
        - If your data is helping to fund a company's efforts, you should know what those efforts are all about and be given the opportunity to opt out.
    - Privacy
        - Privacy is personal. We may all define privacy in our own way, and we're all entitled to it.
        - Privacy means preserving a data subject's information and activity any time a data transaction occurs. This is sometimes called information privacy or data protection
        - It's all about access, use, and collection of data
        - It also covers a person's legal right to their data
        - Protection from unauthorized access to our private data
        - Freedom from inappropriate use of our data
        - The right to inspect, update, or correct our data
        - Ability to give consent to use our data
        - Legal right to access our data
    - Openness
        - Free access, usage, and sharing of data
- Data anonymization
    - Is the process of protecting people's private or sensitive data by eliminating personally identifiable information (PII)
    - What types of data should be anonymized?
        - Healthcare and financial data are two of the most sensitive types of data
        - Telephone numbers
        - Names
        - License plates and license numbers
        - Social security numbers
        - IP addresses
        - Medical records
        - Email addresses
        - Photographs
        - Account numbers
- Personally identifiable information (PII)
    - Information that can be used by itself or with other data to track down a person's identity
- De-identification
    - A process used to wipe data clean of all personally identifying information
    
#### Understanding open data
- Standards
    - Availability and access
        - Open data must be available as a whole, preferably by downloading over the internet in a convenient and modifable form
        - data.gov
    - Reuse and redistribution
        - Open data must be provided under terms that allow reuse and redistribution including the ability to use it with other datasets
    - Universal participation
        - Everyone must be able to use, reuse, and redistribute the data. There shouldn't be any discrimination against fields, persons, or groups
        - No one can place restrictions on the data like making it only available for use in a specific industry
- Benefits of open data
    - Credible databases can be used more widely
    - Good data can be leveraged, shared, and combined with other data
    - Scientific collaboration, research advances, analytical capacity, and decision-making
- Interoperability is key to open data's success
- Data interoperability
    - The ability of data systems and services to openly connect and share data
- Open data
    - Part of data ethics, which has to do with using data ethically 
- Sites and resources for open data
    - [U.S. government data site](https://data.gov/)
    - [U.S. Census Bureau](https://www.census.gov/data.html)
    - [Open Data Network](https://www.opendatanetwork.com/)
    - [Google Cloud Public Datasets](https://cloud.google.com/datasets)
    - [Dataset Search](https://datasetsearch.research.google.com/)
    
#### Glossary
https://docs.google.com/document/d/1Q19G-LXCzWIpv42LzzJF3wYlRJRF4s9rzmrSVNi9cHw/template/preview

---

## Module 3: Databases: Where data lives

### Learning log

#### Working with databases
- Database
    - Collection of data stored in a computer system
- Meta
    - Usually talking about something referencing back to itself or being completely self aware
    - Example:
        - If a character in a book knows she's in a book, that's meta
        - If you make a documentary about making a documentaries, that's meta
        - Analyze how you analyze data
- Metadata
    - Data about data
    - Think of it like a reference guide
    - Without a guide, all you have is a bunch of data with no context explaining what it means
    - Tells where the data comes from, when and how it was created, and what it's all about
- Relational database
    - A database that contains a series of tables that can be connected via their relationships
    - Allows data analysts to organize and link data based on what the data has in common
- Non-relational table
    - You will find all of the possible variables you might be interested in analyzing all grouped together
    - Hard to sort through
- Primary key
    - An identifier that references a column in which each value is unique
    - Used to ensure data in a specific column is unique
    - Uniquely identifies a record in a relational database table
    - Only one primary key is allowed in a table
    - Cannot contain null or blank values
- Foreign key
    - A field within a table that is a primary key in another table
    - A column or group of columns in a relational database table that provides a link between the data in two tables
    - More than one foreign key is allowed to exist in a table
- Database Normalization
    - A process of organizing data in a relational database
    - It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database
    - For example, creating tables and establishing relationships between those tables
- Key to relational databases
    - Tables in a relational database are connected by the fields they have in common
    - Some tables don't require a primary key
    - Composite key
        - A primary key may also be constructed using multiple columns of a table
- Structured Query Language (SQL) 
    - A type of query language that lets data analysts communicate with a database
    - Data analyst will use SQL to create a query to view the specific data that they want from within the larger set
    - Data analysts can write queries to get data from the related tables
- Inspecting a dataset
    - Before you begin an analysis, it’s important to inspect your data to determine if it contains the specific information you need to answer your stakeholders’ questions
    - The data is not there (you have sandwich data, but you need pizza data)
    - The data is insufficient (you have pizza data for June 1-7, but you need data for the entire month of June)
    - The data is incorrect (your pizza data lists the cost of a slice as $250, which makes you question the validity of the dataset)
    - Inspecting your dataset will help you pinpoint what questions are answerable and what data is still missing
    - You may be able to recover this data from an external source or at least recommend to your stakeholders that another data source be used
- Ice cream sales sample problem:
    - Question 1: What is the most popular flavor of ice cream?
        - To discover the most popular flavor, you first need to define what is meant by "popular." Is the most popular flavor the one that generated the most revenue in 2019? Or is it the flavor that had the largest number of units sold in 2019? 
        - The dataset did not come with a data description, so you have to figure out the significance of the columns on your own
        - Your next step would be to ask your stakeholders if the annual sales per flavor data is available from another source. If not, you can add a statement about the current data’s limitations to your analysis. 
    - Question 2: How does temperature affect sales?
        - When daily highs are above X degrees, average ice cream sales increase by Y amount
    - Question 3: How do weekends and holidays affect sales?
        - Add a column on the spreadsheet to determine if the row is weekend or holiday
        - Find out whether sales on weekends and holidays are greater than sales on other days
    - Question 4: How does profitability differ for new customers versus returning customers?
        - Dataset does not contain sales data related to new customers
        - It may be the case that the company collects customer data and stores it in a different data table
        - Your next step would be to find out how to access the company’s customer
        - You can then join the revenue sales data to the customer data table to categorize each sale as from a new or returning customer and analyze the difference in profitability between the two sets of customers
        - This information will help your stakeholders develop marketing campaigns for specific types of customers to increase brand loyalty and overall profitability
    - Conclusion
        - You won’t always have all the necessary or relevant data at your disposal
        - In many of these cases, you can turn to other data sources to fill in the gaps
        - Despite the limitations of your dataset, it’s still possible to offer your stakeholders some valuable insights
        - For next steps, your best plan of action will be to take the initiative to ask questions, identify other relevant datasets, or do some research on your own
        - No matter what data you’re working with, carefully inspecting your data makes a big impact on the overall quality of your analysis
    
#### Managing data with metadata
- Metadata
    - Used in database management to help data analysts interpret the contents of the data within the database
    - As important as the data itself
    - Tells the who, what, when, where, which, how, and why of data
    - Ensures that you are able to find, use, preserve, and reuse data in the future
    - Creates a single source of truth by keeping things consistent and uniform
    - Makes data more reliable by making sure it's accurate, precise, relevant, and timely
- 3 common types of metadata
    - Descriptive
        - Describes a piece of data and can be used to identify it at a later point in time
        - Example: Book's ISBN
    - Structural
        - Indicates how a piece of data is organized and whether it is part of one, or more than one, data collection
        - Important to note that structural metadata keeps track of the relationship between two things
        - Example: How pages of a book are put together to create different chapters
    - Administrative
        - Indicates the technical source and details of a digital asset
- Elements of metadata
    - Title and description
        - What is the name of the file or website you are examining? What type of content does it contain?
    - Tags and categories
        - What is the general overview of the data that you have? Is the data indexed or described in a specific way? 
    - Who created it and when
        - Where did the data come from, and when was it created? Is it recent, or has it existed for a long time?
    - Who last modified it and when
        - Were any changes made to the data?  If yes, were the modifications recent?
    - Who can access or update it
        - Is this dataset public? Are special permissions needed to customize or modify the dataset?
- Examples of metadata
    - Photos
    - Emails
    - Spreadsheets and documents
    - Websites
    - Digital files
    - Books
- Remember, it will be your responsibility to manage and make use of data in its entirety; metadata is as important as the data itself
- Data that is uniform can be organized, classified, stored, accessed, and used effectively
- When a database is consistent, it's so much easier to discover relationships between the data inside it and the data elsewhere
- Metadata repository
    - A database specifically created to store metadata
    - Can be stored in a physical location, or they can be virtual
    - Describes where the data came from, keep it in an accessible form so it can be used quickly and easily, and keeps it in a common structure for everyone who may need to use it
    - Makes it easier and faster to bring together multiple sources for data analysis
        - Describe the state and location of the metadata
        - Describe the structures of the tables inside
        - Describe how the data flows through the repository
        - Keep track of who accesses the metadata and when
    - Using a metadata repository, a data analyst can find it easier to bring together multiple sources of data, confirm how or when data was collected, and verify that data from an outside source is being used appropriately
- Data governance
    - A process to ensure the formal management of a company's data assets
    - Gives an organization better control of their data and helps a company manage issues related to data security and privacy, integrity, usability and internal and external data flows
    - More than just standardizing terminology and procedures. It's about the roles and responsibilities of the people who work with the metadata every day
- Metadata specialists
    - Organize and maintain company data ensuring that it's of the highest possible quality
    - Creates basic metadata identification and discovery information, describe the way different data sets work together, and explain the many different types of data resources 
    - Creates standards that everyone follows and the models used to organize the data
    - Makes data accessible by sharing with colleagues and other stakeholders
- Datalake
    - Brings together all of the sources of data that you might want to use in an analysis into one place

#### Accessing different data sources
- IMPORTRANGE
    - Enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in
- IMPORTHTML
    - Enables you to import the data from an HTML table (or list) on a web page

##### Further reading
- [Web scraping made easy](https://www.thedataschool.co.uk/anna-prosvetova/web-scraping-made-easy-import-html-tables-or-lists-using-google-sheets-and-excel/)

##### Public datasets
- [Google Cloud Public Datasets](https://cloud.google.com/datasets)
- [Dataset Search](https://datasetsearch.research.google.com/)
- [Kaggle](https://www.kaggle.com/datasets)
- [BigQuery](https://cloud.google.com/bigquery/public-data)
- Public health datasets
    - [Global Health Observatory data](https://www.who.int/data/collections)
    - [The Cancer Imaging Archive (TCIA) dataset](https://cloud.google.com/healthcare-api/docs/resources/public-datasets/idc)
    - [1000 Genomes](https://cloud.google.com/life-sciences/docs/resources/public-datasets/1000-genomes)
- Public climate datasets
    - [National Climatic Data Center](https://www.ncei.noaa.gov/products)
    - [NOAA Public Dataset Gallery](https://www.climate.gov/maps-data/all?listingMain=datasetgallery)
- Public social-political datasets
    - [UNICEF State of the World’s Children](https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/)
    - [CPS Labor Force Statistics](https://www.bls.gov/cps/tables.htm)
    - [The Stanford Open Policing Project](https://openpolicing.stanford.edu/)

#### Sorting and filtering
- Sorting
    - Arranging data into a meaningful order to make it easier to understand, analyze, and visualize
- Filtering
    - Showing only the data that meets a specific criteria while hiding the rest
- Data cleaning
    - Corrects or removes incorrect, missing, and faulty data
    - Is of critical importance because an analysis based on dirty data can lead to wrong conclusions and bad decisions
- Spreadsheets are better-suited to self-contained data, where the data exists in one place
- Databases can be relational while spreadsheets cannot
- Databases can be used to store data from external tables, allowing you to change data in several places by editing in only one place

#### Working with large datasets in SQL
- [BigQuery](https://cloud.google.com/bigquery/docs)
    - Is a data warehouse on Google Cloud that data analysts can use to query, filter large datasets, aggregate results, and perform complex operations
- In-depth guide: SQL best practices
    - https://d3c33hcgiwev3.cloudfront.net/5vVDkB5qT1y1Q5Aeau9c_Q_6d0e31160e2e43479d172390d19853f1_DAC3-In-depth-guide_-SQL-best-practices.pdf?Expires=1700006400&Signature=Fz-rqGB1WlHTr6LN1vuZPRc4PgPirOONj5CA2XBuYtb1djJZchhJNCQLqVezLtC32Nqsyv9VDwX-7F7wTppumCcmz7~Ko6TpMOgRSJH6535BelVLJy3kDy4psKETIul8XGSMfAJZkZhxlVz8ee-xTa-w775Kk5PLly2csuYm2w8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A

#### Glossary
https://docs.google.com/document/d/1TYydDqJIHiVMEjpOX7tI5VNsaqteiyZM97cz7VZuXqQ/template/preview

---

## Module 4: Organizing and protecting your data

### Learning log

#### Effectively organize data
- Benefits of organizing data
    - Makes it easier to find and use
    - Helps you avoid making mistakes during your analysis
    - Helps to protect your data
- Best practices when organizing data
    - Naming conventions
    - Foldering
    - Archiving older files
    - Align your naming and storage practices with your team
- Naming conventions
    - Consistent guidelines that describe the content, date, or version of a file in its name
    - Use logical and descrptive names for your files to make them easier to find and use
- Foldering
    - Organizing your files into folders helps keep project-related files together in one place
- Archiving
    - Move old projects to a separate location to create an archive and cut down on clutter
- Align your naming and storage practices with your team
    - Avoids confusion
- Think about how often you're making copies of data and storing it in different places
- Best practices for file naming conventions
    - Work out and agree on file naming conventions early on in a project to avoid renaming files again and again.
    - Align your file naming with your team's or company's existing file-naming conventions.
    - Ensure that your file names are meaningful; consider including information like project name and anything else that will help you quickly identify (and use) the file for the right purpose.
    - Include the date and version number in file names; common formats are YYYYMMDD for dates and v## for versions (or revisions).
    - Create a text file as a sample file with content that describes (breaks down) the file naming convention and a file name that applies it.
    - Avoid spaces and special characters in file names. Instead, use dashes, underscores, or capital letters. Spaces and special characters can cause errors in some applications: SalesReport_2020_11_25_v02
    - Keep file names short and sweet
    - Format dates yyyymmdd: SalesReport20201125
    - Lead revision numbers with 0: SalesReport20201125v02
- Best practices for keeping files organized
    -  Create folders and subfolders in a logical hierarchy so related files are stored together.
    - Separate ongoing from completed work so your current project files are easier to find. Archive older files in a separate folder, or in an external storage location.
    - If your files aren't automatically backed up, manually back them up often to avoid losing important work.

#### Securing data
- Data security
    - Protecting data from unauthorized access or corruption by adopting safety measures
    - Usually the purpose of data security is to keep unauthorized users from accessing or viewing sensitive data
    - Data analysts have to find a way to balance data security with their actual analysis needs
- Encryption
    - Uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm
    - This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form
- Tokenization
    - Replaces the data elements you want to protect with randomly generated data referred to as a “token.”
    - The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping. This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location.
- Encryption and tokenization are just some of the data security options out there. There are a lot of others, like using authentication devices for AI technology

#### Glossary
https://docs.google.com/document/d/1W3Uzz4TdlNlHNrGPQytypTKxrYteQggcWS8r6cBk-CM/template/preview

---

## Module 5: Optional: Engaging in the data community

---

## Module 6: Course challenge