layout | published-on | author | title | subtitle | description | keywords | meta_og_image |
---|---|---|---|---|---|---|---|
post |
March 27th 2023 |
Josh Patterson |
Appendix B - Data Roles |
The Hitchhiker's Guide To Building Modern Data Products |
In this post we'll ..... |
snowflake, snowpark, automl, AutoGluon, pandas, dataframe, whl, pip, anaconda, dependency |
pct_autogluon_dep_og_card.jpg |
Purpose of this series:
To develop a clear step-by-step process to design and operate data infrastructure for your data product.
The intended audience for this series is:
Individual researchers, data scientists, and then also enterprise data teams as well
This article is part of a larger series:
- Prologue (Don't Panic)
- The Evolution of Modern Data Platforms
- Revisting The Lab and the Factory
- A Methodology for Building Data Products
- Appendix A: Definitions
- Appendix B: Roles
In this appendix we breakdown the specific roles involved in analytics and machine learning. There are many skills required of roles in the data platform space and each role has some overlap with other roles in terms of skills required. In the infographic below, we show a comparison of the skills (rated 1-5 per skill per role) in a polar plot:
Note: This diagram was inspired by a similar digram in the Data Captain's article "Guide to Data roles".
In the sections that follow I breakdown specific details about each role and what kind of tools are commonly used in the role.
A data scientist is a professional who uses their skills in statistical analysis, machine learning, data mining, and data visualization to extract insights from large and complex sets of data. They possess technical skills in programming languages such as Python or R, and data manipulation tools such as SQL and Excel. Additionally, they must have strong communication skills to effectively present their findings to non-technical stakeholders.
Data scientists work with data engineers to build machine learning models based on datasets extracted from the data warehouse and from other souces such as log files, tabular data in flat files, etc.
- Python, Pandas, Jupyter Notebooks
- Scikit-learn, TensorFlow, PyTorch, more
- SAS
- R
A data engineer is a professional who designs and builds the infrastructure to store, process, and analyze large and complex data sets. They work closely with data scientists to identify the data requirements and ensure that the necessary data is available in a usable format.
Data engineers are skilled in data architecture, database design, and distributed systems. They are responsible for developing and implementing data pipelines, transforming data into a usable format, and loading it into a data warehouse. They also optimize data processing systems to ensure efficiency and scalability.
We list key tasks and the required skills for a data engineer below based on the article The Rise of the Data Engineer.
- data ingestion
- metric computation
- anomaly detection
- metadata management
- experimentation
- instrumentation
- sessionization
- SQL
- Data modeling techniques
- ETL design
- System architecture
A machine learning engineer is responsible for designing, building, and deploying machine learning systems that can learn from and make predictions on large and complex sets of data. They work closely with data scientists to develop and implement machine learning algorithms and are skilled in programming languages such as Python or Java, as well as machine learning frameworks such as TensorFlow or PyTorch.
Machine learning engineers are responsible for selecting and preparing data sets, tuning the parameters of machine learning models, and designing and developing the software infrastructure needed to support machine learning systems. They work with data scientists and other stakeholders to identify business problems that can be solved with machine learning and to deliver solutions that can provide business value.
Overall, machine learning engineers are highly technical professionals who are critical to the development and implementation of machine learning systems.
Analytics engineer is a role defined by DBT Labs:
Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their own questions. While a data analyst spends their time analyzing data, an analytics engineer spends their time transforming, testing, deploying, and documenting data. Analytics engineers apply software engineering best practices like version control and continuous integration to the analytics code base.
We list the key tasks for an analytics engineer from Preset's blog:
- owning core datasets: as analytics engineers cover more specific subject areas, data engineers may cover the mission-critical datasets that are heavily shared across teams
- data modeling: define and refine best practices around modeling data
- coding standards: define and defend naming conventions, coding conventions, and testing standards.
- abstractions: create and manage reusable components in the form of jinja macros libraries and computation frameworks
- metadata management: data assets documentation, discoverability, metadata integration, etc
- data operations: SLAs, data warehouse cost management, data quality monitoring, anomaly detection, and garbage collecting unused resources
A data analyst is a professional who examines large and complex sets of data to identify patterns, trends, and insights that can be used to inform business decisions. They use statistical and analytical techniques such as regression analysis, hypothesis testing, and data visualization to extract insights from data. They are skilled in programming languages such as SQL, Python, or R, and data manipulation tools such as Excel or Tableau.
Data analysts are responsible for collecting and cleaning data, preparing it for analysis, and presenting their findings to stakeholders through reports or visualizations. They play a critical role in informing business decisions by providing insights based on data analysis. Overall, data analysts are highly analytical professionals who are responsible for extracting insights from data and communicating their findings to stakeholders.
DBT Labs has a great article comparing the Data Analyst role with other data roles:
While a data analyst spends their time analyzing data, an analytics engineer spends their time transforming, testing, deploying, and documenting data.
Analytics engineers apply software engineering best practices like:
- version control
- continuous integration
to the analytics code base.
- TODO: more writing on how this role deploys infrastructure as contrasted to roles who "use the infrastructure"