Ideally a new data science hire would go through these resources in the first 30-60 days. I started this with a junior data scientist in my mind working for a data science team using Python and GCP. This is roughly ordered.
Rectangle: A window management app for macOS, Rectangle enables you to quickly and effortlessly resize and organize your windows using keyboard shortcuts or by dragging windows to screen edges. Iused to use ShiftIt which did something similar but Rectangle does the same thing but works on the latest versions of macOS.
Stats: An open-source system monitor for macOS, Stats provides you with detailed information on your CPU, memory, disk, network, and battery usage, all accessible from your menu bar. I used to pay for iStat Menus but stats is an open source version.
Amphetamine: Keep your Mac awake and prevent it from sleeping with Amphetamine, a powerful and customizable app that allows you to set rules based on applications, time, or power source. Similar to the Caffiene app.
Be Focused: A productivity-enhancing time management app, Be Focused utilizes the Pomodoro Technique to help you break work into manageable intervals, maintain focus, and stay on track. I find using Pomodoros, setting 25 minute timers of focused work to be incredibly helpful.
Hidden Bar: A minimalist app that allows you to declutter your Mac's menu bar by hiding icons you don't need to see all the time, Hidden Bar lets you access these icons with a simple click whenever needed.
Homebrew: A must-have package manager for macOS, Homebrew makes it easy to install, update, and manage software packages, including command-line tools and graphical applications.
Visual Studio Code: A versatile and free source code editor developed by Microsoft, Visual Studio Code supports a wide range of programming languages and comes with built-in support for Git, intelligent code completion, and a plethora of extensions to customize your coding environment.
iTerm2: A highly customizable and feature-rich terminal emulator for macOS, iTerm2 improves upon the default Terminal app with features like split panes, search functionality, and extensive customization options.
Anaconda/Miniconda: Anaconda is a powerful Python and R distribution that simplifies package management and deployment, while Miniconda is its lightweight counterpart. Both options provide you with the essential tools to set up and manage your data science and machine learning environments.
zsh: zsh has become my bash replacement.
Oh My Zsh: Makes zsh more useful with a bunch of plugins.
Sublime Text: A sophisticated and lightning-fast text editor designed for code, markup, and prose, Sublime Text offers a sleek interface, multiple selections, and a highly extensible plugin API.
Reading through Foundations sections of madewithml:
🛠 Toolkit | 🔥 Machine Learning | 🤖 Deep Learning |
Notebooks | Linear Regression | |
Python | Logistic Regression | |
NumPy | Neural Network | |
Pandas | Data Quality | |
PyTorch | Utilities |
Left out MadewithML's material on Attention, Embeddings and Transformers because Jay Alammar's blog posts are better.
There's many tutorials but this one is decent https://www.freecodecamp.org/news/command-line-for-beginners/
- gcloud
- gsutil
- Big Query - Read through and run through the examples in these Big Query documentation pages:
- Introduction: https://cloud.google.com/bigquery/docs/query-overview
- Query BigQuery data (15 subpages) https://cloud.google.com/bigquery/docs/running-queries
- Query data with SQL (10 subpages) https://cloud.google.com/bigquery/docs/introduction-sql
Read through Reproducibility section of madewithml's MLOps course
♻️ Reproducibility |
Git |
Pre-commit |
Versioning |
Sometimes you will run into merge conflicts, read this guide from Github for how to resolve them.
- https://valohai.com/blog/docker-for-data-science/
- madewithml Docker guide: https://madewithml.com/courses/mlops/docker/
Hadley Wickham's tidy data paper first introduced me to the idea of tidy data. I first read it around 2017 when I first was getting into data analytics. Totally changed how I thought about representing data and put categories to shapes of data like "wide" and "long" data.
Read the original paper: https://vita.had.co.nz/papers/tidy-data.pdf Work through some Python examples: https://byuidatascience.github.io/python4ds/tidy-data.html
This is one of the most important skills for a data scientist as most of the data lives in databases. Therefore, being able to extract and manipulate data using SQL is crucial. Mode Analytics provides a good tutorial. Start with the intermediate one. https://mode.com/sql-tutorial/
🎨 Design |
Product |
Engineering |
Project |
🔢 Data |
Exploration |
Labeling |
Preprocessing |
Splitting |
Augmentation |
📈 Modeling |
Baselines |
Evaluation |
Experiment tracking |
Optimization |
Understand the sklearn API through this example notebook.
- fit
- transform
- predict
- fit_transform
Work through Steps 0-7 in the official PyTorch guide: https://pytorch.org/tutorials/beginner/basics/intro.html
- Use the cookiecutter data science project template for new projects: https://drivendata.github.io/cookiecutter-data-science/
- Read through the developing section of madewithml:
💻 Developing |
Packaging |
Organization |
Logging |
Documentation |
Styling |
Makefile |
- Similar to @radekosmulski, I use VS Code exclusively in order to use Github Copilot in Jupyter
- You can remote SSH to connect to a server and run Jupyter on the server and use Copilot there as well (bare metal VMs, Cloud VMs, etc)
- Learn hotkeys
- Learn by working through the official example
- Streamlit Cheatsheet
@karpathy also recommends spending a couple hours learning Streamlit
Gradio is a similar library from Hugging Face.
- Use Github Copilot for all coding (Python, Jupyter, SQL, etc.). I estimate it makes me 20% more productive for all programming tasks.
- Use Jupyter in VS Code to use Copilot in Jupyter notebooks
📦 Serving |
Command-line |
RESTful API |
✅ Testing |
Code |
Data |
Models |
- https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
- https://www.promptingguide.ai/
🚀 Production |
Dashboard |
CI/CD |
Monitoring |
Systems design |
⎈ Data engineering |
Data stack |
Orchestration |
Feature store |
-
Great example of a full ML project (Part 1, Part 2, Part 3) from Will Koehrsen. Steps 1-3 is in Part 1, Steps 4-6 is in Part 2 and Steps 7-8 is in Part 3.
- Data cleaning and formatting
- Exploratory data analysis
- Feature engineering and selection
- Compare several machine learning models on a performance metric
- Perform hyperparameter tuning on the best model to optimize it for the problem
- Evaluate the best model on the testing set
- Interpret the model results to the extent possible
- Draw conclusions and write a well-documented report
Remember, the field of data science is vast and constantly evolving. The most important skill to develop is the ability to learn and adapt to new tools, technologies, and techniques. Here are some resources to help you continue to learn:
- @lexfridman - and associated transcripts
- @AndrejKarpathy
- @jamesbriggs
- @ai-explained-