Skip to content

A toolkit for new hires in data science. Designed with Python users in mind.

Notifications You must be signed in to change notification settings

lawwu/data-science-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

Ideally a new data science hire would go through these resources in the first 30-60 days. I started this with a junior data scientist in my mind working for a data science team using Python and GCP. This is roughly ordered.

Computer & Environment Setup

Productivity

Rectangle: A window management app for macOS, Rectangle enables you to quickly and effortlessly resize and organize your windows using keyboard shortcuts or by dragging windows to screen edges. Iused to use ShiftIt which did something similar but Rectangle does the same thing but works on the latest versions of macOS.

Stats: An open-source system monitor for macOS, Stats provides you with detailed information on your CPU, memory, disk, network, and battery usage, all accessible from your menu bar. I used to pay for iStat Menus but stats is an open source version.

Amphetamine: Keep your Mac awake and prevent it from sleeping with Amphetamine, a powerful and customizable app that allows you to set rules based on applications, time, or power source. Similar to the Caffiene app.

Be Focused: A productivity-enhancing time management app, Be Focused utilizes the Pomodoro Technique to help you break work into manageable intervals, maintain focus, and stay on track. I find using Pomodoros, setting 25 minute timers of focused work to be incredibly helpful.

Hidden Bar: A minimalist app that allows you to declutter your Mac's menu bar by hiding icons you don't need to see all the time, Hidden Bar lets you access these icons with a simple click whenever needed.

Developer Tools

Homebrew: A must-have package manager for macOS, Homebrew makes it easy to install, update, and manage software packages, including command-line tools and graphical applications.

Visual Studio Code: A versatile and free source code editor developed by Microsoft, Visual Studio Code supports a wide range of programming languages and comes with built-in support for Git, intelligent code completion, and a plethora of extensions to customize your coding environment.

iTerm2: A highly customizable and feature-rich terminal emulator for macOS, iTerm2 improves upon the default Terminal app with features like split panes, search functionality, and extensive customization options.

Anaconda/Miniconda: Anaconda is a powerful Python and R distribution that simplifies package management and deployment, while Miniconda is its lightweight counterpart. Both options provide you with the essential tools to set up and manage your data science and machine learning environments.

zsh: zsh has become my bash replacement.

Oh My Zsh: Makes zsh more useful with a bunch of plugins.

Sublime Text: A sophisticated and lightning-fast text editor designed for code, markup, and prose, Sublime Text offers a sleek interface, multiple selections, and a highly extensible plugin API.

Technical

Reading through Foundations sections of madewithml:

🛠  Toolkit 🔥  Machine Learning 🤖  Deep Learning
Notebooks Linear Regression
Python Logistic Regression
NumPy Neural Network
Pandas Data Quality
PyTorch Utilities

Left out MadewithML's material on Attention, Embeddings and Transformers because Jay Alammar's blog posts are better.

Command Line

There's many tutorials but this one is decent https://www.freecodecamp.org/news/command-line-for-beginners/

Cloud - GCP

Git & Bitbucket/Github

Read through Reproducibility section of madewithml's MLOps course

♻️  Reproducibility
Git
Pre-commit
Versioning

Sometimes you will run into merge conflicts, read this guide from Github for how to resolve them.

Docker

Kubeflow

Data

Hadley Wickham's tidy data paper first introduced me to the idea of tidy data. I first read it around 2017 when I first was getting into data analytics. Totally changed how I thought about representing data and put categories to shapes of data like "wide" and "long" data.

Read the original paper: https://vita.had.co.nz/papers/tidy-data.pdf Work through some Python examples: https://byuidatascience.github.io/python4ds/tidy-data.html

SQL

This is one of the most important skills for a data scientist as most of the data lives in databases. Therefore, being able to extract and manipulate data using SQL is crucial. Mode Analytics provides a good tutorial. Start with the intermediate one. https://mode.com/sql-tutorial/

madewithml (parts 1, 2 and 3)

🎨  Design
Product
Engineering
Project
🔢  Data
Exploration
Labeling
Preprocessing
Splitting
Augmentation
📈  Modeling
Baselines
Evaluation
Experiment tracking
Optimization

Python

scikit-learn

Understand the sklearn API through this example notebook.

  • fit
  • transform
  • predict
  • fit_transform

torch

Work through Steps 0-7 in the official PyTorch guide: https://pytorch.org/tutorials/beginner/basics/intro.html

Packaging

💻  Developing 
Packaging
Organization
Logging
Documentation
Styling
Makefile

Jupyter

  • Similar to @radekosmulski, I use VS Code exclusively in order to use Github Copilot in Jupyter
  • You can remote SSH to connect to a server and run Jupyter on the server and use Copilot there as well (bare metal VMs, Cloud VMs, etc)
  • Learn hotkeys

Streamlit

@karpathy also recommends spending a couple hours learning Streamlit

Gradio is a similar library from Hugging Face.

Github Copilot

madewithml (parts 5 and 6)

📦  Serving
Command-line
RESTful API
✅  Testing
Code
Data
Models

NLP

Embeddings

Transformers

Language Models / Gen AI

Prompt Engineering

madewithml (parts 8 and 9)

🚀  Production
Dashboard
CI/CD
Monitoring
Systems design
⎈  Data engineering
Data stack
Orchestration
Feature store

Extras

  • MIT: The Missing Semester of Your CS Education

  • Great example of a full ML project (Part 1, Part 2, Part 3) from Will Koehrsen. Steps 1-3 is in Part 1, Steps 4-6 is in Part 2 and Steps 7-8 is in Part 3.

    1. Data cleaning and formatting
    2. Exploratory data analysis
    3. Feature engineering and selection
    4. Compare several machine learning models on a performance metric
    5. Perform hyperparameter tuning on the best model to optimize it for the problem
    6. Evaluate the best model on the testing set
    7. Interpret the model results to the extent possible
    8. Draw conclusions and write a well-documented report

Continue to Learn

Remember, the field of data science is vast and constantly evolving. The most important skill to develop is the ability to learn and adapt to new tools, technologies, and techniques. Here are some resources to help you continue to learn:

YouTube

Twitter

Blogs

About

A toolkit for new hires in data science. Designed with Python users in mind.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published