Open Source Data Science Resources.
Latest commit 6923816 Apr 20, 2016 @jonathan-bower Added git tips
Permalink
Failed to load latest commit information.
README.md Added git tips Apr 20, 2016

README.md

Data Science Resources

Hello and welcome to the Data Science Resources repo. I originally built this repo so that I could have a location to host resources that are helpful to me. Through building the repo I realized that other people might be also be interested in this content - so I have tried to curate content on data science topics, high quality resources to learn from, and relevant blog posts.

The intended goal was to cover more than just the technical component of data science. Data Science as a discipline is still relatively fresh and many business are learning how to properly integrate and structure those teams and also proper understanding the value proposition that data science can provide.

As a result I also tried to find topics that cover building data science teams, business practices, use-cases, product metrics and data science career paths.

This is a constant work in progress and I hope to refactor and update in some kind of meaningful time frame.

If you find this resource helpful - please send it around to other people or you can upvote it on datatau, share it on linkedIn, twitter, Facebook, add it to Quora or just send me a note. Good luck, I hope this helps you find what you are looking for now, or in the future.

Remember - If you’re not prepared to be wrong, you’ll never come up with anything original.

Table Of Contents

  1. Data Science Getting Started

  2. Data Pipeline & Tools

  3. Product

  4. Career Resources

  5. Open Source Data Science Resources

  6. About Me

Data Science Getting Started

Data Science is a multidisciplinary field covering at the very minimum - statistics, programming, machine learning Drew Conway's venn diagram or Cheat Sheet of a Modern Data Scientist. These topics are covered throughout this repo. I personally find the best way to learn a topic is to get my hands dirty quickly - with that in mind I would get to work in python and then implement different tools or theory into my toolkit as they are understood. If you haven't used python before I would strongly urge you to use the codecademy course to familiarize yourself with the content and how to program. Good luck and have fun.

A note about order - I framed the contents in the Pipeline & Tools section order of the data pipeline starting with acquisition, exploratory data analysis, cleaning data, model section & evaluation and then visualization.

Start

Data Science Courses:

  • Coursera - Data Science Specialization at Coursera - many other courses available as well.
  • Udacity - Online MOOCs that are the Data Science related courses. by I
  • Data Science Bootcamps - A collection of all bootcamps currently on the market as of April 5, 2014 by Ikechukwu Okonkwo.
  • Coursera Machine Learning Course - Andrew Ng's pinnacle Machine Learning course.
  • Edx - EDX courses related to data science.

Data Pipeline & Tools

Python

Python is my workhorse language specifically as it has many data science and statistic library, the ability to work in production environments, and work on other problems outside of data science. There are many other languages that could be useful but are not covered here: Julia, R, Cython, Pig, Scala, Java, etc.

Data Structures & CS Topics

Statistics

Some primers on understanding statistics and other resources to get a deeper understanding.

Stats/Engineering Libraries

A collection of workhorse libraries that are elemental for any python data scientist.

  • Pandas Wes McKinney's pandas library for EDA on small to medium sized data sets when you don't want to put the infrastructure for SQL or when it isn't necessary. It has many other great applications other than just better than SQL on small to medium data sets.
  • SciPy - Open-source software for mathematics, science and engineering.
  • NumPy - Fundamental package for scientific computing with Python.
  • StatsModels - Module that allows users to explore data, estimate statistical models and perform statistical tests.
  • PyMC - Bayesian estimation useful for Markov chain Monte Carlo analysis (among other things).

Data Acquisition

Libraries that are very helpful for abstracting away some of the complications of scraping or working with HTTP.

Processing & Exploratory Data Analysis

A collection of documents explaining some of the ways to do processing & EDA.

Databases/Frameworks

A collection of databases & frameworks that are helpful for data management and are the industry standard.

Machine Learning

There is a lot of information available online about the theory, mathematical intuition, tuning for this discipline. Here are some tools that are currently available.

Machine Learning Theory

Deep Learning

Getting a lot of media traction is deep learning - get your feet wet with some of these resources:

Time-Series

Model Selection

Resources about how to decide on your model.

Model Evaluation

Resources to help with understanding model evaluation.

Feature Engineering

A critical element of Data Science to improve your performance but minimally talked about.

Additional Tools or Processes

Resources on other topics that are very helpful for data scientists and product.

Data Visualization

Collection of the best libraries that I know for easy and powerful data visualizations.

  • ggplot - ggplot for python ported by the team at yhat.
  • matplotlib - Awesome plotting library for python.
  • d3 - Mike Bostock's viz library - the de facto gold standard for polished visualization - in js, steep learning curve but beautiful outcomes.
  • bokeh - Interactive visualization library.
  • d3py - Another library for data viz.
  • vincent - Help with python for d3.
  • seaborn - Clean statistical data visualization library.

Other available Visualization Resources.

  • Scott Murray's D3 Tutorials Tutorials from Interactive Data Visualization for the Web
  • tributary.io - live code visualization platform designed specifically for D3.js
  • plot.ly - A web visualization and data processing platform
  • blockspring - Share code and visualizations through a single platform
  • dot.append - Ian Johnson (enjalot) goes through several live-coding examples using D3
  • Text Visualization Plots - Interactive site with different types of text visualization for different problems.

Design Theory

The importance of design theory in data visualization, story telling and presentations could not be understated. It can take great content and make it confusing or virtually unusable, or it can make content sing and connect with the audience. Through better understanding of design theory, UI principles, a data scientist (or anyone) can convey more understandable information to the intended audience and give a strong story to their content.

Ipython Notebook Tutorials

Collection of ipython notebooks that are helpful as examples to either using tools or to explain certain topics.

Data Sources

Collection of sites to access data if you want to build out a project or just use some of the tools for EDA.

New Data Tools

Aim to keep track of developing trends and new tech that is helpful for the practicing Data Scientist. New might be a misnomer.

  • BigML - machine learning for the everyday user, also useful for EDA.
  • GraphLab - graph-based, high performance, distributed computation framework. They just implemented deep learning onto their platform.
  • ModeAnalytics - platform to share analysis/data science.
  • Apache Mahout - Scalable machine learning library. Not in python.
  • Apache Hadoop - Open-source software for reliable, scalable, distributed computing. Not really new (10 years old at this point)

Other Useful Scripts

Product

Product Metrics

Understanding product, user behavior, and product metrics is helpful for data scientists in industry. Being able to help your product manager and team execute on strategies by understanding the problem, metrics and what they understand facilitates a more fruitful relationship.

Team Communication & Business Tools

There are some very innovative new companies that are producing very effective tools to minimize and abstract away inefficient processes at companies. While it isn't strictly data science related, these products could be very help to integrate with your teams to improve overall productivity.

  • Aha! - Clean product roadmapping software for PMs.
  • Slack - Amazing team communication tool - abstracting away unnecessary e-mails.
  • Harvest - Effortless time tracking for business.
  • Trello - Helping organize everything - great for project management.
  • Zapier - Bringing together Harvest + Slack + Trello and a lot more...
  • Thoughtbot Playbook - A detailed account of how thought book runs is software consulting company talking about guiding principles, design sprints, code reviews to sales and operations. A content packed post.
  • IFTTT - 'Putting the internet to work for you'. Great for small companies to automate social media, marketing or to have your own personal recipes set up.
  • Github - Clearly a great product - 'Build software better, together'.
  • Web Analytics & Reporting Software:
    • Google Analytics - In depth real-time analytics.
    • Mixpanel - provides real-time analytics and solid cohort analysis.
    • Clicky - Pride themselves on ease of use.
    • Evernote - Great for keeping notes

Best Practices

Source control and keeping accurate documentation so that you and your colleagues can follow and reproduce your work is very important. I will add some best coding practices & data science practices.

Career Resources

Data Science Career Path

Types of Data Scientists

Not all Data Scientists are the same and it's critical for organizations to understand what it is they need, and how best to fill those roles and/or complement the skills of their team. Finding the organizational structure that enables the data scientists/data engineers within the organization and generates better results is also crucial. It should be given thorough consideration.

Data Science Applications/Use Cases

Data Science has so many different applications and use cases within industry - many are continuously discovered. These resources provide some potential ideas.

Data Science Websites/Books

More resources for community based information or hard copy books.

  • Data Science Handbook - Not yet released but should be interesting providing stories from academia and industry about data science - go read the post for a better description!
  • CrossValidated - A question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
  • StackOverflow - Language-independent collaboratively edited question and answer site for programmers.
  • Kaggle - Model building competition and great resources for training and data.
  • O'Reilly Media - A lot of content rich books available and tutorials on using the tools.
  • Quora - Question and answer site - lots of data science content and career content.
  • Data Science @ StackExchange - Still in beta.

Data Science Meetups in the Bay Area

A great way to meet other Data Scientists and keep up to date with best practices.

Data Science Blogs

Data Science Conferences

  • Strata - Conference and a lot of videos from previous conferences - great resource.
  • GraphLab - Another great conference.
  • PyData

Data Science Presentations

Relevant Business Processes

  • Lean Startup - A method to develop product and businesses.
  • Agile Development - group of software development methods to optimize for self-organizational and cross-functional teams.
  • Scrum - an iterative and incremental agile software development framework for managing product development.

Start-Up Resources

  • How to Start a Start-up - Series of lectures from successful entrepreneurs (i.e. Y comb, SV angels, etc.) on how to start a start up.

Open Source Data Science Resources

While the name might sound redundant this section represents other sites or repos that have aggregated information covering similar topics. Tons of great content on these sites - definitely go check them out.

Other Open Source Data Science Content

There are some really great resources linked within this section covering all of Data Science, the entire data pipeline, machine-learning, statistics, python, etc. Go check them out.

Auxiliary Content & Apps

ABOUT ME

I am currently working at an advanced energy storage start-up called Stem which is at the heart of revolutionizing how the grid integrates energy storage from a consumer and utility perspective. Our team works on a variety of different engineering challenges in particular, a lot of time-series problems.

I am a chemical engineer and economist by formal education and have worked in the energy, water and carbon industries ever since college. I acquired my data science code skills through programming in an on-the-job environment and then taking three months off to learn to hone my data science skills @ Zipfian Academy (since acquired by Galvanize). For me taking that time off to learn, run the daily/weekly sprints, and be in a collective learning environment at Zipfian was irreplaceable. Even if all of Zipfian resources were open source, without taking the time off work it would have been next to impossible to learn all that content. Not to mention the great people I met through the program.

I am always interested to hear what other data scientists are up to, especially those in the clean energy industry. If you have some project ideas or other resources that would be great to add here - feel free to reach out on Twitter @sf_oak, LinkedIn or AngelList.

Analytics