Python for Data Mining
Python is a programming language designed to have clear, concise, and expressive code. An extremely popular general-purpose language, Python has been used for tasks as diverse as web development, teaching, and systems administration. This mini-course provides an introduction to Python for data mining.
Messy data has an inconsistent or inconvenient format, and may have missing values. Noisy data has measurement error. Data mining extracts meaningful information from messy, noisy data. This is a start-to-finish process that includes gathering, cleaning, visualizing, modeling, and reporting.
Programming and research best practices are a secondary focus of the mini-course, because Python is a philosophy as well as a language. Core concepts include: writing organized, well-documented code; being a self-sufficient learner; using version control for code management and collaboration; ensuring reproducibility of results; producing concise, informative analyses and visualizations.
We will meet for four weeks during the Winter 2015 quarter at the University of California, Davis.
The mini-course is open to undergraduate and graduate students from all departments. We recommend that students have prior programming experience and a basic understanding of statistical methods, so they can follow along with the examples. For instance, completion of STA 108 and STA 141 is sufficient (but not required).
The mini-course will kick off with a quick introduction to the syntax of Python, including operators, data types, control flow statements, function definition, and string manipulation. Slower, in-depth coverage will be given to uniquely Pythonic features such as built-in data structures, list comprehensions, iterators, and docstrings.
Authoring packages and other advanced topics may also be discussed.
Support for stable, high-performance vector operations is provided by the NumPy package. NumPy will be introduced early and used often, because it's the foundation for most other scientific computing packages. We will also cover SciPy, which extends NumPy with functions for linear algebra, optimization, and elementary statistics.
Specialized packages will be discussed during the final week.
The pandas package provides tabular data structures and convenience functions for manipulating them. This includes a two-dimensional data frame similar to the one found in R. Pandas will be covered extensively, because it makes it easy to
- Read and write many formats (CSV, JSON, HDF, database)
- Filter and restructure data
- Handle missing values gracefully
- Perform group-by operations (
Many visualization packages are available for Python, but the mini-course will focus on Seaborn, which is a user-friendly abstraction of the venerable matplotlib package.
Other packages such as ggplot2, Vincent, Bokeh, and mpld3 may also be covered.
Python 3 has syntax changes and new features that break compatibility with
All of the major scientific computing packages have added support for Python 3
over the last few years, so it will be our focus.
We recommend the Anaconda Python 3 distribution,
which bundles most packages we'll use into one download.
Any other packages needed can be installed using
Python code is supported by a vast array of editors.
- Spyder IDE, included in Anaconda, is a Python equivalent of RStudio, designed with scientific computing in mind.
- PyCharm IDE and Sublime provide good user interfaces.
- Terminal-based text editors, such as Vim and Emacs, are a great choice for ambitious students. They can be used with any language. See here for more details. Clark and Nick both use Vim.
No books are required, but we recommend Wes McKinney's book:
- McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.
Python and most of the packages we'll use have excellent documentation, which can be found at the following links.
Due to Python's popularity, a large number of general references are available. While these don't focus specifically on data analysis, they're helpful for learning the language and its idioms. Some of our favorites are listed below, many of which are free.
- Swaroop, C. H. (2003). A Byte of Python. (PDF)
- Reitz, K. Hitchhiker's Guide to Python. (PDF)
- Lutz, M. (2014). Python Pocket Reference. O'Reilly Media.
- Beazley, D. (2009). Python Essential Reference. Addison-Wesley.
- Pilgrim, M., & Willison, S. (2009). Dive Into Python 3. Apress.
- Non-programmer's Tutorial for Python 3
- Beginner's Guide to Python
- Five Lifejackets to Throw to the New Coder
- StackOverflow. Please be conscious of the rules!
* Videos featuring Guido Van Rossum, Raymond Hettinger, Travis Oliphant, Fernando Perez, David Beazley, and Alex Martelli are suggested.