Python for Data Mining

Python is a programming language designed to have clear, concise, and expressive code. An extremely popular general-purpose language, Python has been used for tasks as diverse as web development, teaching, and systems administration. This mini-course provides an introduction to Python for data mining.

Messy data has an inconsistent or inconvenient format, and may have missing values. Noisy data has measurement error. Data mining extracts meaningful information from messy, noisy data. This is a start-to-finish process that includes gathering, cleaning, visualizing, modeling, and reporting.

Programming and research best practices are a secondary focus of the mini-course, because Python is a philosophy as well as a language. Core concepts include: writing organized, well-documented code; being a self-sufficient learner; using version control for code management and collaboration; ensuring reproducibility of results; producing concise, informative analyses and visualizations.

We will meet for four weeks during the Winter 2015 quarter at the University of California, Davis.

Target Audience

The mini-course is open to undergraduate and graduate students from all departments. We recommend that students have prior programming experience and a basic understanding of statistical methods, so they can follow along with the examples. For instance, completion of STA 108 and STA 141 is sufficient (but not required).

Topics

Core Python

The mini-course will kick off with a quick introduction to the syntax of Python, including operators, data types, control flow statements, function definition, and string manipulation. Slower, in-depth coverage will be given to uniquely Pythonic features such as built-in data structures, list comprehensions, iterators, and docstrings.

Authoring packages and other advanced topics may also be discussed.

Scientific Computing

Support for stable, high-performance vector operations is provided by the NumPy package. NumPy will be introduced early and used often, because it's the foundation for most other scientific computing packages. We will also cover SciPy, which extends NumPy with functions for linear algebra, optimization, and elementary statistics.

Specialized packages will be discussed during the final week.

Data Manipulation

The pandas package provides tabular data structures and convenience functions for manipulating them. This includes a two-dimensional data frame similar to the one found in R. Pandas will be covered extensively, because it makes it easy to

Read and write many formats (CSV, JSON, HDF, database)
Filter and restructure data
Handle missing values gracefully
Perform group-by operations (apply functions)

Data Visualization

Many visualization packages are available for Python, but the mini-course will focus on Seaborn, which is a user-friendly abstraction of the venerable matplotlib package.

Other packages such as ggplot2, Vincent, Bokeh, and mpld3 may also be covered.

Programming Environment

Python 3 has syntax changes and new features that break compatibility with Python 2. All of the major scientific computing packages have added support for Python 3 over the last few years, so it will be our focus. We recommend the Anaconda Python 3 distribution, which bundles most packages we'll use into one download. Any other packages needed can be installed using pip or conda.

Python code is supported by a vast array of editors.

Spyder IDE, included in Anaconda, is a Python equivalent of RStudio, designed with scientific computing in mind.
PyCharm IDE and Sublime provide good user interfaces.
Terminal-based text editors, such as Vim and Emacs, are a great choice for ambitious students. They can be used with any language. See here for more details. Clark and Nick both use Vim.

References

No books are required, but we recommend Wes McKinney's book:

McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.

Python and most of the packages we'll use have excellent documentation, which can be found at the following links.

Python 3
NumPy
SciPy
pandas
matplotlib
Seaborn
scikit-learn
IPython

Due to Python's popularity, a large number of general references are available. While these don't focus specifically on data analysis, they're helpful for learning the language and its idioms. Some of our favorites are listed below, many of which are free.

Swaroop, C. H. (2003). A Byte of Python. (PDF)
Reitz, K. Hitchhiker's Guide to Python. (PDF)
Lutz, M. (2014). Python Pocket Reference. O'Reilly Media.
Beazley, D. (2009). Python Essential Reference. Addison-Wesley.
Pilgrim, M., & Willison, S. (2009). Dive Into Python 3. Apress.
Non-programmer's Tutorial for Python 3
Beginner's Guide to Python
Five Lifejackets to Throw to the New Coder
Pyvideo*
StackOverflow. Please be conscious of the rules!

* Videos featuring Guido Van Rossum, Raymond Hettinger, Travis Oliphant, Fernando Perez, David Beazley, and Alex Martelli are suggested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Python for Data Mining

Target Audience

Topics

Core Python

Scientific Computing

Data Manipulation

Data Visualization

Programming Environment

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Python for Data Mining

Target Audience

Topics

Core Python

Scientific Computing

Data Manipulation

Data Visualization

Programming Environment

References