# Introduction

This book focuses on a specific sub-field of machine learning called predictive modeling. This is
the field of machine learning that is the most useful in industry and the type of machine learning
that the scikit-learn library in Python excels at facilitating. Unlike statistics, where models are
used to understand data, predictive modeling is laser focused on developing models that make
the most accurate predictions at the expense of explaining why predictions are made. Unlike the
broader field of machine learning that could feasibly be used with data in any format, predictive
modeling is primarily focused on tabular data (e.g. tables of numbers like in a spreadsheet).

A predictive modeling machine
learning project can be broken down into 6 top-level tasks:
1. Define Problem: Investigate and characterize the problem in order to better understand
the goals of the project.
2. Analyze Data: Use descriptive statistics and visualization to better understand the data
you have available.
3. Prepare Data: Use data transforms in order to better expose the structure of the
prediction problem to modeling algorithms.
4. Evaluate Algorithms: Design a test harness to evaluate a number of standard algorithms
on the data and select the top few to investigate further.
5. Improve Results: Use algorithm tuning and ensemble methods to get the most out of
well-performing algorithms on your data.
6. Present Results: Finalize the model, make predictions and present results.

You need to piece the recipes together into end-to-end projects. This will show you how to
actually deliver a model or make predictions on new data using Python. This book uses small
well-understood machine learning datasets from the UCI Machine learning repository 1 in both
the lessons and in the example projects. These datasets are available for free as CSV downloads.
These datasets are excellent for practicing applied machine learning because:
* They are small, meaning they fit into memory and algorithms can model them in
reasonable time.
* They are well behaved, meaning you often don’t need to do a lot of feature engineering
to get a good result.
* They are benchmarks, meaning that many people have used them before and you can
get ideas of good algorithms to try and accuracy levels you should expect.

In Part III you will work through three projects:

Hello World Project (Iris flowers dataset) : This is a quick pass through the project steps without much tuning or optimizing on a dataset that is widely used as the hello world of machine learning.

Regression (Boston House Price dataset) : Work through each step of the project process
with a regression problem.

Binary Classification (Sonar dataset) : Work through each step of the project process
using all of the methods on a binary classification problem.

**Python ecosystem for machine learning**

1. Python and its rising use for machine learning.
2. SciPy and the functionality it provides with NumPy, Matplotlib and Pandas.
3. scikit-learn that provides all of the machine learning algorithms.

**Python**

Python is a general purpose interpreted programming language. It is easy to learn and use primarily because the language focuses on readability.

It is a popular language in general, consistently appearing in the top 10 programming
languages in surveys on StackOverflow 1 . It’s a dynamic language and very suited to interactive development and quick prototyping with the power to support the development of large applications. It is also widely used for machine learning and data science because of the excellent library support and because it is a general purpose programming language (unlike R or Matlab).

**SciPy**

SciPy is an ecosystem of Python libraries for mathematics, science and engineering. It is an
add-on to Python that you will need for machine learning. The SciPy ecosystem is comprised of
the following core modules relevant to machine learning:
 NumPy: A foundation for SciPy that allows you to efficiently work with data in arrays.
 Matplotlib: Allows you to create 2D charts and plots from data.
 Pandas: Tools and data structures to organize and analyze your data.

To be effective at machine learning in Python you must install and become familiar with
SciPy. Specifically:
 You will prepare your data as NumPy arrays for modeling in machine learning algorithms.
 You will use Matplotlib (and wrappers of Matplotlib in other frameworks) to create plots
and charts of your data.
 You will use Pandas to load, explore, and better understand your data.

**scikit-learn**

The scikit-learn library is how you can develop and practice machine learning in Python. It is
built upon and requires the SciPy ecosystem. The name scikit suggests that it is a SciPy plug-in
or toolkit. The focus of the library is machine learning algorithms for classification, regression,
clustering and more. It also provides tools for related tasks such as evaluating models, tuning
parameters and pre-processing data.

# Analyze Data

# Prepare Data

# Evaluate Algorithms

# Improve Results

# Present Results