# Distributed computing and Spark

So, you've learned about the software for accessing data that is in a distributed form. Now you want to do something with it. But chances are good that your analyses are going to be pretty slow if you try to do them all at once on a single machine as a single job. What can you do about this? There are several ways to deal with this issue. Here, you'll focus on two. One of them is local, and the other is distributed.

**Key Terms**

* **Multicore parallelization:**
    The execution of a task in several cores of a CPU at the same time, where each core handles a part of the execution

## Multicore computing

When you bought your computer, you may have heard about how many cores it has. Home computers used to all be single-core machines, with one place where everything ran from a CPU perspective. Now they are typically multicore machines, usually with either two, four, or eight cores.

If you have a multicore machine, then there is a way that you can build models more quickly. When you're building a model in scikit-learn, one of the options that you have is `n_jobs`. This is the number of different cores that your model training can run on. The higher this number, the more distributed the process of building the model is. This is referred to as multicore parallelization. The training runs on multiple cores simultaneously. Ideally, this means that processing time is divided by the number of cores, so splitting across two cores means that an analysis runs in half the time.

Although some data science models are easier to parallelize than others, most can be parallelized in some way. It's probably easiest to see how random forest can be parallelized: different trees are generated by different cores. Boosted trees can also easily be parallelized when, as the tree starts to split, subsequent models are run in different cores. Something like SVM doesn't particularly parallelize, but luckily, memory isn't much of a problem, since you only care about points near the margins.

## Spark

Sometimes—actually, often—you just can't run the model on your local machine. It's either too much data or would take too long to train. In that case, you need to use something else. That something else is *Spark*. Spark is part of the Apache suite that has been built up around Hadoop.

Spark is a fantastic tool for distributed computing. And luckily for you, it is incredibly easy to translate Python code into Spark. That's because Spark has been translated into Python-like syntax with [PySpark](https://spark.apache.org/docs/0.9.0/python-programming-guide.html). There is even a Spark version of iPython notebooks and scikit-learn.

PySpark looks almost identical to Python. You just need infrastructure set up to run it in a distributed fashion.