# Google Colaboratory
In this assignment, you'll get up and running with [Google Colaboratory](https://research.google.com/colaboratory/faq.html) (in short Colab), a platform for machine learning research powered by Google. Using this platform, you'll install and run Spark to do big data work in the remainder of this unit. We'll walk you through the Colab setup and how to install and run Apache Spark on this platform. ​ Apache Spark can be challenging to install and configure — it is inherently a cloud-based server designed for multi-tenancy. In other words, an instance of Apache Spark can be installed on multiple computers that act as a single cluster. When set up as a cluster, it leverages the combined power of all systems in the cluster by distributing data and computations across the various systems. ​ Fortunately, we can avoid all of these configuration headaches by installing it on Colab and we can quickly get up and running with the important business of learning Spark. Note that, we install Spark on Colab using a single machine since Colab offers us only single virtual machine. However, all the analysis we'll do in this unit is also valid for multi-server settings. ​

## What is Colab?
Colab is a free platform for running Jupyter notebooks in the cloud: ​

* Yes, it's free! You will not pay anything for accessing and using this platform.
* Colab also provides free GPU up to some limited time for a single execution of a thread.
* The platform is based on the Jupyter notebooks. Although the interface of the notebooks look slightly different from what you used to, they're still the same Jupyter notebooks with some adjustments and additions to run on Colab.
* Python and all its major data science libraries are already installed. So, most of the time you just run your notebooks without installing anything. So, you can just upload your Jupyter notebooks and run them on Colab without any modifications to your codes.
* That being said, Apache Spark is not pre-installed on Colab. In order to use Spark, we need to install it. ​ In the rest of this assignment, we'll show you how to set up your Colab platform and run your Jupyter notebooks on it. If you're already familiar with Google Drive, then the instructions here are pretty straightforward. If you're new to Google Drive, then the instructions below are prepared for you. Before jumping into the Colab set up, if you want to read more on it, here's a good [introduction](https://colab.research.google.com/notebooks/welcome.ipynb#scrollTo=-Rh3-Vt9Nev9) and if you want to see it in action, you can just follow this [link](https://colab.research.google.com/notebooks/welcome.ipynb#recent=true).

### Step 1: Create a Google account
If you're using Gmail or any other services of Google, then you have a Google account. But if you don't already have a Google account, then you need to create one for yourself. You can do this [here](https://accounts.google.com/signup/v2/webcreateaccount?flowName=GlifWebSignIn&flowEntry=SignUp). Once you create your account, you can move on to the next step.

### Step 2: Set up your Colab on Google Drive
In this step, we'll connect Colab to your Google Drive. This step involves multiple sub-steps as follows:

#### Enter Google Drive
Log in to your Google account and enter Google Drive by clicking on the Drive image at the top right corner of the screen:

![Drive](Photos/drive.png)

This should open Google Drive in a new tab on your browser. This is your drive and you can upload any files to this place.

#### Connect Colab to your drive
Once you entered to your drive, click on the + New button at the top left corner of your drive:

![New](Photos/new.png)

Then click on the + Connect more apps by first hovering over the More menu:

![connect_more_apps](Photos/connect_more_apps.png)


Type *colab* and search for it:

![search_colab](Photos/search_colab.png)

Click on the Colaboratory and then click on the + Connect button:

![connect_colab](Photos/connect_colab.png)

You should see the *Colaboratory was connected to Google Drive.* message which indicates that you successfully connected Colab to your drive.

![connected_colab](Photos/connected_colab.png)

You can check that Colab is added to your connected apps by clicking on the + New and then the More buttons:

![check_connected](Photos/check_connected.png)

That's all for setting up Colab in your drive.

#### Create two folders
Next, you need to create two folders in your drive. We'll use these two folders throughout this unit. We suggest you to name them as *Colab Datasets* and *Colab Notebooks* as the codes we'll provide assume that the relevant files will reside in these folders. But you're free to name them at your convenience as long as you edit the codes accordingly.

![create_two_folders](Photos/create_two_folders.png)

And, we're done with the set up. Now, let's see how we can use Colab.

### Step 3: Create a new Colab notebook
The next step is to create a jupyter notebook and upload it to the *Colab Notebooks* folder on your Google Drive. Once you upload the notebook, you can run it two different ways:

1. Right click on the notebook --> Open with --> Colaboratory: 
![colab_right_click](Photos/colab_right_click.png)

2. Double click on the notebook and press Open with Colaboratory at the top of the screen: 
![colab_double_click](Photos/colab_double_click.png)

And that's it. You're now ready to run your notebooks.

### Step 4: Run your first Jupyter notebook
Colab comes ready with most of the Python libraries that are relevant to data science including *NumPy*, *Pandas* and *Matplotlib*. Just write the following small code into a cell and run it as you run a regular Jupyter notebook cell:

![first_notebook](Photos/first_notebook.png)

You should see the same output as the one in the figure above.

### Step 5: Install the packages
Colab is a great research and education platform but it has some limitations. The major one for our purposes is that it allocates server to us just temporarily. Once you shutdown your noteboook and your server becomes idle for some time, then it may disallocate the previously assigned resources. What this means is that you may need to install Java, Spark and the other packages every time you want to run your notebook.

**Hence, it's a good idea to put the following codes in the beginning of all of your notebooks.** Note that, at the beginning of every line in the following codes, there's **!**. This is something special to Colab. By putting **!** in the beginning of a code, you can access the terminal of the Ubuntu server you're allocated to by Colab. Thus, you can install whatever you want to the current machine you're using.

#### Install Java and Apache Spark
First, you need to install Java and Apache Spark to the server that is allocated to you. The following code installs Apache Spark 2.4.0 and Java 8:

- !apt-get install openjdk-8-jdk-headless -qq > /dev/null
- !wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
- !tar xf spark-2.4.0-bin-hadoop2.7.tgz

Second, you need to set the locations where Spark and Java are installed:

- import os
- os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
- os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

#### Install Findspark and PySpark

Third, you need to install [Findspark](https://github.com/minrk/findspark) (a library that makes it easy for Python to find Spark) and PySpark using pip:

- !pip install -q findspark
- !pip install pyspark

#### Mount your Google Drive to Colaboratory

Lastly, you'll need to mount your Google Drive to Colaboratory so that you can access the files on your drive. This will enable you to read the data files you uploaded to your drive:

- from google.colab import drive
- drive.mount('/content/gdrive')

After you run the code above, Google will ask you an authorization code:

![authorization_code](Photos/authorization_code.png)

You should follow that link and authorize Google to access your drive:

![authorize_google](Photos/authorize_google.png)

After you authorize Google, a code will appear in the screen:

![copy_code](Photos/copy_code.png)

You should copy that code and paste it to the input box that appeared in your notebook on Colab:

![paste_notebook](Photos/paste_notebook.png)

Once you hit Enter, a message saying your drive is mounted will appear:

![drive_mounted](Photos/drive_mounted.png)

And that's all. You can start to run your own code.

As we said before, you should put all the code in the previous four code cells to the beginning of every notebook that you run on Google Colaboratory. When you run these codes in Colab, if they're installed already on the server you are given, then you'll get a warning that the packages are already installed. If the server is a freshly allocated one to you, then these codes will install all the relevant packages to the machine.