 # **Loading Data in Google Colab** 


## 1) Loading data from your computer:

Run the following two lines of code to invoke the browser.

ps: make sure to turn off "block third-party cookies" on your browser

In [61]:
from google.colab import files
uploaded = files.upload()




Saving example.csv to example.csv


In [62]:
import pandas 
df_python = pandas.read_csv('example.csv')
df_python

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


## 2) Loading data from AWS’s S3 cloud storage 

Big Data is stored in the cloud. To load data from Amazon cloud storage:

Create a (free) AWS account. Go to: console.aws.amazon.com.
Set up S3. Create a bucket. Upload a file.  

Then create Access Key. Go to account name, then choose My Security Credentials.
Expand the Access keys (access key ID and secret access key) section.
Choose Create New Access Key. 


## Step 1: Install AWS SDK:

In [42]:
!pip install boto3

Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/cc/a8/b5037dc144e458b3574c085d891b85ab2035b63ab946b5c91c23f2dfc1c6/boto3-1.16.4-py2.py3-none-any.whl (129kB)
[K     |████████████████████████████████| 133kB 4.5MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
Collecting s3transfer<0.4.0,>=0.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 4.8MB/s 
[?25hCollecting botocore<1.20.0,>=1.19.4
[?25l  Downloading https://files.pythonhosted.org/packages/2b/55/9347e51769db0fe3487ed2ae5f438b3cc6aa2916e5e9d05e60a04855373e/botocore-1.19.4-py2.py3-none-any.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 7.7MB/s 
[?25hCollecting urll

## Step2: Set up boto credentials to pull data from S3:



In [43]:
import boto3

# replace with your bucket name
BUCKET_NAME = 'mys3bucket_name' 

#replace with your aws_access_key, and aws_secret_access_key:
s3 = boto3.resource('s3', aws_access_key_id = 'aws_access_key', 
                          aws_secret_access_key= 'aws_secret_access_key')

In [44]:
KEY = 'ex.csv' # replace with your object key

s3.Bucket('mys3bucket_name').download_file(KEY, 'ex.csv')


## Step 3: To be able to run PySpark in Google colab run this code:

For details, see my file, "PySpark_GoogleColab.ipynb"

In [9]:


!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz 

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.2-bin-hadoop2.7"

!pip install -q findspark
import findspark
findspark.init("/content/spark-2.4.1-bin-hadoop2.7")

In [37]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Step 4: Load the data using Pyspark

In [47]:
df_spark = spark.read.csv('ex.csv')

df_spark.show()

+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|  _c4|
+---+---+---+---+-----+
|  1|  2|  3|  4|hello|
|  5|  6|  7|  8|world|
|  9| 10| 11| 12|  foo|
+---+---+---+---+-----+



## To can visualize this dataset using pandas:

In [48]:
import pandas 
df_python = pandas.read_csv('ex.csv')
df_python

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


## 3) Loading data from GitHub repositories

It is straightforward. Just copy the link to the raw data using pandas.read_csv:

In [54]:
df_github = pandas.read_csv('https://raw.githubusercontent.com/kyramichel/Pyspark_Cloud/master/ex.csv')
df_github

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo
