#                ** How 2 Get Data in Google Colab**


- Google Colab is a cloud Jupyter Notebook that supports free GPU to help Machine Learning education more accessible

- Chosing Colab platform allows to quickly get started using Apache Spark to work with big data with limited set-up time. (See [my notebook]((https://github.com/kyramichel/Pyspark_Cloud/blob/master/PySpark_GoogleColab.ipynb) for details on how to install PySPark on Google Colab)


- The ability to access data in your Colab environment is the second important step in getting started working on a Data Science project in Colab



## This guide will show you how to access data from a variety of sources in Google's cloud Jupiter environment




# **Let's Get Started:**

When working on a small Data Science project, there are various ways to get data into Colab
 
## 1) Upload your data file to Colab directly:

Run the following two lines of code to invoke the browser and upload data files from your computer.

ps: make sure to turn off "block third-party cookies" on your browser

In [1]:
from google.colab import files
uploaded = files.upload()




To visualize uploaded data in Colab use Pandas read_csv, as if you were in your Jupiter notebook on your computer:


In [None]:
import pandas 
df_python = pandas.read_csv('example.csv')
df_python

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


## 2) Connect your Google drive to Google Colab

After mounting your drive, accessing data files from Google drive in Colab is like working on local computer

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Accesing data from Github:**

Accessing data in Colab from Github is in the same way as you would on your computer:

- Copy the link to the raw data file from the Github Repo, then use Pandas read_csv to read the data:

In [None]:
df_github = pandas.read_csv('https://raw.githubusercontent.com/kyramichel/Pyspark_Cloud/master/ex.csv')
df_github

- Clone an entire GitHub repo inside your Colab using *git clone* . Go to GitHub Repo, Code, Clone. Copy the HTTPS link

ps: refresh your browser

In [None]:
! git clone https://github.com/kyramichel/Pyspark_Cloud.git

# **Loading data from Kaggle**:


Kaggle is a platform where you can find datasets and Data Science competitions

## Uploading data from Kaggle into Colab

Kaggle datasets can be downloaded in Colab using kaggle cli.

**Steps:** 

- Create a Kaggke account
- Go to My account, APi, click 'Create New API Token, download kaggle.json file.

- Install kaggle cli:



In [2]:
! pip install -q kaggle

- Upload kaggle.json file:

In [None]:
from google.colab import files

files.upload()

- Create the .kaggle folder:

In [None]:
!mkdir ~/.kaggle

In [8]:
! cp kaggle.json ~/.kaggle/

- Change the permissions of the file:

In [9]:
! chmod 600 ~/.kaggle/kaggle.json

- Now you are ready to search for datasets on kaggle:

In [93]:
! kaggle datasets list

ref                                                               title                                                 size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------------  --------------------------------------------------  ------  -------------------  -------------  ---------  ---------------  
nehaprabhavalkar/indian-food-101                                  Indian Food 101                                        7KB  2020-09-30 06:23:43           5163        187  1.0              
christianlillelund/donald-trumps-rallies                          Donald Trump's Rallies                               720KB  2020-09-26 10:25:08           1238         82  1.0              
heeraldedhia/groceries-dataset                                    Groceries dataset                                    257KB  2020-09-17 04:36:08           5400        187  1.0              
andrewmvd/trip-advisor-hotel-reviews         

ps: If you get a warning that you're using an outdated API Version, please consider updating (server 1.5.6/client 1.5.4)

- consider upgrading your Kaggle api. Try as follows:

In [12]:
!kaggle -v

Kaggle API 1.5.4


In [15]:
! pip install -q kaggle --upgrade

In [17]:
!kaggle -v

Kaggle API 1.5.4


In [18]:
!pip install --upgrade --force-reinstall --no-deps kaggle

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/fe/52/3d13208c0f24c72b886c400e94748076222d5ffa4913fb410af50cb09219/kaggle-1.5.9.tar.gz (58kB)
[K     |█████▋                          | 10kB 22.0MB/s eta 0:00:01[K     |███████████▎                    | 20kB 27.6MB/s eta 0:00:01[K     |████████████████▉               | 30kB 13.9MB/s eta 0:00:01[K     |██████████████████████▌         | 40kB 10.6MB/s eta 0:00:01[K     |████████████████████████████▏   | 51kB 10.1MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 5.3MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.9-cp36-none-any.whl size=73265 sha256=c5dae4e760befd9a31c830c98324043b7ce764f1fcf94e82ef0544d6cfe5745a
  Stored in directory: /root/.cache/pip/wheels/68/6d/9b/7a98271454edcba3b56328cbc78c037286e787d004c8afee71
Successfully built kaggle
Installing collected package

## Now you are ready to download data from Kaggle!

- Go to Kaggle, click Compete, join a competition, go to Data, copy the API command (kaggle competitions download -c 'name-of-competition')

- run the API commmand:

In [32]:
! kaggle competitions download -c 'house-prices-advanced-regression-techniques'

house-prices-advanced-regression-techniques.zip: Skipping, found more recently modified local copy (use --force to force download)


To unzip the data
- first create a folder kaggle_data:

In [26]:
! mkdir kaggle_data

Copy the path of downloaded zip file then
- Unzip downloaded file to your kaggle_data folder:

In [33]:
! unzip /content/house-prices-advanced-regression-techniques.zip -d kaggle_data

Archive:  /content/house-prices-advanced-regression-techniques.zip
  inflating: kaggle_data/data_description.txt  
  inflating: kaggle_data/sample_submission.csv  
  inflating: kaggle_data/test.csv    
  inflating: kaggle_data/train.csv   


To visualize your train.csv dataset:

In [35]:
import pandas
df2  = pandas.read_csv("/content/kaggle_data/train.csv")
df2.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,...,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,...,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,...,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,...,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,...,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,...,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


# **Big Data:**

Considerations:

### Big Data storage

Big data is on the cloud. There are various cloud data storage providers. Their price may vary.

### Machine Learning/Deep Learning on the cloud

- Building large machine learning or deep learning 
models require access to high computational power. The price for it from existing cloud providers may vary. 

- Using Apache Spark in Jupyter notebook environment on your computer to run Machine Learning/Deep Learning Models on large datasets requires the right GPU and RAM for your needs. Their price may vary.


### Limitations of Google Colab  

Colab is a temporary cloud environment. The runtime is disconnect if it has remained idle for 90 minutes, or if it has been in use for 12 hours. When disconnected you will lose all your variables, states, installed packages, and files. When you reconnect, you will be connected to a new environment.

Colab's limited disk space: 77 GB per user is enough for small projects, but you may not meet memory requirements when working with big data.



# **Working on a Larger Project:**

There is a *free* multi-cloud solution to get you started running ML/AI models with large datasets!

For medium-sized projects, a solution would be to: 
- Use Google Colab that is a Jupiter notebook running in the cloud on a server at Google. There is a paid version Colab Pro $9.99/mo for 24 hours runtime and high computational power GPU/TPU, that costs less compared to the cloud providers
- Run Apache Spark open-source big data framework in Google Colab (see [my notebook](https://github.com/kyramichel/Pyspark_Cloud/blob/master/PySpark_GoogleColab.ipynb) for installation details) to process and analyze vast amounts of data 
- Big data can be acessed from various public cloud storage like AWS and GCP public datasets or other sources
- Use a free cloud storage like AWS S3 for storage needs 



# **Let's Get Started:**

## Load data from AWS’s S3 cloud storage 

To load data from Amazon cloud storage:

- To create a (free) AWS account. Go to: console.aws.amazon.com.
Set up S3. Create a bucket. Upload a file.  

- To create Access Key. Go to account name, then choose My Security Credentials. Expand the Access keys (access key ID and secret access key) section. Choose Create New Access Key. 


- Install AWS SDK:

In [41]:
!pip install boto3

Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/83/c2/1d4ccb8d0772b37d2f5d14e784babbf6a51aca6dd2786da842cd912e98b4/boto3-1.16.5.tar.gz (97kB)
[K     |████████████████████████████████| 102kB 6.3MB/s 
[?25hCollecting botocore<1.20.0,>=1.19.5
[?25l  Downloading https://files.pythonhosted.org/packages/59/91/c3ac686983570cbaf01063956e51ebd547d2e496f3e33052b7a9102c9a75/botocore-1.19.5-py2.py3-none-any.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 16.7MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl
Collecting s3transfer<0.4.0,>=0.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 9.7MB/s 
Collecting urllib3<1.26,>=1.25.4; p

- Set up boto credentials to pull data from S3:



In [42]:
import boto3

In [60]:
# replace with your bucket name
BUCKET_NAME = 'XXXX'


#replace with your authentication credentials:
s3 = boto3.resource('s3', aws_access_key_id = 'YYYY', 
                          aws_secret_access_key= 'ZZZZ')


In [69]:
s3.Bucket(BUCKET_NAME).download_file(KEY, 'ex.csv')

## Loading data from amazon using AWS cli

In [70]:
!pip install awscli




## Accessing AWS public datasets  

https://registry.opendata.aws/fcp-indi/


In [None]:
#To extract one file by name:
!aws s3 ls s3://fcp-indi/ --no-sign-request --recursive --exclude "*" --include <file name with path in quotes>

                           PRE data/
                           PRE resources/
2019-02-08 17:34:25          0 c-pac_gui-0.0.1.dmg
2019-02-08 17:34:21   37112608 c-pac_gui-0.0.1_amd64.deb
2019-02-08 17:34:22  109142016 c-pac_gui-0.0.1_amd64.snap
2019-02-08 17:34:24   49467932 c-pac_gui-0.0.1_amd64.tar.gz
2019-02-08 17:34:25          0 c-pac_gui-0.0.1_mac.zip
2020-10-25 09:16:50     118782 fcp-indi_TEST.csv


## **Run PySpark in Google colab** 

Copy this code:

ps: For details, see my file, "PySpark_GoogleColab.ipynb"

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz 

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.2-bin-hadoop2.7"

!pip install -q findspark
import findspark
findspark.init("/content/spark-2.4.1-bin-hadoop2.7")

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Loading data using Pyspark:

In [None]:
df_spark = spark.read.csv('ex.csv')

df_spark.show()

+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|  _c4|
+---+---+---+---+-----+
|  1|  2|  3|  4|hello|
|  5|  6|  7|  8|world|
|  9| 10| 11| 12|  foo|
+---+---+---+---+-----+



- You can visualize this dataset using pandas:

In [None]:
import pandas 
df_python = pandas.read_csv('ex.csv')
df_python

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


## **Loading data files from Cloud Storage to Google Colab:**

To access data (even public datasets) from GCS you will need to authenticate your Google account:

In [84]:
from google.colab import auth
auth.authenticate_user()

- Install the GC SDK:

In [None]:
!curl https://sdk.cloud.google.com | bash

In [None]:
!gcloud init

Now you are ready to access data files using GC cloud storage:


In [92]:
!gsutil cp gs://mycloud/example.csv