<a href="https://colab.research.google.com/github/indahpuspitaa17/DeepLearning.AI-TensorFlow-Developer/blob/main/Acquiring%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Acquiring Data

The datasets we have worked with so far have all been small enough that we have directly typed them in our code blocks. In reality, you'll need to bring in your data from an outside source.

This can be done in many ways. We'll explore a few in this lab and, we will also mention some methods that are out of scope for this course but often seen in the wild.

## Uploading Data

If you have the data that you want to work with on your computer, you can work with it locally using [Python](http://python.org), [Jupyter](https://jupyter.org/), and/or many other tools. If you want to work with that data in Colab, you'll need to upload the data to Colab since Colab executes code on a virtual machine in the cloud, not locally on your computer.

The first step of uploading data is having the data on your local machine.

For this example we will use the famous [Iris Dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). There are many copies of this dataset on the internet. We'll use the [version hosted by the University of California Irvine](https://archive.ics.uci.edu/ml/datasets/Iris).

The direct link to the dataset is [https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data).

**Download `iris.data` now.**

Once you have the data downloaded, you can upload the file to this Colab environment. To do that:

1. Click on the folder icon on the left of the Colab interface. This opens the 'Files' sidebar.
1. At the top of the sidebar there is an 'Upload' link. Click the link in order to open a file selctor.
1. Through the selector, find the `iris.data` file that you just downloaded.
1. Click 'Open' or 'OK' to confirm the upload.

You will see a warning about files not being saved. This is because the file is stored on a virtual machine in the cloud, and when that machine is turned off, all files in it are lost. For classes like this you should be fine. For longer projects or long-running model trainings, there are other ways and places to store your files. We'll get to those later in the course.

Let's see if you uploaded the file successfully.

**Run the code block below**

In [1]:
import pandas as pd

column_names = [
  'sepal length',
  'sepal width',
  'petal length',
  'petal width',
  'class'
]

pd.read_csv('iris.data', names=column_names)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


You should see a `DataFrame` containing information about iris flowers.

If you get an error, you likely uploaded the wrong file or uploaded the file to the wrong location.

By default Colab works in the `/content/` folder of the virtual machine. Most of the time this is invisible to you. However, when you uploaded the file, you might have hit the 'Parent Directory' link instead of the 'Upload' link since they are close to each other and both have "up arrow" icons. If you see a long list of folders instead of a single `sample_data` folder in the files list, then you hit the 'Parent Directory' button. Unfortunately, the only way to redirect uploads to the correct folder is to restart your runtime.

## Downloading With Python

If your data is hosted online, you can use Python to directly download the data and bypass the download/upload cycle mentioned in the last section.

One way to do this is to use the [`urllib.request`](https://docs.python.org/3/library/urllib.request.html) library's [`urlretrieve`](https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve) method.

In the example below we request the `iris.names` file from UCI and then list the directory where it was downloaded.

In [None]:
import urllib.request
import os

urllib.request.urlretrieve(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names',
    'iris.names')

os.listdir()

['.config', 'iris.data', 'iris.names', 'sample_data']

## Downloading With Pandas

It is possible to download data directly into a Pandas `DataFrame`. We have used `read_csv` in previous labs to load files from disk. If you pass `read_csv`, a URL it will pull data from the internet and load that data into a `DataFrame` in one shot.

In [None]:
import pandas as pd

column_names = [
  'sepal length',
  'sepal width',
  'petal length',
  'petal width',
  'class'
]

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

pd.read_csv(url, names=column_names)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


## Kaggle Data

Kaggle is a popular machine learning and data science educational playground. There are many interesting datasets hosted on [Kaggle's datasets page](https://www.kaggle.com/datasets).

If you [navigate to a dataset](https://www.kaggle.com/joshmcadams/oranges-vs-grapefruit) you won't be able to download it until you create a Kaggle account.

We'll be using Kaggle in this course. Even outside of the course, Kaggle is a great place to learn and experiment with machine learning and data science, all while building your public machine learning and data science resume.

**Log in to Kaggle now. Create a new account if you need to.**

At this point you should have a Kaggle account. You can now download [a dataset](https://www.kaggle.com/joshmcadams/oranges-vs-grapefruit) by clicking on the 'Download' link at the top of the information page for the dataset.

If you download the [Oranges vs. Grapefruit](https://www.kaggle.com/joshmcadams/oranges-vs-grapefruit) dataset, you should now have a file called `oranges-vs-grapefruit.zip` on your computer.

**Upload `oranges-vs-grapefruit.zip` to this lab.**

Now that you have uploaded `oranges-vs-grapefruit.zip`, we can load it into a `DataFrame`.

In [None]:
import pandas as pd

pd.read_csv('oranges-vs-grapefruit.zip')

Unnamed: 0,name,diameter,weight,red,green,blue
0,orange,2.96,86.76,172,85,2
1,orange,3.91,88.05,166,78,3
2,orange,4.42,95.17,156,81,2
3,orange,4.47,95.60,163,81,4
4,orange,4.48,95.76,161,72,9
...,...,...,...,...,...,...
9995,grapefruit,15.35,253.89,149,77,20
9996,grapefruit,15.41,254.67,148,68,7
9997,grapefruit,15.59,256.50,168,82,20
9998,grapefruit,15.92,260.14,142,72,11


Notice that the file that we loaded was `oranges-vs-grapefruit.zip`, which is a zip file, not a csv file. Zip files are 'compressed' files. We do this to save space. However, if you were to open `oranges-vs-grapefruit.zip` in a text editor, you wouldn't be able to read it. Lucky for us, `read_csv` knows what to do when it receives a compressed file.

Sometimes we will want to decompress a file before creating a `DataFrame`. Zip files can contain more than one file, so we might need to unzip our files and then load them individually into `DataFrame` objects.

To do this we use the `zipfile` library.

In the example below, we open the zip file and then extract all of the contained files into the current directory. We then list the directory and see that we now have a `citrus.csv` file, which is the uncompressed contents of `oranges-vs-grapefruit.zip`. This csv can then be loaded into a `DataFrame` directly.

In [None]:
import zipfile
import os

with zipfile.ZipFile('oranges-vs-grapefruit.zip','r') as z:
  z.extractall('./')

os.listdir()

['.config',
 'oranges-vs-grapefruit.zip',
 'iris.data',
 'iris.names',
 'citrus.csv',
 'sample_data']

Zip is one of many file compression formats, and it is actually more than just a compression format. Remember when we mentioned above that a zip file might contain multiple files? The combining of one or more files is known as archiving. The reduction in size of files is known as compression. Zip is actually an archiving and compression algorithm.

You can find a list of similar types of algorithms [on Wikipedia](https://en.wikipedia.org/wiki/List_of_archive_formats).

### Direct Downloads

So far, we've been able to download data from Kaggle and then upload it to Colab. This involves downloading the entire dataset from Kaggle's servers onto your local machine and then uploading that dataset to the Colab server to actually process the data. For small datasets, this is reasonable. But for large datasets, this can quickly become a burden on your network connection and your device's storage space.

Kaggle offers an [API](https://github.com/Kaggle/kaggle-api) that comes with a command line program that can help you download files directly from Kaggle to Colab, skipping over your local machine entirely.

#### Credentials

In order to use the API, you'll need to "log in" to Kaggle. This is done using [API credentials](https://github.com/Kaggle/kaggle-api#api-credentials) in the form of an API token.

To do that:

1. Navigate to the 'Account' tab of your user profile in Kaggle at `https://www.kaggle.com/<username>/account`.
1. Click the "Create New API Token" button. This will download a `kaggle.json` file containing your API credentials
1. Upload the `kaggle.json` file to this lab.

**Warning: Keep your `kaggle.json` private! It contains information that will allow people to authenticate into Kaggle using your user account.**

At this point you should have a `kaggle.json` file in the file list on the left. We can now download a dataset using the `kaggle` command and `datasets` subcommand:

In [6]:
! KAGGLE_CONFIG_DIR=/content/ kaggle datasets download joshmcadams/oranges-vs-grapefruit

Downloading oranges-vs-grapefruit.zip to /content
  0% 0.00/61.2k [00:00<?, ?B/s]
100% 61.2k/61.2k [00:00<00:00, 54.5MB/s]


You should see text similar to:

```
Downloading oranges-vs-grapefruit.zip to /content
  0% 0.00/61.2k [00:00<?, ?B/s]
100% 61.2k/61.2k [00:00<00:00, 23.1MB/s]
```

If you do, then you successfully downloaded the dataset!

You might have also seen a warning like this:

```
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /content/kaggle.json'
```

If so, it means that the `kaggle.json` file is readable by people other than you. This is probably okay since you are on a virtual machine by yourself. If you do want to fix the warning, run `chmod` as instructed.

In [7]:
! chmod 600 kaggle.json

You might also be wondering what that `KAGGLE_CONFIG_DIR=/content/` in front of the `kaggle` command was.

This is telling `kaggle` where to find your `kaggle.json` file. `kaggle` expects the file to be in `~/.kaggle/`. Since we didn't upload it there, `kaggle` can't find `kaggle.json` without us leading it to the correct folder.

If you want to not have to do this, move `kaggle.json`.

First make sure the directory exists.

In [8]:
! ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle

And then move the file.

In [9]:
! mv kaggle.json ~/.kaggle

Now you can run the `kaggle` command without having to set the configuration directory.

In [10]:
! kaggle datasets download joshmcadams/oranges-vs-grapefruit

oranges-vs-grapefruit.zip: Skipping, found more recently modified local copy (use --force to force download)


Note that you'll have to repeat this process every time your virtual machine resets. The setup will live through reloads though.

Keep a copy of your `kaggle.json` file on your local machine. Then, when you need to load your credentials into a colab, just:

1. Upload `kaggle.json`
1. Create a code block and run `! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'`

## Other Data Acquisition Methods

### Databases

It is possible to interact with databases directly from Python and therefore from Colab and other notebook environments. Python has a standard [database API](https://www.python.org/dev/peps/pep-0249/) and [many toolkits](https://docs.python-guide.org/scenarios/db/) that make interacting with databases easier.

If your data is stored in a database, you'll need to work with a database administrator to get access credentials and to understand the data and how it is stored.

### APIs

Data can also be accessed by application programming interface (API). APIs provide a way for you to write Python code that interacts with another system in a well defined way.

For example, [Twitter](http://twitter.com) has an [API](https://developer.twitter.com/en/docs) that allows you to work with tweets. There are even abstraction layers like [Tweepy](https://www.tweepy.org/) that make working with the API even easier.

Every system has their own API with different methods and calling patterns. You'll hear terms like REST, SOAP, JSON and XML thrown around when talking about specific APIs.

# Exercises

## Exercise 1: Direct Download

Use Python to directly download the `bridges.data.version2` data file from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/bridges/). Load the data into a Pandas `DataFrame` and `.describe()` that `DataFrame`.

**Student Solution**

In [13]:
# Your code goes here
import pandas as pd

df = pd.read_csv('bridges.data.version2')
df.describe()

Unnamed: 0,E1,M,3,CRAFTS,HIGHWAY,?,2,N,THROUGH,WOOD,SHORT,S,WOOD.1
count,107,107,107,107,107,107,107,107,107,107,107,107,107
unique,107,4,55,4,4,4,5,3,3,4,4,4,8
top,E23,A,28,MATURE,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T
freq,1,49,5,54,70,48,60,80,86,79,53,58,44


---

## Exercise 2: Kaggle Download

Use the Kaggle API to download [a dataset containing avocado prices in the US](https://www.kaggle.com/neuromusic/avocado-prices). Load the data into a Pandas `DataFrame` and describe the `DataFrame`.

**Student Solution**

In [15]:
# Your code goes here
! KAGGLE_CONFIG_DIR=/content/ kaggle datasets download neuromusic/avocado-prices

Downloading avocado-prices.zip to /content
  0% 0.00/629k [00:00<?, ?B/s]
100% 629k/629k [00:00<00:00, 91.7MB/s]


In [16]:
! chmod 600 kaggle.json

In [17]:
! ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle

kaggle.json


In [18]:
! mv kaggle.json ~/.kaggle

In [19]:
! kaggle datasets download neuromusic/avocado-prices

avocado-prices.zip: Skipping, found more recently modified local copy (use --force to force download)


In [23]:
import pandas as pd

datset = pd.read_csv('avocado-prices.zip')
datset.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


---