# Download and uncompress data for analysis from Kaggle API

------------------
![GitHub](https://img.shields.io/github/license/nicolastolosa/AirBnbModel)

**Author:** Nicolás Tolosa (github nicolastolosa)

**Achievements:** **(1)** Downloaded relevant data for the project from Kaggle API into *../exploration_data* folder. **(2)** Uncompressed the data into csv format and **(3)** Removed *.zip* files.

-------------
### Introduction
The purpose of this notebook is to get the data necessary to develop *AirBnbModel* project. 
For doing so, first, a zip file containing all the files of the dataset is downloaded using the API provided by Kaggle. 
Then, the contents of the file are unzipped into the path stored into the `destination_path` variable and finally, all the *.zip* files are removed, to leave only data in *.csv* format.


### Important notes!
**For the API query to be sucessfully run, and hence this code sucessfully run, Kaggle API user-key pair, must be present on *.kaggle/kaggle.json* folder.** Further information can be found in __[Public API Documentation | Kaggle](https://www.kaggle.com/docs/api#authentication)__

<div class="alert alert-block alert-warning">
<b>Warning:</b> For the data of this project to be accessible, the terms and conditions of the Kaggle competition must be accepted first. Details can be found in the following page: <a href="https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data"> Airbnb New User Bookings | Kaggle </a>
    
</div>

### Table of contents
* [1. Libraries](#1)
* [2. Setup](#2)
* [3. Download and unzip data](#3)
    * [3.1 Main folder](#3_1)
    * [3.2 *.csv* files](#3_2)
* [4. Resulting data](#4)

----------



In [1]:
%load_ext watermark
%watermark

Last updated: 2021-08-15T00:19:25.005343+02:00

Python implementation: CPython
Python version       : 3.9.5
IPython version      : 7.21.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
CPU cores   : 4
Architecture: 64bit



***


## 1. Libraries <a class="anchor" id="1"></a>

In [2]:
# Import libraries
# ----------------

# system
import os
import zipfile

# APIs
import kaggle

---------------------

## 2. Setup <a class="anchor" id="2"></a>

In [3]:
# Definition and authentincation of the Kaggle API
kaggle_api = kaggle.KaggleApi()
kaggle_api.authenticate()

In [4]:
# Definition and creation of the folder to contain the data
destination_path = '../exploration_data'
os.makedirs(destination_path, exist_ok=True)

------------------

## 3. Download and unzip data <a class="anchor" id="3"></a>

### 3.1 Main folder <a class="anchor" id="3_1"></a>
A main *.zip* file containing the whole dataset will be extracted using `competition_download_files`.

The files contained in that main file, can be seen in the following code block.

In [5]:
competition_name = 'airbnb-recruiting-new-user-bookings'
competition_datasets = kaggle_api.competition_list_files(competition_name)
competition_datasets

[age_gender_bkts.csv.zip,
 countries.csv.zip,
 test_users.csv.zip,
 train_users_2.csv.zip,
 sample_submission_NDF.csv.zip,
 sessions.csv.zip]

The file is downloaded and stores into `path`, and its contents are extracted.

In [6]:
kaggle_api.competition_download_files(competition=competition_name, path=destination_path)

In [7]:
zip_path = os.path.join(destination_path, competition_name + '.zip')
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(destination_path)
    
os.remove(zip_path) # Remove .zip after extracting its contents 

### 3.2 *.csv* files <a class="anchor" id="3_2"></a>

The files extracted on the previous step, consist on further *.zip* files, which contain the *.csv* files of the dataset. The following step, consists on extracting those *.csv* files and, lastly, removing all the *.zip* files from the path folder.

In [8]:
competition_datasets # Files extracted on the previous step

[age_gender_bkts.csv.zip,
 countries.csv.zip,
 test_users.csv.zip,
 train_users_2.csv.zip,
 sample_submission_NDF.csv.zip,
 sessions.csv.zip]

In [9]:
for dataset in competition_datasets:
    zip_path = os.path.join(destination_path, str(dataset))
    
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(destination_path)
        
    os.remove(zip_path)

---------------------
## 4. Resulting data <a class="anchor" id="4"></a>

The final structure of the folder, once the data has been extracted can be seen in the following code block:

In [10]:
os.listdir(destination_path)

['age_gender_bkts.csv',
 'countries.csv',
 'sample_submission_NDF.csv',
 'sessions.csv',
 'test_users.csv',
 'train_users_2.csv']