# Setup

## Instructions

This notebook utilizes the Kaggle API. Follow these steps in order to get the necessary credentials to continue:

1. Sign up for a Kaggle account at https://www.kaggle.com.
2. Go to the 'Account' tab of your user profile - ```https://www.kaggle.com/{username}/account```.
3. Select 'Create New API Token' under 'API' section.
    - This will trigger the download of kaggle.json, a file containing your API credentials. 
4. Place this file in the location:
    - ~/.kaggle/kaggle.json (for macOS/unix)
    - C:\Users\<Windows-username>\.kaggle\kaggle.json (for Windows) 
    - You can check the exact location, sans drive, with echo %HOMEPATH%). 
    - You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to:
        - $KAGGLE_CONFIG_DIR/kaggle.json (for macOS/unix)
        - %KAGGLE_CONFIG_DIR%\kaggle.json (for Windows)

## Additional Information

Documentation Guide:
- Kaggle API ([Kaggle](https://www.kaggle.com/docs/api))
- Kaggle API ([GitHub](https://github.com/Kaggle/kaggle-api)) 

## Imports

In [1]:
# Import kaggle, installing if necessary
try:
    import kaggle
except ImportError as e:
    !pip3 install kaggle
    import kaggle
    
import subprocess # Used to run unix commands
import pandas as pd # For storing/manipulating command data
from io import StringIO # Lets us read csv string output from command into DataFrame
import json # Reading back the metadata files
from tqdm import tqdm # Gives status bar on loop completion

# Data wrangling

## Getting/extracting dataset names

In [2]:
def get_search_output(search_terms, search_type):
    '''
    Calls the Kaggle API with the specified query terms and returns the search output results.
    
    Params:
    - search_terms (str/list-like): string or list of strings that should be searched for
    - search_type (str): objects to search over (must be either datasets or kernels)
    
    Returns:
    - cumulative_output (str): output from all searches
    '''
    # Make sure our input is valid
    assert len(search_terms) > 0, 'Please enter non-empty search terms'
    assert search_type in ('datasets', 'kernels'), 'Search can only be conducted over datasets or kernels'
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(search_terms) == str:
        search_terms = [search_terms]
    
    # Set search parameters
    page_idx = 1
    search_output = ''
    cumulative_output = ''
    completion_phrase = f'No {search_type} found\n'
    
    # Search for each term
    for search_term in tqdm(search_terms):
        # Pulls the records for a single page of datasets for the given search term
        # Runs the command, captures the output in stdout, reads it from stdout, and decodes it to str from binary
        search_output = subprocess.run(['kaggle', search_type, 'list', '-v',
                                         '-s', f'"{search_term}"', 
                                         '-p', str(page_idx)], 
                                        capture_output=True).stdout.decode()
        
        # Once we no longer see new output, we stop
        while search_output != completion_phrase:
            # Accumulate the output
            cumulative_output = cumulative_output + search_output

            # Increments the page count for searching
            page_idx += 1
            
            # Pulls the records for a single page of datasets for the given search term
            # Runs the command, captures the output in stdout, reads it from stdout, and decodes it to str from binary
            search_output = subprocess.run(['kaggle', search_type, 'list', '-v',
                                             '-s', f'"{search_term}"', 
                                             '-p', str(page_idx)], 
                                            capture_output=True).stdout.decode()
            
            # Remove header row
            if search_output != completion_phrase:
                search_output = '\r\n'.join(search_output.split('\r\n')[1::])
        
    return cumulative_output

In [3]:
def convert_string_csv_output_to_dataframe(output):
    '''
    Given a string variable in csv format, returns a Pandas DataFrame
    
    Params:
    - output (str): csv-styled string to be converted
    
    Returns:
    - df (pandas.DataFrame): DataFrame consisting of data from 'output' string variable
    '''
    # Create DataFrame of results
    output = StringIO(output)
    df = pd.read_csv(output)
    
    return df

In [4]:
search_terms = ['korea']
search_type = 'datasets'

In [5]:
search_output = get_search_output(search_terms, search_type)
search_output_df = convert_string_csv_output_to_dataframe(search_output)

100%|██████████| 1/1 [00:07<00:00,  7.32s/it]


In [6]:
search_output_df

Unnamed: 0,ref,title,size,lastUpdated,downloadCount,voteCount,usabilityRating
0,kimjihoo/coronavirusdataset,[NeurIPS 2020] Data Science for COVID-19 (DS4C),7MB,2020-07-13 14:07:31,82997,1465,1.000000
1,bappekim/air-pollution-in-seoul,Air Pollution in Seoul,20MB,2020-04-03 16:33:49,9741,309,1.000000
2,bappekim/south-korea-visitors,South Korea Visitors,99KB,2020-06-04 08:53:36,1027,24,1.000000
3,hongsean/korea-income-and-welfare,Korea Income and Welfare,772KB,2020-12-20 13:05:27,568,15,0.970588
4,bryanpark/korean-single-speaker-speech-dataset,Korean Single Speaker Speech Dataset,3GB,2020-03-15 08:56:42,5469,106,0.750000
...,...,...,...,...,...,...,...
169,llkdev/gender-data,Gender Data,2MB,2019-06-14 08:08:16,8,0,0.125000
170,chenykfrank/immc21,immc21,94KB,2021-02-03 15:21:36,0,0,0.117647
171,anuragshakya2005/gas-prices,gas_prices,690B,2020-11-18 10:57:15,41,1,0.117647
172,gremmn/gas-prices,gas prices,685B,2021-01-20 12:59:28,12,0,0.117647


## Pulling dataset metadata

Note: Unable to find a way to store metadata in memory as opposed to saving file, but this workaround appears to be functional.

In [7]:
def _retrieve_metadata_json(dataset_path):
    '''
    Queries Kaggle for metadata json file & returns the json data as a dictionary
    
    Params:
    - dataset_path (str): path for the dataset
    
    Returns:
    - metadata_dict (dict): dictionary containing json metadata
    '''
    # Download the metadata
    subprocess.run(['kaggle', 'datasets', 'metadata', dataset_path])

    # Access the metadata and load it in as a dictionary
    with open('dataset-metadata.json') as file:
        json_data = json.load(file)
        
    return json_data

In [8]:
def get_metadata(dataset_paths):
    '''
    Retrieves the metadata for the file/files listed in dataset_paths
    
    Params:
    - dataset_paths (str/list-like): string or list of strings containing the paths for the datasets
    
    Returns:
    - metadata_df (pandas.DataFrame): DataFrame containing metadata for the requested datasets
    '''
    # Make sure our input is valid
    assert len(dataset_paths) > 0, 'Please enter at least one dataset path'
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(dataset_paths) == str:
        dataset_paths = [dataset_paths]
        
    # Run first query
    json_data = _retrieve_metadata_json(dataset_paths[0])
        
    # Create DataFrame to store metadata in, using columns found in first query, and then add query info
    metadata_df = pd.DataFrame(columns=json_data.keys(), dtype=object)
    metadata_df = metadata_df.append(json_data, ignore_index=True)
        
    # Pulls metadata information for each dataset found above
    for dataset_path in tqdm(dataset_paths[1::]):
        # Download & load the metadata
        json_data = _retrieve_metadata_json(dataset_path)

        # Store the metadata into our DataFrame created above
        metadata_df = metadata_df.append(json_data, ignore_index=True)
        
    return metadata_df

In [9]:
dataset_paths = search_output_df.ref.values
metadata_df = get_metadata(dataset_paths)

100%|██████████| 173/173 [03:14<00:00,  1.12s/it]


In [10]:
metadata_df

Unnamed: 0,id,id_no,datasetId,datasetSlug,ownerUser,usabilityRating,totalViews,totalVotes,totalDownloads,title,subtitle,description,isPrivate,keywords,licenses,collaborators,data
0,kimjihoo/coronavirusdataset,527325,527325,coronavirusdataset,kimjihoo,1.000000,490936,1465,82997,[NeurIPS 2020] Data Science for COVID-19 (DS4C),DS4C: Data Science for COVID-19 in South Korea,### A portion of our dataset has been accepted...,False,"[universities and colleges, biology, data visu...",[{'name': 'CC-BY-NC-SA-4.0'}],"[{'username': 'kjm0623v', 'role': 'writer'}, {...",[]
1,bappekim/air-pollution-in-seoul,576393,576393,air-pollution-in-seoul,bappekim,1.000000,81243,309,9741,Air Pollution in Seoul,Air Pollution Measurement Information in Seoul...,### Context\nThis dataset deals with air pollu...,False,"[earth and nature, environment, pollution]",[{'name': 'CC-BY-SA-4.0'}],[],[]
2,bappekim/south-korea-visitors,692628,692628,south-korea-visitors,bappekim,1.000000,5843,24,1027,South Korea Visitors,Foreign visitors into South Korea,### Context\n\nThis dataset deals with the vis...,False,"[global, travel]",[{'name': 'CC-BY-SA-4.0'}],[],[]
3,hongsean/korea-income-and-welfare,1046735,1046735,korea-income-and-welfare,hongsean,0.970588,3018,15,568,Korea Income and Welfare,Where Korea wealth come from?,![](https://www.googleapis.com/download/storag...,False,"[income, education, social issues and advocacy...",[{'name': 'CC0-1.0'}],[],[]
4,bryanpark/korean-single-speaker-speech-dataset,19829,19829,korean-single-speaker-speech-dataset,bryanpark,0.750000,83396,106,5469,Korean Single Speaker Speech Dataset,KSS Dataset: Korean Single Speaker Speech Dataset,"# [Updated on September 28, 2019] KSS Dataset:...",False,"[languages, arts and entertainment, education]",[{'name': 'CC-BY-NC-SA-4.0'}],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,llkdev/gender-data,231235,231235,gender-data,llkdev,0.125000,338,0,8,Gender Data,,,False,[],[{'name': 'unknown'}],[],[]
170,chenykfrank/immc21,1135726,1135726,immc21,chenykfrank,0.117647,44,0,0,immc21,,,False,[],[{'name': 'unknown'}],[],[]
171,anuragshakya2005/gas-prices,981330,981330,gas-prices,anuragshakya2005,0.117647,208,1,41,gas_prices,,,False,[],[{'name': 'unknown'}],[],[]
172,gremmn/gas-prices,1108372,1108372,gas-prices,gremmn,0.117647,168,0,12,gas prices,,,False,[],[{'name': 'unknown'}],[],[]
