# Setup

## Instructions

This notebook utilizes the Kaggle API. Follow these steps in order to get the necessary credentials to continue:

1. Sign up for a Kaggle account at https://www.kaggle.com.
2. Go to the 'Account' tab of your user profile - ```https://www.kaggle.com/{username}/account```.
3. Select 'Create New API Token' under 'API' section.
    - This will trigger the download of kaggle.json, a file containing your API credentials. 
4. Place this file in the location:
    - ~/.kaggle/kaggle.json (for macOS/unix)
    - C:/Users/username/.kaggle/kaggle.json (for Windows) 
    - You can check the exact location, sans drive, with echo %HOMEPATH%). 
    - You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to:
        - $KAGGLE_CONFIG_DIR/kaggle.json (for macOS/unix)
        - %KAGGLE_CONFIG_DIR%\kaggle.json (for Windows)

Note: this notebook uses functions written in Python to query the Kaggle API. **This code will only work for python 3.7 or later**.

## Additional Information

Documentation Guide:
- Kaggle API ([Kaggle](https://www.kaggle.com/docs/api))
- Kaggle API ([GitHub](https://github.com/Kaggle/kaggle-api)) 

## Workflow overview

<img src="../images/Kaggle_workflow.jpg" width=650 height=650 align="left"/>

## Import libraries

In [33]:
# Allow system to search parent folder for local imports
import sys
sys.path.append('..')

# Kaggle-specific imports
from kaggle import KaggleApi
from kaggle.rest import ApiException

import pandas as pd # For storing/manipulating command data
import json # Reading back the metadata files
from tqdm import tqdm # Gives status bar on loop completion
import itertools # For efficient looping over queries
import os # Exporting saved results
from collections import OrderedDict
from utils import flatten_nested_df
from flatten_json import flatten

In [2]:
api = KaggleApi()
api.authenticate()



## Query #1: query API based on search terms and search types

Function `get_all_search_outputs` queries the Kaggle API for all combinations of search terms and search types specified and returns the results as a dictionary of dataframes (one dataframe for each query combination)
- Calls function `get_individual_search_output`

In [3]:
def get_all_search_outputs(search_terms, search_types, flatten_output=False):
    """
    Call the Kaggle API for each search term and search type. 
    Results are retured in results['{term}_{type}'] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - search_types (list-like): collection of search types to query over
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_search_output for each query
    """
    
    num_searches = len(search_terms) * len(search_types)
    results = OrderedDict()
    
    for search_term, search_type in itertools.product(search_terms, search_types):
        results[(search_term, search_type)] = get_individual_search_output(search_term, search_type, flatten_output)
        
    return results

Function `get_individual_search_output` queries the Kaggle API with the specified search term (e.g., “machine learning”) and search type (must be either “datasets” or “kernels”)
- Searches across all returned pages
- Calls function `_convert_string_csv_output_to_dataframe` which converts results from API (strings in semi-structured table) to dataframe format
- Result is a dataframe (one dataframe per search term/search type combination)

In [4]:
def get_individual_search_output(search_term, search_type, flatten_output=False):
    """
    Calls the Kaggle API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - search_type (str): objects to search over (must be either datasets or kernels)
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure our input is valid
    assert isinstance(search_type, str), 'Search term must be a string'
    assert search_type in ('datasets', 'kernels'), 'Search can only be conducted over datasets or kernels'
    
    # Use search type to get relevant API function
    list_queries = getattr(api, f'{search_type}_list')
    
    page_idx = 1
    search_df = pd.DataFrame()
    
    # Pulls the records for a single page of results for the given search term
    output = list_queries(search=search_term, page=page_idx)

    # Continue searching until we no longer recieve results
    while output:
        if search_type == 'kernels':
            output = [vars(result) for result in output]
        if flatten_output:
            output = [flatten(result) for result in output]
            
        output_df = pd.DataFrame(output)
        output_df['page'] = page_idx

        search_df = pd.concat([search_df, output_df]).reset_index(drop=True)
        
        # Increments the page count for searching
        page_idx += 1

        # Pulls the records for a single page of results for the given search term
        output = list_queries(search = search_term, page=page_idx)
        
    # Modify columns for metadata merge
    search_df = search_df.rename(columns={'id': 'datasetId', 'ref': 'id'})
    if search_type == 'datasets':
        
        search_df = search_df.drop(columns=['viewCount', 'voteCount'])
    search_df = search_df.convert_dtypes()

    
    return search_df

Function `_convert_string_csv_output_to_dataframe` converts raw results from API (strings in semi-structured table) to dataframe format

In [5]:
def _convert_string_csv_output_to_dataframe(output):
    """
    Given a string variable in csv format, returns a Pandas DataFrame
    
    Params:
    - output (str): csv-styled string to be converted
    
    Returns:
    - df (pandas.DataFrame): DataFrame consisting of data from 'output' string variable
    """
    
    # Create DataFrame of results
    output = StringIO(output)
    df = pd.read_csv(output)
    
    return df

#### Run query #1 functions - example

In [6]:
search_terms = ['"machine learning"', '"artificial intelligence"']
search_types = ['datasets', 'kernels']

In [7]:
search_output_dict = get_all_search_outputs(search_terms, search_types, flatten_output=True)

In [8]:
sample_key = (search_terms[0], search_types[0])
sample_df = search_output_dict[sample_key]

In [9]:
sample_df

Unnamed: 0,datasetId,id,subtitle,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,isPrivate,...,tags_9_scriptCount,tags_9_totalCount,tags_10_ref,tags_10_name,tags_10_description,tags_10_fullPath,tags_10_competitionCount,tags_10_datasetCount,tags_10_scriptCount,tags_10_totalCount
0,70947,kaggle/kaggle-survey-2018,The most comprehensive dataset available on th...,Paul Mooney,paultimothymooney,4405170,https://www.kaggle.com/kaggle/kaggle-survey-2018,2018-11-03T22:35:07.12Z,15400,False,...,,,,,,,,,,
1,2733,kaggle/kaggle-survey-2017,A big picture view of the state of data scienc...,Mark McDonald,markmcdonald,3692241,https://www.kaggle.com/kaggle/kaggle-survey-2017,2017-10-27T22:03:03.417Z,23057,False,...,,,,,,,,,,
2,635,alopez247/pokemon,(Almost) all Pokémon stats until generation 6:...,alopez247,alopez247,731777,https://www.kaggle.com/alopez247/pokemon,2017-03-05T15:01:26.013Z,10498,False,...,,,,,,,,,,
3,32132,kashnitsky/mlcourse,Open Machine Learning Course by OpenDataScience,Yury Kashnitsky,kashnitsky,53599525,https://www.kaggle.com/kashnitsky/mlcourse,2018-12-09T16:45:09.507Z,25561,False,...,,,,,,,,,,
4,654897,kaushil268/disease-prediction-using-machine-le...,Use Machine Learning and Deep Learning models ...,KAUSHIL268,kaushil268,30490,https://www.kaggle.com/kaushil268/disease-pred...,2020-05-15T03:58:44.15Z,2476,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1822,334537,sinjonny/student-projects,,Kedar Jadav,sinjonny,27580,https://www.kaggle.com/sinjonny/student-projects,2019-09-04T17:24:25.917Z,49,False,...,,,,,,,,,,
1823,293689,partham/techgig-times-internet-gender-prediction,,Partha,partham,659711447,https://www.kaggle.com/partham/techgig-times-i...,2019-08-08T14:32:19.19Z,11,False,...,,,,,,,,,,
1824,1418312,jimregan/dalaj-v10,,Jim O'Regan,jimregan,266431,https://www.kaggle.com/jimregan/dalaj-v10,2021-06-18T21:07:57.907Z,5,False,...,,,,,,,,,,
1825,776337,awsaf49/vip-cup-2020,,Awsaf,awsaf49,5307889843,https://www.kaggle.com/awsaf49/vip-cup-2020,2020-08-02T18:30:26.807Z,14,False,...,,,,,,,,,,


## Query #2: query API for full metadata for hits from initial query

Note: Unable to find a way to store metadata in memory as opposed to saving file, but this workaround appears to be functional.

Function `_retrieve_object_json` uses the path for each object (each dataframe in result #1 ordered dictionary to query API for metadata associated with each object. Loads JSON results as dictionary.

In [68]:
def _retrieve_object_json(object_path, flatten_output=False):
    """
    Queries Kaggle for metadata json file & returns the json data as a dictionary
    
    Params:
    - object_path (str): path for the dataset
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - metadata_dict (dict): dictionary containing json metadata
    """
    
    # Download the metadata
    try:
        api.dataset_metadata(object_path, path='../data/')
    # Error occurs when there is no metadata to return
    except (TypeError, ApiException) as e:
        if (isinstance(e, ApiException) and 
            e.status != 404 and
            'bigquery' not in e.headers['Turbolinks-Location']):
            raise e
        else:
            return None
    else:
        # Access the metadata and load it in as a dictionary
        with open('../data/dataset-metadata.json') as file:
            json_data = json.load(file)

        if flatten_output:
            json_data = flatten(json_data)

        return json_data

Function `get_query_metadata` extracts metadata associated with each object and formats as dataframe
- Calls function `_retrieve_object_json`
- For each object, appends JSON results from `_retrieve_object_json` to a dataframe (converting to dataframe format)
- Output is single dataframe for each search query (matching each dataframe in result #1 ordered dictionary)

In [11]:
def get_query_metadata(object_paths, flatten_output=False):
    """
    Retrieves the metadata for the file/files listed in object_paths
    
    Params:
    - object_paths (str/list-like): string or list of strings containing the paths for the objects
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - metadata_df (pandas.DataFrame): DataFrame containing metadata for the requested datasets
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
        
    # Run first query
    json_data = _retrieve_object_json(object_paths[0], flatten_output)
        
    # Create DataFrame to store metadata in, using columns found in first query, and then add query info
    metadata_df = pd.DataFrame(json_data)
        
    # Pulls metadata information for each dataset found above
    for object_path in tqdm(object_paths[1::]):
        # Download & load the metadata
        json_data = _retrieve_object_json(object_path, flatten_output)

        # Store the metadata into our DataFrame created above
        metadata_df = metadata_df.append(json_data, ignore_index=True)
        
    # Modify dtypes
    metadata_df = metadata_df.convert_dtypes()
        
    return metadata_df

Function `get_all_metadata` uses a `for` loop to put dataframes into an ordered dictionary, matching result #1 ordered dictionary
- Calls function `get_query_metadata`

In [12]:
def get_all_metadata(search_output_dict, flatten_output=False):
    """
    Retrieves all of the metadata that relates to the provided DataFrames
    
    Params:
    - search_output_dict : dict
        Dictionary of DataFrames from get_all_search_outputs
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output  
      
    Returns:
    - metadata_dict : collections.OrderedDict
        OrderedDict of DataFrames with metadata for each query
        Order matches the order of search_output_dict
    """

    ## Extract IDs from DataFrame, and returns as list of strings
    metadata_dict = OrderedDict()

    for query, df in search_output_dict.items():
        # Kernels do not have metadata
        if 'kernels' not in query:
            print(f'Retrieving {query} metadata')
            # Create object paths
            _, search_type = query
            object_paths = df.id.values

            metadata_dict[query] = get_query_metadata(object_paths, flatten_output)
        
    return metadata_dict

In [69]:
metadata_dict = get_all_metadata(search_output_dict, flatten_output=True)

Retrieving ('"machine learning"', 'datasets') metadata


100%|██████████| 1826/1826 [3:25:13<00:00,  6.74s/it]   


Retrieving ('"artificial intelligence"', 'datasets') metadata


100%|██████████| 506/506 [09:56<00:00,  1.18s/it]  


## Combine results of query #1 and query #2

Function `merge_search_and_metadata_dicts` merges the output dictionaries from query #1 and query #2 to a single ordered dictionary and (optional) saves the results as a single csv file

In [70]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on=None, left_on=None, right_on=None, save=False):
    """
    Merges together all of the search and metadata DataFrames by the given 'on' key
    
    Params:
    - search_dict (dict): dictionary of search output results
    - metadata_dict (dict): dictionary of metadata results
    - on (str/list-like): column name(s) to merge the two dicts on
    - left_on (str/list-like): column name(s) to merge the left dict on
    - right_on (str/list-like): column name(s) to merge the right dict on
    - save=False, optional (bool/list-like): specifies if the output DataFrames should be saved
        If True: saves to file of format 'data/kaggle/kaggle_{search_term}_{search_type}.csv'
        If list-like: saves to respective location in list of save locations
            Must contain enough strings (one per query; len(search_terms) * len(search_types))
            
    Returns:
    - df_dict (OrderedDict): OrderedDict containing all of the merged search/metadata dicts
    """
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')

    # Merge the DataFrames
    df_dict = OrderedDict()
    for query_key, save_loc in zip(search_dict.keys(), save):
        search_df = search_dict[query_key]
        if query_key in metadata_dict:
            # Merge small version of "full" dataframe with "detailed" dataframe
            metadata_df = metadata_dict[query_key]
            df_all = pd.merge(search_df, metadata_df, on=on, left_on=left_on, right_on=right_on, how='outer')
        else:
            df_all = search_df
        
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'kaggle')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure kaggle directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)
                
                search_term, search_type = query_key
                output_file = f'{search_term}_{search_type}.csv'
            else:
                raise ValueError('Save type must be bool or str')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

In [71]:
#run merge function
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict)

In [72]:
df_dict[sample_key]

Unnamed: 0,datasetId,id,subtitle,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,isPrivate,...,collaborators_2_role,collaborators_2_username,collaborators_3_role,collaborators_3_username,collaborators_4_role,collaborators_4_username,collaborators_5_role,collaborators_5_username,collaborators_6_role,collaborators_6_username
0,70947,kaggle/kaggle-survey-2018,The most comprehensive dataset available on th...,Paul Mooney,paultimothymooney,4405170,https://www.kaggle.com/kaggle/kaggle-survey-2018,2018-11-03T22:35:07.12Z,15400,False,...,,,,,,,,,,
1,2733,kaggle/kaggle-survey-2017,A big picture view of the state of data scienc...,Mark McDonald,markmcdonald,3692241,https://www.kaggle.com/kaggle/kaggle-survey-2017,2017-10-27T22:03:03.417Z,23057,False,...,,,,,,,,,,
2,635,alopez247/pokemon,(Almost) all Pokémon stats until generation 6:...,alopez247,alopez247,731777,https://www.kaggle.com/alopez247/pokemon,2017-03-05T15:01:26.013Z,10498,False,...,,,,,,,,,,
3,32132,kashnitsky/mlcourse,Open Machine Learning Course by OpenDataScience,Yury Kashnitsky,kashnitsky,53599525,https://www.kaggle.com/kashnitsky/mlcourse,2018-12-09T16:45:09.507Z,25561,False,...,,,,,,,,,,
4,654897,kaushil268/disease-prediction-using-machine-le...,Use Machine Learning and Deep Learning models ...,KAUSHIL268,kaushil268,30490,https://www.kaggle.com/kaushil268/disease-pred...,2020-05-15T03:58:44.15Z,2476,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3584,334537,sinjonny/student-projects,,,,,,,,0,...,,,,,,,,,,
3585,293689,partham/techgig-times-internet-gender-prediction,,,,,,,,0,...,,,,,,,,,,
3586,1418312,jimregan/dalaj-v10,,,,,,,,0,...,,,,,,,,,,
3587,776337,awsaf49/vip-cup-2020,,,,,,,,0,...,,,,,,,,,,


#### Example output

In [73]:
#results of query #1
output_df = search_output_dict[sample_key]

#results of query #2
metadata_df = metadata_dict[sample_key]

#result of merging datasets into "full" dataframe
full_df = df_dict[sample_key]

In [74]:
sample_df.head()

Unnamed: 0,datasetId,id,subtitle,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,isPrivate,...,tags_9_scriptCount,tags_9_totalCount,tags_10_ref,tags_10_name,tags_10_description,tags_10_fullPath,tags_10_competitionCount,tags_10_datasetCount,tags_10_scriptCount,tags_10_totalCount
0,70947,kaggle/kaggle-survey-2018,The most comprehensive dataset available on th...,Paul Mooney,paultimothymooney,4405170,https://www.kaggle.com/kaggle/kaggle-survey-2018,2018-11-03T22:35:07.12Z,15400,False,...,,,,,,,,,,
1,2733,kaggle/kaggle-survey-2017,A big picture view of the state of data scienc...,Mark McDonald,markmcdonald,3692241,https://www.kaggle.com/kaggle/kaggle-survey-2017,2017-10-27T22:03:03.417Z,23057,False,...,,,,,,,,,,
2,635,alopez247/pokemon,(Almost) all Pokémon stats until generation 6:...,alopez247,alopez247,731777,https://www.kaggle.com/alopez247/pokemon,2017-03-05T15:01:26.013Z,10498,False,...,,,,,,,,,,
3,32132,kashnitsky/mlcourse,Open Machine Learning Course by OpenDataScience,Yury Kashnitsky,kashnitsky,53599525,https://www.kaggle.com/kashnitsky/mlcourse,2018-12-09T16:45:09.507Z,25561,False,...,,,,,,,,,,
4,654897,kaushil268/disease-prediction-using-machine-le...,Use Machine Learning and Deep Learning models ...,KAUSHIL268,kaushil268,30490,https://www.kaggle.com/kaushil268/disease-pred...,2020-05-15T03:58:44.15Z,2476,False,...,,,,,,,,,,


In [75]:
metadata_df.head()

Unnamed: 0,collaborators,data,datasetId,datasetSlug,description,id,id_no,isPrivate,keywords_0,keywords_1,...,collaborators_2_role,collaborators_2_username,collaborators_3_role,collaborators_3_username,collaborators_4_role,collaborators_4_username,collaborators_5_role,collaborators_5_username,collaborators_6_role,collaborators_6_username
0,[],[],635,pokemon,# Context With the rise of the popularity of...,alopez247/pokemon,635,0,arts and entertainment,games,...,,,,,,,,,,
1,[],[],32132,mlcourse,![](https://habrastorage.org/webt/ia/m9/zk/iam...,kashnitsky/mlcourse,32132,0,computer science,data visualization,...,,,,,,,,,,
2,[],[],654897,disease-prediction-using-machine-learning,### Context During the time when Machine Lear...,kaushil268/disease-prediction-using-machine-le...,654897,0,diseases,earth and nature,...,,,,,,,,,,
3,[],[],477512,online-shoppers-intention,**Data Set Information:** The dataset consist...,henrysue/online-shoppers-intention,477512,0,universities and colleges,arts and entertainment,...,,,,,,,,,,
4,[],[],1373456,phishing-dataset-for-machine-learning,### Context Anti-phishing refers to efforts t...,shashwatwork/phishing-dataset-for-machine-lear...,1373456,0,research,exploratory data analysis,...,,,,,,,,,,


In [76]:
full_df.head()

Unnamed: 0,datasetId,id,subtitle,creatorName,creatorUrl,totalBytes,url,lastUpdated,downloadCount,isPrivate,...,collaborators_2_role,collaborators_2_username,collaborators_3_role,collaborators_3_username,collaborators_4_role,collaborators_4_username,collaborators_5_role,collaborators_5_username,collaborators_6_role,collaborators_6_username
0,70947,kaggle/kaggle-survey-2018,The most comprehensive dataset available on th...,Paul Mooney,paultimothymooney,4405170,https://www.kaggle.com/kaggle/kaggle-survey-2018,2018-11-03T22:35:07.12Z,15400,False,...,,,,,,,,,,
1,2733,kaggle/kaggle-survey-2017,A big picture view of the state of data scienc...,Mark McDonald,markmcdonald,3692241,https://www.kaggle.com/kaggle/kaggle-survey-2017,2017-10-27T22:03:03.417Z,23057,False,...,,,,,,,,,,
2,635,alopez247/pokemon,(Almost) all Pokémon stats until generation 6:...,alopez247,alopez247,731777,https://www.kaggle.com/alopez247/pokemon,2017-03-05T15:01:26.013Z,10498,False,...,,,,,,,,,,
3,32132,kashnitsky/mlcourse,Open Machine Learning Course by OpenDataScience,Yury Kashnitsky,kashnitsky,53599525,https://www.kaggle.com/kashnitsky/mlcourse,2018-12-09T16:45:09.507Z,25561,False,...,,,,,,,,,,
4,654897,kaushil268/disease-prediction-using-machine-le...,Use Machine Learning and Deep Learning models ...,KAUSHIL268,kaushil268,30490,https://www.kaggle.com/kaushil268/disease-pred...,2020-05-15T03:58:44.15Z,2476,False,...,,,,,,,,,,
