# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

# <span style="font-width:bold; font-size: 3rem; color:#333;">How to Query from Federated Data Sources with Hopsworks Feature Query Service</span>

The aim of this tutorial is to create a unified view of features regarding the 100 most popular GitHub projects joining public datasets on Snowflake ([GitHub Archive](https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3/cybersyn-inc-github-archive?search=software&categorySecondary=%5B%2213%22%5D)). BigQuery ([deps.dev](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deps-dev?hl=en)) and Hopsworks. We will create feature groups for each of these sources and then combine them in a unified view exposing all features together regardless of their source. We then use the view to create training data for a model predicting the code coverage of Github projects.

## Prerequisites:
* To follow this tutorial you can sign up for the [Hopsworks Free Tier](https://app.hopsworks.ai/) or use  your own Hopsworks installation. You also need access to Snowflake and BigQuery, which offer free trials: [Snowflake Free Trial](https://signup.snowflake.com/?utm_source=google&utm_medium=paidsearch&utm_campaign=em-se-en-brand-trial-exact&utm_content=go-rsa-evg-ss-free-trial&utm_term=c-g-snowflake%20trial-e&_bt=591349674928&_bk=snowflake%20trial&_bm=e&_bn=g&_bg=129534995484&gclsrc=aw.ds&gad_source=1&gclid=EAIaIQobChMI0eeI-rPrggMVOQuiAx3WfgzdEAAYASAAEgIwS_D_BwE), [Google Cloud Free Tier](https://cloud.google.com/free?hl=en). If you choose to use your own Hopsworks, you should have an instance of Hopsworks version 3.5 or above and be the Data Owner/Author of a project. Furthermore, to use the  Hopsworks Feature Query Service, the user has to configure the Hopsworks cluster to enable it. This can only be done during [cluster creation](https://docs.hopsworks.ai/3.5/setup_installation/common/arrow_flight_duckdb/).

## <span style='color:#ff5f27'> Gain access to the dataset

* Add the [GitHub Archive](https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3/cybersyn-inc-github-archive?search=software&categorySecondary=%5B%2213%22%5D) dataset to your Snowflake account
* The BigQuery dataset [deps.dev](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deps-dev?hl=en) is readable by default

## <span style='color:#ff5f27'> Set up the Snowflake and BigQuery in Hopsworks

Hopsworks manages the connection to Snowflake and BigQuery through storage connectors. Follow the [Storage Connector Guides](https://docs.hopsworks.ai/3.5/user_guides/fs/storage_connector/) to configure storage connectors for Snowflake and BigQuery and name them **Snowflake** and **BigQuery**.

## <span style='color:#ff5f27'> Dependencies

In [None]:
!pip install hopsworks>=3.5.0rc1 hsfs>=3.5.0rc2 --quiet

## <span style='color:#ff5f27'> Connect to Hopsworks

In [None]:
import hopsworks
from hsfs.feature import Feature


project = hopsworks.login()
feature_store = project.get_feature_store()

snowflake = feature_store.get_storage_connector("Snowflake")
bigquery = feature_store.get_storage_connector("BigQuery") 

## <span style='color:#ff5f27'> Create an External Feature Group on Snowflake
We now create an external feature group querying the [GitHub Archive](https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3/cybersyn-inc-github-archive?search=software&categorySecondary=%5B%2213%22%5D) dataset on Snowflake to return the 100 repositories that got the most stars during the 365 days before Nov 11, 2023. 

In [None]:
query_str = """
WITH latest_repo_name AS (
    SELECT repo_name,
           repo_id
    FROM cybersyn.github_repos
    QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT LOWER(repo.repo_name) as repo_name,
       SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
    ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, DATE('2023-11-13'))
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 100;"""

features = [
    Feature(name="repo_name",type="string"),
    Feature(name="sum_stars",type="int")
]

github_most_starts_fg = feature_store.create_external_feature_group(
    name="github_most_starts",
    version=1,
    description="The Github repos that got the most stars last year",
    primary_key=['repo_name'],
    query=query_str,
    storage_connector=snowflake,
    features=features
)

github_most_starts_fg.save()

After creating the external feature group on Snowflake, we are now able to query it in our notebook utilizing the Hopsworks Feature Query Service:

In [None]:
github_most_starts_df = github_most_starts_fg.read()
github_most_starts_df.head()

## <span style='color:#ff5f27'> Create an External Feature Group on BigQuery

We now create an external feature group on BigQuery containing information about the licenses, number of forks and open issues from the deps.dev dataset. To limit the cost, we limit the content to the 100 repositories from the github_most_starts feature group:

In [None]:
repos_quoted = github_most_starts_df['repo_name'].map(lambda r: f"'{r}'").tolist()
repos_quoted[0:5]

In [None]:
query_str = f"""
SELECT
  Name as repo_name, Licenses as licenses, ForksCount as forks_count, OpenIssuesCount as open_issues_count
FROM
  `bigquery-public-data.deps_dev_v1.Projects`
WHERE
  TIMESTAMP_TRUNC(SnapshotAt, DAY) = TIMESTAMP("2023-11-13")
  AND
  Type = 'GITHUB'
  AND Name IN ({','.join(repos_quoted)})
 """

features = [
    Feature(name="repo_name",type="string"),
    Feature(name="licenses",type="string"),
    Feature(name="forks_count",type="int"),
    Feature(name="open_issues_count",type="int")
]

github_info_fg = feature_store.create_external_feature_group(
    name="github_info",
    version=1,
    description="Information about Github project licenses, forks count and open issues count",
    primary_key=['repo_name'],
    query=query_str,
    storage_connector=bigquery,
    features=features
)

github_info_fg.save()

After creating the external feature group on BigQuery, we can now query it in our notebook utilizing the Hopsworks Feature Query Service:

In [None]:
github_info_df = github_info_fg.read()
github_info_df.head()

## <span style='color:#ff5f27'> Create a Feature Group on Hopsworks

To show that the data from the datasets on Snowflake and BigQuery can be queried together with data on Hopsworks, we now make up a dataset for the code coverage of repositories on GitHub and put it into a feature group on Hopsworks:

In [None]:
import random
import pandas as pd

repos = github_most_starts_df['repo_name'].tolist()

numbers = [random.uniform(0, 1) for _ in range(len(repos))]
coverage_df = pd.DataFrame(list(zip(repos, numbers)),
               columns =['repo_name', 'code_coverage'])

coverage_fg = feature_store.create_feature_group(name="github_coverage",
    version=1,
    primary_key=['repo_name'],
)

coverage_fg.insert(coverage_df, write_options={"wait_for_job": True})

After creating the feature group, we can look at it:

In [None]:
coverage_fg.select_all().show(5)

## <span style='color:#ff5f27'> Create a Feature View joining all Feature Groups together

We now join the two external feature groups on Snowflake and BigQuery with the feature group in Hopsworks into a single feature view and mark the feature code_coverage as our label to be able to create training data in the next step:

In [None]:
query = github_most_starts_fg.select_all().join(github_info_fg.select_all(), join_type='left').join(coverage_fg.select_all(), join_type='left')

feature_view = feature_store.create_feature_view(
    name='github_all_info',
    version=1,
    query=query,
    labels=['code_coverage']
)

We can query the feature view in the same way we query any other feature view, regardless of the data being spread across Snowflake, BigQuery and Hopsworks. The data will be queried directly from its source and joined using the Hopsworks Feature Query Service before being returned to Python:

In [None]:
data = feature_view.get_batch_data()
data.head()

## <span style='color:#ff5f27'> Create the training data from the Feature View

Finally, we can use the feature view to create training data that could be used to train a model predicting the code coverage of the GitHub repositories:

In [None]:
X_train, X_test, Y_train, Y_test = feature_view.train_test_split(test_size=0.2)