![](/files/images/dbxatscale.png)

#Overview
In order to leverage the powers of Databricks' full AI/ML pipelining system, while using AtScale data, we need to move a small portion data to start our AutoML experiment. Databricks AutoML experiments require the data table to already exist in Databricks which our data does; however, it is spread out across two fact tables and all the desirable <b>Semantic Model</b> transformations are not actually enacted on those underlying data tables. In order to ensure our transformations are exposed to our AutoML experiment, we need to execute and write the result of our AI Link generated SQL query.

Imports and AtScale connection: We need to create an AtScale connection in order to generate our database query which will define our new datatable.

In [0]:
from atscale.client import Client
from atscale.data_model import DataModel
from atscale.project import Project
from atscale.connection import Connection
from atscale.eda.feature_engineering import *
from atscale.base import enums
from atscale.utils import db_utils

Our username and password are stored in an Azure secret scope which we can access using AI Link's db.utils

In [0]:
AILinkUser = dbutils.secrets.get(scope = "ai-link scope", key = "atscale-andrew-user")
AILinkPass = dbutils.secrets.get(scope = "ai-link scope", key = "atscale-andrew-pass")

In [0]:
client = Client(
    server = "http://ailink-public.atscale.com",
    organization = "default",
    username = AILinkUser,
    password = AILinkPass)

In [0]:
client.connect()

In [0]:
project = client.select_project(name_contains="Walmart Databricks ML")
dm = project.select_data_model(name_contains= "m5_walmart_sales")

Please choose a project:
Automatically selecting only option: "ID: bfab4719-9a61-46f5-7d4c-1dcc2881ba44: Name: Walmart Databricks ML"
Please choose a published project:
Automatically selecting only option: "ID: 7f2140aa-8b8c-4517-6484-47432da7e337: Name: Walmart Databricks ML"
Please choose a data model:
Automatically selecting only option: "ID: 50fd61ba-2a0b-44d2-7dec-786e15c6ab6f: Name: m5_walmart_sales"


For more information on the dataset behind this demo check it out <a href="https://www.kaggle.com/competitions/m5-forecasting-accuracy">here</a>

#Problem Statement
Before we pull any data, let's discuss the model we are aiming to create, deploy, and use. For this demo, we are going to pretend our business team has asked us to create a model for predicting the type of an item based soley upon its sales information. So we want to create a model that sorts all of our sales into three categories: Food items, Hobby related items, and Household items. So we are going to pull all numerical sales related data, and our item so we can classify our training data.

Using AI Link's get all calls we can retrieve a list of all categorical and all numeric features in our data model

In [0]:
catfeats = dm.get_all_categorical_feature_names()
numfeats = dm.get_all_numeric_feature_names()

We are then going to define our features as all of our numeric features as well as 3 categorical features appended onto the end. You can see our complete feature list printed below. Lastly, we build our SQL query for all of our features and save it as a string named query.

In [0]:
features = numfeats

features.append(catfeats[2])
features.append(catfeats[-1])
features.append(catfeats[-3])

print(features)
query = dm.get_database_query(feature_list = features)

['average_sales', 'average_units_sold', 'max_sales', 'max_units_sold', 'population_variance_sales', 'population_variance_units_sold', 'sample_standard_deviation_sales', 'sample_standard_deviation_units_sold', 'sample_variance_units_sold', 'total_categories', 'total_departments', 'total_items', 'total_sales', 'total_states', 'total_stores', 'total_transactions', 'total_units_sold', 'day_over_day_units_sold', 'previous_days_units_sold', 'total_sales_30_prd_mv_avg', 'total_units_sold_28_day_max', 'total_units_sold_30_prd_mv_avg', 'date', 'store', 'item']


Create our spark dataframe from our query. AI Link allows us to eliminate the need to move data through AtScale, instead we can programatically generate a dataframe based solely upon the underlying SQL generated by the AtScale engine.

In [0]:
data = spark.sql(query)

Now let's convert our data to pandas so we can divide up our 14 items into our 3 categories.

In [0]:
df = data.toPandas()

We can now create a new column called item_type which consists of entirely 1s, 2s, or 3s.

In [0]:
item_cat = []

foodc=0
hobc=0
housec=0

for index, row in df.iterrows():
  item = row["item"][0:3]
  if item == "FOO":
    item_cat.append(1)
    foodc+=1
  elif item == "HOB":
    item_cat.append(2)
    hobc+=1
  elif item =="HOU":
    item_cat.append(3)
    housec+=1
  
df["item_type"] = item_cat
df = df.drop(["item", "date", "store"], axis = 1)

df.head(5)

Unnamed: 0,average_sales,average_units_sold,max_sales,max_units_sold,population_variance_sales,population_variance_units_sold,sample_standard_deviation_sales,sample_standard_deviation_units_sold,sample_variance_units_sold,total_categories,total_departments,total_items,total_sales,total_states,total_stores,total_transactions,total_units_sold,day_over_day_units_sold,previous_days_units_sold,total_sales_30_prd_mv_avg,total_units_sold_28_day_max,total_units_sold_30_prd_mv_avg,item_type
0,1.96,19.0,1.96,19.0,0.0,0.0,,,,1.0,1.0,1.0,1.96,1.0,1.0,1.0,19.0,0.0,,1.96,19.0,19,1
1,0.98,9.0,0.98,9.0,0.0,0.0,,,,1.0,1.0,1.0,0.98,1.0,1.0,1.0,9.0,0.0,,0.98,9.0,9,1
2,3.28,10.0,3.28,10.0,0.0,0.0,,,,1.0,1.0,1.0,3.28,1.0,1.0,1.0,10.0,0.0,,3.28,10.0,10,1
3,3.28,9.0,3.28,9.0,0.0,0.0,,,,1.0,1.0,1.0,3.28,1.0,1.0,1.0,9.0,0.0,,3.28,9.0,9,1
4,1.48,0.0,1.48,0.0,0.0,0.0,,,,1.0,1.0,1.0,1.48,1.0,1.0,1.0,0.0,0.0,,1.48,0.0,0,1


#Write a table
Finally let's just write our data as a table so we can use it as an AutoML input

In [0]:
table = spark.createDataFrame(df)
table.write.saveAsTable("hive_metastore.ai_link_data.m5_data_for_auto_ml", mode = "overwrite")

![](/files/images/atscale_logo.png)