# MLflow Runs

The main difference between this notebook and the previous one is to show the new "Runs" feature in the Databricks interface.

### REQUIREMENT

Create MLFlow Experiment and Define Experiment Name and ID

In [3]:
#Read information about our products and transform it to pandas
df = spark.read.parquet("/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsFull")
pandasDF = df.toPandas()

In [4]:
#Look at the type of info we have
pandasDF.head()

# Let's apply TDIDF on our product name to see if we can predict the price purely based off of it

<img src=https://cdn-images-1.medium.com/max/2000/1*q3qYevXqQOjJf6Pwdlx8Mw.png
      />

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [6]:
#Let's build a prediction based on the description of the product

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas

#Create TFIDF 
v = TfidfVectorizer()
x = v.fit_transform(pandasDF['Name'])

#Add our values to the initial dataframe
df1 = pandas.DataFrame(x.toarray(), columns=v.get_feature_names())
res = pandas.concat([pandasDF, df1], axis=1)

#Create our training and testing dataframes
X = res.iloc[:, 17:]
y = res['ListPrice']

In [7]:
from sklearn.ensemble import RandomForestRegressor
import mlflow
import mlflow.sklearn
import pandas
import numpy
from sklearn.model_selection import train_test_split

print("MLflow Version:",mlflow.version.VERSION)

with mlflow.start_run():

  #Send the dataframes into Pandas
  trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.33, random_state=42)
  
  #Define number of estimators and log them
  n_est = 1000
  mlflow.log_param("n_est", n_est)
  
  #Train the model
  rfsk = RandomForestRegressor(n_estimators = n_est, random_state = 42)
  rfsk.fit(trainX, trainY)

  #Make the predictions
  predictions = rfsk.predict(testX)
  predDF = pandas.DataFrame(testY[:])
  predDF['pred'] = predictions
  predDF['errors'] = predDF["ListPrice"] - predDF['pred']

  #Get standard deviation of errors
  stdv = predDF['errors'].std()

  #Mean absolute error of predictions
  mae = predDF['errors'].abs().mean()
  
  #Log to mlflow
  mlflow.log_metric("mae", mae)
  mlflow.log_param("std", stdv)
  mlflow.sklearn.log_model(rfsk, "model1sklearn")

## Next Step

[5-01 Simple Pipeline without Delta]($../5-Delta/5-01 Simple Pipeline without Delta)

&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>