## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/reviews-2.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = "\t"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

reviews,Name,Platform,metacritic_score,Num_of_comments
"Just as a Nintendo 64 without 'Super Mario 64' was unthinkable, so too is a Wii without Wii Sports. It’s one of the most perfect launch games we’ve seen.",Wii Sports,wii,90,49
"In terms of control and immersion, Wii Sports destroys the competition and provides an excellent foundation for other developers to build upon.",Wii Sports,wii,90,49
Wii Sports is the perfect pack-in for the Wii since it really shows off the system’s unique capabilities and manages to draw in casual gamers who would otherwise not want to play videogames.,Wii Sports,wii,87,49
The game does a great job of showing off what the system is about and showcasing the system’s abilities.,Wii Sports,wii,87,49
"We can't help but enjoy the fact that we're getting a solid sports experience for nothing. It's definitely more fun in groups and won't have a great deal of longevity, but anyone who doesn't find at least some fun in Wii Sports has a heart of coal.",Wii Sports,wii,85,49
"Sure, it lacks goals and can be beaten very quickly. But the ultra-responsive technology and high multiplayer replay value are far greater than any other party or sports game collection.",Wii Sports,wii,85,49
"The single-player game is good for practice and daily tests, but Wii Sports shines as a multiplayer game. [Jan. 2007, p.96]",Wii Sports,wii,83,49
"It's bloody good fun and you'll enjoy playing it for ages. [January 2007, p.38]",Wii Sports,wii,82,49
"It's a game you can play with your friends and family, and it perfectly highlights the direction that Nintendo is taking with the platform. It's not the best-looking game on the system, and it's definitely not the deepest, but it may well be the most fun.",Wii Sports,wii,80,49
"This is a simple game about simple fun, and everyone owes it to themselves to try it out to get a taste for what the Wii may in fact have in store for us all very soon.",Wii Sports,wii,80,49


In [0]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.sql.functions import col, udf, length
from pyspark.sql.types import IntegerType

count_tokens = udf(lambda words: len(words), IntegerType())

In [0]:
df = df.withColumn("length", length(col("reviews")))

In [0]:
tokenizer= Tokenizer(inputCol="reviews", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
hashing = HashingTF(inputCol="filtered", outputCol="hashed")
idf = IDF(inputCol="hashed", outputCol="tf_idf")


In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
clean_up = VectorAssembler(inputCols=["tf_idf","length"], outputCol="features")

In [0]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

In [0]:
from pyspark.ml import Pipeline

In [0]:
data_prep_pipe = Pipeline(stages=[tokenizer, remover, hashing, idf, clean_up])

In [0]:
cleaner = data_prep_pipe.fit(df)

In [0]:
cleaned_data = cleaner.transform(df)

In [0]:
cleaned_data.columns

In [0]:
clean_data = cleaned_data.select("metacritic_score", "features")

In [0]:
clean_data = clean_data.withColumnRenamed("metacritic_score","label")

In [0]:
training, test = clean_data.randomSplit([0.7,0.3])

In [0]:
rf = RandomForestRegressor(featuresCol="features")

In [0]:
model = rf.fit(training)

In [0]:
predictions= model.transform(test)

predictions.select("prediction","label","features").show(5)

In [0]:
evaluator = RegressionEvaluator(
  labelCol="label", predictionCol="prediction",metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)


In [0]:
 from pyspark.ml.regression import LinearRegression

In [0]:
lr = LinearRegression(featuresCol="features", labelCol="label", maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(training)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))


In [0]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)


In [0]:
lr_predictions = lr_model.transform(test)
lr_predictions.select("prediction", "label", "features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction",\
                                  labelCol="label", metricName="r2")


In [0]:
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

test_result = lr_model.evaluate(test)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)



In [0]:
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()