# Big Data Model and Evaluation

## Decision Tree

We construct a decition tree of depth 10 based on tokenized description and tags, and also other attributes like comment_disabled.

First import modules and initialize spark context

In [1]:
%load_ext autoreload
%autoreload 2

from predict.decision_tree import *
spark: SparkSession = SparkSession.builder.getOrCreate()

your 131072x1 screen size is bogus. expect trouble
23/11/19 00:10:18 WARN Utils: Your hostname, KW resolves to a loopback address: 127.0.1.1; using 172.24.216.50 instead (on interface eth0)
23/11/19 00:10:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/19 00:10:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Lazy-load the data into DataFrame. Pyspark DataFrame has similar API to pandas, and thus is more readable.

Then Filter and cast the data to correct types. Tokenize the description and tag fields for model later

In [2]:
file_list = [
    f"{ROOT_PATH}/{c}_youtube_trending_data_processed.csv" for c in COUNTRIES
]
df = spark.read.option("header", True).csv(file_list)
print(df)

df = filter_and_cast(df)
df = tokenize_description(df)
df = tokenize_tag(df)

DataFrame[video_id: string, title: string, publishedAt: string, channelId: string, channelTitle: string, categoryId: string, trending_date: string, tags: string, view_count: string, likes: string, dislikes: string, comment_count: string, thumbnail_link: string, comments_disabled: string, ratings_disabled: string, description: string, category_name: string]


[nltk_data] Downloading package punkt to /home/kw/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/kw/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Construct the hashing function and the decision tree model to be fitted

In [3]:
hashing_text, hashing_tags, assembler = construct_hashing_assembler()
df = hashing_text.transform(df)
df = hashing_tags.transform(df)
df = assembler.transform(df)

regressor = DecisionTreeRegressor(
    maxDepth=10,
    maxMemoryInMB=512,
    featuresCol="features",
    labelCol="view_count",
    predictionCol="prediction",
)
model = regressor.fit(df)
print(model.featureImportances)

                                                                                

(42,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41],[0.0018278573848525665,0.07464633900220487,0.12334376379985049,0.028529325879655464,0.031046733856399958,0.02460595751780358,0.008506165500510471,0.0151428801631184,0.016525395217440626,0.012582680834966888,0.020564151214719998,0.02549186190101141,0.013064011639899674,0.0051530214713479605,0.04920009146921148,0.003737898074440687,0.017573337674904443,0.018624275185159786,0.03895801335276766,0.016106560401864277,0.014763713640571502,0.0464948813974357,0.006854187144458909,0.004259017218670106,0.06680596816862805,0.028662138859354973,0.026826854802257362,0.016110971754540213,0.015344687830170508,0.013130711912632202,0.010088154180455182,0.04895480723521748,0.006158957042376181,0.027262245950435707,0.02274762327941597,0.010080178143945443,0.013879554288261486,0.01245376191055722,0.003289278176354053,0.023300910469035093,0.011321306283613653,0.025979768769482454])


Different channels have vastly different number of subscribers. To better evaluate the model, we don't calculate the error on view count, but instead its rank inside the same channel.

In [4]:
file_list = [
    f"{TEST_PATH}/{c}_youtube_trending_data_processed_test.csv" for c in COUNTRIES
]
df = spark.read.option("header", True).csv(file_list[0])

df = filter_and_cast(df)
df = tokenize_description(df)
df = tokenize_tag(df)

df = hashing_text.transform(df)
df = hashing_tags.transform(df)
df = assembler.transform(df)
df = model.transform(df)

# Eval based on rank instead of abs view count
df = df.withColumn(
    "expected",
    dense_rank().over(Window.partitionBy("channelId").orderBy("view_count")),
)
df = df.withColumn(
    "predicted",
    dense_rank().over(Window.partitionBy("channelId").orderBy("prediction")),
)
df = df.withColumns(
    {
        "expected": df["expected"].cast("double"),
        "predicted": df["predicted"].cast("double"),
    }
)

metrics = RegressionMetrics(df.select("expected", "predicted").rdd.map(tuple))
print(f"MAE: {metrics.meanAbsoluteError}, MSE: {metrics.meanSquaredError}")

[nltk_data] Downloading package punkt to /home/kw/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/kw/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
                                                                                

MAE: 11.356571662907525, MSE: 380.21283575968835


In [6]:
sub_df = df.limit(100).select(
    ["video_id", "title", "channelId", "channelTitle", "expected", "predicted"]
)
sub_df.write.csv(f"{PREDICT_OUTPUT}/decision_tree", header=True)

                                                                                