# Описание задач

Для соревнования SNA Hackathon были собраны логи показов контента из открытых групп в новостных лентах пользователей за февраль-март 2018 года. В тестовое множество спрятаны последние полторы недели марта. Каждая запись в логе содержит информацию о том, что и кому было показано, а также о том, как отреагировал пользователь на этот контент: поставил «Класс», прокомментировал, проигнорировал или скрыл из ленты. 

Суть задачи в том, чтобы для каждого пользователя тестового множества отранжировать кандидатов, как можно выше поднимая тех, которые получат «класс»,.

Обычно мы давали одну задачу, но в этот раз решили дать сразу три. Вам не нужно их решать все, достаточно только одну. Поскольку пользовательская лента совмещает контент разного типа, то при его ранжировании востребованы навыки из разных областей — компьютерное зрение, работа с текстами и рекомендательные системы. 

В рамках онлайн-этапа мы предлагаем три набора данных, в каждом из которых представлен только один из типов информации: изображение, текст или данные о разнообразных коллаборативных признаках. 

Только на втором этапе, когда эксперты в разных областях соберутся вместе, будет раскрыт общий датасет, позволяющий найти точки для синергии разных методов.
После открытия чемпионата на платформе, вы увидите описание задач и получите возможность скачать необходимые для участия данные. 

# Описание данных
 

Информация представлена в формате Apache Parquet, который является основным для фреймворка Spark. Для работы с этим форматом из Python мы рекомендуем воспользоваться библиотекой Apache Arrow. Для простоты понимания в репозитории на GitHub выложены бейзлайны. Пользуйтесь! 

В обучающем множестве данные разложены по дням, а внутри дня разделены на 6 частей по ID пользователя (один и тот же пользователь всегда попадает в ту же самую часть). Такая раскладка позволяет участникам анализировать не все данные сразу, а ограничиться определёнными днями и/или подгруппами пользователей.

Обучающие наборы разбиты на три непересекающиеся группы: с текстами, с картинками и с коллаборативными признаками. В каждой группе данные содержат следующие поля:

* instanceId_userId — идентификатор пользователя (анонимизированный);
* instanceId_objectType — тип объекта;
* instanceId_objectId — идентификатор объекта (анонимизированный);
* feedback — массив с типами реакций пользователя (наличие в массиве токена Liked говорит о том, что объект получил «класс» от пользователя);
* audit_clientType — тип платформы, с которой зашёл пользователь;
* audit_timestamp — время, когда строилась лента;
* metadata_ownerId — автор показанного объекта (анонимизированный);
* metadata_createdAt — дата создания показанного объекта.


Для объектов из обучающего текстового множества дополнительно предоставлены связанные с ними тексты в формате Apache Parquet:

* objectId — идентификатор объекта;
* lang — язык текста (на базе детектора языка Одноклассников);
* text — сырой текст, связанный с объектом;
* preprocessed — массив токенов, полученный после фильтрации пунктуации и стемминга.



В данных для ранжирования по картинкам дополнительно присутствует поле-массив ImageId с MD5-хешами, связанными с объектами картинок. Тела изображений разложены по отдельным tar-файлам, в зависимости от первой буквы хеша.


В блоке с коллаборативными признаками представлена разнообразная дополнительная информация:

* audit_* — расширенная информация о контексте построения ленты;
* metadata_* — расширенная информация о самом объекте;
* userOwnerCounters_* — информация о предыдущих взаимодействиях пользователя и автора контента;
* ownerUserCounters_* — информация о предыдущих взаимодействиях автора контента и пользователя;
* membership_* — информация о членстве пользователя в группе, где опубликован контент;
* user_* — подробная информация о пользователе;
* auditweights_* — большое количество runtime-признаков, извлечённых текущей системой.
Структуры тестовых наборов эквивалентны структуре обучающих множеств, но не разложены по дням и не содержат поля feedback.

Оценка результата

Участники чемпионата должны так отсортировать ленту, чтобы объекты с высокой вероятностью «класса» оказались наверху. Сортировка производится индивидуально для каждого пользователя, после чего формируется текст сабмита следующего вида (формат соответствует экспорту из Pandas-датафрейма с колонками типа int и int[]):

* User_id_1,"[object_id_1_1, object_id_2_2]
* User_id_2,"[object_id_2_1, object_id_2_2, object_id_2_3]


В сабмите должна присутствовать строчка для каждого пользователя тестового набора, а строки отсортированы по возрастанию ID. Объекты для каждого пользователя должны быть отсортированы по убыванию релевантности.  При оценке сабмита для каждого пользователя будет посчитан его личный ROC-AUC, после чего посчитано среднее по всем пользователям и умножено на 100. 

# Some imports

In [1]:
import $ivy.`org.apache.spark::spark-sql:2.4.5`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = NotebookSparkSession
    .builder()
    .master("local[*]")
    .getOrCreate()

Loading spark-stubs
Getting spark JARs
Creating SparkSession


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@2451efa6

In [2]:
import $ivy.`org.plotly-scala::plotly-almond:0.7.6`
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._

repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36mplotly._, plotly.element._, plotly.layout._, plotly.Almond._

[39m

In [3]:
def sc = spark.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

defined [32mfunction[39m [36msc[39m
[36msqlContext[39m: [32mSQLContext[39m = org.apache.spark.sql.SQLContext@289048e0

In [4]:
import spark.implicits._
// val df = sqlContext.read.parquet("data/test/part-00000-6d949390-48b0-4104-a477-39e306b726c5-c000.gz.parquet")
// df.count()

[32mimport [39m[36mspark.implicits._
// val df = sqlContext.read.parquet("data/test/part-00000-6d949390-48b0-4104-a477-39e306b726c5-c000.gz.parquet")
// df.count()[39m

In [5]:
import $ivy.`org.apache.spark::spark-mllib:2.4.5`

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression

[32mimport [39m[36m$ivy.$                                    

[39m
[32mimport [39m[36morg.apache.spark.ml.linalg.{Matrix, Vectors}
[39m
[32mimport [39m[36morg.apache.spark.ml.stat.Correlation
[39m
[32mimport [39m[36morg.apache.spark.ml.feature.VectorAssembler
[39m
[32mimport [39m[36morg.apache.spark.ml.classification.LogisticRegression[39m

In [6]:
val df = sqlContext.read.parquet("train/")

[36mdf[39m: [32mDataFrame[39m = [instanceId_userId: int, instanceId_objectType: string ... 167 more fields]

In [7]:
df.printSchema

root
 |-- instanceId_userId: integer (nullable = true)
 |-- instanceId_objectType: string (nullable = true)
 |-- instanceId_objectId: integer (nullable = true)
 |-- audit_pos: long (nullable = true)
 |-- audit_clientType: string (nullable = true)
 |-- audit_timestamp: long (nullable = true)
 |-- audit_timePassed: long (nullable = true)
 |-- audit_experiment: string (nullable = true)
 |-- audit_resourceType: long (nullable = true)
 |-- metadata_ownerId: integer (nullable = true)
 |-- metadata_ownerType: string (nullable = true)
 |-- metadata_createdAt: long (nullable = true)
 |-- metadata_authorId: integer (nullable = true)
 |-- metadata_applicationId: long (nullable = true)
 |-- metadata_numCompanions: integer (nullable = true)
 |-- metadata_numPhotos: integer (nullable = true)
 |-- metadata_numPolls: integer (nullable = true)
 |-- metadata_numSymbols: integer (nullable = true)
 |-- metadata_numTokens: integer (nullable = true)
 |-- metadata_numVideos: integer (nullable = true)
 |-- me

# Get info about dataframe

In [8]:
val n_rows = df.count()

[36mn_rows[39m: [32mLong[39m = [32m18286575L[39m

In [9]:
val n_columns = df.columns.length

[36mn_columns[39m: [32mInt[39m = [32m169[39m

In [10]:
df.select("audit_pos").describe().show()

+-------+-----------------+
|summary|        audit_pos|
+-------+-----------------+
|  count|         18286575|
|   mean|11.39937626373446|
| stddev|16.76051641364607|
|    min|                0|
|    max|              489|
+-------+-----------------+



In [11]:
df.head(5)

[36mres10[39m: [32mArray[39m[[32mRow[39m] = [33mArray[39m(
  [108,Post,18452434,0,MOB,1520194086477,10184811,XPRM-5386_G1,8,13680,GROUP_OPE...

In [12]:
for (elem <- df.columns) println(elem)

instanceId_userId
instanceId_objectType
instanceId_objectId
audit_pos
audit_clientType
audit_timestamp
audit_timePassed
audit_experiment
audit_resourceType
metadata_ownerId
metadata_ownerType
metadata_createdAt
metadata_authorId
metadata_applicationId
metadata_numCompanions
metadata_numPhotos
metadata_numPolls
metadata_numSymbols
metadata_numTokens
metadata_numVideos
metadata_platform
metadata_totalVideoLength
metadata_options
relationsMask
userOwnerCounters_USER_FEED_REMOVE
userOwnerCounters_USER_PROFILE_VIEW
userOwnerCounters_VOTE_POLL
userOwnerCounters_USER_SEND_MESSAGE
userOwnerCounters_USER_DELETE_MESSAGE
userOwnerCounters_USER_INTERNAL_LIKE
userOwnerCounters_USER_INTERNAL_UNLIKE
userOwnerCounters_USER_STATUS_COMMENT_CREATE
userOwnerCounters_PHOTO_COMMENT_CREATE
userOwnerCounters_MOVIE_COMMENT_CREATE
userOwnerCounters_USER_PHOTO_ALBUM_COMMENT_CREATE
userOwnerCounters_COMMENT_INTERNAL_LIKE
userOwnerCounters_USER_FORUM_MESSAGE_CREATE
userOwnerCounters_PHOTO_MARK_CREATE
userOwnerCoun

# About target

In [13]:
df.select("feedback").show(10)

+----------------+
|        feedback|
+----------------+
|         [Liked]|
|[Clicked, Liked]|
|         [Liked]|
|       [Ignored]|
|       [Ignored]|
|       [Ignored]|
|[Clicked, Liked]|
|         [Liked]|
|       [Ignored]|
|       [Ignored]|
+----------------+
only showing top 10 rows



In [14]:
df.select("feedback").schema

[36mres13[39m: [32mtypes[39m.[32mStructType[39m = [33mStructType[39m(
  [33mStructField[39m([32m"feedback"[39m, [33mArrayType[39m(StringType, true), true, {})
)

In [15]:
val df_flags = df
.withColumn("feedback_Liked_flag", array_contains($"feedback", "Liked").cast("Int"))
.withColumn("feedback_Clicked_flag", array_contains($"feedback", "Clicked").cast("Int"))
.withColumn("feedback_Viewed_flag", array_contains($"feedback", "Viewed").cast("Int"))

[36mdf_flags[39m: [32mDataFrame[39m = [instanceId_userId: int, instanceId_objectType: string ... 170 more fields]

In [16]:
df_flags.select("instanceId_objectId", "feedback", "feedback_Liked_flag", "feedback_Clicked_flag", "feedback_Viewed_flag").show(25)

+-------------------+----------------+-------------------+---------------------+--------------------+
|instanceId_objectId|        feedback|feedback_Liked_flag|feedback_Clicked_flag|feedback_Viewed_flag|
+-------------------+----------------+-------------------+---------------------+--------------------+
|           18452434|         [Liked]|                  1|                    0|                   0|
|           31980032|[Clicked, Liked]|                  1|                    1|                   0|
|           33834009|         [Liked]|                  1|                    0|                   0|
|           25653019|       [Ignored]|                  0|                    0|                   0|
|           24024934|       [Ignored]|                  0|                    0|                   0|
|           25075524|       [Ignored]|                  0|                    0|                   0|
|           18132300|[Clicked, Liked]|                  1|                    1|  

# Top popular groups with likes

In [17]:
df_flags.groupBy("instanceId_objectId").sum("feedback_Liked_flag").orderBy(desc("sum(feedback_Liked_flag)")).show(false)

+-------------------+------------------------+
|instanceId_objectId|sum(feedback_Liked_flag)|
+-------------------+------------------------+
|535842             |1329                    |
|1282812            |868                     |
|603629             |822                     |
|19152905           |689                     |
|11300713           |660                     |
|1041333            |633                     |
|9458730            |569                     |
|38567725           |568                     |
|31009524           |560                     |
|35514331           |544                     |
|18354936           |528                     |
|39007803           |506                     |
|1004136            |490                     |
|340210             |478                     |
|21549312           |476                     |
|32819707           |467                     |
|31704618           |465                     |
|22311315           |465                     |
|23742340    

# Top popular groups with clickes

In [18]:
df_flags.groupBy("instanceId_objectId").sum("feedback_Clicked_flag").orderBy(desc("sum(feedback_Clicked_flag)")).show(false)

+-------------------+--------------------------+
|instanceId_objectId|sum(feedback_Clicked_flag)|
+-------------------+--------------------------+
|535842             |3529                      |
|1041333            |3017                      |
|9469235            |2735                      |
|1282812            |2536                      |
|603629             |2228                      |
|1026913            |1752                      |
|372612             |1646                      |
|9766501            |1568                      |
|1004136            |1541                      |
|340210             |1515                      |
|9458730            |1459                      |
|858942             |1399                      |
|803420             |1243                      |
|9501306            |1127                      |
|385538             |1051                      |
|9501615            |1022                      |
|432606             |998                       |
|9501042            

# Top popular groups with view

In [19]:
df_flags.groupBy("instanceId_objectId").sum("feedback_Viewed_flag").orderBy(desc("sum(feedback_Viewed_flag)")).show(false)

+-------------------+-------------------------+
|instanceId_objectId|sum(feedback_Viewed_flag)|
+-------------------+-------------------------+
|1375200            |170                      |
|1041333            |164                      |
|1282812            |119                      |
|603629             |114                      |
|535842             |106                      |
|9458730            |96                       |
|9469235            |90                       |
|1382438            |89                       |
|1026913            |81                       |
|1004136            |77                       |
|858942             |69                       |
|9448527            |58                       |
|790516             |57                       |
|372612             |55                       |
|340210             |51                       |
|2673314            |51                       |
|2184938            |50                       |
|608435             |50                 

# Some info about time

In [20]:
df.select("audit_timestamp").schema

[36mres19[39m: [32mtypes[39m.[32mStructType[39m = [33mStructType[39m(
  [33mStructField[39m([32m"audit_timestamp"[39m, LongType, true, {})
)

In [21]:
df.select("audit_timestamp").show(10)

+---------------+
|audit_timestamp|
+---------------+
|  1520194086477|
|  1520113126655|
|  1520181187538|
|  1520149339921|
|  1520180595763|
|  1520160411706|
|  1520149355325|
|  1520149355325|
|  1520159055559|
|  1520180648026|
+---------------+
only showing top 10 rows



In [22]:
val df_time = df_flags.withColumn("audit_timestamp_unix", from_unixtime($"audit_timestamp" / 1000)).withColumn("hour", hour($"audit_timestamp_unix"))

[36mdf_time[39m: [32mDataFrame[39m = [instanceId_userId: int, instanceId_objectType: string ... 172 more fields]

In [23]:
df_time.select("audit_timestamp_unix", "hour").show(10)

+--------------------+----+
|audit_timestamp_unix|hour|
+--------------------+----+
| 2018-03-04 20:08:06|  20|
| 2018-03-03 21:38:46|  21|
| 2018-03-04 16:33:07|  16|
| 2018-03-04 07:42:19|   7|
| 2018-03-04 16:23:15|  16|
| 2018-03-04 10:46:51|  10|
| 2018-03-04 07:42:35|   7|
| 2018-03-04 07:42:35|   7|
| 2018-03-04 10:24:15|  10|
| 2018-03-04 16:24:08|  16|
+--------------------+----+
only showing top 10 rows



In [24]:
df_time.select("audit_timestamp_unix", "hour").schema

[36mres23[39m: [32mtypes[39m.[32mStructType[39m = [33mStructType[39m(
  [33mStructField[39m([32m"audit_timestamp_unix"[39m, StringType, true, {}),
...

In [25]:
val df_hist_act = df_time.groupBy("hour").count().orderBy("hour")
df_hist_act.show()

+----+-------+
|hour|  count|
+----+-------+
|   0| 190669|
|   1| 255999|
|   2| 353397|
|   3| 508045|
|   4| 662820|
|   5| 787754|
|   6| 831354|
|   7| 837955|
|   8| 843048|
|   9| 858546|
|  10| 889471|
|  11| 922062|
|  12| 972513|
|  13|1029619|
|  14|1122013|
|  15|1213270|
|  16|1276641|
|  17|1275757|
|  18|1180341|
|  19| 917899|
+----+-------+
only showing top 20 rows



[36mdf_hist_act[39m: [32mDataset[39m[[32mRow[39m] = [hour: int, count: bigint]

In [26]:
val tmp_hist = df_hist_act.select("hour", "count").collect() // вытащить из спарка
.map(
    _.toSeq // каждую строку преобразовать в последовательность - так как пока там Row-тип
           .map(_.toString) // каждое значение преобразовать в строку - иначе не получиться преобразовать в Int - на уровне Java какое-то хуё-моё
           .map(_.toInt)  // вот теперь преобразовать каждое значение в строке в Int
.toVector // делаем на всякий случай вектор-строку
)
.toVector // еще раз вектор, так как после collect - это Array
.transpose

[36mtmp_hist[39m: [32mVector[39m[[32mVector[39m[[32mInt[39m]] = [33mVector[39m(
  [33mVector[39m(
...

In [27]:
val tmp_hour = tmp_hist(0)
val tmp_count = tmp_hist(1)

[36mtmp_hour[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m0[39m,
...
[36mtmp_count[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m190669[39m,
...

# Like plot

In [28]:
val df_hist_liked = df_time.where(col("feedback_Liked_flag")===1).groupBy("hour").count().orderBy("hour").cache()

[36mdf_hist_liked[39m: [32mDataset[39m[[32mRow[39m] = [hour: int, count: bigint]

In [29]:
val tmp_hour_liked = df_hist_liked.select("hour").collect().map(_.toSeq(0)).map(_.toString).map(_.toInt).toVector
val tmp_count_liked = df_hist_liked.select("count").collect().map(_.toSeq(0)).map(_.toString).map(_.toInt).toVector

[36mtmp_hour_liked[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m0[39m,
...
[36mtmp_count_liked[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m35654[39m,
...

In [30]:
// https://alexarchambault.github.io/plotly-scala
val data = Seq(
  Bar(
      tmp_hour_liked,
      tmp_count_liked
  )
)

val layout = Layout(
  title = "All liked bar plot"
)

plot(data, layout)

[36mdata[39m: [32mSeq[39m[[32mBar[39m] = [33mList[39m(
  [33mBar[39m(
...
[36mlayout[39m: [32mLayout[39m = [33mLayout[39m(
  [33mSome[39m([32m"All liked bar plot"[39m),
...
[36mres29_2[39m: [32mString[39m = [32m"plot-decf436f-1349-4e9e-b5ba-a3f6e7fc4cb8"[39m

# Click plot

In [31]:
val df_hist_clicked = df_time.where(col("feedback_Clicked_flag")===1).groupBy("hour").count().orderBy("hour").cache()

[36mdf_hist_clicked[39m: [32mDataset[39m[[32mRow[39m] = [hour: int, count: bigint]

In [32]:
val tmp_hour_clicked = df_hist_clicked.select("hour").collect().map(_.toSeq(0)).map(_.toString).map(_.toInt).toVector
val tmp_count_clicked = df_hist_clicked.select("count").collect().map(_.toSeq(0)).map(_.toString).map(_.toInt).toVector

[36mtmp_hour_clicked[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m0[39m,
...
[36mtmp_count_clicked[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m21653[39m,
...

In [33]:
// https://alexarchambault.github.io/plotly-scala
val data = Seq(
  Bar(
      tmp_hour_clicked,
      tmp_count_clicked
  )
)

val layout = Layout(
  title = "All clicked bar plot"
)

plot(data, layout)

[36mdata[39m: [32mSeq[39m[[32mBar[39m] = [33mList[39m(
  [33mBar[39m(
...
[36mlayout[39m: [32mLayout[39m = [33mLayout[39m(
  [33mSome[39m([32m"All clicked bar plot"[39m),
...
[36mres32_2[39m: [32mString[39m = [32m"plot-c7d4e447-eca4-4a18-bf3c-d267e4aac393"[39m

# View plot

In [34]:
val df_hist_viewed = df_time.where(col("feedback_Viewed_flag")===1).groupBy("hour").count().orderBy("hour").cache()

[36mdf_hist_viewed[39m: [32mDataset[39m[[32mRow[39m] = [hour: int, count: bigint]

In [35]:
val tmp_hour_viewed = df_hist_viewed.select("hour").collect().map(_.toSeq(0)).map(_.toString).map(_.toInt).toVector
val tmp_count_viewed = df_hist_viewed.select("count").collect().map(_.toSeq(0)).map(_.toString).map(_.toInt).toVector

[36mtmp_hour_viewed[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m0[39m,
...
[36mtmp_count_viewed[39m: [32mVector[39m[[32mInt[39m] = [33mVector[39m(
  [32m1496[39m,
...

In [36]:
// https://alexarchambault.github.io/plotly-scala
val data = Seq(
  Bar(
      tmp_hour_viewed,
      tmp_count_viewed
  )
)

val layout = Layout(
  title = "All viewed bar plot"
)

plot(data, layout)

[36mdata[39m: [32mSeq[39m[[32mBar[39m] = [33mList[39m(
  [33mBar[39m(
...
[36mlayout[39m: [32mLayout[39m = [33mLayout[39m(
  [33mSome[39m([32m"All viewed bar plot"[39m),
...
[36mres35_2[39m: [32mString[39m = [32m"plot-b02ee180-02f4-428a-b38b-7a3e7c772928"[39m

In [37]:
df_flags.select("feedback_Liked_flag", "feedback_Clicked_flag", "feedback_Viewed_flag").show(5)

+-------------------+---------------------+--------------------+
|feedback_Liked_flag|feedback_Clicked_flag|feedback_Viewed_flag|
+-------------------+---------------------+--------------------+
|                  1|                    0|                   0|
|                  1|                    1|                   0|
|                  1|                    0|                   0|
|                  0|                    0|                   0|
|                  0|                    0|                   0|
+-------------------+---------------------+--------------------+
only showing top 5 rows



# Feedback Liked correlation

In [26]:
val liked_flag_corr = scala.collection.mutable.Map[String,Double]()
val liked_flag_cov = scala.collection.mutable.Map[String,Double]()

[36mliked_flag_corr[39m: [32mcollection[39m.[32mmutable[39m.[32mMap[39m[[32mString[39m, [32mDouble[39m] = [33mMap[39m()
[36mliked_flag_cov[39m: [32mcollection[39m.[32mmutable[39m.[32mMap[39m[[32mString[39m, [32mDouble[39m] = [33mMap[39m()

In [None]:
for (elem <- df_flags.schema) {
    if ( (elem.dataType.toString != "StringType") & (elem.dataType.toString != "ArrayType(StringType,true)")
       & (elem.dataType.toString != "DateType") & (elem.name != "feedback_Liked_flag")
       )  {
        liked_flag_corr += elem.name -> df_flags.stat.corr("feedback_Liked_flag", elem.name)
    }
}

In [40]:
val df_liked_flag_corr = liked_flag_corr.toSeq.toDF("name", "corr")
df_liked_flag_corr.show(200)

+--------------------+--------------------+
|                name|                corr|
+--------------------+--------------------+
|     audit_timestamp|-0.00196274844152...|
|auditweights_user...|                 NaN|
|ownerUserCounters...|                 NaN|
|auditweights_user...|                 NaN|
|auditweights_user...|-0.01057874993249...|
|auditweights_partSvd|                 NaN|
|   user_is_activated|-0.00178984943993...|
|   metadata_authorId|0.007233333042612959|
|userOwnerCounters...|                 NaN|
|auditweights_isRa...| 0.08935913247486381|
|           audit_pos|0.005763417569643012|
|ownerUserCounters...|                 NaN|
|ownerUserCounters...|                 NaN|
|auditweights_frie...|-0.01049097762314...|
|userOwnerCounters...|                 NaN|
|ownerUserCounters...|                 NaN|
|membership_status...|-0.04364947249430642|
|        owner_status|                 NaN|
| owner_change_datime|                 NaN|
|auditweights_svd_...| 0.0872707

[36mdf_liked_flag_corr[39m: [32mDataFrame[39m = [name: string, corr: double]

# Feedback Clicked correlation

In [27]:
val clicked_flag_corr = scala.collection.mutable.Map[String,Double]()
val clicked_flag_cov = scala.collection.mutable.Map[String,Double]()

[36mclicked_flag_corr[39m: [32mcollection[39m.[32mmutable[39m.[32mMap[39m[[32mString[39m, [32mDouble[39m] = [33mMap[39m()
[36mclicked_flag_cov[39m: [32mcollection[39m.[32mmutable[39m.[32mMap[39m[[32mString[39m, [32mDouble[39m] = [33mMap[39m()

In [None]:
for (elem <- df_flags.schema) {
    if ( (elem.dataType.toString != "StringType") & (elem.dataType.toString != "ArrayType(StringType,true)")
       & (elem.dataType.toString != "DateType") & (elem.name != "feedback_Clicked_flag")
       )  {
        clicked_flag_corr += elem.name -> df_flags.stat.corr("feedback_Clicked_flag", elem.name)
    }
}

In [29]:
val df_clicked_flag_corr = clicked_flag_corr.toSeq.toDF("name", "corr")
df_clicked_flag_corr.show(200)

+--------------------+--------------------+
|                name|                corr|
+--------------------+--------------------+
|     audit_timestamp|-0.00271881167382...|
|auditweights_user...|                 NaN|
|ownerUserCounters...|                 NaN|
|auditweights_user...|                 NaN|
|auditweights_user...|0.045602424465965505|
|auditweights_partSvd|                 NaN|
|   user_is_activated|4.782788156491318...|
|   metadata_authorId|-0.00691360987401...|
|userOwnerCounters...|                 NaN|
|auditweights_isRa...|0.056966288112293234|
|           audit_pos|-0.00385700979785...|
|ownerUserCounters...|                 NaN|
|ownerUserCounters...|                 NaN|
|auditweights_frie...|0.021223367554136247|
|userOwnerCounters...|                 NaN|
|ownerUserCounters...|                 NaN|
|membership_status...|-0.02002750273713564|
|        owner_status|                 NaN|
| owner_change_datime|                 NaN|
|auditweights_svd_...|-2.8955866

[36mdf_clicked_flag_corr[39m: [32mDataFrame[39m = [name: string, corr: double]

# Feedback Viewed correlation 

In [26]:
val view_flag_corr = scala.collection.mutable.Map[String,Double]()
val view_flag_cov = scala.collection.mutable.Map[String,Double]()

[36mview_flag_corr[39m: [32mcollection[39m.[32mmutable[39m.[32mMap[39m[[32mString[39m, [32mDouble[39m] = [33mMap[39m()
[36mview_flag_cov[39m: [32mcollection[39m.[32mmutable[39m.[32mMap[39m[[32mString[39m, [32mDouble[39m] = [33mMap[39m()

In [None]:
for (elem <- df_flags.schema) {
    if ( (elem.dataType.toString != "StringType") & (elem.dataType.toString != "ArrayType(StringType,true)")
       & (elem.dataType.toString != "DateType") & (elem.name != "feedback_Viewed_flag")
       
       
       )  {
        view_flag_corr += elem.name -> df_flags.stat.corr("feedback_Viewed_flag", elem.name)
    }
}

In [32]:
val df_view_flag_corr = view_flag_corr.toSeq.toDF("name", "corr")
df_view_flag_corr.show(200)

+--------------------+--------------------+
|                name|                corr|
+--------------------+--------------------+
|     audit_timestamp|-0.00417983697636...|
|auditweights_user...|                 NaN|
|ownerUserCounters...|                 NaN|
|auditweights_user...|                 NaN|
|auditweights_user...|0.005781136003325965|
|auditweights_partSvd|                 NaN|
|   user_is_activated|3.320926485636930...|
|   metadata_authorId|-0.04477702002512947|
|userOwnerCounters...|                 NaN|
|auditweights_isRa...|0.015134480580451286|
|           audit_pos|0.017989935921300836|
|ownerUserCounters...|                 NaN|
|ownerUserCounters...|                 NaN|
|auditweights_frie...|  0.0198370881648911|
|userOwnerCounters...|                 NaN|
|ownerUserCounters...|                 NaN|
|membership_status...| 0.02347187229860836|
|        owner_status|                 NaN|
| owner_change_datime|                 NaN|
|auditweights_svd_...|-0.0143091

[36mdf_view_flag_corr[39m: [32mDataFrame[39m = [name: string, corr: double]

# More data insights:

# Feedback covaration

In [None]:
for (elem <- df_flags.schema) {
    if ( (elem.dataType.toString != "StringType") & (elem.dataType.toString != "ArrayType(StringType,true)")
       & (elem.dataType.toString != "DateType") & (elem.name != "feedback_Liked_flag")
       )  {
        liked_flag_cov += elem.name -> df_flags.stat.cov("feedback_Liked_flag", elem.name)
    }
}

In [34]:
val df_liked_flag_cov = liked_flag_cov.toSeq.toDF("name", "cov")
df_liked_flag_cov.show(200)

+--------------------+--------------------+
|                name|                 cov|
+--------------------+--------------------+
|     audit_timestamp|  -915773.9287044593|
|auditweights_user...|                 0.0|
|ownerUserCounters...|                 0.0|
|auditweights_user...|                 0.0|
|auditweights_user...|-0.00170523010704...|
|auditweights_partSvd|                 0.0|
|   user_is_activated|-5.98856362348141...|
|   metadata_authorId|   775.4579561169603|
|userOwnerCounters...|                 0.0|
|auditweights_isRa...|0.007159232425890721|
|           audit_pos|0.036795361359536025|
|ownerUserCounters...|                 0.0|
|ownerUserCounters...|                 0.0|
|auditweights_frie...|-3.48941473680328...|
|userOwnerCounters...|                 0.0|
|ownerUserCounters...|                 0.0|
|membership_status...|-1.16438193166170...|
|        owner_status|                 0.0|
| owner_change_datime|                 0.0|
|auditweights_svd_...| 0.0119947

[36mdf_liked_flag_cov[39m: [32mDataFrame[39m = [name: string, cov: double]

In [None]:
for (elem <- df_flags.schema) {
    if ( (elem.dataType.toString != "StringType") & (elem.dataType.toString != "ArrayType(StringType,true)")
       & (elem.dataType.toString != "DateType") & (elem.name != "feedback_Clicked_flag")
       )  {
        clicked_flag_cov += elem.name -> df_flags.stat.cov("feedback_Clicked_flag", elem.name)
    }
}

In [36]:
val df_clicked_flag_cov = clicked_flag_cov.toSeq.toDF("name", "cov")
df_clicked_flag_cov.show(200)

+--------------------+--------------------+
|                name|                 cov|
+--------------------+--------------------+
|     audit_timestamp|  -1095736.876389146|
|auditweights_user...|                 0.0|
|ownerUserCounters...|                 0.0|
|auditweights_user...|                 0.0|
|auditweights_user...|0.006349508238355...|
|auditweights_partSvd|                 0.0|
|   user_is_activated|1.382263469251077...|
|   metadata_authorId|   -640.218443712178|
|userOwnerCounters...|                 0.0|
|auditweights_isRa...|0.003942293535770...|
|           audit_pos| -0.0212699880556395|
|ownerUserCounters...|                 0.0|
|ownerUserCounters...|                 0.0|
|auditweights_frie...|6.097536985056149E-4|
|userOwnerCounters...|                 0.0|
|ownerUserCounters...|                 0.0|
|membership_status...|-4.614735091499944E9|
|        owner_status|                 0.0|
| owner_change_datime|                 0.0|
|auditweights_svd_...|-3.4376448

[36mdf_clicked_flag_cov[39m: [32mDataFrame[39m = [name: string, cov: double]

In [None]:
for (elem <- df_flags.schema) {
    if ( (elem.dataType.toString != "StringType") & (elem.dataType.toString != "ArrayType(StringType,true)")
       & (elem.dataType.toString != "DateType") & (elem.name != "feedback_Viewed_flag")
       
       
       )  {
        view_flag_cov += elem.name -> df_flags.stat.cov("feedback_Viewed_flag", elem.name)
    }
}

In [28]:
val df_view_flag_cov = view_flag_cov.toSeq.toDF("name", "cov")
df_view_flag_cov.show(200)

+--------------------+--------------------+
|                name|                 cov|
+--------------------+--------------------+
|     audit_timestamp| -491390.50286864804|
|auditweights_user...|                 0.0|
|ownerUserCounters...|                 0.0|
|auditweights_user...|                 0.0|
|auditweights_user...|2.348040632053031E-4|
|auditweights_partSvd|                 0.0|
|   user_is_activated|2.799685257722645E-7|
|   metadata_authorId| -1209.5358985658843|
|userOwnerCounters...|                 0.0|
|auditweights_isRa...|3.055194190154317...|
|           audit_pos|0.028939190823878195|
|ownerUserCounters...|                 0.0|
|ownerUserCounters...|                 0.0|
|auditweights_frie...|1.662487293093171...|
|userOwnerCounters...|                 0.0|
|ownerUserCounters...|                 0.0|
|membership_status...|1.5776402308577611E9|
|        owner_status|                 0.0|
| owner_change_datime|                 0.0|
|auditweights_svd_...|-4.9553903

[36mdf_view_flag_cov[39m: [32mDataFrame[39m = [name: string, cov: double]

# Difference in distributions between train and test
* To test the difference in distributions, we will train a logistic regression model on each variable and feed the model with a target of 0 for train and 1 for test. If the ROC AUC of the model on the variable is close to 0.5, then the distribution of the variables can be considered homogeneous.

In [7]:
val df_test = sqlContext.read.parquet("test/")

[36mdf_test[39m: [32mDataFrame[39m = [instanceId_userId: int, instanceId_objectType: string ... 166 more fields]

In [None]:
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType") )
    println(elem.name.toString.split("_")(0)) }

In [None]:
val audit_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType")
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "audit")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0))
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1))
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        audit_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [9]:
val df_audit_roc_train_test = audit_roc_train_test.toSeq.toDF("name", "auc")
df_audit_roc_train_test.show(200)

+------------------+------------------+
|              name|               auc|
+------------------+------------------+
|   audit_timestamp|0.9997749792361402|
|         audit_pos|0.5191852809372661|
|  audit_timePassed|0.4738613106710364|
|audit_resourceType|0.5152863106330348|
+------------------+------------------+



[36mdf_audit_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val metadata_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType") 
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "metadata")
       )  { 
        //view_flag_corr += elem.name -> df_flags.stat.corr("feedback_Viewed_flag", elem.name)
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0))
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1))
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        metadata_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [9]:
val df_metadata_roc_train_test = metadata_roc_train_test.toSeq.toDF("name", "auc")
df_metadata_roc_train_test.show(200)

+--------------------+-------------------+
|                name|                auc|
+--------------------+-------------------+
|metadata_totalVid...| 0.5083871163567828|
|  metadata_createdAt| 0.9997988288627346|
|   metadata_numPolls| 0.5052815299447426|
|  metadata_numVideos| 0.5117300441626862|
|  metadata_numPhotos|0.49416092042028514|
| metadata_numSymbols|  0.504941536035124|
|  metadata_numTokens| 0.5041447339114223|
|metadata_numCompa...| 0.5004224324403201|
+--------------------+-------------------+



[36mdf_metadata_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val userOwnerCounters_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType") 
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "userOwnerCounters")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0)).na.fill(0)
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1)).na.fill(0)
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        userOwnerCounters_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [10]:
val df_userOwnerCounters_roc_train_test = userOwnerCounters_roc_train_test.toSeq.toDF("name", "auc")
df_userOwnerCounters_roc_train_test.show(200)

+--------------------+-------------------+
|                name|                auc|
+--------------------+-------------------+
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...| 0.5097951486811122|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...| 0.5001834328968966|
|userOwnerCounters...| 0.5089637611668155|
|userOwnerCounters...| 0.5318797127438658|
|userOwnerCounters...| 0.5035311469187826|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|                0.5|
|userOwnerCounters...|  0.499950024027785|
|userOwnerC

[36mdf_userOwnerCounters_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val ownerUserCounters_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType")
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "ownerUserCounters")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0)).na.fill(0)
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1)).na.fill(0)
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        ownerUserCounters_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [9]:
val df_ownerUserCounters_roc_train_test = ownerUserCounters_roc_train_test.toSeq.toDF("name", "auc")
df_ownerUserCounters_roc_train_test.show(200)

+--------------------+---+
|                name|auc|
+--------------------+---+
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
|ownerUserCounters...|0.5|
+--------------------+---+



[36mdf_ownerUserCounters_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val auditweights_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType") 
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "auditweights")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0)).na.fill(0)
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1)).na.fill(0)
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        auditweights_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [9]:
val df_auditweights_roc_train_test = auditweights_roc_train_test.toSeq.toDF("name", "auc")
df_auditweights_roc_train_test.show(200)

+--------------------+-------------------+
|                name|                auc|
+--------------------+-------------------+
|auditweights_user...|                0.5|
|auditweights_user...|                0.5|
|auditweights_partSvd|                0.5|
|auditweights_user...| 0.5174566606064765|
|auditweights_isRa...|   0.50745456609254|
|auditweights_frie...| 0.5012586371569485|
|auditweights_svd_...| 0.7239223510435374|
|auditweights_x_Ac...| 0.5142121153057541|
|  auditweights_ageMs| 0.5641927917081782|
|auditweights_frie...| 0.5455312654345361|
| auditweights_closed|                0.5|
|auditweights_ctr_...|0.49280467942401984|
|auditweights_user...| 0.5002178216372128|
|auditweights_numD...|  0.527474157261406|
|auditweights_user...|                0.5|
|auditweights_partAge|                0.5|
|auditweights_frie...| 0.5012585952787948|
|auditweights_ctr_...|0.47002066809821763|
|auditweights_dail...| 0.5622979091064119|
|auditweights_like...| 0.5207009990685785|
|auditweigh

[36mdf_auditweights_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val owner_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType") 
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "owner")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0)).na.fill(0)
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1)).na.fill(0)
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        owner_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [9]:
val df_owner_roc_train_test = owner_roc_train_test.toSeq.toDF("name", "auc")
df_owner_roc_train_test.show(200)

+--------------------+---+
|                name|auc|
+--------------------+---+
|        owner_status|0.5|
| owner_change_datime|0.5|
|    owner_birth_date|0.5|
|     owner_is_active|0.5|
|   owner_create_date|0.5|
|        owner_region|0.5|
|    owner_is_deleted|0.5|
|    owner_ID_country|0.5|
|   owner_ID_Location|0.5|
|  owner_is_activated|0.5|
|owner_is_semiacti...|0.5|
|     owner_is_abused|0.5|
|        owner_gender|0.5|
+--------------------+---+



[36mdf_owner_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val user_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType")
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "user")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0)).na.fill(0)
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1)).na.fill(0)
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        user_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [9]:
val df_user_roc_train_test = user_roc_train_test.toSeq.toDF("name", "auc")
df_user_roc_train_test.show(200)

+--------------------+------------------+
|                name|               auc|
+--------------------+------------------+
|   user_is_activated|0.5000212285529356|
|     user_birth_date|0.5036404909771568|
|      user_is_abused|0.5000454746590249|
|         user_gender| 0.512492314286315|
|user_is_semiactiv...|               0.5|
|  user_change_datime|0.5462780402293776|
|         user_status| 0.505587802371713|
|     user_is_deleted|               0.5|
|    user_create_date|0.5314741209672205|
|     user_ID_country|               0.5|
|    user_ID_Location|0.5042253339314723|
|         user_region|0.5023994555602083|
|      user_is_active|0.5000244337495369|
+--------------------+------------------+



[36mdf_user_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]

In [None]:
val membership_roc_train_test = scala.collection.mutable.Map[String,Double]()
for (elem <- df.schema) {
    if ( (elem.dataType.toString != "StringType")
       & (elem.dataType.toString != "DateType") & (elem.name != "instanceId_userId")
       & (elem.name != "instanceId_objectId") & (elem.name != "metadata_ownerId")
       & (elem.name != "metadata_authorId") & (elem.name != "metadata_options")
       & (elem.name != "metadata_applicationId") & (elem.name != "relationsMask")
       & (elem.name.toString.split("_")(0) == "membership")
       )  { 
        println(elem.name)
        val tmp_df_train = df.select(elem.name).withColumn("label", lit(0)).na.fill(0)
        val tmp_df_test = df_test.select(elem.name).withColumn("label", lit(1)).na.fill(0)
        val tmp_df_union = tmp_df_train.union(tmp_df_test).cache()
        
        val assembler = new VectorAssembler().setInputCols(Array(elem.name)).setOutputCol("features")
        val output = assembler.transform(tmp_df_union)
        
        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01).setElasticNetParam(0.01)
        val lrModel = lr.fit(output.select("features", "label"))

        // Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
        val trainingSummary = lrModel.binarySummary
        membership_roc_train_test += elem.name -> trainingSummary.areaUnderROC
    }
}

In [11]:
val df_membership_roc_train_test = membership_roc_train_test.toSeq.toDF("name", "auc")
df_membership_roc_train_test.show(200)

+--------------------+------------------+
|                name|               auc|
+--------------------+------------------+
|membership_status...|0.5795486944417629|
|membership_joinRe...|0.5002262453015549|
| membership_joinDate|0.5797038225237044|
+--------------------+------------------+



[36mdf_membership_roc_train_test[39m: [32mDataFrame[39m = [name: string, auc: double]