https://sparkbyexamples.com/pyspark-rdd/

# Лаба 1. Расчет рейтингов фильмов – RDD

## Дедлайн

Понедельник, 1 марта, 23:59:59

## Задача

По имеющимся данным о рейтингах фильмов (MovieLens: 100 000 рейтингов) посчитать агрегированную статистику по ним.

## Описание данных

Имеются следующие входные данные:

* Таблица `users x movies` с рейтингами. Архив с датасетом нужно скачать с сайта [GroupLens](http://files.grouplens.org/datasets/movielens/ml-100k.zip). Также, он загружен на HDFS в `/labs/laba01/ml-100k`. Файл u.data содержит все оценки, а файл u.item — список всех фильмов.

`!hdfs dfs -ls /labs/laba01/ml-100k`

* `id фильма` для расчета индивидуальных характеристик — в Личном кабинете на странице [Лабы 1](https://lk-spark.newprolab.com/lab/slaba01).

## Результат

Выходной формат файла — json. Пример решения:

```json
{
   "hist_film": [  
      134,
      123,
      782,
      356,
      148
   ],
   "hist_all": [  
      134,
      123,
      782,
      356,
      148
   ]
}
```

В поле `“hist_film”` нужно указать для заданного `id` фильма количество поставленных оценок в следующем порядке: `"1", "2", "3", "4", "5"`. То есть сколько было единичек, двоек, троек и т.д.

В поле `“hist_all”` нужно указать то же самое только для всех фильмов общее количество поставленных оценок в том же порядке: `"1", "2", "3", "4", "5"`.

## Проверка

Файл необходимо положить в свою домашнюю директорию на кластере под названием: `lab01.json`.

Проверка осуществляется автоматическим скриптом на странице лабы в личном кабинете.

Обязательное условие зачета лабораторной работы – это выкладка после дедлайна лабы своего решения в репозиторий через pull-request. Как это сделать, можно прочитать [здесь](/git.md). Если будут вопросы – спрашивайте в Slack.

# Запуск PySpark

In [79]:
import os
import sys
os.environ["PYSPARK_PYTHON"]='/opt/anaconda/envs/bd9/bin/python'
os.environ["SPARK_HOME"]='/usr/hdp/current/spark2-client'
os.environ["PYSPARK_SUBMIT_ARGS"]='--num-executors 3 pyspark-shell'

spark_home = os.environ.get('SPARK_HOME', None)

sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.7-src.zip'))

## SparkContext (sc) запуск (основной управляющий объект)

In [80]:
from pyspark import SparkContext, SparkConf

conf = SparkConf()
conf.set("spark.app.name", "dmitry.varyukhin Spark RDD lab01") 

sc = SparkContext(conf=conf)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=dmitry.varyukhin Spark RDD lab01, master=yarn) created by __init__ at <ipython-input-2-f6618846785d>:6 

In [81]:
sc

In [82]:
sc.stop()

## Для получения всех установленных опций конфигурации можно использовать `sc.getConf()`

In [83]:
# sc.getConf().getAll()

In [84]:
sc.getConf().get("spark.app.name")

'dmitry.varyukhin Spark RDD lab01'

# Проверка источников

In [85]:
!hdfs getconf -confKey fs.defaultFS

hdfs://spark-de-master-1.newprolab.com:8020


In [86]:
!hdfs dfs -ls /labs/laba01/ml-100k

Found 23 items
-rw-r--r--   3 hdfs hdfs       6750 2020-09-05 20:38 /labs/laba01/ml-100k/README
-rw-r--r--   3 hdfs hdfs        716 2020-09-05 20:38 /labs/laba01/ml-100k/allbut.pl
-rw-r--r--   3 hdfs hdfs        643 2020-09-05 20:38 /labs/laba01/ml-100k/mku.sh
-rw-r--r--   3 hdfs hdfs    1979173 2020-09-05 20:38 /labs/laba01/ml-100k/u.data
-rw-r--r--   3 hdfs hdfs        202 2020-09-05 20:38 /labs/laba01/ml-100k/u.genre
-rw-r--r--   3 hdfs hdfs         36 2020-09-05 20:38 /labs/laba01/ml-100k/u.info
-rw-r--r--   3 hdfs hdfs     236344 2020-09-05 20:38 /labs/laba01/ml-100k/u.item
-rw-r--r--   3 hdfs hdfs        193 2020-09-05 20:38 /labs/laba01/ml-100k/u.occupation
-rw-r--r--   3 hdfs hdfs      22628 2020-09-05 20:38 /labs/laba01/ml-100k/u.user
-rw-r--r--   3 hdfs hdfs    1586544 2020-09-05 20:38 /labs/laba01/ml-100k/u1.base
-rw-r--r--   3 hdfs hdfs     392629 2020-09-05 20:38 /labs/laba01/ml-100k/u1.test
-rw-r--r--   3 hdfs hdfs    1583948 2020-09-05 20:38 /labs/laba01/ml-1

In [87]:
!hdfs dfs -cat /labs/laba01/ml-100k/u.data | head

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013
cat: Unable to write to output stream.


In [88]:
!hdfs dfs -tail /labs/laba01/ml-100k/u.data

91363685
823	134	5	878438232
130	93	5	874953665
130	121	5	876250746
537	778	3	886031106
655	913	4	891817521
889	2	3	880182460
865	1009	5	880144368
851	979	3	875730244
833	474	5	875122675
394	380	4	881132876
193	690	4	889123221
621	809	4	880740136
766	91	5	891310125
650	479	5	891372339
429	199	5	882386006
847	596	3	878938982
934	216	1	891191511
788	556	2	880871128
897	369	4	879993713
936	287	4	886832419
936	766	3	886832597
449	120	1	879959573
661	762	2	876037121
721	874	3	877137447
821	151	4	874792889
764	596	3	876243046
537	443	3	886031752
618	628	2	891308019
487	291	3	883445079
113	975	5	875936424
943	391	2	888640291
864	685	4	888891900
750	323	3	879445877
279	64	1	875308510
646	750	3	888528902
654	370	2	887863914
617	582	4	883789294
913	690	3	880824288
660	229	2	891406212
421	498	4	892241344
495	1091	4	888637503
806	421	4	882388897
676	538	4	892685437
721	262	3	877137285
913	209	2	881367150
378	78	3	880056976
880	476	3	880175444
716	204

# Загрузка данных

In [89]:
# user id | item id | rating | timestamp
udata_textFile = sc.textFile("hdfs://spark-de-master-1.newprolab.com:8020/labs/laba01/ml-100k/u.data")

AttributeError: 'NoneType' object has no attribute 'sc'

In [None]:
# user id | item id | rating
udata_rdd = udata_textFile.map(lambda line: ((line.split("\t")[1]), (line.split("\t")[2])))
# udata_rdd = udata_textFile.map(lambda line: (int(line.split("\t")[1]), int(line.split("\t")[2])))

In [None]:
udata_rdd.take(10)

# Подготовка данных для записи

In [70]:
hist_film = udata_rdd.filter(lambda x: x[0] == "318")\
.map(lambda l: (l[1], 1))\
.reduceByKey(lambda k, v: k+v)\
.sortByKey(ascending=True)
hist_film.collect()

[('1', 4), ('2', 6), ('3', 23), ('4', 79), ('5', 186)]

In [None]:
hist_film.keys().take(10)

In [71]:
hist_film.values().take(10)

[4, 6, 23, 79, 186]

In [72]:
hist_all = udata_rdd.map(lambda l: (l[1], 1))\
.reduceByKey(lambda k, v: k+v)\
.sortByKey(ascending=True)
hist_all.collect()

[('1', 6110), ('2', 11370), ('3', 27145), ('4', 34174), ('5', 21201)]

In [73]:
hist_all.keys().take(10)

['1', '2', '3', '4', '5']

In [74]:
hist_all.values().take(10)

[6110, 11370, 27145, 34174, 21201]

# Сохраняю результат в json

In [None]:
import json

In [None]:
json_data = (json.dumps({"hist_film" : hist_film.values().collect(), "hist_all": hist_all.values().collect()}))

In [None]:
with open("/data/home/dmitry.varyukhin/lab01.json", "w") as f:
    f.write(json_data)

# WTF!?!?!?

In [16]:
# rdd1 = udata_rdd.map(lambda l: (l[0] + "_" + l[1], 1))\
# .reduceByKey(lambda k, v: k+v)\
# .sortByKey(ascending=True)

In [17]:
# rdd1.take(10)

In [63]:
lines = udata_rdd
lines.take(10)

[('242', '3'),
 ('302', '3'),
 ('377', '1'),
 ('51', '2'),
 ('346', '1'),
 ('474', '4'),
 ('265', '2'),
 ('465', '5'),
 ('451', '3'),
 ('86', '3')]

In [64]:
processed = lines.map(lambda l: (l, 1))
processed.take(10)

[(('242', '3'), 1),
 (('302', '3'), 1),
 (('377', '1'), 1),
 (('51', '2'), 1),
 (('346', '1'), 1),
 (('474', '4'), 1),
 (('265', '2'), 1),
 (('465', '5'), 1),
 (('451', '3'), 1),
 (('86', '3'), 1)]

In [65]:
reduced = processed.reduceByKey(lambda a, b: a + b).sortByKey(ascending=True)
reduced.take(10)

[(('1', '1'), 8),
 (('1', '2'), 27),
 (('1', '3'), 96),
 (('1', '4'), 202),
 (('1', '5'), 119),
 (('10', '1'), 2),
 (('10', '2'), 7),
 (('10', '3'), 21),
 (('10', '4'), 33),
 (('10', '5'), 26)]

In [66]:
def formatter(line):
    film = line[0][0]
#     raiting = line[0][1]
    cnt = str(line[1])
    return film, int(cnt)

film_raiting = reduced.map(formatter)
film_raiting.collect()

[('1', 8),
 ('1', 27),
 ('1', 96),
 ('1', 202),
 ('1', 119),
 ('10', 2),
 ('10', 7),
 ('10', 21),
 ('10', 33),
 ('10', 26),
 ('100', 14),
 ('100', 18),
 ('100', 70),
 ('100', 179),
 ('100', 227),
 ('1000', 2),
 ('1000', 6),
 ('1000', 2),
 ('1001', 10),
 ('1001', 2),
 ('1001', 1),
 ('1001', 3),
 ('1001', 1),
 ('1002', 5),
 ('1002', 2),
 ('1002', 1),
 ('1003', 2),
 ('1003', 3),
 ('1003', 2),
 ('1003', 1),
 ('1004', 1),
 ('1004', 1),
 ('1004', 3),
 ('1004', 4),
 ('1005', 2),
 ('1005', 2),
 ('1005', 4),
 ('1005', 7),
 ('1005', 7),
 ('1006', 3),
 ('1006', 6),
 ('1006', 8),
 ('1006', 4),
 ('1006', 2),
 ('1007', 8),
 ('1007', 25),
 ('1007', 14),
 ('1008', 4),
 ('1008', 2),
 ('1008', 11),
 ('1008', 16),
 ('1008', 4),
 ('1009', 3),
 ('1009', 15),
 ('1009', 11),
 ('1009', 25),
 ('1009', 10),
 ('101', 5),
 ('101', 16),
 ('101', 20),
 ('101', 19),
 ('101', 13),
 ('1010', 4),
 ('1010', 6),
 ('1010', 15),
 ('1010', 13),
 ('1010', 6),
 ('1011', 7),
 ('1011', 8),
 ('1011', 40),
 ('1011', 33),
 ('1011'

In [67]:
hist_film = film_raiting.combineByKey(lambda x: [x], lambda u, v: u + [v], lambda u1,u2: u1+u2)
hist_film.collect()

[('1', [8, 27, 96, 202, 119]),
 ('10', [2, 7, 21, 33, 26]),
 ('100', [14, 18, 70, 179, 227]),
 ('1002', [5, 2, 1]),
 ('1009', [3, 15, 11, 25, 10]),
 ('1013', [11, 9, 14, 4]),
 ('1015', [2, 3, 4, 1, 2]),
 ('1017', [2, 9, 20, 13, 6]),
 ('1018', [1, 4, 17, 8, 2]),
 ('1019', [2, 5, 14, 10]),
 ('102', [4, 12, 16, 17, 5]),
 ('1020', [1, 9, 18, 7]),
 ('1021', [2, 3, 9, 10, 14]),
 ('1023', [4, 10, 12, 5]),
 ('1024', [1, 1, 6, 2, 5]),
 ('1025', [6, 13, 10, 8, 7]),
 ('1034', [5, 10, 7, 3, 2]),
 ('1037', [9, 10, 2, 2, 1]),
 ('1039', [6, 14, 43, 27]),
 ('1040', [4, 7, 10, 3, 1]),
 ('1043', [1, 1, 4, 1, 1]),
 ('1044', [2, 7, 13, 16, 2]),
 ('1045', [6, 9, 8, 2]),
 ('1049', [5, 5, 12, 3]),
 ('1051', [1, 9, 13, 15, 3]),
 ('1054', [8, 3, 10, 1, 1]),
 ('1057', [4, 10, 5, 3]),
 ('1058', [1, 3, 4, 5, 2]),
 ('1059', [9, 4, 10, 10, 2]),
 ('106', [3, 27, 20, 17, 4]),
 ('1060', [4, 7, 16, 7, 5]),
 ('1062', [2, 2, 3, 5]),
 ('1065', [7, 6, 9, 16, 15]),
 ('1069', [3, 4, 1, 7, 3]),
 ('107', [4, 1, 15, 14, 8]),
 (

In [68]:
hist_film.keys().take(10)

['1', '10', '100', '1002', '1009', '1013', '1015', '1017', '1018', '1019']

In [69]:
hist_film.values().take(10)

[[8, 27, 96, 202, 119],
 [2, 7, 21, 33, 26],
 [14, 18, 70, 179, 227],
 [5, 2, 1],
 [3, 15, 11, 25, 10],
 [11, 9, 14, 4],
 [2, 3, 4, 1, 2],
 [2, 9, 20, 13, 6],
 [1, 4, 17, 8, 2],
 [2, 5, 14, 10]]

In [70]:
hist_film = udata_rdd.filter(lambda x: x[0] == "318")\
.map(lambda l: (l[1], 1))\
.reduceByKey(lambda k, v: k+v)\
.sortByKey(ascending=True)
hist_film.collect()

[('1', 4), ('2', 6), ('3', 23), ('4', 79), ('5', 186)]

In [None]:
hist_film.keys().take(10)

In [71]:
hist_film.values().take(10)

[4, 6, 23, 79, 186]

In [72]:
hist_all = udata_rdd.map(lambda l: (l[1], 1))\
.reduceByKey(lambda k, v: k+v)\
.sortByKey(ascending=True)
hist_all.collect()

[('1', 6110), ('2', 11370), ('3', 27145), ('4', 34174), ('5', 21201)]

In [73]:
hist_all.keys().take(10)

['1', '2', '3', '4', '5']

In [74]:
hist_all.values().take(10)

[6110, 11370, 27145, 34174, 21201]

# Сохраняю результат в json

## Результат

Выходной формат файла — json. Пример решения:

```json
{
   "hist_film": [  
      134,
      123,
      782,
      356,
      148
   ],
   "hist_all": [  
      134,
      123,
      782,
      356,
      148
   ]
}
```

В поле `“hist_film”` нужно указать для заданного `id` фильма количество поставленных оценок в следующем порядке: `"1", "2", "3", "4", "5"`. То есть сколько было единичек, двоек, троек и т.д.

В поле `“hist_all”` нужно указать то же самое только для всех фильмов общее количество поставленных оценок в том же порядке: `"1", "2", "3", "4", "5"`.


In [75]:
import json

In [76]:
json_data = (json.dumps({"hist_film" : hist_film.values().collect(), "hist_all": hist_all.values().collect()}))

In [77]:
json_data

'{"hist_film": [4, 6, 23, 79, 186], "hist_all": [6110, 11370, 27145, 34174, 21201]}'

In [78]:
with open("/data/home/dmitry.varyukhin/lab01.json", "w") as f:
    f.write(json_data)

In [547]:
json_data = (json.dumps({"hist_film" : hist_film.values().collect(), "hist_all": hist_all.values().collect()}))

In [None]:
z = dict(x.items() + y.items())

In [534]:
hist_all_dict = {"hist_all": hist_all.values().collect()}

In [551]:
hist_film_dict = hist_film.collectAsMap()

In [562]:
list(hist_film.collectAsMap().items())

[('44', [6, 6, 29, 31, 7]),
 ('440', [9, 2, 2, 1]),
 ('441', [8, 16, 21, 7, 1]),
 ('442', [3, 1]),
 ('445', [3, 7, 10, 2]),
 ('446', [2, 3, 2, 2]),
 ('447', [5, 11, 40, 48, 17]),
 ('449', [9, 26, 45, 26, 11]),
 ('45', [1, 1, 20, 29, 29]),
 ('451', [15, 31, 37, 54, 33]),
 ('452', [10, 19, 24, 9, 4]),
 ('453', [6, 6, 3, 1]),
 ('456', [16, 7, 20, 4, 1]),
 ('457', [16, 5, 3, 3]),
 ('458', [11, 10, 33, 29, 7]),
 ('461', [5, 28, 31, 10]),
 ('464', [2, 2, 7, 10, 6]),
 ('465', [4, 8, 26, 30, 17]),
 ('467', [2, 15, 22, 9]),
 ('468', [4, 9, 22, 21, 8]),
 ('469', [3, 3, 14, 33, 14]),
 ('470', [2, 9, 36, 37, 24]),
 ('473', [8, 22, 52, 34, 10]),
 ('474', [6, 34, 59, 95]),
 ('477', [11, 13, 27, 33, 11]),
 ('478', [5, 16, 45, 38]),
 ('48', [3, 7, 11, 51, 45]),
 ('481', [2, 1, 19, 18, 23]),
 ('483', [1, 1, 25, 75, 141]),
 ('485', [3, 10, 34, 38, 40]),
 ('486', [1, 2, 20, 27, 14]),
 ('488', [2, 10, 26, 27]),
 ('490', [2, 8, 27, 13]),
 ('492', [3, 15, 28, 13]),
 ('495', [4, 19, 29, 7]),
 ('496', [5, 10,

In [536]:
json_data = dict(list(hist_film_dict.items()) + list(hist_all_dict.items()))

In [539]:
json.dumps(json_data)

'{"44": [6, 6, 29, 31, 7], "440": [9, 2, 2, 1], "441": [8, 16, 21, 7, 1], "442": [3, 1], "445": [3, 7, 10, 2], "446": [2, 3, 2, 2], "447": [5, 11, 40, 48, 17], "449": [9, 26, 45, 26, 11], "45": [1, 1, 20, 29, 29], "451": [15, 31, 37, 54, 33], "452": [10, 19, 24, 9, 4], "453": [6, 6, 3, 1], "456": [16, 7, 20, 4, 1], "457": [16, 5, 3, 3], "458": [11, 10, 33, 29, 7], "461": [5, 28, 31, 10], "464": [2, 2, 7, 10, 6], "465": [4, 8, 26, 30, 17], "467": [2, 15, 22, 9], "468": [4, 9, 22, 21, 8], "469": [3, 3, 14, 33, 14], "470": [2, 9, 36, 37, 24], "473": [8, 22, 52, 34, 10], "474": [6, 34, 59, 95], "477": [11, 13, 27, 33, 11], "478": [5, 16, 45, 38], "48": [3, 7, 11, 51, 45], "481": [2, 1, 19, 18, 23], "483": [1, 1, 25, 75, 141], "485": [3, 10, 34, 38, 40], "486": [1, 2, 20, 27, 14], "488": [2, 10, 26, 27], "490": [2, 8, 27, 13], "492": [3, 15, 28, 13], "495": [4, 19, 29, 7], "496": [5, 10, 41, 71, 104], "499": [1, 15, 32, 14], "50": [9, 16, 57, 176, 325], "501": [4, 13, 45, 40, 21], "502": [2

In [540]:
with open("/data/home/dmitry.varyukhin/lab01.json", "w") as f:
    f.write(json.dumps(json_data))

In [522]:
json_data = (json.dumps({hist_film.collectAsMap(), "hist_all": hist_all.values().collect()}))

SyntaxError: invalid syntax (<ipython-input-522-4484159f343c>, line 1)

In [557]:
a = {"hist_all": hist_all.values().collect()}
a

{'hist_all': [6110, 11370, 27145, 34174, 21201]}

In [558]:
b = hist_film.collectAsMap()
b

{'44': [6, 6, 29, 31, 7],
 '440': [9, 2, 2, 1],
 '441': [8, 16, 21, 7, 1],
 '442': [3, 1],
 '445': [3, 7, 10, 2],
 '446': [2, 3, 2, 2],
 '447': [5, 11, 40, 48, 17],
 '449': [9, 26, 45, 26, 11],
 '45': [1, 1, 20, 29, 29],
 '451': [15, 31, 37, 54, 33],
 '452': [10, 19, 24, 9, 4],
 '453': [6, 6, 3, 1],
 '456': [16, 7, 20, 4, 1],
 '457': [16, 5, 3, 3],
 '458': [11, 10, 33, 29, 7],
 '461': [5, 28, 31, 10],
 '464': [2, 2, 7, 10, 6],
 '465': [4, 8, 26, 30, 17],
 '467': [2, 15, 22, 9],
 '468': [4, 9, 22, 21, 8],
 '469': [3, 3, 14, 33, 14],
 '470': [2, 9, 36, 37, 24],
 '473': [8, 22, 52, 34, 10],
 '474': [6, 34, 59, 95],
 '477': [11, 13, 27, 33, 11],
 '478': [5, 16, 45, 38],
 '48': [3, 7, 11, 51, 45],
 '481': [2, 1, 19, 18, 23],
 '483': [1, 1, 25, 75, 141],
 '485': [3, 10, 34, 38, 40],
 '486': [1, 2, 20, 27, 14],
 '488': [2, 10, 26, 27],
 '490': [2, 8, 27, 13],
 '492': [3, 15, 28, 13],
 '495': [4, 19, 29, 7],
 '496': [5, 10, 41, 71, 104],
 '499': [1, 15, 32, 14],
 '50': [9, 16, 57, 176, 325],
 

In [560]:
z = dict(list(b.items()) + list(a.items()))

In [561]:
json.dumps(z)

'{"44": [6, 6, 29, 31, 7], "440": [9, 2, 2, 1], "441": [8, 16, 21, 7, 1], "442": [3, 1], "445": [3, 7, 10, 2], "446": [2, 3, 2, 2], "447": [5, 11, 40, 48, 17], "449": [9, 26, 45, 26, 11], "45": [1, 1, 20, 29, 29], "451": [15, 31, 37, 54, 33], "452": [10, 19, 24, 9, 4], "453": [6, 6, 3, 1], "456": [16, 7, 20, 4, 1], "457": [16, 5, 3, 3], "458": [11, 10, 33, 29, 7], "461": [5, 28, 31, 10], "464": [2, 2, 7, 10, 6], "465": [4, 8, 26, 30, 17], "467": [2, 15, 22, 9], "468": [4, 9, 22, 21, 8], "469": [3, 3, 14, 33, 14], "470": [2, 9, 36, 37, 24], "473": [8, 22, 52, 34, 10], "474": [6, 34, 59, 95], "477": [11, 13, 27, 33, 11], "478": [5, 16, 45, 38], "48": [3, 7, 11, 51, 45], "481": [2, 1, 19, 18, 23], "483": [1, 1, 25, 75, 141], "485": [3, 10, 34, 38, 40], "486": [1, 2, 20, 27, 14], "488": [2, 10, 26, 27], "490": [2, 8, 27, 13], "492": [3, 15, 28, 13], "495": [4, 19, 29, 7], "496": [5, 10, 41, 71, 104], "499": [1, 15, 32, 14], "50": [9, 16, 57, 176, 325], "501": [4, 13, 45, 40, 21], "502": [2

In [520]:
with open("/data/home/dmitry.varyukhin/lab01.json", "w") as f:
    f.write(json.dumps(z))

In [477]:
a = json.dumps({"hist_all": hist_all.values().collect()})
a

'{"hist_all": [6110, 11370, 27145, 34174, 21201]}'

In [478]:
({"hist_all": hist_all.values().collect()})

{'hist_all': [6110, 11370, 27145, 34174, 21201]}

In [476]:
b = json.dumps(hist_film.collectAsMap())
b

'{"1": ["8", "27", "96", "202", "119"], "10": ["2", "7", "21", "33", "26"], "100": ["14", "18", "70", "179", "227"], "1002": ["5", "2", "1"], "1009": ["3", "15", "11", "25", "10"], "1013": ["11", "9", "14", "4"], "1015": ["2", "3", "4", "1", "2"], "1017": ["2", "9", "20", "13", "6"], "1018": ["1", "4", "17", "8", "2"], "1019": ["2", "5", "14", "10"], "102": ["4", "12", "16", "17", "5"], "1020": ["1", "9", "18", "7"], "1021": ["2", "3", "9", "10", "14"], "1023": ["4", "10", "12", "5"], "1024": ["1", "1", "6", "2", "5"], "1025": ["6", "13", "10", "8", "7"], "1034": ["5", "10", "7", "3", "2"], "1037": ["9", "10", "2", "2", "1"], "1039": ["6", "14", "43", "27"], "1040": ["4", "7", "10", "3", "1"], "1043": ["1", "1", "4", "1", "1"], "1044": ["2", "7", "13", "16", "2"], "1045": ["6", "9", "8", "2"], "1049": ["5", "5", "12", "3"], "1051": ["1", "9", "13", "15", "3"], "1054": ["8", "3", "10", "1", "1"], "1057": ["4", "10", "5", "3"], "1058": ["1", "3", "4", "5", "2"], "1059": ["9", "4", "10", 

In [479]:
hist_film.collectAsMap()

{'44': ['6', '6', '29', '31', '7'],
 '440': ['9', '2', '2', '1'],
 '441': ['8', '16', '21', '7', '1'],
 '442': ['3', '1'],
 '445': ['3', '7', '10', '2'],
 '446': ['2', '3', '2', '2'],
 '447': ['5', '11', '40', '48', '17'],
 '449': ['9', '26', '45', '26', '11'],
 '45': ['1', '1', '20', '29', '29'],
 '451': ['15', '31', '37', '54', '33'],
 '452': ['10', '19', '24', '9', '4'],
 '453': ['6', '6', '3', '1'],
 '456': ['16', '7', '20', '4', '1'],
 '457': ['16', '5', '3', '3'],
 '458': ['11', '10', '33', '29', '7'],
 '461': ['5', '28', '31', '10'],
 '464': ['2', '2', '7', '10', '6'],
 '465': ['4', '8', '26', '30', '17'],
 '467': ['2', '15', '22', '9'],
 '468': ['4', '9', '22', '21', '8'],
 '469': ['3', '3', '14', '33', '14'],
 '470': ['2', '9', '36', '37', '24'],
 '473': ['8', '22', '52', '34', '10'],
 '474': ['6', '34', '59', '95'],
 '477': ['11', '13', '27', '33', '11'],
 '478': ['5', '16', '45', '38'],
 '48': ['3', '7', '11', '51', '45'],
 '481': ['2', '1', '19', '18', '23'],
 '483': ['1', 

In [488]:
c = a + b

In [489]:
c

'{"hist_all": [6110, 11370, 27145, 34174, 21201]}{"1": ["8", "27", "96", "202", "119"], "10": ["2", "7", "21", "33", "26"], "100": ["14", "18", "70", "179", "227"], "1002": ["5", "2", "1"], "1009": ["3", "15", "11", "25", "10"], "1013": ["11", "9", "14", "4"], "1015": ["2", "3", "4", "1", "2"], "1017": ["2", "9", "20", "13", "6"], "1018": ["1", "4", "17", "8", "2"], "1019": ["2", "5", "14", "10"], "102": ["4", "12", "16", "17", "5"], "1020": ["1", "9", "18", "7"], "1021": ["2", "3", "9", "10", "14"], "1023": ["4", "10", "12", "5"], "1024": ["1", "1", "6", "2", "5"], "1025": ["6", "13", "10", "8", "7"], "1034": ["5", "10", "7", "3", "2"], "1037": ["9", "10", "2", "2", "1"], "1039": ["6", "14", "43", "27"], "1040": ["4", "7", "10", "3", "1"], "1043": ["1", "1", "4", "1", "1"], "1044": ["2", "7", "13", "16", "2"], "1045": ["6", "9", "8", "2"], "1049": ["5", "5", "12", "3"], "1051": ["1", "9", "13", "15", "3"], "1054": ["8", "3", "10", "1", "1"], "1057": ["4", "10", "5", "3"], "1058": ["1"

In [490]:
json.dumps(c)

'"{\\"hist_all\\": [6110, 11370, 27145, 34174, 21201]}{\\"1\\": [\\"8\\", \\"27\\", \\"96\\", \\"202\\", \\"119\\"], \\"10\\": [\\"2\\", \\"7\\", \\"21\\", \\"33\\", \\"26\\"], \\"100\\": [\\"14\\", \\"18\\", \\"70\\", \\"179\\", \\"227\\"], \\"1002\\": [\\"5\\", \\"2\\", \\"1\\"], \\"1009\\": [\\"3\\", \\"15\\", \\"11\\", \\"25\\", \\"10\\"], \\"1013\\": [\\"11\\", \\"9\\", \\"14\\", \\"4\\"], \\"1015\\": [\\"2\\", \\"3\\", \\"4\\", \\"1\\", \\"2\\"], \\"1017\\": [\\"2\\", \\"9\\", \\"20\\", \\"13\\", \\"6\\"], \\"1018\\": [\\"1\\", \\"4\\", \\"17\\", \\"8\\", \\"2\\"], \\"1019\\": [\\"2\\", \\"5\\", \\"14\\", \\"10\\"], \\"102\\": [\\"4\\", \\"12\\", \\"16\\", \\"17\\", \\"5\\"], \\"1020\\": [\\"1\\", \\"9\\", \\"18\\", \\"7\\"], \\"1021\\": [\\"2\\", \\"3\\", \\"9\\", \\"10\\", \\"14\\"], \\"1023\\": [\\"4\\", \\"10\\", \\"12\\", \\"5\\"], \\"1024\\": [\\"1\\", \\"1\\", \\"6\\", \\"2\\", \\"5\\"], \\"1025\\": [\\"6\\", \\"13\\", \\"10\\", \\"8\\", \\"7\\"], \\"1034\\": [\\"5\\", \\"

In [469]:
hist_film.collectAsMap()

{'44': ['6', '6', '29', '31', '7'],
 '440': ['9', '2', '2', '1'],
 '441': ['8', '16', '21', '7', '1'],
 '442': ['3', '1'],
 '445': ['3', '7', '10', '2'],
 '446': ['2', '3', '2', '2'],
 '447': ['5', '11', '40', '48', '17'],
 '449': ['9', '26', '45', '26', '11'],
 '45': ['1', '1', '20', '29', '29'],
 '451': ['15', '31', '37', '54', '33'],
 '452': ['10', '19', '24', '9', '4'],
 '453': ['6', '6', '3', '1'],
 '456': ['16', '7', '20', '4', '1'],
 '457': ['16', '5', '3', '3'],
 '458': ['11', '10', '33', '29', '7'],
 '461': ['5', '28', '31', '10'],
 '464': ['2', '2', '7', '10', '6'],
 '465': ['4', '8', '26', '30', '17'],
 '467': ['2', '15', '22', '9'],
 '468': ['4', '9', '22', '21', '8'],
 '469': ['3', '3', '14', '33', '14'],
 '470': ['2', '9', '36', '37', '24'],
 '473': ['8', '22', '52', '34', '10'],
 '474': ['6', '34', '59', '95'],
 '477': ['11', '13', '27', '33', '11'],
 '478': ['5', '16', '45', '38'],
 '48': ['3', '7', '11', '51', '45'],
 '481': ['2', '1', '19', '18', '23'],
 '483': ['1', 

In [475]:
json_data = (json.dumps({"hist_all": hist_all.values().collect()}, hist_film.collectAsMap()}))
json_data

TypeError: dumps() takes 1 positional argument but 2 were given

In [455]:
b = json.dumps(hist_film.collectAsMap())
b

'{"44": ["6", "6", "29", "31", "7"], "440": ["9", "2", "2", "1"], "441": ["8", "16", "21", "7", "1"], "442": ["3", "1"], "445": ["3", "7", "10", "2"], "446": ["2", "3", "2", "2"], "447": ["5", "11", "40", "48", "17"], "449": ["9", "26", "45", "26", "11"], "45": ["1", "1", "20", "29", "29"], "451": ["15", "31", "37", "54", "33"], "452": ["10", "19", "24", "9", "4"], "453": ["6", "6", "3", "1"], "456": ["16", "7", "20", "4", "1"], "457": ["16", "5", "3", "3"], "458": ["11", "10", "33", "29", "7"], "461": ["5", "28", "31", "10"], "464": ["2", "2", "7", "10", "6"], "465": ["4", "8", "26", "30", "17"], "467": ["2", "15", "22", "9"], "468": ["4", "9", "22", "21", "8"], "469": ["3", "3", "14", "33", "14"], "470": ["2", "9", "36", "37", "24"], "473": ["8", "22", "52", "34", "10"], "474": ["6", "34", "59", "95"], "477": ["11", "13", "27", "33", "11"], "478": ["5", "16", "45", "38"], "48": ["3", "7", "11", "51", "45"], "481": ["2", "1", "19", "18", "23"], "483": ["1", "1", "25", "75", "141"], "4

In [462]:
json_data = (json.dumps(hist_film.collectAsMap()), {"hist_all": hist_all.values().collect()})

In [467]:
json_data = (json.dumps({"hist_all": hist_all.values().collect()}, json.dumps(hist_film.collectAsMap())))

TypeError: dumps() takes 1 positional argument but 2 were given

In [463]:
json_data

('{"44": ["6", "6", "29", "31", "7"], "440": ["9", "2", "2", "1"], "441": ["8", "16", "21", "7", "1"], "442": ["3", "1"], "445": ["3", "7", "10", "2"], "446": ["2", "3", "2", "2"], "447": ["5", "11", "40", "48", "17"], "449": ["9", "26", "45", "26", "11"], "45": ["1", "1", "20", "29", "29"], "451": ["15", "31", "37", "54", "33"], "452": ["10", "19", "24", "9", "4"], "453": ["6", "6", "3", "1"], "456": ["16", "7", "20", "4", "1"], "457": ["16", "5", "3", "3"], "458": ["11", "10", "33", "29", "7"], "461": ["5", "28", "31", "10"], "464": ["2", "2", "7", "10", "6"], "465": ["4", "8", "26", "30", "17"], "467": ["2", "15", "22", "9"], "468": ["4", "9", "22", "21", "8"], "469": ["3", "3", "14", "33", "14"], "470": ["2", "9", "36", "37", "24"], "473": ["8", "22", "52", "34", "10"], "474": ["6", "34", "59", "95"], "477": ["11", "13", "27", "33", "11"], "478": ["5", "16", "45", "38"], "48": ["3", "7", "11", "51", "45"], "481": ["2", "1", "19", "18", "23"], "483": ["1", "1", "25", "75", "141"], "

In [440]:
json.dumps(hist_film.collectAsMap())

'{"1": ["8", "27", "96", "202", "119"], "10": ["2", "7", "21", "33", "26"], "100": ["14", "18", "70", "179", "227"], "1002": ["5", "2", "1"], "1009": ["3", "15", "11", "25", "10"], "1013": ["11", "9", "14", "4"], "1015": ["2", "3", "4", "1", "2"], "1017": ["2", "9", "20", "13", "6"], "1018": ["1", "4", "17", "8", "2"], "1019": ["2", "5", "14", "10"], "102": ["4", "12", "16", "17", "5"], "1020": ["1", "9", "18", "7"], "1021": ["2", "3", "9", "10", "14"], "1023": ["4", "10", "12", "5"], "1024": ["1", "1", "6", "2", "5"], "1025": ["6", "13", "10", "8", "7"], "1034": ["5", "10", "7", "3", "2"], "1037": ["9", "10", "2", "2", "1"], "1039": ["6", "14", "43", "27"], "1040": ["4", "7", "10", "3", "1"], "1043": ["1", "1", "4", "1", "1"], "1044": ["2", "7", "13", "16", "2"], "1045": ["6", "9", "8", "2"], "1049": ["5", "5", "12", "3"], "1051": ["1", "9", "13", "15", "3"], "1054": ["8", "3", "10", "1", "1"], "1057": ["4", "10", "5", "3"], "1058": ["1", "3", "4", "5", "2"], "1059": ["9", "4", "10", 

In [436]:
hist_film.collectAsMap()

{'44': ['6', '6', '29', '31', '7'],
 '440': ['9', '2', '2', '1'],
 '441': ['8', '16', '21', '7', '1'],
 '442': ['3', '1'],
 '445': ['3', '7', '10', '2'],
 '446': ['2', '3', '2', '2'],
 '447': ['5', '11', '40', '48', '17'],
 '449': ['9', '26', '45', '26', '11'],
 '45': ['1', '1', '20', '29', '29'],
 '451': ['15', '31', '37', '54', '33'],
 '452': ['10', '19', '24', '9', '4'],
 '453': ['6', '6', '3', '1'],
 '456': ['16', '7', '20', '4', '1'],
 '457': ['16', '5', '3', '3'],
 '458': ['11', '10', '33', '29', '7'],
 '461': ['5', '28', '31', '10'],
 '464': ['2', '2', '7', '10', '6'],
 '465': ['4', '8', '26', '30', '17'],
 '467': ['2', '15', '22', '9'],
 '468': ['4', '9', '22', '21', '8'],
 '469': ['3', '3', '14', '33', '14'],
 '470': ['2', '9', '36', '37', '24'],
 '473': ['8', '22', '52', '34', '10'],
 '474': ['6', '34', '59', '95'],
 '477': ['11', '13', '27', '33', '11'],
 '478': ['5', '16', '45', '38'],
 '48': ['3', '7', '11', '51', '45'],
 '481': ['2', '1', '19', '18', '23'],
 '483': ['1', 

In [446]:
a = hist_film.collectAsMap()

In [447]:
json_data = json.dumps({"hist_all": hist_all.values().collect()})

SyntaxError: invalid syntax (<ipython-input-447-76e6b4d6943a>, line 1)

In [445]:
json_data?

In [None]:
json_data = (json.dumps({"hist_film" : hist_film.values().collect(), "hist_all": hist_all.values().collect()}))

In [403]:
hist_film?

In [424]:
hist_all.values().collect()

[6110, 11370, 27145, 34174, 21201]

In [426]:
hist_film.collect()

[('1', ['8', '27', '96', '202', '119']),
 ('10', ['2', '7', '21', '33', '26']),
 ('100', ['14', '18', '70', '179', '227']),
 ('1002', ['5', '2', '1']),
 ('1009', ['3', '15', '11', '25', '10']),
 ('1013', ['11', '9', '14', '4']),
 ('1015', ['2', '3', '4', '1', '2']),
 ('1017', ['2', '9', '20', '13', '6']),
 ('1018', ['1', '4', '17', '8', '2']),
 ('1019', ['2', '5', '14', '10']),
 ('102', ['4', '12', '16', '17', '5']),
 ('1020', ['1', '9', '18', '7']),
 ('1021', ['2', '3', '9', '10', '14']),
 ('1023', ['4', '10', '12', '5']),
 ('1024', ['1', '1', '6', '2', '5']),
 ('1025', ['6', '13', '10', '8', '7']),
 ('1034', ['5', '10', '7', '3', '2']),
 ('1037', ['9', '10', '2', '2', '1']),
 ('1039', ['6', '14', '43', '27']),
 ('1040', ['4', '7', '10', '3', '1']),
 ('1043', ['1', '1', '4', '1', '1']),
 ('1044', ['2', '7', '13', '16', '2']),
 ('1045', ['6', '9', '8', '2']),
 ('1049', ['5', '5', '12', '3']),
 ('1051', ['1', '9', '13', '15', '3']),
 ('1054', ['8', '3', '10', '1', '1']),
 ('1057', ['4',

In [421]:
hist_film.map(lambda x: json.dumps(x)).collect()

['["44", ["6", "6", "29", "31", "7"]]',
 '["440", ["9", "2", "2", "1"]]',
 '["441", ["8", "16", "21", "7", "1"]]',
 '["442", ["3", "1"]]',
 '["445", ["3", "7", "10", "2"]]',
 '["446", ["2", "3", "2", "2"]]',
 '["447", ["5", "11", "40", "48", "17"]]',
 '["449", ["9", "26", "45", "26", "11"]]',
 '["45", ["1", "1", "20", "29", "29"]]',
 '["451", ["15", "31", "37", "54", "33"]]',
 '["452", ["10", "19", "24", "9", "4"]]',
 '["453", ["6", "6", "3", "1"]]',
 '["456", ["16", "7", "20", "4", "1"]]',
 '["457", ["16", "5", "3", "3"]]',
 '["458", ["11", "10", "33", "29", "7"]]',
 '["461", ["5", "28", "31", "10"]]',
 '["464", ["2", "2", "7", "10", "6"]]',
 '["465", ["4", "8", "26", "30", "17"]]',
 '["467", ["2", "15", "22", "9"]]',
 '["468", ["4", "9", "22", "21", "8"]]',
 '["469", ["3", "3", "14", "33", "14"]]',
 '["470", ["2", "9", "36", "37", "24"]]',
 '["473", ["8", "22", "52", "34", "10"]]',
 '["474", ["6", "34", "59", "95"]]',
 '["477", ["11", "13", "27", "33", "11"]]',
 '["478", ["5", "16", 