Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spark logging and spark tests action workflow #1413

Merged
merged 40 commits into from May 7, 2021
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
784c3c7
Test sentry in test setup
amCap1712 Apr 26, 2021
dd11f22
Build spark containers before test and use run instead of up
amCap1712 Apr 26, 2021
ee3d2b1
Copy configuration file during spark test run
amCap1712 Apr 26, 2021
ba0b805
Remove SparkIntegration
amCap1712 Apr 26, 2021
c768d5d
Change config.py.sample to hadoop master because it is used in tests
amCap1712 Apr 26, 2021
2b892e5
Test removing pyspark dependency
amCap1712 Apr 26, 2021
46e14a6
Configure python logger
amCap1712 Apr 27, 2021
45593bd
Test a hunch
amCap1712 Apr 27, 2021
03628bd
Configure base logger
amCap1712 Apr 27, 2021
5e13034
Add missing logger
amCap1712 Apr 27, 2021
34a272c
Move configuration to earlier phase
amCap1712 Apr 27, 2021
b006235
Another attempt at logging configuration
amCap1712 Apr 27, 2021
910ed62
remove pyspark dep
amCap1712 Apr 27, 2021
49a91ce
Add stop-request-consumer-container.sh script
amCap1712 Apr 27, 2021
c324ebe
Add metabrainz-spark-test image for use in tests
amCap1712 Apr 27, 2021
0349388
Add back deps
amCap1712 Apr 27, 2021
88c25cb
Copy config file correctly
amCap1712 Apr 27, 2021
76fbc3b
Fix file path and rearrange
amCap1712 Apr 28, 2021
dfbd064
Fix copying config file
amCap1712 May 1, 2021
2fb6eb9
Do not configure sentry in test
amCap1712 May 1, 2021
75e5f40
Dedup spark Dockerfile
amCap1712 May 2, 2021
09ae051
Install development dependencies
amCap1712 May 2, 2021
ca5a5bf
Remove pyspark dep
amCap1712 May 2, 2021
f476062
Set PYTHONPATH correctly
amCap1712 May 2, 2021
7b7c160
Add py4j to PYTHONPATH
amCap1712 May 2, 2021
4c6edbf
reformat file
amCap1712 May 2, 2021
a27e681
Fix SPARK_HOME
amCap1712 May 2, 2021
77a198b
Second attempt to fix SPARK_HOME
amCap1712 May 2, 2021
51de543
third attempt to fix SPARK_HOME
amCap1712 May 2, 2021
6d7a50a
Rearrange schema fields
amCap1712 May 2, 2021
65d8115
Rearrange schema fields - 2
amCap1712 May 2, 2021
63edc77
Rearrange schema fields - 3
amCap1712 May 2, 2021
245ec5a
Rearrange schema fields - 4
amCap1712 May 2, 2021
f94e95c
Add labels to Dockerfile.spark
amCap1712 May 5, 2021
f59c161
Add build-arg to push-request-consumer.sh
amCap1712 May 5, 2021
0276144
Delete obsolete scripts
amCap1712 May 5, 2021
036c978
Move remaining spark scripts a level up
amCap1712 May 5, 2021
5807dd4
Add default label to base
amCap1712 May 6, 2021
6d0f8e8
Add build arg after FROM as well
amCap1712 May 6, 2021
ac66b8d
Run spark-request-consumer without docker
amCap1712 May 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/frontend-tests.yml
Expand Up @@ -25,7 +25,7 @@ jobs:
- uses: satackey/action-docker-layer-caching@v0.0.11
continue-on-error: true

- name: Build frontend tests
- name: Build frontend containers
run: ./test.sh fe -b

- name: Run frontend tests
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/spark-tests.yml
Expand Up @@ -28,5 +28,8 @@ jobs:
- uses: satackey/action-docker-layer-caching@v0.0.11
continue-on-error: true

- name: Build spark containers
run: ./test.sh spark -b

- name: Run tests
run: ./test.sh spark
193 changes: 29 additions & 164 deletions Dockerfile.spark
@@ -1,148 +1,7 @@
ARG JAVA_VERSION=1.8
FROM airdock/oraclejdk:$JAVA_VERSION as metabrainz-spark-base

ARG GIT_COMMIT_SHA

LABEL org.label-schema.vcs-url="https://github.com/metabrainz/listenbrainz-server.git" \
org.label-schema.vcs-ref=$GIT_COMMIT_SHA \
org.label-schema.schema-version="1.0.0-rc1" \
org.label-schema.vendor="MetaBrainz Foundation" \
org.label-schema.name="ListenBrainz" \
org.metabrainz.based-on-image="airdock/oraclejdk:$JAVA_VERSION"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the labels in the new version. Also add the changes in #1424


# Compile and install specific version of Python
# The jdk image comes with jessie which has python 3.4 which
# is not supported anymore. We install Python 3.6 here because
# 3.7 needs a version of OpenSSL that is not available in jessie
# Based on https://github.com/docker-library/python/blob/master/3.6/jessie/Dockerfile

# Ensure that local Python build is preferred over whatever might come with the base image
ENV PATH /usr/local/bin:$PATH

# http://bugs.python.org/issue19846
# > At the moment, setting "LANG=C" on a Linux system *fundamentally breaks Python 3*, and that's not OK.
ENV LANG C.UTF-8

# Runtime dependencies. This includes the core packages for all of the buildDeps listed
# below. We explicitly install them so that when we `remove --auto-remove` the dev packages,
# these packages stay installed.
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ca-certificates \
netbase \
git \
libbz2-1.0 \
libexpat1 \
libffi6 \
libgdbm3 \
liblzma5 \
libncursesw5 \
libreadline6 \
libsqlite3-0 \
libssl1.0.0 \
libuuid1 \
tcl \
tk \
zlib1g wget \
&& rm -rf /var/lib/apt/lists/*

ENV GPG_KEY 0D96DF4D4110E5C43FBFB17F2D347EA6AA65421D
ENV PYTHON_VERSION 3.6.9

# The list of build dependencies comes from the python-docker slim version:
# https://github.com/docker-library/python/blob/408f7b8130/3.7/stretch/slim/Dockerfile#L29
RUN set -ex \
&& buildDeps=' \
build-essential \
libbz2-dev \
libexpat1-dev \
libffi-dev \
libgdbm-dev \
liblzma-dev \
libncursesw5-dev \
libreadline-dev \
libsqlite3-dev \
libssl-dev \
tk-dev \
tcl-dev \
uuid-dev \
xz-utils \
zlib1g-dev \
' \
&& apt-get update \
&& apt-get install -y $buildDeps --no-install-recommends \
\
&& wget -O python.tar.xz "https://www.python.org/ftp/python/${PYTHON_VERSION%%[a-z]*}/Python-$PYTHON_VERSION.tar.xz" \
&& wget -O python.tar.xz.asc "https://www.python.org/ftp/python/${PYTHON_VERSION%%[a-z]*}/Python-$PYTHON_VERSION.tar.xz.asc" \
&& export GNUPGHOME="$(mktemp -d)" \
&& gpg --batch --keyserver ha.pool.sks-keyservers.net --recv-keys "$GPG_KEY" \
&& gpg --batch --verify python.tar.xz.asc python.tar.xz \
&& { command -v gpgconf > /dev/null && gpgconf --kill all || :; } \
&& rm -rf "$GNUPGHOME" python.tar.xz.asc \
&& mkdir -p /usr/src/python \
&& tar -xJC /usr/src/python --strip-components=1 -f python.tar.xz \
&& rm python.tar.xz \
\
&& cd /usr/src/python \
&& gnuArch="$(dpkg-architecture --query DEB_BUILD_GNU_TYPE)" \
&& ./configure \
--build="$gnuArch" \
--enable-loadable-sqlite-extensions \
--enable-shared \
--with-system-expat \
--with-system-ffi \
--without-ensurepip \
&& make -j "$(nproc)" \
&& make install \
&& ldconfig \
\
&& find /usr/local -depth \
\( \
\( -type d -a \( -name test -o -name tests \) \) \
-o \
\( -type f -a \( -name '*.pyc' -o -name '*.pyo' \) \) \
\) -exec rm -rf '{}' + \
&& rm -rf /usr/src/python \
\
&& apt-get purge -y --auto-remove $buildDeps \
&& rm -rf /var/lib/apt/lists/* \
\
&& python3 --version


# make some useful symlinks that are expected to exist
RUN cd /usr/local/bin \
&& ln -s idle3 idle \
&& ln -s pydoc3 pydoc \
&& ln -s python3 python \
&& ln -s python3-config python-config

# Install pip
ENV PYTHON_PIP_VERSION 21.0.1

RUN set -ex; \
\
wget -O get-pip.py 'https://bootstrap.pypa.io/get-pip.py'; \
\
python get-pip.py \
--disable-pip-version-check \
--no-cache-dir \
"pip==$PYTHON_PIP_VERSION" \
; \
pip --version; \
\
find /usr/local -depth \
\( \
\( -type d -a \( -name test -o -name tests \) \) \
-o \
\( -type f -a \( -name '*.pyc' -o -name '*.pyo' \) \) \
\) -exec rm -rf '{}' +; \
rm -f get-pip.py

FROM metabrainz/python:3.8-20210115 as metabrainz-spark-base

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
scala \
wget \
net-tools \
dnsutils \
Expand All @@ -152,36 +11,42 @@ RUN apt-get update \
zip \
&& rm -rf /var/lib/apt/lists/*

RUN pip3 install pip==21.0.1

COPY requirements_spark.txt /requirements_spark.txt
RUN pip3 install -r /requirements_spark.txt

FROM metabrainz-spark-base as metabrainz-spark-prod
WORKDIR /rec
COPY . /rec

FROM metabrainz-spark-base as metabrainz-spark-test

ENV DOCKERIZE_VERSION v0.6.1
RUN wget https://github.com/jwilder/dockerize/releases/download/$DOCKERIZE_VERSION/dockerize-linux-amd64-$DOCKERIZE_VERSION.tar.gz \
&& tar -C /usr/local/bin -xzvf dockerize-linux-amd64-$DOCKERIZE_VERSION.tar.gz \
&& rm dockerize-linux-amd64-$DOCKERIZE_VERSION.tar.gz

COPY docker/apache-download.sh /apache-download.sh
ENV SPARK_VERSION 2.4.1
ENV HADOOP_VERSION 2.7
RUN cd /usr/local && \
/apache-download.sh spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz && \
tar xzf spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz && \
ln -s spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION spark

RUN mkdir /rec
WORKDIR /rec
COPY requirements_spark.txt /rec/requirements_spark.txt
RUN pip3 install -r requirements_spark.txt

FROM metabrainz-spark-base as metabrainz-spark-master
CMD /usr/local/spark/sbin/start-master.sh
WORKDIR /usr/local

FROM metabrainz-spark-base as metabrainz-spark-worker
CMD dockerize -wait tcp://spark-master:7077 -timeout 9999s /usr/local/spark/sbin/start-slave.sh spark://spark-master:7077
ENV JAVA_VERSION 11.0.11
ENV JAVA_BUILD_VERSION 9
RUN wget https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-${JAVA_VERSION}%2B${JAVA_BUILD_VERSION}/OpenJDK11U-jdk_x64_linux_hotspot_${JAVA_VERSION}_${JAVA_BUILD_VERSION}.tar.gz \
&& tar xzf OpenJDK11U-jdk_x64_linux_hotspot_${JAVA_VERSION}_${JAVA_BUILD_VERSION}.tar.gz
ENV JAVA_HOME /usr/local/jdk-${JAVA_VERSION}+${JAVA_BUILD_VERSION}
ENV PATH $JAVA_HOME/bin:$PATH

FROM metabrainz-spark-base as metabrainz-spark-jobs
COPY . /rec
COPY docker/apache-download.sh /apache-download.sh
ENV SPARK_VERSION 3.1.1
ENV HADOOP_VERSION 3.2
RUN /apache-download.sh spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz \
&& tar xzf spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz
ENV SPARK_HOME /usr/local/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION
ENV PATH $SPARK_HOME/bin:$PATH
ENV PYTHONPATH $SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$SPARK_HOME/python:$PYTHONPATH

FROM metabrainz-spark-base as metabrainz-spark-dev
COPY . /rec
COPY requirements_development.txt /requirements_development.txt
RUN pip3 install -r /requirements_development.txt

FROM metabrainz-spark-base as metabrainz-spark-request-consumer
WORKDIR /rec
COPY . /rec
22 changes: 0 additions & 22 deletions Dockerfile.spark.newcluster

This file was deleted.

4 changes: 2 additions & 2 deletions docker/docker-compose.spark.test.yml
Expand Up @@ -21,7 +21,7 @@ services:
build:
context: ..
dockerfile: Dockerfile.spark
target: metabrainz-spark-dev
command: dockerize -wait tcp://hadoop-master:9000 -timeout 60s bash -c "PYTHONDONTWRITEBYTECODE=1 python -m pytest -c pytest.spark.ini --junitxml=/data/test_report.xml --cov-report xml:/data/coverage.xml"
target: metabrainz-spark-test
command: dockerize -wait tcp://hadoop-master:9000 -timeout 60s bash -c "cp listenbrainz_spark/config.py.sample listenbrainz_spark/config.py; PYTHONDONTWRITEBYTECODE=1 python -m pytest -c pytest.spark.ini"
volumes:
- ..:/rec:z
2 changes: 1 addition & 1 deletion docker/spark-new-cluster/push-request-consumer.sh
Expand Up @@ -2,5 +2,5 @@

cd "$(dirname "${BASH_SOURCE[0]}")/../../"

docker build -t metabrainz/listenbrainz-spark-new-cluster -f Dockerfile.spark.newcluster .
docker build --target metabrainz-spark-prod -t metabrainz/listenbrainz-spark-new-cluster -f Dockerfile.spark .
docker push metabrainz/listenbrainz-spark-new-cluster
3 changes: 1 addition & 2 deletions docker/spark-new-cluster/start-request-consumer-container.sh
Expand Up @@ -12,7 +12,6 @@ docker pull metabrainz/listenbrainz-spark-new-cluster:latest
python3 -m venv pyspark_venv
source pyspark_venv/bin/activate
pip install -r requirements_spark.txt
pip uninstall pyspark py4j -y
pip install venv-pack
venv-pack -o pyspark_venv.tar.gz

Expand Down Expand Up @@ -42,4 +41,4 @@ docker run \
--conf "spark.executor.memory"=$EXECUTOR_MEMORY \
--conf "spark.driver.memory"=$DRIVER_MEMORY \
--py-files listenbrainz_spark_request_consumer.zip \
spark_manage.py request_consumer
spark_manage.py request_consumer
6 changes: 6 additions & 0 deletions docker/spark-new-cluster/stop-request-consumer-container.sh
@@ -0,0 +1,6 @@
#!/bin/bash

docker stop spark-request-consumer
docker rm spark-request-consumer
rm -r pyspark_venv pyspark_venv.tar.gz listenbrainz_spark_request_consumer.zip

14 changes: 12 additions & 2 deletions listenbrainz_spark/__init__.py
@@ -1,5 +1,15 @@
import logging

_handler = logging.StreamHandler()
_handler.setLevel(logging.INFO)
_formatter = logging.Formatter("%(asctime)s %(name)-20s %(levelname)-8s %(message)s")
_handler.setFormatter(_formatter)

_logger = logging.getLogger("listenbrainz_spark")
_logger.setLevel(logging.INFO)
_logger.addHandler(_handler)

import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

from py4j.protocol import Py4JJavaError
from pyspark.sql import SparkSession, SQLContext
Expand All @@ -19,7 +29,7 @@ def init_spark_session(app_name):
app_name (str): Name of the Spark application. This will also occur in the Spark UI.
"""
if hasattr(config, 'LOG_SENTRY'): # attempt to initialize sentry_sdk only if configuration available
sentry_sdk.init(**config.LOG_SENTRY, integrations=[SparkIntegration()])
sentry_sdk.init(**config.LOG_SENTRY)
global session, context, sql_context
try:
session = SparkSession \
Expand Down
12 changes: 6 additions & 6 deletions listenbrainz_spark/config.py.sample
@@ -1,6 +1,6 @@
HDFS_HTTP_URI = 'http://leader:9870' # the URI of the http webclient for HDFS
HDFS_HTTP_URI = 'http://hadoop-master:9870' # the URI of the http webclient for HDFS

HDFS_CLUSTER_URI = 'hdfs://leader:9000' # the URI to be used with Spark
HDFS_CLUSTER_URI = 'hdfs://hadoop-master:9000' # the URI to be used with Spark

# rabbitmq
RABBITMQ_HOST = "rabbitmq"
Expand All @@ -19,10 +19,10 @@ SPARK_RESULT_QUEUE = "spark_result"
# calculate stats on X months data
STATS_CALCULATION_WINDOW = 1

LOG_SENTRY = {
'dsn':'',
'environment': 'development',
}
# LOG_SENTRY = {
# 'dsn':'',
# 'environment': 'development',
# }

# Model id is made up of two parts.
# String + UUID
Expand Down
1 change: 0 additions & 1 deletion listenbrainz_spark/recommendations/recording/recommend.py
Expand Up @@ -34,7 +34,6 @@
from pyspark.sql.types import DoubleType
from pyspark.mllib.recommendation import MatrixFactorizationModel


logger = logging.getLogger(__name__)


Expand Down
@@ -1,9 +1,8 @@
from datetime import datetime
import sys
from listenbrainz_spark.tests import SparkTestCase
from listenbrainz_spark.recommendations.recording import candidate_sets
from listenbrainz_spark.recommendations.recording import create_dataframes
from listenbrainz_spark import schema, utils, config, path, stats
from listenbrainz_spark import utils, path, stats
from listenbrainz_spark.exceptions import (TopArtistNotFetchedException,
SimilarArtistNotFetchedException)

Expand All @@ -12,6 +11,7 @@
import pyspark.sql.functions as f
from pyspark.sql.types import StructField, StructType, StringType


class CandidateSetsTestClass(SparkTestCase):

recommendation_generation_window = 7
Expand Down Expand Up @@ -568,10 +568,10 @@ def test_explode_artist_collaborations(self):
def test_append_artists_from_collaborations(self, mock_explode, mock_read_hdfs):
top_artist_df = utils.create_dataframe(
Row(
mb_artist_credit_mbids=["6a70b322-9aa9-41b3-9dce-824733633a1c"],
top_artist_credit_id=2,
top_artist_name='kishorekumar',
user_name='vansika',
mb_artist_credit_mbids=["6a70b322-9aa9-41b3-9dce-824733633a1c"]
),
schema=None
)
Expand Down