Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the AniDB utility (together with a test!) #248

Merged
merged 9 commits into from Feb 8, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 28 additions & 0 deletions mangaki/mangaki/management/commands/add_anidb.py
@@ -0,0 +1,28 @@
from django.core.management.base import BaseCommand, CommandError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CommandError is unused.

from mangaki.utils.anidb import AniDB
from mangaki.models import Work, Category
from django.db.models import Count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

from urllib.parse import urlparse, parse_qs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_qs is unused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we find an automated thing that does this job?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, theorically, autoflake should do the job.

import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.


def create_anime(**kwargs):
anime = Category.objects.get(slug='anime')
if 'anidb_aid' in kwargs:
return Work.objects.update_or_create(category=anime, anidb_aid=kwargs['anidb_aid'], defaults=kwargs)[0]
else:
return Work.objects.create(category=anime, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we don't specify an anidb_aid and we create a duplicate by mistake?
Is there any escape hatch to prevent this behavior which is destroying our DB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destroying is not the correct word.
Actually, for this function only, the if is unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Still, that does not answer the question, if anidb_aid is not specified, can we create a duplicate? (due to no constraints on the DB side or the model manager).


class Command(BaseCommand):
args = ''
help = 'Retrieve AniDB data'

def add_arguments(self, parser):
parser.add_argument('id', type=int)

def handle(self, *args, **options):
if options.get('id'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C'est possible que ça soit false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pas tellement. Sécurité inutile.

print(options.get('id'))
anidb = AniDB('mangakihttp', 1)
anime = create_anime(**anidb.get(options.get('id')))
print(anime)
anime.retrieve_poster()
1 change: 1 addition & 0 deletions mangaki/mangaki/models.py
Expand Up @@ -141,6 +141,7 @@ def retrieve_poster(self, url=None, session=None):
return False

filename = os.path.basename(urlparse(url).path)
# Hé mais ça va pas écraser des posters / créer des collisions, ça ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we need to handle name collisions.
A simple way to do it is:

  • save files with a random hash based on their content (md5, sha1, whatever) : poster_f182308f12830.png
  • link it to the DB
  • run a maintenance task which will prune the old posters which are not longer linked to a work to free up disk space.

If you encounter a collision, it means:

  • either, we have bad luck and the Birthday problem didn't like us.
  • either, it's exactly the same file, then, we overwrite, we don't care.

Generally, we should overwrite because we are lucky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oui mais c'est l'objet d'une issue nice-to-have (vais créer). Et tu es pédant avec le paradoxe des anniversaires.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, then wouldn't that be better if you put the issue link in the comment so that anyone can access to discussion and context about the issue?
Also, I was merely exposing my thoughts about the problem 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: #251


try:
r = session.get(url, timeout=5, stream=True)
Expand Down
25 changes: 25 additions & 0 deletions mangaki/mangaki/tests/test_anidb.py
@@ -0,0 +1,25 @@
from django.test import TestCase
from mangaki.models import Work, Category, Editor, Studio
from mangaki.utils.anidb import AniDB


class AniDBTest(TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really prefer a TestCase which is not dependent on the network, nor on the AniDB API.

We should be able to provide a fixture of test AniDB responses and feed them to the test case.
Moreover, should we depend on the network, then we have to handle credentials on the CI server… yet another pain.

Mocking + fixturing this test case will be interesting and set an example for the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't mock. I will remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests-chan :'(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: #250


def create_anime(self, **kwargs):
anime = Category.objects.get(slug='anime')
return Work.objects.create(category=anime, **kwargs)

def setUp(self):
# FIXME: The defaults for editor and studio in Work requires those to
# exist, or else foreign key constraints fail.
Editor.objects.create(pk=1)
Studio.objects.create(pk=1)
self.anidb = AniDB('mangakihttp', 1)

def test_anidb_search(self):
results = self.anidb.search(q='sangatsu no lion')
self.assertNotEqual(len(results), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite basic as a test, could we try to see if the result contain really something similar to sangatsu no lion ?


def test_anidb_get(self):
anime = self.create_anime(**self.anidb.get(11606))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11606 is really a fascinating number.

It is composed of three distinct primes factors: 2, 7, 829.

But, really, I am not convinced that we should rely on AniDB server-side IDs (e.g. though, we store them in our DB, agreed), I'd prefer if we could test random ID or rely on fixture responses rather than network responses.

self.assertNotEqual(anime.title, '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I just created the anime: ?

112 changes: 27 additions & 85 deletions mangaki/mangaki/utils/anidb.py
@@ -1,17 +1,12 @@
import requests
from bs4 import BeautifulSoup

from datetime import datetime


BASE_URL = "http://api.anidb.net:9001/httpapi"
SEARCH_URL = "http://anisearch.outrance.pl/"
PROTOCOL_VERSION = 1

try:
str = unicode
except:
pass
# str = str

class AniDB:
def __init__(self, client_id, client_ver=0):
Expand Down Expand Up @@ -45,12 +40,10 @@ def search(self, q):
results = []
animetitles = BeautifulSoup(r.text, 'xml').animetitles
for anime in animetitles.find_all('anime'):
results.append(Anime({
'id': int(anime['aid']),
'title': str(anime.find('title', attrs={'type': "official"}).string),
'picture': "http://img7.anidb.net/pics/anime/" + str(anime.find('picture').string),
}, partial=True, updater=lambda: self.get(anime.id)))

results.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me this is a good candidate to transform this function into a generator, if animetitles.find_all is also a generator, of course.

Otherwise, I'm wondering if it'd be more optimal to use a dict to store animes by IDs, anyway, this is okay.

(int(anime['aid']),
str(anime.find('title', attrs={'type': "official"}).string))
)
return results

def get(self, id):
Expand All @@ -62,81 +55,30 @@ def get(self, id):

r = self._request("anime", {'aid': id})
soup = BeautifulSoup(r.text.encode('utf-8'), 'xml') # http://stackoverflow.com/questions/31126831/beautifulsoup-with-xml-fails-to-parse-full-unicode-strings#comment50430922_31146912
"""with open('backup.xml', 'w') as f:
f.write(r.text)"""
with open('backup.xml', 'w') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do you create this file, which purpose does it serve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for debug, bien vu!

f.write(r.text)
if soup.error is not None:
raise Exception(soup.error.string)

anime = soup.anime
titles = anime.titles

a = Anime({
'id': id,
'type': str(anime.type.string),
'episodecount': int(anime.episodecount.string),
'startdate': datetime(*list(map(int, anime.startdate.string.split("-")))),
'enddate': datetime(*list(map(int, anime.enddate.string.split("-")))),
'titles': [(
str(title.string),
title['type'] if 'type' in title else "unknown"
) for title in anime.find_all('title')],
'title': str(titles.find('title', attrs={'type': "main"}).string),
'relatedanime': [],
'url': str(anime.url.string) if anime.url else None,
'creators': anime.creators,
'description': str(anime.description.string),
'ratings': SmartDict({
'permanent': float(anime.ratings.permanent.string),
'temporary': float(anime.ratings.temporary.string),
'review': float(anime.ratings.review.string) if anime.ratings.review else ''
}),
'picture': "http://img7.anidb.net/pics/anime/" + str(anime.picture.string),
'categories': [],
'tags': [],
'characters': [],
'episodes': [],
})

self._cache[id] = a

all_titles = anime.titles
# creators = anime.creators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO marker or FIXME to let someone catch it while running ag 'TODO' mangaki/** when bored.

# episodes = anime.episodes
# tags = anime.tags
# characters = anime.characters
# ratings = anime.ratings.{permanent, temporary}

a = {
'title': str(all_titles.find('title', attrs={'type': "main"}).string),
'source': 'AniDB: ' + str(anime.url.string) if anime.url else None,
'ext_poster': 'http://img7.anidb.net/pics/anime/' + str(anime.picture.string),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hotlink is hot. Could we use urlparse.urljoin rather than concatenating manually? (just for extra safety!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking in 2.7. from urllib.parse import urljoin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting old. :'(

# 'nsfw': ?
'date': datetime(*list(map(int, anime.startdate.string.split("-")))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this magic method into a utility function which would be put at the top of this file?
So that we can easily reuse this datetime parsing logic for all fields which may require it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember where it comes from but yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, it comes from the original library.

# not yet in model: 'enddate': datetime(*list(map(int, anime.enddate.string.split("-")))),
'synopsis': str(anime.description.string),
# 'artists': ? from anime.creators
'nb_episodes': int(anime.episodecount.string),
'anime_type': str(anime.type.string),
'anidb_aid': id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally a bad idea to name a variable id, time for riddles: why?

(answer: because it's supposed to be a built-in Python function.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now please merge!

}
return a

class SmartDict(dict):
def __init__(self, *a, **kw):
super(SmartDict, self).__init__(**kw)
for x in a:
try:
self.update(x)
except TypeError:
self.update(x.__dict__)
self.__dict__ = self

class Anime:
def __init__(self, data={}, partial=False, **kw):
self._data = data
self._partial = partial
if partial:
self._updater = kw['updater']

def _update(self):
if not self._partial:
raise Exception("Attempted to update a ready object")
else:
self._data = self._updater()._data
self._partial = False

def __getattr__(self, name):
if name in self._data:
return self._data[name]
else:
if self._partial:
self._update()
if name in self._data:
return self._data[name]
else:
raise AttributeError("no attribute called '%s'" % name)
else:
raise AttributeError("no attribute called '%s'" % name)

def __repr__(self):
return u'<Anime %i "%s">' % (self.id, self.title)