# Mangaki

## Hypothèses confirmées

- Manami n'a pas de références en double
- Deux anime avec le même ID AniDB ne sont pas forcément le même anime sur Mangaki (ex. films Madoka, Kara no Kyoukai)

## Hypothèses à confirmer

- Deux anime avec le même ID MAL sont le même anime sur Mangaki

## Reste à faire

- Création de fixtures WorkCluster

On a plein de références :

In [1]:
Reference.objects.count()

12013

In [2]:
import json

with open('../../anime-offline-database/dead-entries.json') as f:
    dead = json.load(f)

def check_ref(ref):
    if ref.startswith('MAL:'):
        return int(ref[4:]) not in dead['mal']
    if ref.startswith('AniDB:'):
        return int(ref[6:]) not in dead['anidb']
    return None

In [3]:
Reference.objects.first().__dict__

{'_state': <django.db.models.base.ModelState at 0x7f7e809e7580>,
 'id': 1,
 'work_id': 325,
 'source': 'AniDB',
 'identifier': '570',
 'url': 'http://anidb.net/perl-bin/animedb.pl?show=anime&aid=570'}

In [4]:
from collections import Counter

Counter(Reference.objects.values_list('source', flat=True))

Counter({'AniDB': 120,
         'MAL': 11817,
         'VGMdb': 1,
         'Animeka': 57,
         'Manga-News': 17,
         'Icotaku': 1})

In [5]:
Reference.objects.filter(url__contains='ilist')

<QuerySet []>

## Works ayant le même titre

In [6]:
import pandas as pd

def nunique_nonzero(series):
    return len(set(c for c in series if c))

df = pd.DataFrame(Work.objects.filter(category__slug='anime').values('id', 'title', 'anidb_aid'))
same_title = df.groupby('title').agg({'id': 'count', 'anidb_aid': nunique_nonzero}).query('id > 1')

print(f"{len(same_title)} clusters d'anime ont le même titre sur Mangaki")
same_title.index.name = None
same_title

43 clusters d'anime ont le même titre sur Mangaki


Unnamed: 0,id,anidb_aid
Deadman Wonderland,2,1
Desert Punk,3,0
Digimon: The Movie,2,0
Dimension W,3,0
Dororo,2,0
Fruits Basket,2,1
Gestalt,2,0
Ghost Talker's Daydream,2,0
Guin Saga,2,0
Hanayamata,2,0


In [7]:
VALID_MANGAKI_IDS = set(Work.objects.filter(category__slug='anime').values_list('id', flat=True))
# Non valide signifie que c'est déjà un doublon identifié

## Même titre + des AniDB ID différents (ouf, zéro)

In [8]:
nb_with_anidb_aid = Work.objects.filter(anidb_aid__gt=0).count()
f'{nb_with_anidb_aid} œuvres ont un AniDB ID renseigné'

'338 œuvres ont un AniDB ID renseigné'

In [9]:
anidb_ids = pd.DataFrame(Reference.objects.filter(source='AniDB', work_id__category__slug='anime').values(
    'identifier', 'work_id', 'work_id__title', 'work_id__anidb_aid')).rename(
    columns={'work_id__title': 'title'})
anidb_ids.groupby('title').agg({
    'work_id': 'count',
    'identifier': pd.Series.nunique,
    'work_id__anidb_aid': nunique_nonzero
}).query(
    'identifier != work_id__anidb_aid').sort_values(['identifier', 'work_id'], ascending=(False, False))

Unnamed: 0_level_0,work_id,identifier,work_id__anidb_aid
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Grandeek,1,1,0


Grandeek a un objet `Reference` AniDB mais pas de champ `anidb_aid` rempli. Ce n'est pas bien grave.

## Même titre + des MAL ID différents

In [13]:
mal_ids = pd.DataFrame(
    Reference.objects.filter(source='MAL', work_id__category__slug='anime',
                             work__redirect__isnull=True).exclude(identifier__in=dead['mal']).values(
    'identifier', 'work_id', 'work_id__title')).rename(
    columns={'work_id__title': 'title'})
mal_duplicates = mal_ids.groupby('title').agg({'work_id': 'count', 'identifier': pd.Series.nunique}).query(
    'identifier > 1').sort_values(['identifier', 'work_id'], ascending=(False, False))
print(f'{len(mal_duplicates)} clusters de works ayant le même titre mais des MAL IDs différents')
mal_duplicates.index.name = None
mal_duplicates

11 clusters de works ayant le même titre mais des MAL IDs différents


Unnamed: 0,work_id,identifier
The Asterisk War: The Academy City on the Water,9,2
Sorcerer Hunters,3,2
Digimon: The Movie,2,2
Dororo,2,2
Gangsta.,2,2
Gestalt,2,2
Karneval,2,2
Nanana's Buried Treasure,2,2
"Sankarea: I, Too, Am... A Zombie...",2,2
Spice and Wolf II,2,2


Une liste avec 45 clusters d'anime + 35 aberrants à apparier avec Manami

```
même titre | titre à modif x | mal_id | mangaki_id  | anidb_id | anilist_id
Berserk    | Berserk (2005)  | 253    | 25,12,85,32 |          |
Berserk    | Berserk         | 14     | 17,2,4      |          |
```

(- Pondre le fichier tableur si besoin)

- Tant que : y a des doublons
- Appariement Manami via les AniDB_ID & MAL_ID
- Suggestion de nouveau titre via le titre Manami (ou ses synonymes)
- Créer des WorkClusters sur le serveur + Soubi aide

Par exemple, il y a 4 œuvres Berserk qui ont le même titre mais 2 MAL IDs différents :

In [14]:
mal_ids.query('title == "Berserk"') # Manami ID reference

Unnamed: 0,identifier,work_id,title
3391,33,3409,Berserk


Pour ceux-là il faut voir au cas par cas. En fait le MAL ID 33 correspond à Berserk (2016) qui est sûrement moins bien noté que l'original.

In [15]:
# mangaki_to_manami[18331]

In [16]:
EXCLUDED_MANAMI_IDS = {
    #21888,  # Uchiage Hanabi: should be 'Music' not 'Special'
    #5852,  # Gangsta Recap
    #5043,  # Time of Eve episodes
    #17567,  # PV Nanana no Maizoukin
}

In [17]:
# manami.from_title['Yoru wa Mijikashi Arukeyo Otome'.lower()]
# manami.from_title['Azumanga Daioh'.lower()], manami.from_synonym['Azumanga Daioh'.lower()]

In [18]:
# manami['data'][5852]

In [19]:
# manami.references[('MAL', '35456')]

In [20]:
# manami.from_synonym['charlotte']

In [21]:
# manami[2994]

In [22]:
pd.read_csv('dedupe_report.csv').query('problem == "Zero or multiple Manami"')

Unnamed: 0,problem,mangaki_title,refs,mangaki_id,manami_data,manami_data_from_title
0,Zero or multiple Manami,Beelzebub: Kaiketsu!! Beel-bo Meitantei Suiri,MAL:16532:X,3053,,"1817:Beelzebub Specials,1821:Beelzebub: Kaiket..."
1,Zero or multiple Manami,Beelzebub: Sakigake!! Beel to Shinsengumi,MAL:13067:X,1790,,"1817:Beelzebub Specials,1822:Beelzebub: Sakiga..."
2,Zero or multiple Manami,"Charlotte: Arata na ""Unmei"" no Hajimari",MAL:30987:X,12674,,
3,Zero or multiple Manami,Chokotto Kamen,MAL:32754:X,14411,,
21,Zero or multiple Manami,Kanojo x Kanojo x Kanojo Full Version,MAL:18597:X,12101,,


In [24]:
#title_map([20229], mangaki, manami)

In [33]:
# NEW! Manually merge Manami entries

manami_merge = {
    #'Princess Mononoke': mangaki_to_manami[60],
    #'My Teen Romantic Comedy': {23052, 23060},  # 128
    #'Tales from Earthsea': {5910, 5911},  # 2461
    #'Gunbuster': {21440, 21448},  # 3329
    #'Assassination Classroom: Jump Festa 2013 Special': {964, 979},  # 8858
    #'Assassination Classroom': {965, 966},  # 8859
    #'The Night is Short': {23363, 23364},  # 18416
    #'Azumanga Daioh': {1385, 1388},
    #'Eureka Seven': manami.from_synonym['eureka seven']  # 96
    'Macross Delta SP': title_map([15185], mangaki, manami),
    'Saru Kani Gassen': title_map([9259], mangaki, manami),
    'Shin Saru Kani Gassen': title_map([20229], mangaki, manami),
}
manami_map = {k: k for k in range(24369)}

for _, manami_ids in manami_merge.items():
    for manami_id in manami_ids:
        manami_map[manami_id] = min(manami_ids)

# Mangaki + Manami

In [34]:
import json

# Les sources en commun entre Manami et Mangaki
COMMON_SOURCES = {'AniDB', 'MAL'}

def sanitize(s):
    return s.lower()

# On utilise 'AniDB' et 'MAL' pour coller aux sources Mangaki
FROM_MANAMI_SOURCE = {
    'https://anidb.net/anime': 'AniDB',
    'https://anilist.co/anime': 'anilist.co',
    'https://kitsu.io/anime': 'kitsu.io',
    'https://myanimelist.net/anime': 'MAL',
    'https://notify.moe/anime': 'notify.moe',
}

def parse_manami_source(source):
    parts = source.rsplit('/', 1)
    if len(parts) != 2:
        raise Exception("Bad")
        
    source_name, identifier = parts
    return FROM_MANAMI_SOURCE[source_name], identifier

class AnimeOfflineDatabase:
    def __init__(self, path: str, *, filter_sources=None):
        self._path = path
        self._filter_sources = filter_sources
        self.references = {}
        self.from_title = {}
        self.from_synonym = {}

        with open(self._path) as f:
            self._raw = json.load(f)

        for local_id, datum in enumerate(self._raw['data']):
            if local_id in EXCLUDED_MANAMI_IDS:
                continue
            # Parse sources
            for source in datum['sources']:
                source, identifier = parse_manami_source(source)
                if self._filter_sources is not None and source not in self._filter_sources:
                    continue

                self.references.setdefault((source, identifier), []).append(manami_map[local_id])
                
            # Setup title reverse search
            self.from_title.setdefault(sanitize(datum['title']), set()).add(manami_map[local_id])
            for synonym in datum['synonyms']:
                self.from_synonym.setdefault(sanitize(synonym), set()).add(manami_map[local_id])
            
        self._check()
        
    def _check(self):
        for (source, identifier), manami_entry in self.references.items():
            assert len(manami_entry) == 1, f"Multiple Manami entries with reference {source}/{identifier}"

    def __getitem__(self, key):
        return self._raw['data'][key]
    
    def __len__(self):
        return len(self._raw['data'])
    
    def print_summary(self):
        if self._filter_sources is None:
            filters = ''
        else:
            filters = ' ({} only)'.format(', '.join(self._filter_sources))

        print('Manami:')
        print(f'    {len(self)} animes (with {len(self.from_title)} unique titles)')
        print(f'    {len(self.references)} unique references{filters}')

manami = AnimeOfflineDatabase('../../anime-offline-database/anime-offline-database.json', filter_sources=COMMON_SOURCES)
manami.print_summary()

Manami:
    24369 animes (with 23867 unique titles)
    28491 unique references (AniDB, MAL only)


In [35]:
extra_ref = []
for mangaki_id, anidb_id in Work.objects.filter(anidb_aid__gt=0).values_list('id', 'anidb_aid'):
    extra_ref.append((mangaki_id, 'AniDB', str(anidb_id)))

In [36]:
class MangakiDatabase:
    def __init__(self, *, filter_sources=None):
        self._raw = {work['pk']: work for work in Work.objects.filter(category__slug='anime').values('pk', 'title')}
        self._filter_sources = filter_sources
        self.references = {}
        self.from_title = {}
        self.from_synonym = {}

        # Extract all Mangaki references
        redirects = dict(Work.all_objects.filter(redirect__isnull=False).values_list('pk', 'redirect_id'))

        # We use `Reference.objects` so that we also get duplicated works.  References on duplicated
        # works should have been moved to the cluster representative, but it doesn't hurt to be
        # conservative.
        qs = Reference.objects \
                .filter(work_id__category__slug='anime') \
                .values_list('work_id', 'source', 'identifier')
        if filter_sources is not None:
            qs = qs.filter(source__in=list(filter_sources))
        print(qs[:5])
        qs = set(list(qs) + extra_ref)

        for mangaki_id, source, identifier in qs:
            # Clean up known duplicates.  NB: These are bogus DB entries.
            # Yes JJ Kruskal I know.
            while mangaki_id in redirects:
                mangaki_id = redirects[mangaki_id]

            self._raw[mangaki_id].setdefault('sources', []).append((source, identifier))
            self.references.setdefault((source, identifier), set()).add(mangaki_id)
            
        self.missing_refs = {pk for pk, raw in self._raw.items() if 'sources' not in raw}
        
        for pk, work in self._raw.items():
            self.from_title.setdefault(sanitize(work['title']), set()).add(pk)
            
        for work_id, synonym in WorkTitle.objects.values_list('work_id', 'title'):
            while work_id in redirects:
                work_id = redirects[work_id]
            if work_id not in self._raw:
                continue

            self._raw[work_id].setdefault('synonyms', []).append(synonym)
            self.from_synonym.setdefault(sanitize(synonym), set()).add(work_id)
            
    def __getitem__(self, key):
        return self._raw[key]
    
    def __len__(self):
        return len(self._raw)
            
    def print_summary(self):
        if self._filter_sources is None:
            filters = ''
        else:
            filters = '{} '.format(', '.join(self._filter_sources))

        print('Mangaki:')
        print(f'    {len(self)} animes')
        print(f'    {len(self.references)} unique {filters}references')
        if self.missing_refs:
            print(f'    {len(self.missing_refs)} animes with no {filters}references')
            
mangaki = MangakiDatabase(filter_sources=COMMON_SOURCES)
mangaki.print_summary()

<QuerySet [(21453, 'AniDB', '14727'), (15117, 'MAL', '8795'), (325, 'AniDB', '570'), (325, 'MAL', '1033'), (21455, 'AniDB', '14523')]>
Mangaki:
    10978 animes
    10713 unique AniDB, MAL references
    6 animes with no AniDB, MAL references


In [37]:
class UnionFind:
    def __init__(self):
        self._parent = {}
        
    def union(self, first_id, *other_ids):
        root = self.find(first_id)
        for other_id in other_ids:
            other_root = self.find(other_id)
            if other_root != root:
                self._parent[other_root] = root
            
    def find(self, key):
        parent = self._parent.get(key)
        if parent is not None:
            return self.find(parent)
        
        return key
    
    def clusters(self):
        from_root = {}
        for key in self._parent.keys():
            root = self.find(key)
            from_root.setdefault(root, {root}).add(key)
            
        return {tuple(sorted(cluster)) for cluster in from_root.values()}
    
def build_clusters(mapping):
    uf = UnionFind()
    for _, identical in mapping.items():
        uf.union(*identical)
    return uf.clusters()

def remote_merge(from_db, to_db):
    mapping = {}
    uf = UnionFind()
    for ref_id, from_ids in from_db.references.items():
        to_ids = to_db.references.get(ref_id)
        if to_ids is not None:
            for from_id in from_ids:
                mapping.setdefault(from_id, set()).update(to_ids)
                
    return mapping

In [38]:
manami_to_mangaki = remote_merge(manami, mangaki)
mangaki_to_manami = remote_merge(mangaki, manami)

mangaki_clusters = build_clusters(manami_to_mangaki)
merged_animes = {c for cluster in mangaki_clusters for c in cluster}
mangaki_conflations = {mangaki_id for mangaki_id, manami_ids in mangaki_to_manami.items() if len(manami_ids) > 1}

print(f'{len(mangaki)-len(mangaki_to_manami)} ({(1-len(mangaki_to_manami)/len(mangaki))*100:.1f} %) animes not found in Manami')
print(f'{len(mangaki_conflations)} ({len(mangaki_conflations)/len(mangaki)*100.:.1f} %) animes in Mangaki with multiple Manami references')
print(f'{len(mangaki_clusters)} animes merged by Manami (concerns {len(merged_animes)} animes)')

46 (0.4 %) animes not found in Manami
61 (0.6 %) animes in Mangaki with multiple Manami references
464 animes merged by Manami (concerns 960 animes)


In [39]:
mangaki_to_manami[60]

{14001, 14002}

In [40]:
def title_map(cluster, from_db, to_db):
    all_ids = set()
    for from_id in cluster:
        title = sanitize(from_db[from_id]['title'])
        to_ids = to_db.from_title.get(title)
        if to_ids is not None:
            all_ids.update(to_ids)
        
        to_ids = to_db.from_synonym.get(title)
        if to_ids is not None:
            all_ids.update(to_ids)

    if len(all_ids) == 0:
        for from_id in cluster:
            obj = from_db[from_id]
            if 'synonyms' in obj:
                for synonym in obj['synonyms']:
                    synonym = sanitize(synonym)

                    to_ids = to_db.from_title.get(synonym)
                    if to_ids is not None:
                        all_ids.update(to_ids)
                        
                    to_ids = to_db.from_synonym.get(synonym)
                    if to_ids is not None:
                        all_ids.update(to_ids)
                        
    return all_ids

# new_refs are the new references to add to Mangaki
# not_found are the animes not found in Manami at all
# multiples are the animes found multiple times in Manami
before = len(set(mangaki._raw) - set(mangaki_to_manami))
new_refs = []
not_found = []
multiples = []
for missing in set(mangaki._raw) - set(mangaki_to_manami):
    manami_ids = title_map([missing], mangaki, manami)
    if not manami_ids:
        not_found.append(missing)

    if len(manami_ids) == 1:
        manami_id, = manami_ids
        mangaki_to_manami.setdefault(missing, set()).add(manami_id)
        manami_to_mangaki.setdefault(manami_id, set()).add(missing)

        for url in manami[manami_id]['sources']:
            source, identifier = parse_manami_source(url)
            if source in COMMON_SOURCES:
                new_refs.append((missing, source, identifier, url))
                
    if len(manami_ids) > 1:
        multiples.append((missing, manami_ids))

remaining = set(mangaki._raw) - set(mangaki_to_manami)
after = len(remaining)
print(f'Repêché {before - after}/{before} animes (reste {after}, dont {len(not_found)} introuvables)')

Repêché 41/46 animes (reste 5, dont 3 introuvables)


In [41]:
#manami[13844]
mangaki_to_manami[10085]

{504}

In [42]:
#manami[501]

In [43]:
len(manami)

24369

In [44]:
from urllib.parse import urlparse

def get_tld(url):
    return urlparse(url).hostname

def manami_to_dict(manami_id):
    entry = manami[manami_id]
    # Ensure no duplicates references of the same source
    assert Counter(get_tld(url) for url in entry['sources']).most_common(1)[0][1] == 1
    return {**{get_tld(url): url for url in entry['sources']}, **{'manami_title': f"{manami_id}:{entry['title']}"}}

mangaki_conflations = {mangaki_id for mangaki_id, manami_ids in mangaki_to_manami.items() if len(manami_ids) > 1}

def get_manami_ids(mangaki_id, desperate_search_by_title=False):
    manami_ids = mangaki_to_manami.get(mangaki_id)
    if manami_ids is None:
        if desperate_search_by_title:
            try:
                manami_ids = title_map([mangaki_id], mangaki, manami)
            except KeyError:
                print('Failed search by title', mangaki_id)
    return manami_ids

def handle_multiple_manami(mangaki_conflations):
    manami_duplicates = []
    for mangaki_id in sorted(mangaki_conflations):
        mangaki_work = mangaki[mangaki_id]
        manamis = get_manami_ids(mangaki_id, desperate_search_by_title=True)
        if not manamis:
            continue
        manamis = list(manamis)
        print(mangaki_id, manamis)
        manami_duplicates.append({**manami_to_dict(manamis[0]),
                                  **{'mangaki_id': mangaki_id, 'mangaki_title': mangaki_work['title']}})
        for manami_id in manamis[1:]:
            manami_duplicates.append(manami_to_dict(manami_id))
        print(mangaki_id, '|', mangaki_work['title'], '|',
            ' | '.join(
                '{} ( {} )'.format(manami[mid]['title'],' ; '.join(manami[mid]['sources']))
                for mid in manamis))
    df_dup_manami = pd.DataFrame.from_dict(manami_duplicates)
    return df_dup_manami[['mangaki_id', 'mangaki_title', 'manami_title', 'anidb.net', 'myanimelist.net', 'anilist.co', 'kitsu.io', 'notify.moe']].fillna('')

manami_dup = handle_multiple_manami(mangaki_conflations)
manami_dup

60 [14001, 14002]
60 | Princess Mononoke | Mononoke Hime ( https://kitsu.io/anime/142 ; https://myanimelist.net/anime/164 ; https://notify.moe/anime/FZltcFmig ) | Mononoke-hime ( https://anidb.net/anime/7 ; https://anilist.co/anime/164 )
128 [23389, 23383]
128 | My Teen Romantic Comedy SNAFU | Yahari Ore no Seishun LoveCome wa Machigatte Iru. ( https://anidb.net/anime/9310 ) | Yahari Ore no Seishun Love Comedy wa Machigatteiru. ( https://anilist.co/anime/14813 ; https://kitsu.io/anime/7169 ; https://myanimelist.net/anime/14813 ; https://notify.moe/anime/uthdpFmmg )
176 [6064, 12589]
176 | Puella Magi Madoka Magica the Movie Part 3: Rebellion | Gekijouban Mahou Shoujo Madoka Magica ( https://anidb.net/anime/8778 ) | Mahou Shoujo Madoka★Magica Movie 3: Hangyaku no Monogatari ( https://kitsu.io/anime/6638 ; https://myanimelist.net/anime/11981 ; https://notify.moe/anime/GBmHtKimR )
179 [6064, 12587]
179 | Puella Magi Madoka Magica the Movie Part 1: Beginnings | Gekijouban Mahou Shoujo Mado

Unnamed: 0,mangaki_id,mangaki_title,manami_title,anidb.net,myanimelist.net,anilist.co,kitsu.io,notify.moe
0,60,Princess Mononoke,14001:Mononoke Hime,,https://myanimelist.net/anime/164,,https://kitsu.io/anime/142,https://notify.moe/anime/FZltcFmig
1,,,14002:Mononoke-hime,https://anidb.net/anime/7,,https://anilist.co/anime/164,,
2,128,My Teen Romantic Comedy SNAFU,23389:Yahari Ore no Seishun LoveCome wa Machig...,https://anidb.net/anime/9310,,,,
3,,,23383:Yahari Ore no Seishun Love Comedy wa Mac...,,https://myanimelist.net/anime/14813,https://anilist.co/anime/14813,https://kitsu.io/anime/7169,https://notify.moe/anime/uthdpFmmg
4,176,Puella Magi Madoka Magica the Movie Part 3: Re...,6064:Gekijouban Mahou Shoujo Madoka Magica,https://anidb.net/anime/8778,,,,
...,...,...,...,...,...,...,...,...
118,,,533:Ajin 2nd Season,,https://myanimelist.net/anime/33253,https://anilist.co/anime/21799,https://kitsu.io/anime/12115,https://notify.moe/anime/cTmR2KiiR
119,14593,No Game No Life: Zero,15065:No Game No Life Zero,https://anidb.net/anime/12276,,https://anilist.co/anime/21875,,
120,,,15066:No Game No Life: Zero,,https://myanimelist.net/anime/33674,,https://kitsu.io/anime/12325,https://notify.moe/anime/0_Rk2KimR
121,20898,Maquia: When the Promised Flowers Bloom,18217:Sayonara no Asa ni Yakusoku no Hana o Ka...,https://anidb.net/anime/13262,,,,


In [45]:
mangaki_to_manami[176]

{6064, 12589}

In [46]:
Work.all_objects.filter(title__contains='Sankarea').values_list('id', 'title', 'category__slug', 'redirect')

<WorkQuerySet [(1857, 'Sankarea', 'anime', 1236), (1859, 'Sankarea OVA', 'anime', None), (1236, 'Sankarea: I, Too, Am... A Zombie...', 'anime', None), (7210, 'Sankarea', 'manga', None)]>

# 14 anime avec multiples références

In [47]:
manami_dup.to_csv('dup_manami.csv', index=False)

In [48]:
import pandas as pd

def nunique_nonzero(series):
    return len(set(c for c in series if c))

nb_missing = 0
data = []

local_merges = {c: cluster[0] for cluster in build_clusters(manami_to_mangaki) for c in cluster[1:]}

for work_id, work in mangaki._raw.items():
    title = work['title']
    
    # Skip already merged
    if work_id in local_merges:
        continue

    manami_ids = mangaki_to_manami.get(work_id)
    # print(work)
    if manami_ids is None:
        data.append({'id': work_id, 'title': title, 'manami_id': 0})

    if manami_ids is not None:
        for manami_id in manami_ids:
            data.append({'id': work_id, 'title': title, 'manami_id': manami_id})
df = pd.DataFrame(data)
print(df.shape)

duplicates = df.groupby('title').agg({'id': nunique_nonzero, 'manami_id': nunique_nonzero}).query('id > 1')
print(duplicates.index)

print(f"{len(duplicates)} clusters d'anime ont le même titre sur Mangaki")
duplicates.index.name = None
duplicates

(10520, 3)
Index(['Digimon: The Movie', 'Dororo', 'Gestalt', 'Karneval',
       'Sorcerer Hunters', 'The Asterisk War: The Academy City on the Water'],
      dtype='object', name='title')
6 clusters d'anime ont le même titre sur Mangaki


Unnamed: 0,id,manami_id
Digimon: The Movie,2,2
Dororo,2,2
Gestalt,2,2
Karneval,2,2
Sorcerer Hunters,2,2
The Asterisk War: The Academy City on the Water,2,2


In [49]:
dup_title_ids = list(Work.objects.filter(title__in=duplicates.index).values_list('id', flat=True))

In [50]:
for work in Work.objects.filter(title='Lupin the IIIrd: Chikemuri no Ishikawa Goemon').prefetch_related('reference_set'):
    print(work.title, work.reference_set.values('source', 'identifier'))

Lupin the IIIrd: Chikemuri no Ishikawa Goemon <QuerySet []>
Lupin the IIIrd: Chikemuri no Ishikawa Goemon <QuerySet [{'source': 'MAL', 'identifier': '34021'}]>


In [51]:
for mangaki_id in [3409, 14327]:
    for manami_id in mangaki_to_manami[mangaki_id]:
        print(manami_id, manami[manami_id]['title'], manami[manami_id]['synonyms'])

10255 Kenpuu Denki Berserk ['Berserk', 'Berserk: The Chronicles of Wind Blades', 'Kenfu Denki Berserk', 'Sword-Wind Chronicle Berserk', 'Берсерк', 'برسرک', 'นักรบวิปลาส', 'けんぷうでんきべるせるく', '剣風伝奇ベルセルク']
1861 Berserk ['Berserk (2016)', 'Berserk (Saison 1)', 'Берсерк (2016)', 'べるせるく', 'ベルセルク']


# 18 anime partageant 9 titres

In [52]:
manami[1]['sources']

def pick_anidb_from_manami(manami_id):
    for source in manami[manami_id]['sources']:
        if 'anidb' in source:
            return source[source.index('anidb.net/anime/') + len('anidb.net/anime/'):]
    return None

pick_anidb_from_manami(1)

'3689'

In [53]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 1000)

def concatenate(data):
    return ','.join([str(entry) for entry in list(set(data))])

def get_problem(mangaki_id):
    problems = []
    if mangaki_id in remaining:
        problems.append('Zero or multiple Manami')
    if mangaki_id in dup_title_ids:
        problems.append('Mangaki title collision with different Manami')
    elif mangaki_id in same_title_as_duplicates:
        problems.append('Mangaki title collision')
    return ', '.join(problems)

def get_manami_data(mangaki_id, desperate_search_by_title=False):
    manami_ids = get_manami_ids(mangaki_id, desperate_search_by_title)
    if manami_ids is None:
        return ''
    data = []
    for manami_id in manami_ids:
        entry = manami[manami_id]
        base = f"{manami_id}:{entry['title']}"
        anidb_id = pick_anidb_from_manami(manami_id)
        if anidb_id is not None:
            base += f':{anidb_id}'
        data.append(base)  # ','.join(entry['synonyms'])
    return ','.join(data)

def display_summary_from_ref(query, desperate_search_by_title=False):
    dup_title_references = Reference.objects.filter(work__redirect__isnull=True, **query).values_list('work_id', 'work__title', 'source', 'identifier')
    dup_title = pd.DataFrame(dup_title_references, columns=('mangaki_id', 'mangaki_title', 'source', 'identifier'))
    dup_title['manami_data'] = dup_title['mangaki_id'].map(lambda mangaki_id: get_manami_data(mangaki_id, desperate_search_by_title))
    dup_title = dup_title.groupby(['mangaki_title', 'source', 'identifier'])[['mangaki_id', 'manami_data']].agg(concatenate).reset_index()
    return dup_title

# display_summary_from_ref({'work__title__in': duplicates.index})
# Does not include works without refs

In [54]:
from django.db import connection

def display_check_ref(ref):
    base = f'{ref.source}:{ref.identifier}'
    if check_ref(base) == False:
        base += ':X'
    return base

def display_summary(mangaki_ids):
    # start_query = len(connection.queries)
    # print(start_query)
    results = []
    for work in Work.objects.filter(id__in=mangaki_ids).prefetch_related('reference_set'):
        entry = {
            'problem': get_problem(work.id),
            'mangaki_title': work.title,
            'mangaki_id': work.id,
            'refs': '; '.join([display_check_ref(ref) for ref in work.reference_set.all()]),
            'manami_data': get_manami_data(work.id),
            'manami_data_from_title': ''
        }
        if not entry['manami_data']:
            entry['manami_data_from_title'] = get_manami_data(work.id, desperate_search_by_title=True)
        results.append(entry)
    # print(len(connection.queries))
    return pd.DataFrame.from_dict(results).groupby(['problem', 'mangaki_title', 'refs'])[[
        'mangaki_id', 'manami_data', 'manami_data_from_title']].agg(
        concatenate).sort_values('mangaki_title').reset_index()

# display_summary(dup_title_ids)

In [55]:
not_found = 0
for mangaki_id in remaining:
    manami_ids = title_map([mangaki_id], mangaki, manami)
    if not manami_ids:
        not_found += 1
        continue
    print(Work.objects.get(id=mangaki_id), manami_ids)
print(not_found, len(remaining))

Beelzebub: Kaiketsu!! Beel-bo Meitantei Suiri {1817, 1821}
Beelzebub: Sakigake!! Beel to Shinsengumi {1817, 1822}
3 5


In [56]:
remaining_titles = list(Work.objects.filter(id__in=remaining).values_list('title', flat=True))
same_title_as_remaining = (list(Work.objects.filter(category__slug='anime',
                                                    title__in=remaining_titles).values_list('id', flat=True)))
same_title_as_duplicates = list(Work.objects.filter(category__slug='anime',
                                                    title__in=same_title.index).values_list('id', flat=True))
problematic_ids = set(dup_title_ids) | set(same_title_as_remaining) | set(same_title_as_duplicates)
super_dedupe = display_summary(problematic_ids)

Failed search by title 7403
Failed search by title 5098
Failed search by title 5976


In [57]:
super_dedupe['problem'].value_counts()

Mangaki title collision                          41
Mangaki title collision with different Manami    15
Zero or multiple Manami                           5
Name: problem, dtype: int64

In [58]:
super_dedupe.to_csv('dedupe_report.csv', index=False)

In [59]:
len(super_dedupe)

61

In [60]:

#super_dedupe['is_dead'] = super_dedupe['refs'].map(check_ref)

In [61]:
# manami[4297]

In [62]:
Work.objects.filter(title='Free!').values_list('anidb_aid')

<WorkQuerySet [(9858,)]>

In [63]:
Work.objects.get(title='Free!').reference_set.values('source')

<QuerySet [{'source': 'Animeka'}]>

In [64]:
super_dedupe.query('problem != "Mangaki title collision"').shape

(20, 6)

In [65]:
manami[1822]

{'sources': ['https://anilist.co/anime/13067',
  'https://kitsu.io/anime/6875',
  'https://notify.moe/anime/bcZDpKimg'],
 'title': 'Beelzebub: Sakigake!! Beel to Shinsengumi',
 'type': 'Special',
 'episodes': 7,
 'status': 'FINISHED',
 'animeSeason': {'season': 'WINTER', 'year': 2011},
 'picture': 'https://s4.anilist.co/file/anilistcdn/media/anime/cover/medium/13067.jpg',
 'thumbnail': 'https://s4.anilist.co/file/anilistcdn/media/anime/cover/medium/default.jpg',
 'synonyms': ['Beelzebub: Sakigake!! Beel and Shinsengumi',
  'べるぜバブ: 魁！！ベルっと新撰組'],
 'relations': ['https://anilist.co/anime/9513'],
 'tags': ['action',
  'comedy',
  'demon',
  'demons',
  'fantasy',
  'school',
  'school life',
  'shounen',
  'supernatural']}

In [289]:
# 3053: https://anilist.co/anime/16532, manami[1821] 
# 1790: https://anilist.co/anime/13067, manami[1822]

In [290]:
manami.from_synonym['kanojo x kanojo x kanojo'.lower()]

{9929}

In [291]:
manami[9929]

{'sources': ['https://myanimelist.net/anime/7411'],
 'title': 'Kanojo x Kanojo x Kanojo: Sanshimai to no DokiDoki Kyoudou Seikatsu',
 'type': 'OVA',
 'episodes': 3,
 'status': 'FINISHED',
 'animeSeason': {'season': 'FALL', 'year': 2009},
 'picture': 'https://cdn.myanimelist.net/images/anime/2/17720.jpg',
 'thumbnail': 'https://cdn.myanimelist.net/images/anime/2/17720t.jpg',
 'synonyms': ['Kanojo x Kanojo x Kanojo', '彼女×彼女×彼女～三姉妹とのドキドキ共同生活～'],
 'relations': [],
 'tags': ['harem', 'hentai']}

In [292]:
# manami_to_mangaki[1996]
manami.references['AniDB', '11897']

[20314]

In [293]:
manami[20314]

{'sources': ['https://anidb.net/anime/11897',
  'https://anilist.co/anime/118581',
  'https://kitsu.io/anime/11755',
  'https://myanimelist.net/anime/32521',
  'https://notify.moe/anime/fOoqtKimg'],
 'title': 'Super Mario World: Mario to Yoshi no Bouken Land',
 'type': 'OVA',
 'episodes': 1,
 'status': 'FINISHED',
 'animeSeason': {'season': 'WINTER', 'year': 1991},
 'picture': 'https://cdn.myanimelist.net/images/anime/12/78065.jpg',
 'thumbnail': 'https://cdn.myanimelist.net/images/anime/12/78065t.jpg',
 'synonyms': ["Super Mario World: Mario & Yoshi's Adventure Land",
  'スーパーマリオワールド マリオとヨッシーの冒険ランド'],
 'relations': ['https://anidb.net/anime/5551',
  'https://anilist.co/anime/4389',
  'https://kitsu.io/anime/3585',
  'https://myanimelist.net/anime/4389',
  'https://notify.moe/anime/fyYucKimR'],
 'tags': ['comedy', 'game', 'kids', 'video games', 'virtual reality']}

In [66]:
super_dedupe.query('problem != "Mangaki title collision"')#.to_csv('dedupe_report.csv', index=False)

Unnamed: 0,problem,mangaki_title,refs,mangaki_id,manami_data,manami_data_from_title
0,Zero or multiple Manami,Beelzebub: Kaiketsu!! Beel-bo Meitantei Suiri,MAL:16532:X,3053,,"1817:Beelzebub Specials,1821:Beelzebub: Kaiketsu!! Beel-bo Meitantei Suiri"
1,Zero or multiple Manami,Beelzebub: Sakigake!! Beel to Shinsengumi,MAL:13067:X,1790,,"1817:Beelzebub Specials,1822:Beelzebub: Sakigake!! Beel to Shinsengumi"
2,Zero or multiple Manami,"Charlotte: Arata na ""Unmei"" no Hajimari",MAL:30987:X,12674,,
3,Zero or multiple Manami,Chokotto Kamen,MAL:32754:X,14411,,
7,Mangaki title collision with different Manami,Digimon: The Movie,MAL:2397,1513,4127:Digimon Adventure: Bokura no War Game!:1149,
8,Mangaki title collision with different Manami,Digimon: The Movie,MAL:2961,10207,4117:Digimon Adventure Movie:1148,
10,Mangaki title collision with different Manami,Dororo,MAL:5760,11406,4425:Dororo to Hyakkimaru:1119,
11,Mangaki title collision with different Manami,Dororo,MAL:11471,14019,4424:Dororo Pilot,
12,Mangaki title collision with different Manami,Dororo,,5098,,
15,Mangaki title collision with different Manami,Gestalt,MAL:23061,2353,7730:Heya/Keitai,


In [68]:
super_dedupe.query('problem == "Zero or multiple Manami"')#.to_csv('dedupe_report.csv', index=False)

Unnamed: 0,problem,mangaki_title,refs,mangaki_id,manami_data,manami_data_from_title
0,Zero or multiple Manami,Beelzebub: Kaiketsu!! Beel-bo Meitantei Suiri,MAL:16532:X,3053,,"1817:Beelzebub Specials,1821:Beelzebub: Kaiketsu!! Beel-bo Meitantei Suiri"
1,Zero or multiple Manami,Beelzebub: Sakigake!! Beel to Shinsengumi,MAL:13067:X,1790,,"1817:Beelzebub Specials,1822:Beelzebub: Sakigake!! Beel to Shinsengumi"
2,Zero or multiple Manami,"Charlotte: Arata na ""Unmei"" no Hajimari",MAL:30987:X,12674,,
3,Zero or multiple Manami,Chokotto Kamen,MAL:32754:X,14411,,
21,Zero or multiple Manami,Kanojo x Kanojo x Kanojo Full Version,MAL:18597:X,12101,,


In [69]:
ALL_MANAMI = set(range(len(manami)))
covered_by_mangaki = set()
nb_clusters = 0
for _, manami_ids in mangaki_to_manami.items():
    if len(manami_ids) >= 2:
        nb_clusters += 1
    covered_by_mangaki.update(manami_ids)
len(covered_by_mangaki), nb_clusters

(10536, 61)

In [70]:
covered_by_manami = set()
nb_clusters_mangaki = 0
for manami_id, mangaki_ids in manami_to_mangaki.items():
    if len(mangaki_ids) >= 2:
        nb_clusters_mangaki += 1
    covered_by_manami.add(manami_id)
len(covered_by_manami), nb_clusters_mangaki

(10536, 470)

In [72]:
covered_by_mangaki == covered_by_manami

True

In [78]:
covered_by_manami.update([1821, 1822])

In [79]:
len(ALL_MANAMI - covered_by_manami)

13831

In [80]:
list(ALL_MANAMI - covered_by_manami)[:5]

[5, 6, 7, 8, 10]

In [81]:
Reference.objects.filter(id=1).values()

<QuerySet [{'id': 1, 'work_id': 325, 'source': 'AniDB', 'identifier': '570', 'url': 'http://anidb.net/perl-bin/animedb.pl?show=anime&aid=570'}]>

In [83]:
ADDED_WORK_IDS = []
ADDED_REF_IDS = []
ADDED_WORK_TITLE_IDS = []

def insert_manami_to_mangaki(manami_id):
    entry = manami[manami_id]
    # Parse references
    anidb_aid = None
    refs = []
    for url in entry['sources']:
        source, identifier = parse_manami_source(url)
        if source == 'AniDB':
            anidb_aid = identifier
        refs.append(Reference(source=source, identifier=identifier, url=url))

    # Create work
    work = Work.objects.create(
        title=entry['title'],
        category_id=1,
        anime_type=entry['type'],
        ext_poster=entry['picture'],
        nb_episodes=entry['episodes'],
        anidb_aid=anidb_aid if anidb_aid is not None else 0,
    )
    ADDED_WORK_IDS.append(work.id)
    
    # Save references
    for ref in refs:
        ref.work_id = work.id
        ref.save()
        ADDED_REF_IDS.append(ref.id)
    
    # Create synonyms; tags is boring because of the forced AniDB ID
    for synonym in entry['synonyms']:
        worktitle = WorkTitle.objects.create(work_id=work.id, title=synonym, type='synonym')
        ADDED_WORK_TITLE_IDS.append(worktitle.id)
    
    #print(work.id, work)

In [84]:
%%time
for manami_id in ALL_MANAMI - covered_by_manami:
    insert_manami_to_mangaki(manami_id)
len(ADDED_WORK_IDS), len(ADDED_REF_IDS), len(ADDED_WORK_TITLE_IDS)

CPU times: user 30 s, sys: 1.48 s, total: 31.5 s
Wall time: 1min 8s


(13831, 29124, 33291)

In [87]:
max(ADDED_REF_IDS)

42755

In [86]:
from django.core import serializers

with open('manamiworks.json', 'w') as f:
    f.write(serializers.serialize('json', Work.objects.filter(id__in=ADDED_WORK_IDS)))
with open('manamirefs.json', 'w') as f:
    f.write(serializers.serialize('json', Reference.objects.filter(id__in=ADDED_REF_IDS)))
with open('manamisynonyms.json', 'w') as f:
    f.write(serializers.serialize('json', WorkTitle.objects.filter(id__in=ADDED_WORK_TITLE_IDS)))

In [371]:
#Tag.objects.get_or_create(title='delinquents')

In [367]:
for tag in ['basketball', 'bullying', 'comedy', 'delinquents', 'drama', 'high school', 'male protagonist', 'manga', 'primarily male cast', 'school', 'school club', 'school clubs', 'school life', 'shounen', 'sports', 'tragedy', 'twins']:
    print(Tag.objects.filter(title=tag))

<QuerySet [<Tag: basketball>]>
<QuerySet [<Tag: bullying>]>
<QuerySet [<Tag: comedy>]>
<QuerySet []>
<QuerySet []>
<QuerySet [<Tag: high school>]>
<QuerySet [<Tag: male protagonist>]>
<QuerySet [<Tag: manga>]>
<QuerySet []>
<QuerySet []>
<QuerySet []>
<QuerySet [<Tag: school clubs>]>
<QuerySet [<Tag: school life>]>
<QuerySet [<Tag: shounen>]>
<QuerySet [<Tag: sports>]>
<QuerySet [<Tag: tragedy>]>
<QuerySet []>


In [383]:


categories = Counter()
LIMIT = 5
c = 0
for manami_id in ALL_MANAMI - covered_by_manami:
    entry = manami[manami_id]
    categories[f"type_{entry['type']}"] += 1
    categories[f"status_{entry['status']}"] += 1
    if entry['type'] == 'TV' and entry['status'] == 'CURRENTLY' and entry['tags']:
        c += 1
        print(manami_id, entry)
        if c >= LIMIT:
            break

for category in sorted(categories):
    print(category, ':', categories[category])

408 {'sources': ['https://anidb.net/anime/13848', 'https://anilist.co/anime/101239', 'https://kitsu.io/anime/41399', 'https://myanimelist.net/anime/37403', 'https://notify.moe/anime/Ig-64AvmR'], 'title': 'Ahiru no Sora', 'type': 'TV', 'episodes': 50, 'status': 'CURRENTLY', 'animeSeason': {'season': 'FALL', 'year': 2019}, 'picture': 'https://cdn.myanimelist.net/images/anime/1505/102719.jpg', 'thumbnail': 'https://cdn.myanimelist.net/images/anime/1505/102719t.jpg', 'synonyms': ['آهيرو نو سورا', 'あひるのそら', 'あひるの空', '鸭子的天空'], 'relations': [], 'tags': ['basketball', 'bullying', 'comedy', 'delinquents', 'drama', 'high school', 'male protagonist', 'manga', 'primarily male cast', 'school', 'school club', 'school clubs', 'school life', 'shounen', 'sports', 'tragedy', 'twins']}
625 {'sources': ['https://myanimelist.net/anime/35226'], 'title': 'Akita Kenritsu Iburi Gakkou Chuutou-bu', 'type': 'TV', 'episodes': 0, 'status': 'CURRENTLY', 'animeSeason': {'season': 'SPRING', 'year': 2017}, 'picture': 

In [380]:
Counter(Work.objects.values_list('anime_type', flat=True))

Counter({'': 15282,
         'TV Series': 23,
         'Movie': 20,
         'Web': 3,
         'Music Video': 2,
         'Série': 77,
         'VOD': 21,
         'Film': 30,
         ' ': 8,
         'série': 4,
         'Shonen': 1,
         'Oav': 4,
         'Mystère': 1,
         'film': 6,
         'Action - Aventure - Comédie': 1})

In [349]:
max(Work.objects.values_list('id', flat=True))

21462

In [338]:
new_work = Work(id=21462, title='Test')
serializers.serialize('json', [new_work])

'[{"model": "mangaki.work", "pk": 21462, "fields": {"redirect": null, "title": "Test", "source": "", "ext_poster": "", "int_poster": "", "nsfw": false, "date": null, "end_date": null, "synopsis": "", "ext_synopsis": "", "category": null, "origin": "", "nb_episodes": "Inconnu", "anime_type": "", "vo_title": "", "manga_type": "", "catalog_number": "", "anidb_aid": 0, "vgmdb_aid": null, "editor": null, "studio": null, "sum_ratings": 0, "nb_ratings": 0, "nb_likes": 0, "nb_dislikes": 0, "controversy": 0, "title_search": "", "genre": []}}]'

In [333]:
with open('../fixtures/seed_data.json') as f:
    print(json.dumps(json.load(f)[3], indent=4))

{
    "model": "mangaki.work",
    "pk": 1,
    "fields": {
        "redirect": null,
        "title": "Death Note",
        "source": "",
        "ext_poster": "http://myanimelist.cdn-dena.com/images/anime/9/9453.jpg",
        "int_poster": "",
        "nsfw": false,
        "date": null,
        "end_date": null,
        "synopsis": "Light Yagami est un lyc\u00e9en \u00e2g\u00e9 de 17 ans, jeune homme brillant, fils d'un policier, il d\u00e9couvre un \u00e9trange carnet qui se r\u00e9v\u00e8le \u00eatre le livre d'un dieu de la mort : Ry\u00fbk ! Light apprendra vite quels terribles pouvoirs renferment ce carnet : tous ceux dont le nom est inscrit dans le Death Note sont appel\u00e9s \u00e0 mourir dans les 40 secondes qui suivent !Les implications sont \u00e9normes et en possession d'un tel carnet Light est potentiellement capable d'imposer sa loi \u00e0 un monde qu'il estime perverti. Mais peut-on choisir qui va vivre et qui va mourir ? Certaines personnes m\u00e9ritent-elles de mou

In [323]:
from django.core import serializers

print(json.dumps(json.loads(serializers.serialize('json', Work.objects.filter(id=1))), indent=4))

[
    {
        "model": "mangaki.work",
        "pk": 1,
        "fields": {
            "redirect": null,
            "title": "Death Note",
            "source": "",
            "ext_poster": "http://cdn.myanimelist.net/images/anime/9/9453.jpg",
            "int_poster": "",
            "nsfw": false,
            "date": null,
            "end_date": null,
            "synopsis": "Light Yagami est un lyc\u00e9en \u00e2g\u00e9 de 17 ans, jeune homme brillant, fils d'un policier, il d\u00e9couvre un \u00e9trange carnet qui se r\u00e9v\u00e8le \u00eatre le livre d'un dieu de la mort : Ry\u00fbk ! Light apprendra vite quels terribles pouvoirs renferment ce carnet : tous ceux dont le nom est inscrit dans le Death Note sont appel\u00e9s \u00e0 mourir dans les 40 secondes qui suivent !Les implications sont \u00e9normes et en possession d'un tel carnet Light est potentiellement capable d'imposer sa loi \u00e0 un monde qu'il estime perverti. Mais peut-on choisir qui va vivre et qui va mourir

In [299]:
Work.objects.filter(id__in=[10603, 14598])

<WorkQuerySet [<Work: "Bungaku Shoujo" Kyou no Oyatsu: Hatsukoi>, <Work: Book Girl OVA>]>

In [300]:
mangaki_to_manami[14598]

{2}

In [301]:
#super_dedupe.query('problem != "Mangaki title collision"')['is_dead'].value_counts()
#mangaki_to_manami[3053]

In [302]:
super_dedupe['n_occ'] = super_dedupe.groupby('mangaki_title')['problem'].transform('count')

In [303]:
super_dedupe.query('problem == "Mangaki title collision"').sort_values(['n_occ', 'mangaki_title'], ascending=(False, True))

Unnamed: 0,problem,mangaki_title,refs,mangaki_id,manami_data,manami_data_from_title,n_occ
4,Mangaki title collision,Deadman Wonderland,MAL:6880,14198,3833:Deadman Wonderland:7902,,2
5,Mangaki title collision,Deadman Wonderland,Animeka:deadman-wonderland,39,3833:Deadman Wonderland:7902,,2
13,Mangaki title collision,Fruits Basket,MAL:120,14275,5466:Fruits Basket:34,,2
14,Mangaki title collision,Fruits Basket,AniDB:34,85,5466:Fruits Basket:34,,2
29,Mangaki title collision,Lupin the IIIrd: Chikemuri no Ishikawa Goemon,,21458,12275:Lupin the IIIrd: Chikemuri no Ishikawa Goemon:12398,,2
30,Mangaki title collision,Lupin the IIIrd: Chikemuri no Ishikawa Goemon,MAL:34021,20602,12275:Lupin the IIIrd: Chikemuri no Ishikawa Goemon:12398,,2
38,Mangaki title collision,Samurai Champloo,AniDB:1543,41,18076:Samurai Champloo:1543,,2
39,Mangaki title collision,Samurai Champloo,MAL:205,14362,18076:Samurai Champloo:1543,,2
6,Mangaki title collision,Desert Punk,MAL:25,34081337115831,20272:Sunabouzu:2166,,1
9,Mangaki title collision,Dimension W,MAL:31163,142971429813524,4151:Dimension W:11345,,1


In [154]:
again_multiple = map(int, super_dedupe.query('problem == "Zero or multiple Manami"')['mangaki_id'].tolist())
manami_dup_from_title = handle_multiple_manami(again_multiple)
manami_dup_from_title.to_csv('dup_manami2.csv', index=False)
manami_dup_from_title

21 [7793, 5738]
21 | Highschool of the Dead | Highschool of the Dead ( https://anilist.co/anime/8074 ; https://kitsu.io/anime/5187 ; https://myanimelist.net/anime/8074 ; https://notify.moe/anime/KOwzpKmig ) | Gakuen Mokushiroku: High School of the Dead ( https://anidb.net/anime/7382 )
52 [7566, 7567]
52 | Hellsing | Hellsing ( https://anidb.net/anime/32 ; https://anilist.co/anime/270 ; https://kitsu.io/anime/245 ; https://myanimelist.net/anime/270 ; https://notify.moe/anime/cbFh5KimR ) | Hellsing (2006) ( https://anidb.net/anime/3296 )
79 [8797, 8798]
79 | InuYasha | InuYasha ( https://kitsu.io/anime/11153 ; https://myanimelist.net/anime/31133 ; https://notify.moe/anime/LOaXpKmiR ) | InuYasha (TV) ( https://anidb.net/anime/144 ; https://anilist.co/anime/249 ; https://kitsu.io/anime/224 ; https://myanimelist.net/anime/249 ; https://notify.moe/anime/Q3ShcFmiR )
96 [11186, 11175]
96 | Eureka Seven | Koukyoushihen: Eureka Seven ( https://anidb.net/anime/2826 ) | Koukyoushihen Eureka Seven 

Unnamed: 0,mangaki_id,mangaki_title,manami_title,anidb.net,myanimelist.net,anilist.co,kitsu.io,notify.moe
0,21.0,Highschool of the Dead,7793:Highschool of the Dead,,https://myanimelist.net/anime/8074,https://anilist.co/anime/8074,https://kitsu.io/anime/5187,https://notify.moe/anime/KOwzpKmig
1,,,5738:Gakuen Mokushiroku: High School of the Dead,https://anidb.net/anime/7382,,,,
2,52.0,Hellsing,7566:Hellsing,https://anidb.net/anime/32,https://myanimelist.net/anime/270,https://anilist.co/anime/270,https://kitsu.io/anime/245,https://notify.moe/anime/cbFh5KimR
3,,,7567:Hellsing (2006),https://anidb.net/anime/3296,,,,
4,79.0,InuYasha,8797:InuYasha,,https://myanimelist.net/anime/31133,,https://kitsu.io/anime/11153,https://notify.moe/anime/LOaXpKmiR
5,,,8798:InuYasha (TV),https://anidb.net/anime/144,https://myanimelist.net/anime/249,https://anilist.co/anime/249,https://kitsu.io/anime/224,https://notify.moe/anime/Q3ShcFmiR
6,96.0,Eureka Seven,11186:Koukyoushihen: Eureka Seven,https://anidb.net/anime/2826,,,,
7,,,11175:Koukyoushihen Eureka Seven,,https://myanimelist.net/anime/237,https://anilist.co/anime/237,https://kitsu.io/anime/212,https://notify.moe/anime/9onhcFmig
8,98.0,Hellsing Ultimate,7569:Hellsing Ultimate,,https://myanimelist.net/anime/777,https://anilist.co/anime/777,https://kitsu.io/anime/695,https://notify.moe/anime/KJTAcKmmR
9,,,7567:Hellsing (2006),https://anidb.net/anime/3296,,,,


In [214]:
## import json

with open('../../anime-offline-database/anime-offline-database.json') as f:
    manami = json.load(f)

In [215]:
len(manami['data'])

24369

In [45]:
mangaki = Work.objects.filter(category__slug='anime')
mangaki.count()  # Sur Mangaki

10996

In [48]:
manami['data'][42]

{'sources': ['https://anidb.net/anime/8699',
  'https://anilist.co/anime/11755',
  'https://kitsu.io/anime/6588',
  'https://myanimelist.net/anime/11755',
  'https://notify.moe/anime/nX_ItFiiR'],
 'title': '009 Re:Cyborg',
 'type': 'Movie',
 'episodes': 1,
 'status': 'FINISHED',
 'picture': 'https://cdn.myanimelist.net/images/anime/9/40189.jpg',
 'thumbnail': 'https://cdn.myanimelist.net/images/anime/9/40189t.jpg',
 'synonyms': ['009 RE:CYBORG', 'ぜろぜろないん り・さいぼーぐ'],
 'relations': ['https://anidb.net/anime/1214',
  'https://anidb.net/anime/12277',
  'https://anidb.net/anime/1527',
  'https://anidb.net/anime/7994',
  'https://anilist.co/anime/1678',
  'https://anilist.co/anime/4690',
  'https://anilist.co/anime/8394',
  'https://kitsu.io/anime/12326',
  'https://kitsu.io/anime/12397',
  'https://kitsu.io/anime/12410',
  'https://kitsu.io/anime/12543',
  'https://kitsu.io/anime/1508',
  'https://kitsu.io/anime/3741',
  'https://kitsu.io/anime/7868',
  'https://myanimelist.net/anime/1678',


In [50]:
for entry in manami['data']:
    if 'youkai' in entry['title']:
        print(entry['title'], entry['sources'])

Gekijouban Kara no Kyoukai: Mirai Fukuin - The Garden of Sinners Recalled Out Summer ['https://anidb.net/anime/9309']
Gekijouban Kara no Kyoukai: Mirai Fukuin - The Garden of Sinners Recalled Out Summer - Extra Chorus ['https://anidb.net/anime/10566']
Gekijouban Kara no Kyoukai: The Garden of Sinners ['https://anidb.net/anime/4932']
Gekijouban Kyoukai no Kanata - I' LL BE HERE - Kako-hen “Yakusoku no Kizuna” Dansu PV ['https://anilist.co/anime/102629']
Gekijouban Kyoukai no Kanata: I`ll Be Here ['https://anidb.net/anime/10726']
Kara no Kyoukai 1: Fukan Fuukei ['https://kitsu.io/anime/2357', 'https://myanimelist.net/anime/2593', 'https://notify.moe/anime/TxDs5KmiR']
Kara no Kyoukai 2: Satsujin Kousatsu (Zen) ['https://kitsu.io/anime/3248', 'https://myanimelist.net/anime/3782', 'https://notify.moe/anime/Y8T_cKmmR']
Kara no Kyoukai 3: Tsuukaku Zanryuu ['https://kitsu.io/anime/3249', 'https://myanimelist.net/anime/3783', 'https://notify.moe/anime/Pz0l5Fmig']
Kara no Kyoukai 4: Garan no Dou

In [16]:
from collections import defaultdict
import re

references = set()
url1 = defaultdict(list)

# Get all sources of Manami entries
for manami_id, entry in enumerate(manami['data']):
    references.add('manami/{}'.format(manami_id))
    for url in entry['sources']:
        ref = re.sub(r'https?://', '', url)
        references.add(ref)
        url1[manami_id].append(ref)
len(references)

94422

In [17]:
url2 = defaultdict(set)

# Mangaki works with AniDB ID
anidb_id = dict(Work.objects.filter(anidb_aid__gt=0).values_list('id', 'anidb_aid'))
for mangaki_id, aid in anidb_id.items():
    ref1 = 'mangaki.fr/work/{}'.format(mangaki_id)
    ref2 = 'anidb.net/anime/{}'.format(aid)
    references.update([ref1, ref2])
    url2[mangaki_id].update([ref1, ref2])

# All Mangaki references
for mangaki_id, url in Reference.objects.filter(work_id__category__slug='anime').values_list('work_id', 'url'):
    ref1 = 'mangaki.fr/work/{}'.format(mangaki_id)
    ref2 = re.sub(r'https?://', '', url).replace('anidb.net/perl-bin/animedb.pl?show=anime&aid=', 'anidb.net/anime/')
    references.update([ref1, ref2])
    url2[mangaki_id].update([ref1, ref2])
len(references)

105853

In [18]:
ids = dict(zip(sorted(references), range(len(references))))

In [19]:
list(ids.keys())[:5]

['anidb.net/anime/1',
 'anidb.net/anime/10',
 'anidb.net/anime/100',
 'anidb.net/anime/1000',
 'anidb.net/anime/10000']

In [21]:
from tryalgo.kruskal import UnionFind

uf = UnionFind(len(references))
nb_merge = 0
for manami_id, refs in url1.items():
    for ref in refs:
        nb_merge += uf.union(ids[ref], ids['manami/{}'.format(manami_id)])
nb_merge

70404

In [22]:
nb_has_anidb = 0
for mangaki_id, refs in url2.items():
    has_anidb = False
    for ref in refs:
        if 'anidb' in ref:
            has_anidb = True
        nb_merge += uf.union(ids[ref], ids['mangaki.fr/work/{}'.format(mangaki_id)])
    nb_has_anidb += has_anidb
f'{nb_merge} fusions : {nb_has_anidb}/{len(url2)} avaient un ID AniDB ({nb_has_anidb/len(url2) * 100:.1f} %)'

'81828 fusions : 322/11297 avaient un ID AniDB (2.9 %)'

In [23]:
clusters = defaultdict(list)
for ref, ref_id in ids.items():
    clusters[uf.find(ref_id)].append(ref)
len(clusters)

24025

In [32]:
from collections import Counter

c = Counter()
clusters_by_occ = defaultdict(list)

def get_mangaki_id(url):
    return int(url[len('mangaki.fr/work/'):])

def get_manami_id(url):
    return int(url[len('manami/'):])

def get_anidb_id(url):
    return int(url[len('anidb.net/anime/'):])


nb_cdup_mangaki = 0
nb_cdup_manami = 0
total_ref = 0
search_queries = []
mangaki_not_manami = set()
for cluster in clusters.values():
    mangaki_refs = [ref for ref in cluster if ref.startswith('mangaki') and get_mangaki_id(ref) in VALID_MANGAKI_IDS]
    nb_mangaki = len(mangaki_refs)
    nb_manami = len([ref for ref in cluster if ref.startswith('manami')])
    if nb_mangaki > 0:
        total_ref += len(cluster) - nb_mangaki - nb_manami
    if nb_mangaki >= 2 and nb_manami >= 1:
        nb_cdup_mangaki += 1
    if nb_manami >= 2 and nb_mangaki >= 1:
        nb_cdup_manami += 1
    # if nb_mangaki + nb_manami >= 14:
    #     print(cluster)  # It was Kara no Kyoukai
    if nb_mangaki + nb_manami >= 12:
        print(cluster)
    if nb_mangaki >= 3 or nb_manami >= 3:  # Have to analyze
        clusters_by_occ[nb_mangaki, nb_manami].append(cluster)
    if nb_mangaki == 0 and nb_manami == 1:
        manami_url = [ref for ref in cluster if ref.startswith('manami')][0]
        manami_id = get_manami_id(manami_url)
        search_queries.append(manami['data'][manami_id]['title'])  # In Manami but not in Mangaki
    if nb_manami == 0:
        for url in mangaki_refs:
            mangaki_id = get_mangaki_id(url)
            mangaki_not_manami.add(mangaki_id)
    c[nb_mangaki, nb_manami] += 1
c, nb_cdup_mangaki, nb_cdup_manami, total_ref

['anidb.net/anime/4932', 'kitsu.io/anime/2357', 'kitsu.io/anime/3248', 'kitsu.io/anime/3249', 'kitsu.io/anime/3545', 'kitsu.io/anime/3546', 'manami/5990', 'manami/9868', 'manami/9869', 'manami/9870', 'manami/9871', 'manami/9872', 'manami/9874', 'mangaki.fr/work/1374', 'mangaki.fr/work/1375', 'mangaki.fr/work/1376', 'mangaki.fr/work/1377', 'mangaki.fr/work/1378', 'mangaki.fr/work/1381', 'mangaki.fr/work/14626', 'mangaki.fr/work/14629', 'myanimelist.net/anime/2593', 'myanimelist.net/anime/3782', 'myanimelist.net/anime/3783', 'myanimelist.net/anime/4280', 'myanimelist.net/anime/4282', 'myanimelist.net/anime/5205', 'notify.moe/anime/6AbXcFmiR', 'notify.moe/anime/Pz0l5Fmig', 'notify.moe/anime/RNxu5Kmmg', 'notify.moe/anime/TxDs5KmiR', 'notify.moe/anime/Y8T_cKmmR']


(Counter({(1, 1): 9935,
          (0, 1): 13529,
          (2, 1): 427,
          (1, 2): 29,
          (2, 2): 14,
          (3, 1): 12,
          (5, 1): 1,
          (3, 2): 3,
          (4, 3): 1,
          (8, 7): 1,
          (1, 3): 1,
          (4, 1): 2,
          (3, 4): 1,
          (2, 3): 1,
          (1, 0): 68}),
 463,
 51,
 42237)

Cela signifie :

- 13529 works sont chez Manami mais pas chez Mangaki
- 68 works sont chez Mangaki pas chez Manami
- `(2, 1): 427` $\Rightarrow$ 427 clusters correspondent à 2 works chez Mangaki mais seulement 1 work chez Manami
- `(8, 7): 1` correspond en fait à Kara no Kyoukai qui n'a qu'une seule référence sur AniDB mais qui correspond à 8 works chez Mangaki et 7 chez Manami

In [33]:
%%time

# This is long. Don't do this.
'''
search_results = Counter()
print(len(search_queries))
for title in search_queries:
    nb_results = Work.objects.filter(title=title).count()
    search_results[nb_results] += 1
search_results'''
# Counter({0: 12849, 1: 663, 2: 13, 3: 4})

ids_found_in_mangaki_title = set(Work.objects.filter(title__in=search_queries).values_list('id', flat=True))
ids_found_in_mangaki_synonym = set(WorkTitle.objects.filter(title__in=search_queries).values_list('work_id', flat=True))
ids_found_in_mangaki = ids_found_in_mangaki_title | ids_found_in_mangaki_synonym
len(ids_found_in_mangaki_title), len(ids_found_in_mangaki_synonym), len(ids_found_in_mangaki)

CPU times: user 194 ms, sys: 0 ns, total: 194 ms
Wall time: 2.73 s


(673, 16, 683)

In [34]:
paired = ids_found_in_mangaki & mangaki_not_manami
print(f'{len(paired)} anime repêchés')
print(f'{len(search_queries) - len(paired)} anime dans Manami mais pas dans Mangaki')
print(f'{len(mangaki_not_manami - paired)} anime dans Mangaki mais pas dans Manami (ce nombre devrait être quasiment 0)')

33 anime repêchés
13496 anime dans Manami mais pas dans Mangaki
35 anime dans Mangaki mais pas dans Manami (ce nombre devrait être quasiment 0)


In [35]:
# Examples
Work.objects.filter(id__in=mangaki_not_manami - paired)

<WorkQuerySet [<Work: Rurouni Kenshin: Meiji Kenkaku Romantan>, <Work: Shakugan no Shana II>, <Work: Flipping through Belgrade>, <Work: Kamichu! The Goddess is a Middle School Student>, <Work: Galaxy Divine Wind Jinraiger>, <Work: Birei Okami Mie>, <Work: Loups=Garous Pilot>, <Work: A Message for Passing the Baton from Cure Lovely to Cure Flora>, <Work: The Inheritor of the Crescent Moon>, <Work: Paper Rabbit Rope: Christmas>, <Work: Gintama OVA>, <Work: UFO Robo Grendizer: Confrontation in the Red Setting Sun>, <Work: Getter Robo Movie>, <Work: Noragami OVA 2>, <Work: Ore, Twin tails ni Narimasu. Recap>, <Work: Kanojo x Kanojo x Kanojo Full Version>, <Work: Devilman (Movie)>, <Work: Bishoujo Senshi Sailor Moon: Crystal 2nd Season>, <Work: Charlotte: Arata na "Unmei" no Hajimari>, <Work: Shingeki no Bahamut: Genesis Special>, '...(remaining elements truncated)...']>

In [37]:
nb_in_mangaki_not_manami_hard_to_pair = len(mangaki_not_manami - paired)
print(nb_in_mangaki_not_manami_hard_to_pair, 'to pair')

# Are they popular?
df_to_pair = pd.DataFrame(Reference.objects.filter(work_id__in=mangaki_not_manami - paired).values(
    'source', 'identifier', 'work_id', 'work_id__title'))
nb_ratings_of_work = Counter(Rating.objects.filter(work_id__in=df_to_pair['work_id'], choice='like').values_list('work_id', flat=True))
df_to_pair['nb_ratings'] = df_to_pair['work_id'].map(nb_ratings_of_work)
df_to_pair.sort_values('nb_ratings', ascending=False)
# Manuellement

35 to pair


Unnamed: 0,source,identifier,work_id,work_id__title,nb_ratings
19,Animeka,shakugan-no-shana2,145,Shakugan no Shana II,88
20,Animeka,kenshin,105,Rurouni Kenshin: Meiji Kenkaku Romantan,65
21,MAL,489,1907,Kamichu! The Goddess is a Middle School Student,19
2,MAL,30987,12674,"Charlotte: Arata na ""Unmei"" no Hajimari",9
1,MAL,30624,10035,Noragami OVA 2,7
4,MAL,31095,12907,Shingeki no Bahamut: Genesis Special,2
13,MAL,31492,13471,Haikyuu!!: Jump Festa 2015 Special,2
0,MAL,7839,8120,UFO Robo Grendizer: Confrontation in the Red S...,1
28,MAL,28295,3346,Gintama OVA,1
18,MAL,30595,12360,Bishoujo Senshi Sailor Moon: Crystal 2nd Season,1


In [41]:
# Ajouter les MAL comme edges du union-find

In [40]:
for work in Work.objects.filter(title__search='no shana'):
    print(work.title)

Shakugan no Shana: Season II
Shakugan no Shana SP: Koi to Onsen no Kougai Gakushuu!
Shakugan no Shana Movie Special
Shakugan no Shana Specials
Shakugan no Shana: Shana to Yuuji no Naze Nani Shana! Nandemo Shitsumon-bako!
Shakugan no Shana S: OVA Series
Shakugan no Shana: Naze Nani Shana! Nandemo Shitsumonbako! - Special
Shakugan no Shana II (Second) Specials
Shakugan no Shana: The Movie
Shakugan no Shana: Naze Nani Shana 2
Shakugan no Shana III (Final) Specials
Shakugan no Shana S Specials
Shakugan no Shana: Season III
Shakugan no Shana II
Shakugan no Shana: Naze Nani Shana
Shakugan no Shana


Par exemple il faut vérifier que ces liens MAL aboutissent.

In [29]:
def translate_url(url):
    if url.startswith('mangaki'):
        mangaki_id = get_mangaki_id(url)
        try:
            return f'mangaki:{Work.objects.get(id=mangaki_id).title}'
        except Work.DoesNotExist as e:
            print(mangaki_id, e, '(déjà dédoublonné WorkCluster)')
            return f'mangaki:{mangaki_id}'
    if url.startswith('manami'):
        manami_id = get_manami_id(url)
        return f'manami:{manami["data"][manami_id]["title"]}'
    return url

for key in clusters_by_occ:
    print(key)
    for cluster in clusters_by_occ[key]:
        print([translate_url(url) for url in cluster])

(3, 1)
['anidb.net/anime/10697', 'anilist.co/anime/20754', 'kitsu.io/anime/8644', 'manami:Gakkougurashi!', 'mangaki:Gakkou Gurashi!', 'mangaki:School-Live!', 'mangaki:School-Live!', 'myanimelist.net/anime/24765', 'notify.moe/anime/5jHJtFmiR']
['anidb.net/anime/10813', 'anilist.co/anime/20829', 'kitsu.io/anime/8736', 'manami:Owari no Seraph', 'mangaki:Seraph of the End: Vampire Reign', 'mangaki:Seraph of the End: Vampire Reign', 'mangaki:Seraph of the End', 'myanimelist.net/anime/26243', 'notify.moe/anime/90w1tKmiR']
['anidb.net/anime/11163', 'anilist.co/anime/21122', 'kitsu.io/anime/10762', 'manami:Nagato Yuki-chan no Shoushitsu: Owarenai Natsuyasumi', 'mangaki:Nagato Yuki-chan no Shoushitsu OVA', 'mangaki:Nagato Yuki-chan no Shoushitsu: Owarenai Natsuyasumi', 'mangaki:The Vanishing of Nagato Yuki-chan', 'myanimelist.net/anime/30379', 'notify.moe/anime/Id3QpKmig']
['anidb.net/anime/11345', 'anilist.co/anime/21256', 'kitsu.io/anime/11170', 'manami:Dimension W', 'mangaki:Dimension W', 'm

In [30]:
from collections import Counter

c2 = Counter()
nb_with_anidb = 0
nb_c_mangaki = 0
anidb_id_of = {}
for cluster in clusters.values():
    mangaki_urls = [ref for ref in cluster if ref.startswith('mangaki')]
    anidb_urls = [ref for ref in cluster if ref.startswith('anidb')]
    nb_mangaki = len(mangaki_urls)
    nb_anidb = len(anidb_urls)
    if nb_mangaki > 0:
        nb_c_mangaki += 1
        if nb_anidb > 0:
            nb_with_anidb += 1
            anidb_id = get_anidb_id(anidb_urls[0])
            mangaki_id = get_mangaki_id(mangaki_urls[0])
            anidb_id_of[mangaki_id] = anidb_id
    c2[nb_mangaki, nb_anidb] += 1
c2, f'{nb_with_anidb}/{nb_c_mangaki} désormais ({nb_with_anidb / nb_c_mangaki * 100:.1f} %)'

(Counter({(1, 1): 5384,
          (0, 1): 5414,
          (2, 1): 315,
          (3, 1): 40,
          (4, 1): 8,
          (5, 1): 2,
          (3, 2): 4,
          (4, 2): 1,
          (8, 1): 2,
          (1, 0): 4421,
          (0, 0): 8115,
          (2, 0): 295,
          (3, 0): 18,
          (4, 0): 6}),
 '5756/10496 désormais (54.8 %)')

In [31]:
anidb = pd.DataFrame.from_dict(anidb_id_of.items()).rename(columns={0: 'decoded_item', 1: 'anidb_aid'}).sort_values('anidb_aid')

In [29]:
anonymized = pd.read_csv('ratings-2020.csv')
#decoder = pd.read_csv('decoder-2020.csv')
decoder.columns, anonymized.columns

(Index(['user', 'item', 'rating', 'items_per_user', 'users_per_item',
        'decoded_item'],
       dtype='object'),
 Index(['339', '1626', 'favorite'], dtype='object'))

In [30]:
decoder.shape, anonymized.shape

((234814, 6), (234813, 3))

In [31]:
anonymized.head()

Unnamed: 0,339,1626,favorite
0,2057,7939,neutral
1,1382,8120,like
2,1455,8588,like
3,223,5575,dislike
4,1968,3814,like


In [53]:
god = decoder.merge(anidb, on='decoded_item')
god.head()

Unnamed: 0,user,item,rating,items_per_user,users_per_item,decoded_item,anidb_aid
0,1382,8120,like,284,62,1467,10432
1,602,8120,neutral,269,62,1467,10432
2,2026,8120,dislike,227,62,1467,10432
3,1898,8120,neutral,283,62,1467,10432
4,1207,8120,like,364,62,1467,10432


In [55]:
# Work.objects.get(id=1467)

In [59]:
god[['item', 'anidb_aid']].drop_duplicates().to_csv('anidb.csv', index=False)