## Overview

This document attempt to filter out HLAs that do not provide useful information, follows procedure:
* Random draw a small samples of overall HLAs.
* Manually mark important / non-important HLAs.
* Use the marked non-important HLAs to filter out the rest of HLA dataset, by adjecent HLAs.

The importance of HLA is measured by if HLA does describe the character:
* Example 1: it describes story associated to character, rather than character. Example: ADayInTheLimeLight.
* Example 2: it describes a general aspects of characters, rather than attributes. Example: trademark favorite food.

In [7]:
import os
import pandas as pd
import random

In [5]:
# -- simple df inspection function --
def inspect_df(df, col1, col2):
    _count = getattr(df, col1).value_counts()
    print('COUNT {}: {}'.format(col1, len(_count)))
    _count = getattr(df, col2).value_counts()
    print('COUNT {}: {}'.format(col2, len(_count)))

_CHAR_BASE_FOLDER = os.path.expanduser('~/Datasets/dodgsons/formal_aaai/')

_live_chars = pd.read_csv(_CHAR_BASE_FOLDER + 'char_features_live.csv')
_animated_chars = pd.read_csv(_CHAR_BASE_FOLDER +'char_features_animated.csv')
chars = _live_chars.append(_animated_chars, ignore_index = True)
# filtering low count features and characters as noise
chars = chars[chars.groupby('char_id')['feature'].transform('count').ge(5)]
chars = chars[chars.groupby('feature')['char_id'].transform('count').ge(5)]
chars = chars.sample(frac=1, random_state=718281).reset_index(drop=True)

inspect_df(chars, 'char_id', 'feature')
chars

COUNT char_id: 45821
COUNT feature: 12815


Unnamed: 0,feature,char_id,work,char_name
0,XtremeKoolLetterz,l41543,StarWars,Xev Xrexus
1,OneHitKill,l3731,Battlebots,Son of Whyachi (#6)
2,TheLancer,l21770,Primeval,Connor Temple
3,BigBrotherInstinct,l18267,ModernFamily,Manny Delgado
4,FinalDeath,a639,Coco,H?ctor
...,...,...,...,...
945514,MissingMom,a4190,Archer,Malory Archer
945515,ButNowIMustGo,l21681,ThePretender,Jarod
945516,EvenTheGuysWantHim,l27524,Survivor,Ken McNickle
945517,HollywoodGenetics,l24084,ShamelessUS,Liam Gallagher


In [35]:
chars.feature.value_counts()[:1000]

DeadpanSnarker                    5455
Jerkass                           5188
BigBad                            4367
BerserkButton                     4278
JerkWithAHeartOfGold              3974
                                  ... 
TheKirk                            207
HopeSpot                           207
AmusingInjuries                    207
UnsympatheticComedyProtagonist     206
MistakenForGay                     206
Name: feature, Length: 1000, dtype: int64

In [36]:
# select top 300 HLAs
samples = chars.feature.value_counts()[:1000].index
samples

Index(['DeadpanSnarker', 'Jerkass', 'BigBad', 'BerserkButton',
       'JerkWithAHeartOfGold', 'NiceGuy', 'ButtMonkey', 'MeaningfulName',
       'CatchPhrase', 'HiddenDepths',
       ...
       'AmoralAttorney', 'NominalHero', 'AsskickingEqualsAuthority',
       'AmazingTechnicolorPopulation', 'IconicItem', 'TheKirk', 'HopeSpot',
       'AmusingInjuries', 'UnsympatheticComedyProtagonist', 'MistakenForGay'],
      dtype='object', length=1000)

In [11]:
# randomly draw 300
# random.seed(79416256)
# samples = random.choices(all_hla, k=300)
# samples

['BusFullOfInnocents',
 'EvilRunningGood',
 'SexualKarma',
 'ILoveYouBecauseICantControlYou',
 'DoppelgangerSpin',
 'GalacticConquerer',
 'IHaveNoSon',
 'IceQueen',
 'PassedInTheirSleep',
 'PrisonDimension',
 'ExcessiveEvilEyeShadow',
 'TragicKeepsake',
 'AnachronismStew',
 'FilleFatale',
 'PandorasBox',
 'CasualKink',
 'FunSize',
 'IdentityAmnesia',
 'LastSecondShowoff',
 'ThePigPen',
 'TrademarkFavoriteFood',
 'PirateParrot',
 'KnifeNut',
 'TheRainman',
 'SleptThroughTheApocalypse',
 'TookALevelInIdealism',
 'FatAndSkinny',
 'WorkHardPlayHard',
 'VillainsNeverLie',
 'FullNameBasis',
 'MediumBlending',
 'WatchingTroyBurn',
 'IrritationIsTheSincerestFormOfFlattery',
 'UncannyValleyMakeup',
 'EverythingTryingToKillYou',
 'ItMakesJustAsMuchSenseInContext',
 'ShaggyDogStory',
 'GloryHound',
 'ExpressiveShirt',
 'FreudianCouch',
 'ParasolOfPrettiness',
 'YouCanTalk',
 'TheUnfettered',
 'RapidFireNo',
 'OnlyKnownByHerNickname',
 'SomewhereAHerpetologistIsCrying',
 'ChastityCouple',
 'CouldS

In [38]:
# output to csv file and wait for outputs
def write_samples(path):
    if os.path.exist(path):
        raise Exception('already exists.')
    with open(path, "a+") as f:
        f.write('hla,mark,conditioned\n')
        for s in samples:
            f.write('{},,0\n'.format(s))

write_samples('./hlas.csv')

In [39]:
# -- read it back --
manual_correction = pd.read_csv('/home/kits-adm/Workspace/play/dodgson_play/all_hla_cleaned.csv')
manual_correction

Unnamed: 0,hla,mark,conditioned
0,DeadpanSnarker,1,0
1,Jerkass,1,0
2,BigBad,1,0
3,BerserkButton,1,1
4,JerkWithAHeartOfGold,1,0
...,...,...,...
995,TheKirk,1,0
996,HopeSpot,1,0
997,AmusingInjuries,1,1
998,UnsympatheticComedyProtagonist,1,0


In [49]:
filtered = manual_correction[(manual_correction.mark==1) & (manual_correction.conditioned==0)]
filtered

Unnamed: 0,hla,mark,conditioned
0,DeadpanSnarker,1,0
1,Jerkass,1,0
2,BigBad,1,0
4,JerkWithAHeartOfGold,1,0
5,NiceGuy,1,0
...,...,...,...
993,AmazingTechnicolorPopulation,1,0
995,TheKirk,1,0
996,HopeSpot,1,0
998,UnsympatheticComedyProtagonist,1,0


In [50]:
len(set(filtered.hla.unique().tolist()))

709