# Teaching condition sampling

Which items should be used for the teaching set?

Ideally, I want to be able to customize the instructions based on what previous contributors provided. However, if I purposively select the toughest examples, this will bias the results, because the other conditions will have those ten extra tough examples, while the teaching condition won't have measurements for the tough examples. Even though it is not ideal, the approach that I'm taking to avoid bias is random sampling.

In [87]:
%matplotlib inline
import pandas as pd
from pymongo import MongoClient
from IPython.display import Image, display, HTML
from bson.objectid import ObjectId
import random
import json
client = MongoClient('localhost')
db = client['crowdy']

In [3]:
print db.collection_names()
hits = db.hits
ts = db.tasksets

[u'bonus', u'hits', u'pins', u'system.indexes', u'tagqualities', u'tasksets', u'trainingsets', u'turkbackups']


## Sample Tag Condition

Let's grab the list of tags.' All the hits have the items in them, so first I'll pick out a hit and grab the sample list from there.

In [4]:
search = hits.find({},{"name":1})
pd.DataFrame(list(search))

Unnamed: 0,_id,name
0,55911fb049eb7fef15c0f5c2,first-pin-tagging-basic-basic
1,55a48ddcc8f17da574db50b7,pin-tagging-fast-fast-1
2,55b590a40850a14a81376f89,TESTBASIC
3,55bc94fb887cb2a616d5b2b3,Test Relevance HIT
4,55c25383c7ed2ff20c44da12,image-relevance-basic-basic1
5,55c2e23839121f0919b64293,image-relevance-fast-fast-1-TESTING
6,55c3021cf7dc871244355b59,image-relevance-fast-fast-1
7,55c43e3a5560042511d789bc,image-relevance-basic-feedback-TESTING1
8,55c5576b406503ba27582732,image-relevance-basic-feedback-1-lateFri
9,55c68b7447110a6718e4f663,first-pin-tagging-basic-feedback


"first-pin-tagging-basic-basic" will do.

In [82]:
search = hits.find({'_id':ObjectId("55911fb049eb7fef15c0f5c2")}, {'items':1})
items = list(search)[0]['items']
random.seed(84754378437982498555547830) #Keyboard mashing
sample1 = random.sample(items, 10)
random.seed(3432478234) #Keyboard mashing
sample2 = random.sample(items, 10)
sample1

[2.427728546632588e+16,
 1.9527333381469443e+17,
 1.1934545884538064e+17,
 2814818489916730.0,
 4.7027438605946976e+17,
 3.364330346336472e+17,
 1.5129286870590038e+17,
 1.6642214870322944e+17,
 1.191343526127168e+17,
 3.5043641472721856e+17]

Now lets crossreference that with the judgment information on great, good, ok, and poor tags.

In [93]:
i = 0
for sample in [sample1, sample2]:
    i += 1
    search = db.tagqualities.aggregate([
        {'$match':{'item_id':{'$in':sample}}},
        {'$group':{'_id':{'item':'$item_id', 'quality':'$quality'}, 'tags':{'$push':'$tag'}}},
        {'$group':{'_id':'$_id.item', 'answers':{'$push':{'quality':'$_id.quality', 'tags':'$tags'}}}}
    ])
    allanswers = list(search)

    # Replace answers with associative array
    # and rename _id as 'item'
    def toObj(l):
        d = {}
        key = {'1':'poor', '2':'ok', '3':'good', '4':'great'}
        for qual in l:
            d[key[qual['quality']]] = qual['tags']
        return d

    for item in allanswers:
        item['item'] = "%.0f" % item['_id']
        del item['_id']
        item['answers'] = toObj(item['answers'])
        item['teaching'] = {'tips':""}

    data = {'_id':('sample%d'%i), 'items':['%.0f'%s for s in sample], 'answers':allanswers}
    print json.dumps(data)[:300] + "..."
    
    f = open('../data/turk/training-tag-sample%d.json' % i, 'w+')
    json.dump(data,f, indent=2)
    f.close()

{"items": ["24277285466325880", "195273333814694432", "119345458845380640", "2814818489916730", "470274386059469760", "336433034633647232", "151292868705900384", "166422148703229440", "119134352612716800", "350436414727218560"], "_id": "sample1", "answers": [{"item": "195273333814694432", "answers":...
{"items": ["2814818489916730", "269230883943533664", "104708760057824320", "406520303835237888", "84161086761367824", "18366310953663964", "42573158949928880", "226798531204491904", "386254105513790016", "407083253788406464"], "_id": "sample2", "answers": [{"item": "84161086761367824", "answers": {"...


That's it. The rest is manually encoded in the teaching set file. For each cross-reference, here is the image information

In [90]:
pins = list(db.pins.aggregate([{'$project':{'image':'$image.236', 'description':1, 'title':1}}]))
pinDetails = dict(zip([pin['_id'] for pin in pins], ["%s||%s" % (pin['title'], pin['description']) for pin in pins]))
pinRef = dict(zip([pin['_id'] for pin in pins], [pin['image'] for pin in pins]))


for item in sample1 + sample2:
    print "%.0f"%item
    print pinDetails[item]
    display(Image(url=pinRef[item]))

24277285466325880
Outdoor Living||Image Via: A Pair and A Spare DIY


195273333814694432
Nails||Burgundy French Nails style


119345458845380640
Cookies and other mini delights||So, so appealingly delicious: Salted Caramel Bars.


2814818489916730
{ blythe dolos }||Blythe ... Minime


470274386059469760
MARK TEIXEIRA #25||Mark Teixeira


336433034633647232
My style.. ||SMYKKER - SOPHIE BY SOPHIE / STAR BRACELET - NELLY.COM


151292868705900384
NOMAD CHIC swim||http://www.nomad-chic.com/swim.html


166422148703229440
PHOTOS: Happy 31st Birthday, Prince William!||prince william


119134352612716800
Recipes/Food||Homemade granola.


350436414727218560
It's a nurse thing...||Lol funny nurse nursing problems #nurse history of my life


2814818489916730
{ blythe dolos }||Blythe ... Minime


269230883943533664
Food & Drinks||Smoked Salmon Spread in Sourdough Bowl


104708760057824320
Food :: Drink||Didn't realise how wrong my way of cooking mushrooms was!


406520303835237888
Vocaloid/UTAU||#vocaloid #anime


84161086761367824
Krystals baby shower||red queen party theme | queen-of-hearts-engagement-party-dessert-table-01.jpg


18366310953663964
Fiction||The Hound of the Baskervilles from Pulp! The Classics  See more pulped classics at http://www.ipgbook.com/oldcastle-books-ltd-publisher-OLC.php#by_Imprint_pulp--the-classics


42573158949928880
Parties For Girls 2||Beach Seaside Cake


226798531204491904
haha||Funny conversations


386254105513790016
Bonsai||Jeliti 01


407083253788406464
Systems/Networks/Patterns||from Earth Medicine by Kenneth Meadows


## Sample Relevance Judgments

First, randomly select 2 queries. These will be used for different conditions.

In [124]:
random.seed(345679867543) #Keyboard mashing
search = hits.aggregate([
    {"$match":{"_id":ObjectId("55c25383c7ed2ff20c44da12")}},
    {"$unwind":'$facets'}
        ])
facetList = [v['facets'] for v in search]
facetSample = random.sample(facetList, 2)
facetSample

[{u'_id': ObjectId('55c25383c7ed2ff20c44da1f'),
  u'items': [4.094054223486766e+17,
   3.158853613339869e+17,
   3.187704798480601e+17,
   2.1560990088547654e+17,
   5.052476519179934e+17,
   1.2736749568546998e+17,
   4.996885211263089e+17,
   4.837851849450169e+17,
   3.569807079350651e+17,
   2.4488391696345056e+16,
   5.554205665139417e+17,
   3.965983109093696e+17,
   2.5065353548220704e+17,
   7.853158722332877e+16,
   5.1425485117416403e+17,
   5.0545875815085203e+17,
   3.598658264470783e+17,
   4.911739030811547e+17,
   2.923822007810157e+17,
   4.5627100596948275e+17,
   4.600005494110556e+17,
   3.3024058514710886e+17,
   4.127127533233578e+17,
   4.8329260373437766e+17,
   3.945576173273675e+17,
   4.4923413154970195e+17,
   2.319354495294728e+17,
   1.1097157829321018e+17,
   4.2333843369454726e+17,
   4.032135913863553e+16],
  u'meta': {u'good': u'images relating to Islamic religion or Islamic culture, such as prayer or religious congregation; passages of scripture or ins

Okay, so the two sample queries are "upcycle" and "buffalo chicken dip". Let's select 10 items from each.

In [7]:
finalSample = facetSample
for query in finalSample:
     query['items'] = random.sample(query['items'], 10)
finalSample

[{u'_id': ObjectId('55c25383c7ed2ff20c44da19'),
  u'items': [5.600649036302706e+17,
   1.9245858404593962e+17,
   5.011662647571746e+17,
   5.705498465134858e+17,
   3.19474785791594e+16,
   4.187644653241429e+17,
   2.343279868312408e+17,
   5.6252780968015206e+17,
   2.2792443111638272e+17,
   1.1069010331549446e+17],
  u'meta': {u'good': u'items being reused in a way that improves on the original',
   u'ok': u'images that show something in the process of being rebuilt or reused',
   u'query': u'upcycle'}},
 {u'_id': ObjectId('55c25383c7ed2ff20c44da15'),
  u'items': [9.51386108477204e+16,
   5.5647609767733766e+17,
   1.2026025251972878e+17,
   3659243419266073.0,
   1.606519116786779e+17,
   2.5058316674052784e+17,
   2.3890195520626806e+17,
   5.11510470150033e+17,
   8514686766779371.0,
   2.3643966764345764e+16],
  u'meta': {u'good': u'images of buffalo chicken dip (a dip for food, like chips), recipes for buffalo chicken dip',
   u'ok': u'images of other types of dip, images of 

Now, let's look up the counts for all the results that we've seen before.

Note that much of the info in the 'meta' field was manually added for easier querying of tasksets (without having to cross-reference the associated HIT). In other words, make sure everything in the DB is properly tagged!

In [135]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
search = ts.aggregate([
    {"$match":{"meta.type":"relevance judgments", "meta.test":False}},
    {"$unwind":"$tasks"},
    {"$project":{"judgment":"$tasks.contribution.relevance", "query":"$facet.meta.query", "item":"$tasks.item.id"}},
    {"$unwind":"$judgment"}
])

def percentByGroup(x):
    defaults = pd.Series({"Not Relevant":0, "Somewhat Relevant":0, "Very Relevant":0})
    p = x.groupby('judgment').count()['_id'] / len(x)
    return p.add(defaults, fill_value=0)
judgment_likelihood = pd.DataFrame(results.groupby(["query","item"]).apply(percentByGroup))

# Melt from wide to long format
judgescores = pd.melt(
    judgment_likelihood.reset_index(),
    id_vars=["query", "item"],
    value_vars=["Not Relevant", "Somewhat Relevant","Very Relevant"],
    value_name="probability"
)
judgescores

Unnamed: 0,query,item,variable,probability
0,agape,21744010675101592,Not Relevant,1
1,agape,40321359138635528,Not Relevant,1
2,agape,53550683039109584,Not Relevant,1
3,agape,61291244899996240,Not Relevant,1
4,agape,86272149083353136,Not Relevant,1
5,agape,103512491405117920,Not Relevant,1
6,agape,132293307774792368,Not Relevant,0
7,agape,146226319122668288,Not Relevant,1
8,agape,162762974005815584,Not Relevant,1
9,agape,185351340889599296,Not Relevant,1


In [148]:
# I stupidly saved item ids as floats, this will coerce to string
a = ['%.0f' % x for x in finalSample[0]['items']]
judgescores
finalSample

[{u'_id': ObjectId('55c25383c7ed2ff20c44da19'),
  u'items': [5.600649036302706e+17,
   1.9245858404593962e+17,
   5.011662647571746e+17,
   5.705498465134858e+17,
   3.19474785791594e+16,
   4.187644653241429e+17,
   2.343279868312408e+17,
   5.6252780968015206e+17,
   2.2792443111638272e+17,
   1.1069010331549446e+17],
  u'meta': {u'good': u'items being reused in a way that improves on the original',
   u'ok': u'images that show something in the process of being rebuilt or reused',
   u'query': u'upcycle'}},
 {u'_id': ObjectId('55c25383c7ed2ff20c44da15'),
  u'items': [9.51386108477204e+16,
   5.5647609767733766e+17,
   1.2026025251972878e+17,
   3659243419266073.0,
   1.606519116786779e+17,
   2.5058316674052784e+17,
   2.3890195520626806e+17,
   5.11510470150033e+17,
   8514686766779371.0,
   2.3643966764345764e+16],
  u'meta': {u'good': u'images of buffalo chicken dip (a dip for food, like chips), recipes for buffalo chicken dip',
   u'ok': u'images of other types of dip, images of 

In [199]:
import json
def getProb(query, item):
    a= judgescores.query("query == '%s' and item == %.0f" % (query, item))
    return a[['variable', 'probability']].set_index('variable').to_dict()
for sample in finalSample:
    # Get probabilities for all the items
    items = [{("%.0f" % item): getProb(sample['meta']['query'], item)} 
                   for item in sample['items']]
    sample['answers'] = items
    sample['_id'] = str(sample['_id'])
    f = open('../data/turk/training-relevance-%s.json' % sample['_id'], 'w+')
    json.dump(sample,f, indent=2)
    f.close()

In [14]:
search = ts.aggregate([
    {'$unwind':'$tasks'}, {'$unwind':'$tasks.contribution.tags'},
    {'$group':{'_id':{'item':'$tasks.item.id', 'tag':'$tasks.contribution.tags'}, 'count':{'$sum':1}}},
    {'$project':{'item':'$_id.item', 'tag':'$_id.tag', 'count':1, '_id':0}}
])
pd.DataFrame(list(search)).sort('count', ascending=False)

count    2387
item     2387
tag      2387
dtype: int64

For reference while I'm manually encoding the teaching tips:

In [201]:
from IPython.display import Image, display, HTML
for sample in finalSample:
    display(HTML("<h2>%s(%s)</h2>"%(sample['meta']['query'], sample['_id'])))
    for item in sample['items']:
        print "%.0f"%item, getProb(sample['meta']['query'], item)
        print pinDetails[item]
        display(Image(url=pinRef[item]))

560064903630270592 {'probability': {'Somewhat Relevant': 0.25, 'Not Relevant': 0.5, 'Very Relevant': 0.25}}
Upcycle||Interesting pic of shadow. Getting this table, why buy new?


192458584045939616 {'probability': {'Somewhat Relevant': 0.3125, 'Not Relevant': 0.125, 'Very Relevant': 0.5625}}
Reduce, Re-use & Recycle.... and UPcycle!||way too many little plastic animals lying around anyway!


501166264757174592 {'probability': {'Somewhat Relevant': 0.26666666666666666, 'Not Relevant': 0.066666666666666666, 'Very Relevant': 0.66666666666666663}}
Repurpose & Reuse||23 Creative Ways To Reuse Old Plastic Bottles,,plastic-bottles-recycling-ideas-13


570549846513485824 {'probability': {'Somewhat Relevant': 0.1875, 'Not Relevant': 0.1875, 'Very Relevant': 0.625}}
§ Recycle § upcycle clothes||THE MUDDY PRINCESS


31947478579159400 {'probability': {'Somewhat Relevant': 0.29411764705882354, 'Not Relevant': 0.0, 'Very Relevant': 0.70588235294117652}}
Sewing-kids||Upcycle tees


418764465324142912 {'probability': {'Somewhat Relevant': 0.23529411764705882, 'Not Relevant': 0.6470588235294118, 'Very Relevant': 0.11764705882352941}}
PAPER ART||Stargazer Lilies DL Envelope Card Mini Kit on Craftsuprint designed by Sandie Burchell - made by Dianne Jackson - I used a centre opening gatefold card as my base to make this up in another different way. I decoupaged with sticky pads and added the insert. I tied a ribbon bow round the front and added my own badge sentiment to the bottom. This really does look effective. These designs are so versatile and you can use them in so many different ways...How will you use them - Now available for ...


234327986831240800 {'probability': {'Somewhat Relevant': 0.2857142857142857, 'Not Relevant': 0.14285714285714285, 'Very Relevant': 0.5714285714285714}}
Things to make.||Upcycle paint samples. As the daughter of an interior designer, I could make fifty of these.


562527809680152064 {'probability': {'Somewhat Relevant': 0.13333333333333333, 'Not Relevant': 0.20000000000000001, 'Very Relevant': 0.66666666666666663}}
manualidades||


227924431116382720 {'probability': {'Somewhat Relevant': 0.055555555555555552, 'Not Relevant': 0.055555555555555552, 'Very Relevant': 0.88888888888888884}}
upcycle||Dishfunctional Designs: Neat Things You Can Make With Cookie Cutters


110690103315494464 {'probability': {'Somewhat Relevant': 0.57894736842105265, 'Not Relevant': 0.26315789473684209, 'Very Relevant': 0.15789473684210525}}
Recycle, Upcycle :: Donna Shipley-Richie's clipboard on||Hometalk :: Recycle, Upcycle :: Donna Shipley-Richie's clipboard on Hometalk Burlap Boxes


95138610847720400 {'probability': {'Somewhat Relevant': 0.375, 'Not Relevant': 0.625, 'Very Relevant': 0.0}}
Good Eats||Buffalo Chicken Sandwiches


556476097677337664 {'probability': {'Somewhat Relevant': 0.20000000000000001, 'Not Relevant': 0.80000000000000004, 'Very Relevant': 0.0}}
Food and Drink||Bacon Wrapped Chicken And Asparagus I used the same marinade for the chicken and asparagus and put some Philly 3 cheese cooking cream in the middle


120260252519728784 {'probability': {'Somewhat Relevant': 0.41176470588235292, 'Not Relevant': 0.52941176470588236, 'Very Relevant': 0.058823529411764705}}
Chicken with Tomatoes and Prunes||Chicken with tomatoes and prunes... could be wonderful?    UPDATE: Made this tonight, it was very delish!  I might mess around with the cooking times for the chicken itself as it ended up being a little overcooked, but otherwise totally tasty!


3659243419266073 {'probability': {'Somewhat Relevant': 0.1875, 'Not Relevant': 0.6875, 'Very Relevant': 0.125}}
Made By Yours Truly||Stuffed Buffalo Chicken


160651911678677888 {'probability': {'Somewhat Relevant': 0.25, 'Not Relevant': 0.6875, 'Very Relevant': 0.0625}}
Healthy Curry Gluten-Free Chicken Nuggets Recipe||Healthy Curry Gluten-Free Chicken Nuggets - Egg-Free too!


250583166740527840 {'probability': {'Somewhat Relevant': 0.11764705882352941, 'Not Relevant': 0.88235294117647056, 'Very Relevant': 0.0}}
Food||Skinny Chef Red Hot Buffalo Chicken Wings


238901955206268064 {'probability': {'Somewhat Relevant': 0.52941176470588236, 'Not Relevant': 0.35294117647058826, 'Very Relevant': 0.11764705882352941}}
I love food - chicken||Buffalo chicken and celery


511510470150033024 {'probability': {'Somewhat Relevant': 0.25, 'Not Relevant': 0.375, 'Very Relevant': 0.375}}
Cherry Chipotle Chicken Wings For Game Day||“Is the big football game a must see in your house? This year I will be making cherry chipotle chicken wings…. Oh my word. I made them the other day and have made another batch since. I would be too crazy not to share them with my friends and family as we cheer on our favorite football team on Feb. 2nd. Give them a test run and fall in love.” - Babble.com


8514686766779371 {'probability': {'Somewhat Relevant': 0.058823529411764705, 'Not Relevant': 0.88235294117647056, 'Very Relevant': 0.058823529411764705}}
Chicken Meatball Soup||Our Greek-Style Chicken Meatballs recipe is a fresh addition to a classic chicken noodle soup with vegetables in this Chicken Meatball Soup. #recipes #soup #comfortfood


23643966764345764 {'probability': {'Somewhat Relevant': 0.0, 'Not Relevant': 1.0, 'Very Relevant': 0.0}}
My Favorite Recipes||Chicken Curry Salad in a Hurry Allrecipes.com.  This is a great "grab and go" recipe if you're swinging by the market.  I usually omit the hard boiled eggs unless we happen to have some on hand.
