# Teaching condition sampling

Which items should be used for the teaching set?

Ideally, I want to be able to customize the instructions based on what previous contributors provided. However, if I purposively select the toughest examples, this will bias the results, because the other conditions will have those ten extra tough examples, while the teaching condition won't have measurements for the tough examples. Even though it is not ideal, the approach that I'm taking to avoid bias is random sampling.

In [1]:
%matplotlib inline
import pandas as pd
from pymongo import MongoClient
from bson.objectid import ObjectId
import random

In [2]:
client = MongoClient('localhost')
db = client['crowdy']
print db.collection_names()
hits = db.hits
ts = db.tasksets

[u'bonus', u'hits', u'pins', u'system.indexes', u'tagqualities', u'tasksets', u'turkbackups']


## Sample Tag Condition

In [3]:
search = ts.aggregate([
    {'$unwind':'$tasks'}, {'$unwind':'$tasks.contribution.tags'}, {'$group':{'_id':'$hit_id', 'count':{'$sum':1}}}
])
pd.DataFrame(list(search))

Unnamed: 0,_id,count
0,55c68b7447110a6718e4f663,609
1,55911fb049eb7fef15c0f5c2,1574
2,55a48ddcc8f17da574db50b7,1155


## Sample Relevance Judgments

First, randomly select 2 queries. These will be used for different conditions.

In [124]:
random.seed(345679867543) #Keyboard mashing
search = hits.aggregate([
    {"$match":{"_id":ObjectId("55c25383c7ed2ff20c44da12")}},
    {"$unwind":'$facets'}
        ])
facetList = [v['facets'] for v in search]
facetSample = random.sample(facetList, 2)
facetSample

[{u'_id': ObjectId('55c25383c7ed2ff20c44da1f'),
  u'items': [4.094054223486766e+17,
   3.158853613339869e+17,
   3.187704798480601e+17,
   2.1560990088547654e+17,
   5.052476519179934e+17,
   1.2736749568546998e+17,
   4.996885211263089e+17,
   4.837851849450169e+17,
   3.569807079350651e+17,
   2.4488391696345056e+16,
   5.554205665139417e+17,
   3.965983109093696e+17,
   2.5065353548220704e+17,
   7.853158722332877e+16,
   5.1425485117416403e+17,
   5.0545875815085203e+17,
   3.598658264470783e+17,
   4.911739030811547e+17,
   2.923822007810157e+17,
   4.5627100596948275e+17,
   4.600005494110556e+17,
   3.3024058514710886e+17,
   4.127127533233578e+17,
   4.8329260373437766e+17,
   3.945576173273675e+17,
   4.4923413154970195e+17,
   2.319354495294728e+17,
   1.1097157829321018e+17,
   4.2333843369454726e+17,
   4.032135913863553e+16],
  u'meta': {u'good': u'images relating to Islamic religion or Islamic culture, such as prayer or religious congregation; passages of scripture or ins

Okay, so the two sample queries are "upcycle" and "buffalo chicken dip". Let's select 10 items from each.

In [7]:
finalSample = facetSample
for query in finalSample:
     query['items'] = random.sample(query['items'], 10)
finalSample

[{u'_id': ObjectId('55c25383c7ed2ff20c44da19'),
  u'items': [5.600649036302706e+17,
   1.9245858404593962e+17,
   5.011662647571746e+17,
   5.705498465134858e+17,
   3.19474785791594e+16,
   4.187644653241429e+17,
   2.343279868312408e+17,
   5.6252780968015206e+17,
   2.2792443111638272e+17,
   1.1069010331549446e+17],
  u'meta': {u'good': u'items being reused in a way that improves on the original',
   u'ok': u'images that show something in the process of being rebuilt or reused',
   u'query': u'upcycle'}},
 {u'_id': ObjectId('55c25383c7ed2ff20c44da15'),
  u'items': [9.51386108477204e+16,
   5.5647609767733766e+17,
   1.2026025251972878e+17,
   3659243419266073.0,
   1.606519116786779e+17,
   2.5058316674052784e+17,
   2.3890195520626806e+17,
   5.11510470150033e+17,
   8514686766779371.0,
   2.3643966764345764e+16],
  u'meta': {u'good': u'images of buffalo chicken dip (a dip for food, like chips), recipes for buffalo chicken dip',
   u'ok': u'images of other types of dip, images of 

Now, let's look up the counts for all the results that we've seen before.

Note that much of the info in the 'meta' field was manually added for easier querying of tasksets (without having to cross-reference the associated HIT). In other words, make sure everything in the DB is properly tagged!

In [135]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
search = ts.aggregate([
    {"$match":{"meta.type":"relevance judgments", "meta.test":False}},
    {"$unwind":"$tasks"},
    {"$project":{"judgment":"$tasks.contribution.relevance", "query":"$facet.meta.query", "item":"$tasks.item.id"}},
    {"$unwind":"$judgment"}
])

def percentByGroup(x):
    defaults = pd.Series({"Not Relevant":0, "Somewhat Relevant":0, "Very Relevant":0})
    p = x.groupby('judgment').count()['_id'] / len(x)
    return p.add(defaults, fill_value=0)
judgment_likelihood = pd.DataFrame(results.groupby(["query","item"]).apply(percentByGroup))

# Melt from wide to long format
judgescores = pd.melt(
    judgment_likelihood.reset_index(),
    id_vars=["query", "item"],
    value_vars=["Not Relevant", "Somewhat Relevant","Very Relevant"],
    value_name="probability"
)
judgescores

Unnamed: 0,query,item,variable,probability
0,agape,21744010675101592,Not Relevant,1
1,agape,40321359138635528,Not Relevant,1
2,agape,53550683039109584,Not Relevant,1
3,agape,61291244899996240,Not Relevant,1
4,agape,86272149083353136,Not Relevant,1
5,agape,103512491405117920,Not Relevant,1
6,agape,132293307774792368,Not Relevant,0
7,agape,146226319122668288,Not Relevant,1
8,agape,162762974005815584,Not Relevant,1
9,agape,185351340889599296,Not Relevant,1


In [148]:
# I stupidly saved item ids as floats, this will coerce to 
a = ['%.0f' % x for x in finalSample[0]['items']]
judgescores
finalSample

[{u'_id': ObjectId('55c25383c7ed2ff20c44da19'),
  u'items': [5.600649036302706e+17,
   1.9245858404593962e+17,
   5.011662647571746e+17,
   5.705498465134858e+17,
   3.19474785791594e+16,
   4.187644653241429e+17,
   2.343279868312408e+17,
   5.6252780968015206e+17,
   2.2792443111638272e+17,
   1.1069010331549446e+17],
  u'meta': {u'good': u'items being reused in a way that improves on the original',
   u'ok': u'images that show something in the process of being rebuilt or reused',
   u'query': u'upcycle'}},
 {u'_id': ObjectId('55c25383c7ed2ff20c44da15'),
  u'items': [9.51386108477204e+16,
   5.5647609767733766e+17,
   1.2026025251972878e+17,
   3659243419266073.0,
   1.606519116786779e+17,
   2.5058316674052784e+17,
   2.3890195520626806e+17,
   5.11510470150033e+17,
   8514686766779371.0,
   2.3643966764345764e+16],
  u'meta': {u'good': u'images of buffalo chicken dip (a dip for food, like chips), recipes for buffalo chicken dip',
   u'ok': u'images of other types of dip, images of 

In [199]:
import json
def getProb(query, item):
    a= judgescores.query("query == '%s' and item == %.0f" % (query, item))
    return a[['variable', 'probability']].set_index('variable').to_dict()
for sample in finalSample:
    # Get probabilities for all the items
    items = [{("%.0f" % item): getProb(sample['meta']['query'], item)} 
                   for item in sample['items']]
    sample['answers'] = items
    sample['_id'] = str(sample['_id'])
    f = open('../data/turk/training-relevance-%s.json' % sample['_id'], 'w+')
    json.dump(sample,f, indent=2)
    f.close()

In [14]:
search = ts.aggregate([
    {'$unwind':'$tasks'}, {'$unwind':'$tasks.contribution.tags'},
    {'$group':{'_id':{'item':'$tasks.item.id', 'tag':'$tasks.contribution.tags'}, 'count':{'$sum':1}}},
    {'$project':{'item':'$_id.item', 'tag':'$_id.tag', 'count':1, '_id':0}}
])
pd.DataFrame(list(search)).sort('count', ascending=False)

count    2387
item     2387
tag      2387
dtype: int64

For reference while I'm manually encoding the teaching tips:

In [201]:
from IPython.display import Image, display, HTML
pins = list(db.pins.aggregate([{'$project':{'image':'$image.236', 'description':1, 'title':1}}]))
pinDetails = dict(zip([pin['_id'] for pin in pins], ["%s||%s" % (pin['title'], pin['description']) for pin in pins]))
pinRef = dict(zip([pin['_id'] for pin in pins], [pin['image'] for pin in pins]))
for sample in finalSample:
    display(HTML("<h2>%s(%s)</h2>"%(sample['meta']['query'], sample['_id'])))
    for item in sample['items']:
        print "%.0f"%item, getProb(sample['meta']['query'], item)
        print pinDetails[item]
        display(Image(url=pinRef[item]))

560064903630270592 {'probability': {'Somewhat Relevant': 0.25, 'Not Relevant': 0.5, 'Very Relevant': 0.25}}
Upcycle||Interesting pic of shadow. Getting this table, why buy new?


192458584045939616 {'probability': {'Somewhat Relevant': 0.3125, 'Not Relevant': 0.125, 'Very Relevant': 0.5625}}
Reduce, Re-use & Recycle.... and UPcycle!||way too many little plastic animals lying around anyway!


501166264757174592 {'probability': {'Somewhat Relevant': 0.26666666666666666, 'Not Relevant': 0.066666666666666666, 'Very Relevant': 0.66666666666666663}}
Repurpose & Reuse||23 Creative Ways To Reuse Old Plastic Bottles,,plastic-bottles-recycling-ideas-13


570549846513485824 {'probability': {'Somewhat Relevant': 0.1875, 'Not Relevant': 0.1875, 'Very Relevant': 0.625}}
§ Recycle § upcycle clothes||THE MUDDY PRINCESS


31947478579159400 {'probability': {'Somewhat Relevant': 0.29411764705882354, 'Not Relevant': 0.0, 'Very Relevant': 0.70588235294117652}}
Sewing-kids||Upcycle tees


418764465324142912 {'probability': {'Somewhat Relevant': 0.23529411764705882, 'Not Relevant': 0.6470588235294118, 'Very Relevant': 0.11764705882352941}}
PAPER ART||Stargazer Lilies DL Envelope Card Mini Kit on Craftsuprint designed by Sandie Burchell - made by Dianne Jackson - I used a centre opening gatefold card as my base to make this up in another different way. I decoupaged with sticky pads and added the insert. I tied a ribbon bow round the front and added my own badge sentiment to the bottom. This really does look effective. These designs are so versatile and you can use them in so many different ways...How will you use them - Now available for ...


234327986831240800 {'probability': {'Somewhat Relevant': 0.2857142857142857, 'Not Relevant': 0.14285714285714285, 'Very Relevant': 0.5714285714285714}}
Things to make.||Upcycle paint samples. As the daughter of an interior designer, I could make fifty of these.


562527809680152064 {'probability': {'Somewhat Relevant': 0.13333333333333333, 'Not Relevant': 0.20000000000000001, 'Very Relevant': 0.66666666666666663}}
manualidades||


227924431116382720 {'probability': {'Somewhat Relevant': 0.055555555555555552, 'Not Relevant': 0.055555555555555552, 'Very Relevant': 0.88888888888888884}}
upcycle||Dishfunctional Designs: Neat Things You Can Make With Cookie Cutters


110690103315494464 {'probability': {'Somewhat Relevant': 0.57894736842105265, 'Not Relevant': 0.26315789473684209, 'Very Relevant': 0.15789473684210525}}
Recycle, Upcycle :: Donna Shipley-Richie's clipboard on||Hometalk :: Recycle, Upcycle :: Donna Shipley-Richie's clipboard on Hometalk Burlap Boxes


95138610847720400 {'probability': {'Somewhat Relevant': 0.375, 'Not Relevant': 0.625, 'Very Relevant': 0.0}}
Good Eats||Buffalo Chicken Sandwiches


556476097677337664 {'probability': {'Somewhat Relevant': 0.20000000000000001, 'Not Relevant': 0.80000000000000004, 'Very Relevant': 0.0}}
Food and Drink||Bacon Wrapped Chicken And Asparagus I used the same marinade for the chicken and asparagus and put some Philly 3 cheese cooking cream in the middle


120260252519728784 {'probability': {'Somewhat Relevant': 0.41176470588235292, 'Not Relevant': 0.52941176470588236, 'Very Relevant': 0.058823529411764705}}
Chicken with Tomatoes and Prunes||Chicken with tomatoes and prunes... could be wonderful?    UPDATE: Made this tonight, it was very delish!  I might mess around with the cooking times for the chicken itself as it ended up being a little overcooked, but otherwise totally tasty!


3659243419266073 {'probability': {'Somewhat Relevant': 0.1875, 'Not Relevant': 0.6875, 'Very Relevant': 0.125}}
Made By Yours Truly||Stuffed Buffalo Chicken


160651911678677888 {'probability': {'Somewhat Relevant': 0.25, 'Not Relevant': 0.6875, 'Very Relevant': 0.0625}}
Healthy Curry Gluten-Free Chicken Nuggets Recipe||Healthy Curry Gluten-Free Chicken Nuggets - Egg-Free too!


250583166740527840 {'probability': {'Somewhat Relevant': 0.11764705882352941, 'Not Relevant': 0.88235294117647056, 'Very Relevant': 0.0}}
Food||Skinny Chef Red Hot Buffalo Chicken Wings


238901955206268064 {'probability': {'Somewhat Relevant': 0.52941176470588236, 'Not Relevant': 0.35294117647058826, 'Very Relevant': 0.11764705882352941}}
I love food - chicken||Buffalo chicken and celery


511510470150033024 {'probability': {'Somewhat Relevant': 0.25, 'Not Relevant': 0.375, 'Very Relevant': 0.375}}
Cherry Chipotle Chicken Wings For Game Day||“Is the big football game a must see in your house? This year I will be making cherry chipotle chicken wings…. Oh my word. I made them the other day and have made another batch since. I would be too crazy not to share them with my friends and family as we cheer on our favorite football team on Feb. 2nd. Give them a test run and fall in love.” - Babble.com


8514686766779371 {'probability': {'Somewhat Relevant': 0.058823529411764705, 'Not Relevant': 0.88235294117647056, 'Very Relevant': 0.058823529411764705}}
Chicken Meatball Soup||Our Greek-Style Chicken Meatballs recipe is a fresh addition to a classic chicken noodle soup with vegetables in this Chicken Meatball Soup. #recipes #soup #comfortfood


23643966764345764 {'probability': {'Somewhat Relevant': 0.0, 'Not Relevant': 1.0, 'Very Relevant': 0.0}}
My Favorite Recipes||Chicken Curry Salad in a Hurry Allrecipes.com.  This is a great "grab and go" recipe if you're swinging by the market.  I usually omit the hard boiled eggs unless we happen to have some on hand.


In [132]:
search = ts.aggregate([
    {'$unwind':'$tasks'}, {'$unwind':'$tasks.contribution.tags'},
    {'$project': {'tasks':1 ,'tags': { '$toLower': '$tasks.contribution.tags' }}},
    {'$group':{'_id':'$tasks.item.id', 'tags':{'$push':'$tags'}}}
])

rows = []
for item in list(search):
    #img = Image(url=pinRef[item['_id']])
    tags = pd.Series(item['tags'])
    tag_count_list = ", ".join(
        ["%s (%s)" % (row['index'], row[0]) for row in 
         tags.value_counts().reset_index().to_dict('records')]
    )
    rows += "<tr><td width=75><img src='%s' /></td><td>%s</td></tr>" % (pinRef[item['_id']], tag_count_list)
    
table = HTML("<table>%s</table>" % ("".join(rows)))
display(table)

0,1
,"kiss (6), short girl (2), kiss me (2), just shut up and kiss me (2), relationship (1), shut up (1), teenagers kissing (1), weird (1), kids kissing (1), romance (1), love meme (1), couple (1), couples kissing (1), love (1), hug (1), teen (1)"
,"anime (8), girl (2), anime girl (2), vocal (1), cartoon (1), waifu (1), anime girl with wings and blue hair (1), thigh high socks (1), artwork (1), vocaloid (1), art (1), manga blue hair (1)"
,"dip (3), bread (2), appetizer (2), bread bowl (2), party appetizer idea (1), snack (1), breadbowl (1), food (1), delicious bread bowl dip (1), salmon (1), sourdough bowl (1), bread boll (1), smoked salmon spread in sourdough bowl (1), smoked salmon dip (1), salmon spread (1), bread bowl dip (1), smoked salmon (1), bread dip (1), creamy dip (1)"
,"woman (3), monica belluci (2), black bra (2), monica belluci sleeping (1), star (1), comfort (1), brunette laying down (1), red lips (1), movie (1), celebrity (1), makeup (1), beautiful woman (1), actress (1), bra (1), sleeping (1), women (1), actor (1), rest (1), too hard (1), movie star (1), model (1), portrait (1), pretty woman (1), monica (1)"
,"dress (5), black dress (4), little black dress (3), black (2), shift dress (1), frilly dress (1), black frilly dress (1), ruffles (1), black lace (1), frills (1), black dress with frills (1), lacy dress (1), closet (1), short-sleeves (1), womens (1)"
,"cocktail shaker (4), thermos (3), drink (2), too hard (2), shaker (2), copper (2), copper color (1), mixer (1), stainless steel cocktail shaker (1), water bottle (1), cup (1), cocktail skaker (1), copper shaker (1), metal bottle (1), canteen (1), insulated coffee mug (1), drinks (1)"
,"window (3), mountains (2), cabin (2), view (1), lake (1), cabin interior (1), house (1), wood panelling (1), lake view from bedroom window (1), wooden home (1), wood room (1), window with a view (1), comfy (1), lakeside view (1), scene (1), misty (1), wooden (1), mountain view (1), lake view from window (1), scenic (1), architecture (1), hotel with a view (1), giant window (1), interior design (1), chalet view (1), beautiful (1)"
,"clothing (4), mannequins (4), store (3), clothing store (2), fashion (2), mannequin (2), pants (1), boutique (1), department store (1), zara johannesburg (1), shopping (1), shorts (1), zara in johannesburg (1), women's (1), purse (1), shirts (1), figures (1), mannequins in beige (1), skirts (1), shop (1)"
,"underwear (3), men (3), models (2), abs (2), men in thongs (1), lockerroom (1), men in underwear (1), topher dimaggio and dominic pacifico (1), big bulges (1), hot male models (1), muscular (1), gay men locker room (1), dominic pacifico (1), homosexuality (1), topher dimaggio (1), gay men underwear (1), sexy (1), partially nude men (1), hot yummy men (1), two men (1), two semi-nude men in locker room (1)"
,"ohio state (4), keep calm and love ohio state (2), keep calm (2), plaque (1), keep calm and love ohio state poster (1), ohio state university (1), motto (1), ohio (1), design (1), book shelf ideas (1), state (1), sports (1), neutral (1), ohio state sign over shelf (1), love (1), dorm room (1)"


In [120]:
a = ['test', 'test', 'test2', 'test3']
b = pd.Series(a)
", ".join(["%s (%s)" % (row['index'], row[0]) for row in b.value_counts().reset_index().to_dict('records')])

'test (2), test2 (1), test3 (1)'

## Save all unique tags to CSV

This allows me to manual judge the quality of each tag

In [33]:
search = ts.aggregate([
    {'$unwind':'$tasks'}, {'$unwind':'$tasks.contribution.tags'},
    {'$project': {'tasks':1 ,'tags': { '$toLower': '$tasks.contribution.tags' }}},
    {'$group':{'_id':'$tasks.item.id', 'tags':{'$addToSet':'$tags'}}},
    {'$unwind':'$tags'}
])

allUnique = pd.DataFrame(list(search))
allUnique['image'] = allUnique['_id'].apply(lambda x: pinRef[x])
allUnique.sort('image')

Unnamed: 0,_id,tags,image
915,2.004806e+17,craft,https://crowdycrowdy.com/images/s-media-cache-...
916,2.004806e+17,party ideas,https://crowdycrowdy.com/images/s-media-cache-...
917,2.004806e+17,spider pail,https://crowdycrowdy.com/images/s-media-cache-...
918,2.004806e+17,halloween,https://crowdycrowdy.com/images/s-media-cache-...
919,2.004806e+17,halloween pail,https://crowdycrowdy.com/images/s-media-cache-...
920,2.004806e+17,gift pail,https://crowdycrowdy.com/images/s-media-cache-...
921,2.004806e+17,gift basket,https://crowdycrowdy.com/images/s-media-cache-...
922,2.004806e+17,halloween present,https://crowdycrowdy.com/images/s-media-cache-...
923,2.004806e+17,pail,https://crowdycrowdy.com/images/s-media-cache-...
924,2.004806e+17,gift,https://crowdycrowdy.com/images/s-media-cache-...


In [45]:
group = allUnique.groupby(['image', 'tags']).sum()

In [24]:
search = db.pins.aggregate([
        {"$project":{"image":"$image.236"}}
    ])
a = pd.DataFrame(list(search))
b = pd.read_csv("../data/pinterest/tag-quality-judgments.csv")
print a[:5]
print b[:5]
c= a.merge(b)
del c['image']
print c[:5]
c.to_csv("../data/pinterest/tag-quality-judgments-merged.csv", float_format='%.0f', index=False)

                _id                                              image
0  2814818489916730  https://crowdycrowdy.com/images/s-media-cache-...
1  3096293466026588  https://crowdycrowdy.com/images/s-media-cache-...
2  4855512069785413  https://crowdycrowdy.com/images/s-media-cache-...
3  7529524346871766  https://crowdycrowdy.com/images/s-media-cache-...
4 10414642858052268  https://crowdycrowdy.com/images/s-media-cache-...
                                               image             tag  quality
0  https://crowdycrowdy.com/images/s-media-cache-...           craft        2
1  https://crowdycrowdy.com/images/s-media-cache-...     party ideas        2
2  https://crowdycrowdy.com/images/s-media-cache-...     spider pail        3
3  https://crowdycrowdy.com/images/s-media-cache-...       halloween        2
4  https://crowdycrowdy.com/images/s-media-cache-...  halloween pail        4
               _id            tag  quality
0 2814818489916730      girl doll        4
1 2814818489916730  

Unnamed: 0,_id,image,tag,quality
0,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,girl doll,4
1,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,glasses,3
2,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,read hair,1
3,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,bjd,3
4,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,big-eyed doll,4
5,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,doll,3
6,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,toy,2
7,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,child,2
8,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,blithe minime,1
9,2.814818e+15,https://crowdycrowdy.com/images/s-media-cache-...,red haired doll,4
