# Collecting Amazon Turk into CSV

For each batch of Turk jobs, we've had different labels, the labels teach us about the properties of the data and so we usually change the labels as we get more information.


#### Batch 1

* Alcohol Related
* Not Alcohol Related

#### Batch 2

Once we had data on how the split works we realised that we need to seperate alcohol and it's consumption. Things like drunk driving and Beer ads would slip into the domain of `Alcohol Related` so we decided to seperate those.

* Not Alcohol Related
* Alcohol Consomption Related
    - First Person
    - Second Person
    - Third Person
    - Other
* Alcohol Related
    - Promotional Content
    - Discussion

#### Batch 3

Afterwards we noticed how well they could be seperated, so we decided to explicitly look for first person accounts of alcohol consumption with different levels of intensity.

* First Person - Alcohol
    - Looking to drink
    - Casual
    - Heavy
    - Reflecting
* Alcohol Related
    - Promotional Content
    - Discussion
* Not Alcohol Related

In [1]:
import pandas as pd

### Batch 1

In [92]:
# Helpers

def set_alch(df):
    df["alcohol"] = (~df.Answer.str.contains("Not")).apply(int)
    return df

def set_alch_promo(df):
    is_alch = df.Answer.str.contains("Not")
    is_promo = df.Answer.str.contains("Promo")
    df["alcohol"] = (~(is_alch | is_promo)).apply(int)
    return df

def set_first(df):
    df["first_person"] = df.Answer.str.contains("First").apply(int)
    return df

In [93]:
b1 = pd.DataFrame.from_csv("/Users/JasonLiu/Downloads/Batch_2031538_batch_results.csv")
b1.tweet = b1.tweet.apply(eval)
b1 = set_alch(b1)

In [94]:
b2 = pd.DataFrame.from_csv("/Users/JasonLiu/Downloads/Batch_2034190_batch_results.csv")
b2.tweet = b2.tweet.apply(eval)

# Correcting missing index
b2_id = pd.DataFrame.from_csv("/Users/JasonLiu/Desktop/amt_le_02.data.csv")
b2_id._id.index = b2.index
b2 = b2.join(b2_id._id)

In [91]:
b2 = set_alch_promo(b2)
b2 = set_first(b2)

In [95]:
b3 = pd.DataFrame.from_csv("/Users/JasonLiu/Downloads/Batch_2035137_batch_results.csv")
b3.tweet = b3.tweet.apply(eval)
b3 = set_alch_promo(b3)
b3 = set_first(b3)

In [101]:
Y = pd.concat([b1, b2, b3])

In [102]:
Y = Y.set_index("_id")[["alcohol", "first_person"]]

In [103]:
Y.first_person.value_counts()

1    511
0    385
dtype: int64

In [104]:
Y.alcohol.value_counts()

1    1582
0     716
dtype: int64

In [105]:
Y.to_csv("../twitter_labels.csv")