# Loading the expert labeled examples in the datasets from the Benoit et al. (2016) paper

This notebook is supporting material for the exercises on day 4 of our course.
It just shows how to load the [datasets](../data/labeled/benoit_crowdsourced_2016/) for the different tasks Benoit et al. (2016) have distributed to crowd workers:

1. classification of manifesto sentences (a) into economic or social policy issue statements and (2) the stance expressed on these issues (see the instructions in [econ_social_policy.md](../data/labeled/benoit_crowdsourced_2016/instructions/econ_social_policy.md))
2. classification of manifesto sentences (a) according to whether or not they discuss the issue of migration and, if so, (2) the stance expressed on this issues (see the instructions in [immigration_policy.md](../data/labeled/benoit_crowdsourced_2016/instructions/immigration_policy.md))
3. classifciation of the stance expressed in debates about a subsidy scheme in the European Parliament (see the instructions in [subsidies_stance.md](../data/labeled/benoit_crowdsourced_2016/instructions/subsidies_stance.md))

In [7]:
import os
from utils.io import read_tabular
from IPython.display import display, HTML

In [8]:
SEED = 42

In [9]:
data_path = os.path.join('..', 'data', 'labeled', 'benoit_crowdsourced_2016')
data_files = [f for f in os.listdir(data_path) if f.endswith('.csv')]
data_files

['benoit_crowdsourced_2016-econ_policy_stance.csv',
 'benoit_crowdsourced_2016-immigration_policy.csv',
 'benoit_crowdsourced_2016-social_policy_stance.csv',
 'benoit_crowdsourced_2016-subsidies_stance.csv',
 'benoit_crowdsourced_2016-policy_area.csv',
 'benoit_crowdsourced_2016-immigration_policy_stance.csv']

## Data for economic/social policy classification

### Policy area

In [23]:
fp = os.path.join(data_path, 'benoit_crowdsourced_2016-policy_area.csv')
df = read_tabular(fp, columns=['text', 'label', 'metadata__gold'])

# subset to gold examples
df = df[df.metadata__gold]
del df['metadata__gold']

id2label = {
    2: 'economic',
    3: 'social',
    1: 'neither',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

# get five examples per label class
expls = df.groupby('label').sample(5, random_state=SEED)

display(HTML(expls.to_html()))

label
economic    225
neither     181
social      100
Name: count, dtype: int64


Unnamed: 0,text,label
92,"They are no longer content that some of the most important decisions in their lives what school their children attend, for example, or whether or not to go on strike should be taken by officialdom or trade union bosses.",economic
2340,Any extra burden on business will destroy jobs.,economic
1377,We will increase the bonus by paying a double pension in the first week of December.,economic
2800,We will legislate to remove legal immunity from industrial action which has disproportionate or excessive effect.,economic
1843,We will extend the long-term supplementary benefit rate to the long-term unemployed.,economic
3777,Clean up party funding.,neither
893,But we must not lower our guard.,neither
3945,We therefore favour the wider application of majority voting.,neither
738,"Northern Ireland. The British people have shown their commitment to the people of Northern Ireland in the common fight against terrorism, and in helping improve the economic and social situation in the Province.",neither
4984,"The Commonwealth provides Britain with a unique network of contacts linked by history, language and legal systems.",neither


### Economic policy stance

In [26]:
fp = os.path.join(data_path, 'benoit_crowdsourced_2016-econ_policy_stance.csv')
df = read_tabular(fp, columns=['text', 'label', 'metadata__gold'])

# subset to gold examples
df = df[df.metadata__gold]
del df['metadata__gold']

id2label = {
    -2: 'very left',
    -1: 'somewhat left',
     0: 'neither left nor right',
     1: 'somewhat right',
     2: 'very right',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

# get five examples per label class
expls = df.groupby('label').sample(5, random_state=SEED)

display(HTML(expls.to_html()))

label
somewhat right    160
somewhat left      65
Name: count, dtype: int64


Unnamed: 0,text,label
794,"Bring in a stronger regulatory framework to ensure honest practice in the City of London and introduce new safeguards on mergers, takeovers and monopolies to protect our national industrial, technological and research and development interests.",somewhat left
1524,"This tax cut will be paid for by introducing a new rate of income tax of 50%, payable on taxable income of over £100,000 per year.",somewhat left
462,Tax cuts have had a higher priority than job creation.,somewhat left
767,We will start to phase in a new disability income scheme and provide resources to give special support to young people with disabilities.,somewhat left
582,The Alliance will tackle poverty by targeting much higher benefits to those with the lowest incomes in relation to their needs.,somewhat left
1047,"We have already cut the basic rate of income tax from 33p to 23p, and our aim is to get it down to 20p, benefitting 18 million taxpayers.",somewhat right
1171,"We will guarantee to preserve the national identity, universal service and distinctive characteristics of the Royal Mail, while considering options - including different forms of privatisation - to introduce private capital and management skills into its operations.",somewhat right
302,We are also privatising the former British Airports Authority the world's leading international airports group.,somewhat right
169,high taxation prevents them doing so.,somewhat right
63,We must attract new private investment into rented housing - both from large institutions such as building societies and housing associations as well as from small private landlords.,somewhat right


### Social policy stance

In [34]:
fp = os.path.join(data_path, 'benoit_crowdsourced_2016-social_policy_stance.csv')
df = read_tabular(fp, columns=['text', 'label', 'metadata__gold'])

# subset to gold examples
df = df[df.metadata__gold]
del df['metadata__gold']

id2label = {
    -2: 'very liberal',
    -1: 'somewhat liberal',
     0: 'neither liberal nor conservative',
     1: 'somewhat conservative',
     2: 'very conservative',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

# get five examples per label class
expls = df.groupby('label').apply(lambda x: x.sample(min(len(x), 5), random_state=SEED)).reset_index(drop=True)

display(HTML(expls.to_html()))

label
somewhat liberal         53
somewhat conservative    46
very liberal              1
Name: count, dtype: int64


Unnamed: 0,text,label
0,Reform Crown Prosecution Service to convict more criminals,somewhat conservative
1,Since 1985 the average sentence for violence against the person has risen by a third and for sexual offences by nearly 40%.,somewhat conservative
2,We have built 22 new prisons since 1980.,somewhat conservative
3,"We will audit the resources available, take proper ministerial responsibility for the service, and seek to ensure that prison regimes are constructive and require inmates to face up to their offending behaviour.",somewhat conservative
4,Persistent house burglars and dealers in hard drugs will receive mandatory minimum prison sentences of 3 and 7 years respectively.,somewhat conservative
5,"We will give Councils powers and resources to support high-quality, targeted crime prevention initiatives.",somewhat liberal
6,"We will tackle any discriminatory use of police powers, such as stop and search, and enhance police action to deal with racial attacks.",somewhat liberal
7,Stop discrimination.,somewhat liberal
8,"In addition, we will take steps to ensure that homosexuals are not discriminated against.",somewhat liberal
9,"Lesbians and gay men. In a free and tolerant society, discrimination on any grounds is unacceptable.",somewhat liberal


## Data for immigration policy classification

### Policy area

In [36]:
fp = os.path.join(data_path, 'benoit_crowdsourced_2016-immigration_policy.csv')
df = read_tabular(fp, columns=['text', 'label', 'metadata__gold'])

# subset to gold examples
df = df[df.metadata__gold]
del df['metadata__gold']

id2label = {
    4: 'immigration',
    1: 'neither',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

# get five examples per label class
expls = df.groupby('label').sample(5, random_state=SEED)

display(HTML(expls.to_html()))

label
immigration    47
neither        37
Name: count, dtype: int64


Unnamed: 0,text,label
4451,"We value Britain's open, welcoming character, and will protect it by changing the immigration system to make it firm and fair so that people can once again put their faith in it.",immigration
6752,Non- UK citizens travelling to or from the UK will have their entry and exit recorded.,immigration
3804,"Those seeking sanctuary should not be detained, and in particular the administrative detention of children is unacceptable and should cease immediately.",immigration
6763,"As a member of the EU, Britain has lost control of her borders.",immigration
3800,"In particular, a legal status must be provided for people who have not succeeded in their claim for humanitarian protection but who cannot be returned to their country of origin due to the political situation there.",immigration
238,Leaving the European Union. The BNP loves Europe but hates the EU.,neither
2298,"We will commission a 24/7 urgent care service in every area of England, including GP out of hours services, and ensure that every patient can access a GP in their area between 8am and 8pm, seven days a week.",neither
2123,We will put in place a levy on banks.,neither
16,The BNP will institute a Community Award Scheme for young people.,neither
6775,UKIP will also halt European moves to give prisoners the vote.,neither


#### Policy stance

In [38]:
fp = os.path.join(data_path, 'benoit_crowdsourced_2016-immigration_policy_stance.csv')
df = read_tabular(fp, columns=['text', 'label', 'metadata__gold'])

# subset to gold examples
df = df[df.metadata__gold]
del df['metadata__gold']

id2label = {
    -1: 'favorable and open',
     0: 'neutral',
     1: 'Negative and closed',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

# get five examples per label class
expls = df.groupby('label').sample(5, random_state=SEED)

display(HTML(expls.to_html()))

label
Negative and closed    21
favorable and open     20
neutral                 6
Name: count, dtype: int64


Unnamed: 0,text,label
9,"COUNTER JIHAD: CONFRONTING THE ISLAMIC COLONISATION OF BRITAIN. The BNP is implacably opposed to the Labour/Tory regime's mass immigration policies which, if left unchecked, will see Britain and most of Europe colonised by Islam within a few decades.",Negative and closed
214,"As a member of the EU, Britain has lost control of her borders.",Negative and closed
209,Require those living in the UK under 'Permanent Leave to Remain' to abide by a legally binding 'Undertaking of Residence' ensuring they respect our laws or face deportation.,Negative and closed
13,"The BNP will deport all foreigners convicted of crimes in Britain, regardless of their immigration status.",Negative and closed
110,We will introduce an annual limit on the number of non-EU economic migrants admitted into the UK to live and work.,Negative and closed
191,"an Ôearned citizenshipÕ system, similar to those in Canada or australia, would allow Scotland to attract high-skill immigrants who can add to the strength of our economy and help deliver growing prosperity for the whole nation.",favorable and open
160,"We will allow people who have been in Britain without the correct papers for ten years, but speak English, have a clean record and want to live here long-term to earn their citizenship.",favorable and open
150,"A firm but fair immigration system. Britain has always been an open, welcoming country, and thousands of businesses, schools and hospitals in many parts of the country rely on people who've come to live here from overseas.",favorable and open
129,It is not just a matter of immigration Ð over 5 million British Citizens benefit from other countries' liberal immigration policies by living abroad.,favorable and open
128,"Much of our language, culture and way of life have been enriched by successive new arrivals over two thousand years.",favorable and open


In [42]:
fp = os.path.join(data_path, 'benoit_crowdsourced_2016-subsidies_stance.csv')
df = read_tabular(fp, columns=['text', 'label', 'metadata__gold', 'metadata__language'])

# subset to english sentence versions
df = df[df.metadata__language=='en']
del df['metadata__language']

# subset to gold examples
df = df[df.metadata__gold]
del df['metadata__gold']

id2label = {
    -1: "Anti-subsidy",
     0: "Neutral or inapplicable" ,
     1: "Pro-subsidy",
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

# get five examples per label class
expls = df.groupby('label').sample(5, random_state=SEED)

display(HTML(expls.to_html()))

label
Pro-subsidy                14
Anti-subsidy               13
Neutral or inapplicable     8
Name: count, dtype: int64


Unnamed: 0,text,label
73,"Our proposal therefore establishes a gradual reduction in aid, with a target to remove it altogether by 2014.",Anti-subsidy
61,These funds are not being spent on developing sustainable and competitive industries for the future.,Anti-subsidy
2,"However, the previous laws supporting the unprofitable mines, rather than helping them to become profitable, instead encouraged continued waste and uncompetitiveness.",Anti-subsidy
55,We should support the Commission’s proposal.,Anti-subsidy
37,"European coal mines will, sooner or later, have to adapt to change.",Anti-subsidy
90,I would like to discuss three points from the report.,Neutral or inapplicable
135,It concerns a Council regulation.,Neutral or inapplicable
107,We have all considered the arguments both for and against the proposal to continue subsidizing the coal mines.,Neutral or inapplicable
96,We need to consider two additional issues.,Neutral or inapplicable
118,I would like to present a different perspective.,Neutral or inapplicable
