
# Lexicons

When humanities studies concentrate exclusively on vocabulary or on the syntax of texts, and not so much on the meaning of these text, this is admittedly a rather shallow form of research. Various attempts have
been made, however to bridge the gap between the lexical codes and their semantic contents. One of the ways in which the semantic aspects of texts may be uncovered is by making use of lexicons which map the text’s tokens to pre-defined semantic categories. Examples of applications in which this principle is implemented include
the [Harvard General Inquirer (HGI)](http://www.wjh.harvard.edu/~inquirer/homecat.htm), [the Linguistic Inquiry and Word Count (LIWC)
tool](http://liwc.wpengine.com/)  and the [UCREL Semantic Analysis System (USAS)](http://ucrel.lancs.ac.uk/usas/). The programmers responsible for the *Harvard General Inquirer*, for example, have defined 182 semantic categories, and they have compiled long list of words that are indicative of these categories. The category “negative”, for instance, contains over 2290 entries. Such word lists are usually referred to as lexicons. 

To let you work the possibilities of semantic tagging, a number of the lexicons that have been made available have been downloaded and merged. Next to the lexicons developed for the HGI and USAS, the word lists created for this course also include terms taken from lists compiled by [Bing Liu](https://www.cs.uic.edu/~liub/) and by the project team that worked on the [Multi-Perspective Question Answering (MPQA) tool](http://mpqa.cs.pitt.edu/). 

The merged semangic lexicons have been added to the github repository for DTDP: 
https://raw.githubusercontent.com/peterverhaar/dtdp2020/master/Lexicons/

Among other word lists, this folder contains a lexicon named "positive.txt", which lists words with a positive connotation. This particular lexicon can be used to examine the degree to which the text expresses positive emotions. 

The lexicons can all be downloaded by running the code in the following two cells. 

In [1]:
import dtdpTdm as tdm
import os
from os.path import join
import requests
import re

lexicons = [ 'positive.txt', 'negative.txt' , 'Abstract.txt' , 'Academic.txt' , 'Active.txt' , 'Economics.txt' , 'Hostile.txt' , 'Increase.txt' , 'Legal.txt' , 'Military.txt' , 'Movement.txt' , 'Pain.txt' , 'Passive.txt' , 'Pleasure.txt' , 'Politics.txt' , 'Power.txt' , 'Religion.txt' , 'Space.txt' , 'Time.txt' , 'Transportation.txt' , 'Vice.txt' , 'Weather.txt' , 'workAndEmployment.txt' ]



In [None]:
baseUrl = 'https://raw.githubusercontent.com/peterverhaar/dtdp2020/master/Lexicons/'

def downloadLexicon(fileName):
    response = requests.get( baseUrl + fileName)
    if response:
        response.encoding = 'utf-8'
        out = open( fileName , 'w' , encoding = 'utf-8' )
        out.write( response.text )
        out.close()
    else:
        print('Cannot download lexicon file!')



for l in lexicons:
    print( 'Downloading {} ...'.format(l) )
    downloadLexicon( l )
    


If the lexicons are all available on your computer, you can use the code below to count the number of occurrences of the words in these various lexicons within the texts of your corpus. The code makes use of the method `countOccurrencesLexicon()` from the `dtdpTdm` module. The result (consisting of counts for all the texts in your corpus) is stored in a file named 'lexicon.csv'.

If your texts are long, or if the corpus contains many texts, running the code make take quite a while. 

In [None]:
csv = open( 'lexicon.csv' , 'w' , encoding = 'utf-8' )

## print header
csv.write( 'title' )
for l in lexicons:
    column = re.sub(r'\.txt' , '' , l)
    csv.write(',{}'.format(column.lower() ) )
csv.write('\n')

dir = 'Corpus'
for file in os.listdir( dir ):
    if re.search( r'\.txt$' , file ):
        print( 'Performing semantic tagging for {} ...'.format( file ) )
        csv.write( tdm.getTitle( file ) )
        path = join( dir, file )
        tokens = tdm.numberOfTokens(path)
        for l in lexicons:
            count = tdm.countOccurrencesLexicon( path , l )
            csv.write( ',{}'.format( count / tokens ) )
        csv.write('\n')
        
csv.close()

print("Done!")

Performing semantic tagging for ARoomWithaView.txt ...
Lemmatising Corpus/ARoomWithaView.txt ...
Performing semantic tagging for ATaleofTwoCities.txt ...
Lemmatising Corpus/ATaleofTwoCities.txt ...
Performing semantic tagging for HeartofDarkness.txt ...
Lemmatising Corpus/HeartofDarkness.txt ...
Performing semantic tagging for Ivanhoe.txt ...
Lemmatising Corpus/Ivanhoe.txt ...
Performing semantic tagging for MobyDick.txt ...
Lemmatising Corpus/MobyDick.txt ...


In the cell below, the counts that have made for the terms on the various lexicons can be visualised as a bar chart. A value of the variable named `y`, you need to type in the name of the lexicon, without the .txt extension. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd


df = pd.read_csv('lexicon.csv')

fig = plt.figure( figsize=( 7 ,6 ) )
ax = plt.axes()

x = 'title'
y = 'religion'


bar_width = 0.45
opacity = 0.8

ax.bar( df[x] , df[y] , width = bar_width, alpha = opacity , color = '#27a897')

plt.xticks(rotation= 90)

ax.set_xlabel('Categories' , fontsize= 12)
ax.set_ylabel('Mean values' , fontsize = 12 )
ax.set_title( y.title() , fontsize=20 )


plt.show()

As was explained in the notebook on [Natural Language Processing](NLTK.ipynb), it is also possible to explore the broader differences between the various categories that can be distinguished in your corpus. To connect the lexicon CSV to your metadata CSV which captures data about these categories, you need to run the following code.   

In [None]:
import pandas as pd

df = pd.read_csv( 'lexicon.csv' )
metadata = pd.read_csv( 'metadata.csv' )
df['class'] = metadata['class']

df.to_csv (r'dataset.csv', index = False, header=True) 

df = pd.read_csv( 'dataset.csv' )

colours = [ '#09349E' , '#D1AC32' , '#C70C2B' , '#6AD964' , '#A640E6' ]

classColours = dict()

unique_categories = list( set( df['class'] ) )
if len( unique_categories ) <= len(colours):
    for u in range( len( unique_categories ) ):
        classColours[ unique_categories[u] ] = colours[u]
else:
    print("You have more than five categories. You need to add colours to the list!")
    

The code below creates a bar chart whose colours represent the various categories you have distinguished.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd


df = pd.read_csv('dataset.csv')

colours = []
for category in df['class']:
    colours.append( classColours[category] )

fig = plt.figure( figsize=( 7 ,6 ) )
ax = plt.axes()

x = 'title'
y = 'religion'


bar_width = 0.45
opacity = 0.8

ax.bar( df[x] , df[y] , width = bar_width, alpha = opacity , color = colours )

plt.xticks(rotation= 90)


patchList = []
for key in classColours:
    data_key = mpatches.Patch(color=classColours[key], label=key)
    patchList.append(data_key)
    
plt.legend(handles=patchList , shadow=True, fontsize='large' , frameon = True )


ax.set_xlabel('Categories' , fontsize= 12)
ax.set_ylabel('Mean values' , fontsize = 12 )
ax.set_title( y.title() , fontsize=20 )


plt.show()

The code below, finally, creates a bar chart which visualises the counts for three lexicons simultaneously. The counts to be shown need to be specified in the list named `lexiconsInChart`. This bar chart enables you to cmpare the values collected for the three semantic domains you have listed. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('lexicon.csv')

fig = plt.figure( figsize=( 12 ,8 ) )
ax = plt.axes()

lexiconsInChart = [ 'academic' , 'time' , 'religion' ]


ind = np.arange( df.shape[0] )


width = 0.27     


bar1 = ax.bar(ind, df[ lexiconsInChart[0] ] , width, color='#3264a8')
bar2 = ax.bar(ind+width, df[lexiconsInChart[1] ], width, color='#edcf0e')
bar3 = ax.bar(ind+width*2, df[ lexiconsInChart[2] ], width, color='#e01c0d')


ax.set_ylabel('Scores')
ax.set_xticks(ind+width)

ax.set_xticklabels( df['title'] )
ax.legend( ( bar1[0], bar2[0], bar3[0]), ( lexiconsInChart[0] , lexiconsInChart[1] , lexiconsInChart[2] ) )


plt.xticks(rotation= 90)

ax.set_xlabel('Categories' , fontsize= 12)
ax.set_ylabel('Mean values' , fontsize = 12 )
ax.set_title( 'Sample data' , fontsize=20 )


plt.show()