# Dictionary Methods

How do we know what texts are about? One of the simplest ways is called dictionary methods, which counts words of interest in text. This is useful if we know what categories we're interested in and can specify words related to them. It's also one of the most long-standing, and ubiquitous, methods in automated text analysis, so it's important to both understand the method and be able to implement it.

Dictionary methods are used for many purposes. A few possibilities:
* classify text into themes
* measure the *tone* of text
* measure sentiment
* measure psychological processes

Dictionary methods are based on the assumption that themes or categories consist of a group of words, and texts that cover that theme will have a higher percentage of that group of words compared to other texts. Dictionary-based analysis is appropriate when the categories and the textual features (words and/or phrases associated with each category) are known and fixed--based on expert knowledge, crowd-sourcing, etc. 

We will use dictionary methods to do sentiment analysis, a popular text analysis task, on a corpus of Music Reviews (collected from MetaCritic.com), using a standard sentiment analysis dictionary. We will also use a weighted dictionary to detect concreteness in novels. Finally, we will explore using word weighting to get distinctive words.

## Outline
* [Part 0: Basic dictionary method](#basic)
    0. [Introduction to dictionary methods](#intro)
        * Standard dictionaries
        * Custom dictionaries
    1. [Preprocessing](#preprocess)
    2. [Creating dictionary counts](#counts)
    3. [Sentiment analysis using Scikit-learn](#sentiment)
<br><br>    
* [Part 1: Weighted dictionary](#weighting)
    0. [Read concreteness score dictionary](#read)
    1. [Merging a DTM with a weighted dictionary](#merge)
    2. [Weight term frequencies by their concreteness score](#concweight)
    3. [Calculating an average concreteness score for each text](#average)
    4. [Assess the difference](#assess)
<br><br>
* [Bonus: Weighting words with TF-IDF](#tfidf)
    * Identifying distinctive words
<br><br>
* [Further resources](#resources)

## Learning Goals
* Understand the intuition behind dictionary methods
* Learn how to implement in via Python Pandas and NLTK
* Get more comfortable combining Python packages together
* Implement a rudimentary sentiment analysis tool and test it on sample data
* Practice applying a weighted dictionary

## Vocabulary

* *dictionary method*:
    * text analysis method that utilizes the frequency of key words, grouped into themes, to determine the prevelance of that theme throughout a corpus.
* *standard dictionary*:
    * otherwise known as general dictionaries, a dictionary created by experts meant to measure general phenomenon.
* *custom dictionary*:
    * dictionaries tailored to a specific domain or question. Usually created by the researcher based on the research question.
* *sentiment analysis*:
    * the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.
* *Term Frequency-Inverse Document Frequency (TF-IDF) Scores*: 
    *  a numerical statistic intended to reflect how important a word is to a document in a collection or corpus. As with DTM, rows correspond to documents in the collection and columns correspond to terms. Making TF-IDF scores is a common preprocessing step: it takes in tokenized texts and makes them ready for downstream tasks (e.g., topic models, supervised machine learning models)

**__________________________________**


# Part 0: Basic dictionary methods<a id='basic'></a>

## 0.0 Introduction to dictionary methods<a id='intro'></a>

The basic idea behind dictionary methods is that language reflects social position and culture: the cognitive categories through which individuals attend to the world are embedded in the words they use.  Words that are used frequently are cognitively central and reflect what is most on the speaker’s (or writer’s) mind.  Words that are used infrequently or not at all are at the speaker’s cognitive periphery, perhaps even representing uncomfortable or alien concepts.

There are two forms of dictionaries: standard or general dictionaries, and custom dictionaries.

### Standard dictionaries

There are a number of standard dictionaries that have been created by field experts. The benefit of standarized dictionaries is that they're developed by experts and have been throughoughly validated. Others have likely published using these dictionaries, so reviewers are more likely to accept them as valid. Because of this, they are good options if they fit your research question.

Here are a few:

* [DICTION](http://www.dictionsoftware.com/): a computer-aided text analysis program for determining the tone of a text. It was created by and for organization scholars and political scientists.
    * Main five categories: Certainty, Activity, Optimism, Realism, Commonality
    * 35 sub-categories
    * Allows you to create your own dictionary
    * Proprietary software
* [Linguistic Inquiry and Word Count (LIWC)](http://liwc.wpengine.com/): Created by psychologists, it's meant to capture psychological processes around feelings, personality, and motivations. It's also proprietary.
* [Multi-Perspective Question Answering (MPQA)](http://mpqa.cs.pitt.edu/): The free version of LIWC. We will use this dictionary today.
* [Valence Aware Dictionary and sEntiment Reasoner (VADER)](https://github.com/cjhutto/vaderSentiment): a popular and free rule-based sentiment analysis tool tuned to capture sentiment in social media data.
* [Harvard General Inquirer](http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm). Multiple categories, including abstract and concrete words. It's free and available online.

However, dictionaries are context-specific, and applying a dictionary outside its original target domain can lead to biased or  unreliable results. A classic example was applying the Harvard General Inquirer to classify negative tone in corporate earnings reports--which [leads to serious problems](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.2010.01625.x?casa_token=Vl0Os0QtDIQAAAAA%3AaxNRNAj4ajoocjLv_DCJMsS8GEA_jAqb_Z26h5I8g1yW0muShiPs_lglFbpHR-3AU1etsuyjhdN_ahZy). For example, the term "vice" characterizes an executive rather than immorality, while "tire" is not negative in the context of automobile industry reports. And some negative words aren't captured at all: the terms "litigation" and "unanticipated" are missing from the Harvard list. Enough of these problems means the results of our analyses will be wrong. So sometimes it's worth the effort to create a custom dictionary.

### Custom dictionaries

Many research questions or data are domain specific and will thus require you to create your own dictionary based on your own knowledge of the domain and research question. Creating your own dictionary requires a lot of thought and must be validated. These dictionaries are typically created in an iterative fashion; they are first formed from domain knowledge, and as the validation process gets going they may be modified again and again. See references & links (at bottom of this notebook) to Enns et al. (2015) or Haber (2020) for examples of how to construct a domain-specific dictionary. 

Today we will use the free and standard sentiment dictionary from MPQA to measure positive and negative sentiment in the music reviews.

We are about to apply an NLP model. Do you remember what comes first?

## 0.1 Preprocessing<a id='preprocess'></a>

### Review of reading in multiple files

Often, our text data is split across multiple files in a folder. We want to read them all into a single variable. <br>`glob` is a handy package for this: it lists all files matching a pattern. We can use this to get all files in a folder. 

As a demonstration, let's see how to read in all the `.csv` files in the folder `amazon`. We'll extract out only the `text` column from THE FIRST TWO files and store them all in a list called `reviews`. 

In [None]:
import glob, os

fnames = os.path.join('../day-1/data/', 'amazon', '*.csv') # get file paths
fnames = glob.glob(fnames)

# Most elegant or 'pythonic' way to do this: a for loop
reviews = [] # ***VERY IMPORTANT:*** initialize empty list
column_names = ["Id", "ProductId", "UserId", "ProfileName", "HelpfulnessNumerator", "HelpfulnessDenominator", "Score", "Time", "Summary", "Text"] # define column names

# Loop over fnames, for each one create DataFrame and extract txt 
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names) # read each .csv into a DataFrame, coerce column names
    text = list(df['Text']) # extract `text` column
    reviews.extend(text) # add to `reviews` list

reviews[:3] # preview results

### Review of removing punctuation

In [None]:
import string
punctuations = list(string.punctuation)

fname = 'example2.txt'
fname = os.path.join('../day-1/data/', fname)
with open(fname) as f:
    text = f.read()
    
# Note the list comprehension--this is the most common way to remove punctuation:
no_punct = ''.join([char for char in text if char not in punctuations])
no_punct

### New data
Let's read in our Music Reviews corpus as a Pandas dataframe.

In [None]:
#import the necessary packages
import pandas as pd
import nltk
nltk.download('punkt')
from nltk import word_tokenize

#read the Music Reviews corpus into a Pandas dataframe
df = pd.read_csv("../day-2/data/BDHSI2016_music_reviews.csv", encoding='utf-8', sep = '\t')

#view the dataframe
df.head()

In [None]:
# Look at breakdown of genres:
df['genre'].value_counts()

The next step is to create a new column in our dataset that contains tokenized words with all the pre-processing steps. First, let's remove digits, then you can practice by doing the rest.

In [None]:
# Remove digits from `body` column:
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

### Challenge
Let's review preprocessing using the `df` we just created. This is a little different from yesterday's practice using strings and lists, but the essentials are the same. To see the key new things you'll likely want to use, refer to the example of removing digits from the previous cell--especially note the list comprehension and this useful strategy: 
`df['column'].apply(lambda x: function(x))`. 

To preprocess `df`, take these steps:
* Create a new column in `df` called `body_tokens` that contains a lower cased version of `df['body']`. 
* Tokenize the `body_tokens` column of `df` using one of the methods we worked with yesterday. 
* Remove punctuation from `body_tokens`. 
* Create a new column that contains the length of the token list in each row. We will use this later to normalize the dictionary counts. 
* Reflect: What other pre-processing steps might we use?

In [None]:
# your code here

## 0.2. Creating dictionary counts<a id='counts'></a>

I created two text files, one is a list of positive words from the MPQA dictionary, the other is a list of negative words. One word per line. Our goal here is to count the number of positive and negative words in each row of our dataframe, and add two columns to our dataset with the count of positive and negative words.

First, read in the positive and negative words and create list variables for each.

In [None]:
pos_str = open("../day-2/data/positive_words.txt", encoding='utf-8').read()
neg_str = open("../day-2/data/negative_words.txt", encoding='utf-8').read()

#view part of the pos_sent variable, to see how it's formatted.
pos_str[:101]

In [None]:
#remember the split function? We'll split on the newline character (\n) to create a list
positive_words=pos_str.split('\n')
negative_words=neg_str.split('\n')

#view every 100th element in the lists
print(positive_words[::100])
print(negative_words[::100])

In [None]:
#count number of words in each list
print('This many words in positive MPQA dictionary:', str(len(positive_words)))
print('This many words in negative MPQA dictionary:', str(len(negative_words)))

Great! You know what to do now.

### Challenge
1. Create a column with the number of positive words, and another with the proportion of positive words
2. Create a column with the number of negative words, and another with the proportion of negative words
3. Print the average proportion of negative and positive words by genre
4. Compare this to the average score by genre

*Note:* You won't be able to do this challenge (or anything else in this section) if you didn't complete the first challenge above to preprocess `df['body']` into `df['body_tokens']`. If you skipped that part or got stuck, copy and run the solution from `solutions/dictionary-methods-solutions.ipynb` before moving on.

In [None]:
# your code here

That's the dictionary method! You can do this with any dictionary you want, standard or you can create your own.

## 0.3. Sentiment analysis using scikit-learn<a id='sentiment'></a>

We can also do this using a Document-Term Matrix. We'll do this in pandas, to make it conceptually clear. As you get more comfortable with programming, you will probably want to get used to working with Compressed Sparse Format, which is much more computationally efficient.

In [None]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

#create our document term matrix as a pandas dataframe
dtm_df = pd.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)
dtm_df

Now we can keep only those *columns* that occur in our positive words list. To do this, we'll first save a list of the columns names as a variable, and then only keep the elements of the list that occur in our positive words list. We'll then create a new dataframe keeping only those select columns.

In [None]:
#create a columns variable that is a list of all column names
columns = list(dtm_df)
columns[::750] # view every 750th element

In [None]:
#create a new variable that contains only column names that are in our postive words list
pos_columns = [word for word in columns if word in positive_words]
pos_columns[::100] # view every 100th element

In [None]:
#create a dtm from our dtm_df that keeps only positive sentiment columns
dtm_pos = dtm_df[pos_columns]
dtm_pos.head()

In [None]:
#count the number of positive words for each document
dtm_pos['pos_count'] = dtm_pos.sum(axis=1)
dtm_pos['pos_count']

### Challenge
1. Do the same for negative words.  
2. Calculate the proportion of negative and positive words for each document.

In [None]:
# your code here

# Part 1: Weighting dictionaries<a id='weighting'></a>

Next we'll use a weighted dictionary to compare the relative average concreteness of the words used in Austen's *Pride and Prejudice* versus Alcott's *A Garland for Girls*. A weighted dictionary indicates not only whether a phrase is associated with a category, but *how strongly* it is associated with that category. In this approach, a dictionary is a list of weighted words.

This could be done using a regular dictionary: a list of concrete and abstract words. Instead, we'll use a crowdsourced dictionary that provides an average "concreteness score" for a large number of English words.

## 1.1 Read concreteness score dictionary<a id='read'></a>

First we'll create a pandas dataframe from the concreteness score dictionary, saved on our hard drive in the form of a .csv file.

This dictionary comes from work by [Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman.](https://link.springer.com/article/10.3758/s13428-013-0403-5) In summary:

    The authors obtained Concreteness ratings for 37,058 English words and 2,896 two-word expressions (such as zebra crossing and zoom in), by means of a norming study using Internet crowdsourcing for data collection. They had over 4,000 participants rate 5 words on a concreteness scale, from 1 (very abstract) to 5 (very concrete). They define concrete words as words you can experience through the senses, and abstract words as words that you cannot experience through the senses. They provide the average concreteness score and the standard deviation for each word.

Let's read in the data.

In [None]:
con_score = pd.read_csv('../day-2/data/Concreteness_ratings_Brysbaert_et_al.csv')
con_score

We can see the most concrete and most abstract words by sorting on `Conc.M`.

In [None]:
con_score[['Word','Conc.M']].sort_values(by='Conc.M',ascending=False)

## 1.2. Merging a DTM with a weighted dictionary<a id='merge'></a>

The goal is to merge this score with our Document Term Matrix, so we can calculate the average concreteness score for our texts.

To do this, we'll first create the DTM from our two novels, transpose this matrix, and merge it with the dataframe created above. We'll merge on the column 'Word'.

First, create the DTM.

In [None]:
text_list = [] # initialize list

# open and read the novels, save them as variables
austen_string = open('../day-2/data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../day-2/data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

# append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)

countvec = CountVectorizer(stop_words="english")

novels_df = pd.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Next, we'll take a subset of the DTM, keeping only the intersection between the words in our corpus and the words in the dictionary.

In [None]:
columns=list(novels_df)
columns_con = [word for word in columns if word in list(con_score['Word'])]
columns_con[::300]

In [None]:
novels_df_con = novels_df[columns_con]
novels_df_con 

Next, transpose the matrix, rename the column, and merge with the dictionary dataframe.

In [None]:
df = novels_df_con.transpose() # transpose

df.rename(columns={0: 'Austen', 1: 'Alcott'}, inplace=True) # rename

df

In [None]:
#Rename the index 'Word', and reset the index, so the words become a column in our dataframe and we get a new index.
df.index.names = ['Word']
df.reset_index(inplace=True)

df

In [None]:
#merge with our dictionary dataframe, called 'con_score'
df = df.merge(con_score, on = 'Word')
df

## 1.3. Weighting term frequencies by the concreteness score<a id='concweight'></a>

Now we can weight the term frequency cells by the concreteness score, by multiplying the frequency count column by the concreteness score column.

In [None]:
df['austen_con_score'] = df['Austen'] * df['Conc.M']
df['alcott_con_score'] = df['Alcott'] * df['Conc.M']
df

### Challenge

Calculate and print the average concreteness score for each text. Careful! Think through this before you implement it. You want the average score, normalized over all the words in the text. 

In [None]:
# your code here

## 1.4. Assessing the difference<a id='assess'></a>

So there is a difference, but what does it mean? What is the magnitude of the difference?

We can look at the difference between the two means as a percent difference based on the scale range. We can calculate this using simple math.

In [None]:
# First find the difference between the means by substracting one from the other
3.1534507874-2.78328905828

In [None]:
# Find the range of concreteness scores
print(df['Conc.M'].min())
print(df['Conc.M'].max())

In [None]:
# Find the scale range
df['Conc.M'].max() - df['Conc.M'].min()

In [None]:
# Calculate the difference of means as a percent of this range
(0.37/3.83)* 100

### Challenge
Print the most concrete and abstract terms in Austen and in Alcott. Don't worry about term frequencies; just look at the raw score of words present in each novel.<br>
*Hint:* You can't simply sort on the column `austen_con_score` and so on. Why not? What are your next steps?

In [None]:
# your code here

# Bonus: Weighting words with TF-IDF<a id='tfidf'></a>

Next let's practice with Term Frequency-Inverse Document Frequency (TF-IDF) word scores. This isn't a dictionary method, but rather a weighted version of what we practiced yesterday: the Document-Term Matrix (DTM). Using TF-IDF will allow us to find distinctive words in our dataset.

The idea behind word scores (like TF-IDF) in general is to weight words not just by their frequency (as in the weighted dictionaries we just practiced), but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be indicative of the content of that document. We want to instead identify frequent words that are unevenly distributed across the corpus.

TF-IDF is one of the most popular ways to weight word scores (aside from counting raw word frequencies). By offsetting the frequency of a word (TF) by its document frequency (the number of documents in which it appears; IDF), we can filter out common or stop words like 'the', 'of', and 'and'.

More precisely, the IDF is calculated as the inverse proportion of documents with a given word:

`number_of_documents / number_documents_with_word`

This together with the raw frequency of the word in a given document gives:

`tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)`

This will give greater scores to words in long documents. To keep word scores on a comparable scale, it's usually best to normalize the numerator: 

`tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)`

Scikit-learn is the standard method of implementing TF-IDF in Python, and the main class for this is [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer). This notebook uses the Scikit-learn method, but here's a special challenge for those willing: Try calculating TF-IDF manually!

In [None]:
# Let's use our Music Reviews corpus for this. Read into Pandas DataFrame:
df = pd.read_csv("../day-2/data/BDHSI2016_music_reviews.csv", encoding='utf-8', sep = '\t')

# To avoid muddying things, clean out numbers--but leave stopwords, etc. in place for now:
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

In [None]:
#import the function TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()

#create the dtm, but with cells weighted by the TF-IDF word score
tfidf_df = pd.DataFrame(tfidfvec.fit_transform(df['body']).toarray(), columns=tfidfvec.get_feature_names())

#view results
tfidf_df

Let's look at the 20 words with highest tf-idf weights.

In [None]:
print(tfidf_df.max(axis=0).sort_values(ascending=False)[:20])

Ok! We have successfully identified content words, without removing stop words and without part-of-speech tagging. What else do you notice about this list?

## Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we merge the genre of the document into our DTM weighted by TF-IDF scores, and then compare genres.

In [None]:
#creat dataset with document index and genre
df_genre = df['genre'].to_frame()
print(df_genre)

In [None]:
#merge this into the tfidf_df (DTM weighted by TF-IDF)
merged_df = df_genre.join(tfidf_df, how = 'right', lsuffix='_x')

#view result
merged_df

Now lets compare the words with the highest TF-IDF weight for each genre. The illustrating question: **what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?**

In [None]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=='Rap']
dtm_indie = merged_df[merged_df['genre_x']=='Alternative/Indie Rock']
dtm_jazz = merged_df[merged_df['genre_x']=='Jazz']

#print the words with the highest TF-IDF scores for each genre
print('Rap Words')
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print('Indie Words')
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print('Jazz Words')
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

There we go! A method of identifying content words, and distinctive words based on groups of texts. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

### Challenge

Compare the distinctive words for two artists in the data.

Note: the artists should have a number of reviews, so check your frequency counts to identify artists.

*Hint:* Copy and paste the above code and modify it as needed.

In [None]:
# your code here

## Further Resources<a id='resources'></a>

[Gonçalves, Pollyanna, Matheus Araújo, Fabrício Benevenuto, and Meeyoung Cha. 2013. "Comparing and Combining Sentiment Analysis Methods"](https://dl.acm.org/doi/abs/10.1145/2512938.2512951)
<br>Compares eight popular sentiment analysis methods (including LIWC and others) and finds wide variation in their results.

[Jockers, Matthew L. 2015. "A Novel Method for Detecting Plot"](http://www.matthewjockers.net/2014/06/05/a-novel-method-for-detecting-plot/)
<br>Blog applying sentiment analysis to several classic novels. 

Examples of creating custom dictionaries:
- [Haber, Jaren. 2020. “Sorting Schools: A Computational Analysis of Charter School Identities and Stratification”](https://doi.org/10.1177/0038040720953218) <br>
- [Enns, Peter, Nathan Kelly, Jana Morgan, and Christopher Witko. 2015. "Money and the Supply
of Political Rhetoric: Understanding the Congressional (Non-)Response to Economic Inequality”](http://cdn.equitablegrowth.org/wp-content/uploads/2016/06/29155322/enns-kelly-morgan-witko-econinterests-policyagenda.pdf) 