# Dictionary methods

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data), set of readers' comments to articles published in the New York Times.

## Overarching research question

The comments provide a perspective to the kinds of concerns people raise in discussions related to online articles. The point of this example is to try to construct a set of keywords for dictionary search that captures the readers
' views about the New York Times and journalistic reporting. Try different sets of keywords iteratively and examine how the results change.

In [None]:
# Specify keywords for dictionary search
keywords = "New York Times,NYT"
keywords = keywords.lower()
keywords = keywords.split(',')

In [None]:
import csv
import os

In [None]:
path = 'data/nyt-comments/'
files = os.listdir( path ) ## see all files in directory
files = filter( lambda file_name: file_name.startswith("Comments"), files ) ## choose only data files
files = map( lambda file_name: path + file_name, files ) ## add path to file names

In [None]:
counter = 0
comments = 0

# Iterate through the files and count comments which mention any of the keywords
for file in files:
    for entry in csv.DictReader( open( file ) ):
        
        comments += 1
        
        comment = entry['commentBody']
        
        ## work through several different keywords in the analysis
        for keyword in keywords: 
            if keyword in comment.lower():
                counter += 1
                break

print( counter, "/", comments, "comments mention any of the keywords:", ', '.join(keywords) )

## Tasks

* Try to think of other keywords that would make the search more comprehensive, while not producing excessive volumes of irrelevant material. Add these to the keywords above and rerun the script.
* Are there any cases where the above code might not work? Modify the code to address these if possible.
* The data has a `createDate` variable as well, which identifies when the comment was created. Based on this, try to look for some temporal trends in comment counts.

# Natural language analysis

In many languages, different words can have different forms. For example, 'I have an apple' and 'I have several apples' convey almost the same information, similarly 'She had an apple' and 'She has an apple' are almost identical. In the Finnish language such cases abound, since words may have several forms thanks to the many suffixes.

This complicates keyword-based analyses. One approach to reducing the complexity of language is to **stem** or **lemmatize** words into their basic form. Furthermore, tools such as the [Natural Language Toolkit](https://www.nltk.org/) enable parsing text to identify proper nouns, named entities or to determine whether a word is an adjective, noun etc.


## Tasks

The below code is a short example of stemming the words of a message. Replicate the keyword search above, but this time using proper stemmatization. Do the results change?

After doing the search using stemming, try processing the resulting comments to create a Document-Term Matrix. You can also check the other code examples for hints on how this could be done.

In [None]:
import nltk
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()

In [None]:
message = 'This is a longer example! Many words are included here, and we shall stem them all.'
stemmed = ''

for word in nltk.word_tokenize( message ):
    stemmed += stemmer.stem( word ) + ' '
    
print( stemmed )