# Dictionary methods

## Dataset for the exercise

* [New York Times Comments](https://www.kaggle.com/aashita/nyt-comments/data), set of reders' comments for articles published in the New York Times

## Overarching research question

The comments allow a perspective to study what kind of concerns people raise when commenting to online articles.
Study what seem to be the target of the commenting: New York Times staff or journalistic guidelines (suggesting that comments serveas a tool for journalists to interact with their audiences _or_ other audience members).

In [3]:
import csv
from collections import OrderedDict
import os
import datetime

In [55]:
## The logic of having three set of keywords is that since we are interested about New York Times' staff and journalistic 
## guidelines, first thing we need find out what comments are about New York Times itself and after that we need to
## understand the category the comment belongs to. 

keywords = "New York Times, NYT ,staff,journalist,writer,editor,ethics,truth,fairness,diversity"
keywords = keywords.lower()
keywords = keywords.split(',')

keywords_nyt = "New York Times, NYT "
keywords_nyt = keywords_nyt.lower()
keywords_nyt = keywords_nyt.split(',')

keywords_staff = 'staff,journalist,writer,editor'
keywords_staff = keywords_staff.lower()
keywords_staff = keywords_staff.split(',')

keywords_ethics = 'ethics,truth,fairness,diversity,honesty'
keywords_ethics = keywords_ethics.lower()
keywords_ethics = keywords_ethics.split(',')

In [39]:
path = 'data/nyt/'
files = os.listdir( path ) ## see all files in directory
files = filter( lambda file_name: file_name.startswith("Comments"), files ) ## choose only data files
files = map( lambda file_name: path + file_name, files ) ## add path to file names

In [40]:
counter = 0
comments = 0

for file in files:
    for entry in csv.DictReader( open( file, encoding='utf8') ):
        
        comments += 1
        
        comment = entry['commentBody']
        
        ## work through several different keywords in the analysis
            
        for keyword in keywords:
            if keyword in comment.lower():
                counter += 1
                break
   
print( counter, "comments mention any of these:", ','.join(keywords) )
print( "There are in total", comments )

156957 comments mention any of these: new york times, nyt ,staff,journalist,writer,editor,ethics,truth,fairness,diversity
There are in total 2176364


## Tasks

* Identify other potential keywords for this phenomena and add those keywords in the list.

I created three sets of keywords. First one to find the articles that are about New York Times. Second one to find the articles that are about the staff of NYT. Third one to find the articles that are about the journalistic guidelines of NYT.  

* Are there any cases where this approach might break? modify the code to mitigate them when possible

For example, the whitespaces mattered. Meaning that if the keyword is "nyt", it matches with words that have "nyt" in their body. 

* The data has `createDate` variable as well which identifies when the comment was created. Based on this, try to look for some temporal trends in comment counts.

The following code implements the new keyword "strategy" and also maps the number of comments in each month: 

In [57]:
path = 'data/nyt/'
files = os.listdir( path ) ## see all files in directory
files = filter( lambda file_name: file_name.startswith("Comments"), files ) ## choose only data files
files = map( lambda file_name: path + file_name, files ) ## add path to file names

In [58]:
comments = 0

comments_about_nyt_counter = 0
comments_about_staff_counter = 0
comments_about_ethics_counter = 0
comments_in_both_categories = 0

wanted_months = ['Jan 2017', 'Feb 2017', 'Mar 2017', 'Apr 2017', 'May 2017', 'Jan 2018', 'Feb 2018', 'Mar 2018', 
                'Apr 2018'] # In the 2017 data there were also comments from June to December

nyt_comments_per_month = OrderedDict([('Jan 2017', 0), ('Feb 2017', 0), ('Mar 2017', 0), ('Apr 2017', 0), 
                                      ('May 2017', 0), ('Jan 2018', 0), ('Feb 2018', 0), ('Mar 2018', 0), 
                                      ('Apr 2018', 0)])

staff_comments_per_month = OrderedDict([('Jan 2017', 0), ('Feb 2017', 0), ('Mar 2017', 0), ('Apr 2017', 0), 
                                      ('May 2017', 0), ('Jan 2018', 0), ('Feb 2018', 0), ('Mar 2018', 0), 
                                      ('Apr 2018', 0)])

ethics_comments_per_month = OrderedDict([('Jan 2017', 0), ('Feb 2017', 0), ('Mar 2017', 0), ('Apr 2017', 0), 
                                      ('May 2017', 0), ('Jan 2018', 0), ('Feb 2018', 0), ('Mar 2018', 0), 
                                      ('Apr 2018', 0)])

for file in files:
    for entry in csv.DictReader( open( file, encoding='utf-8') ):
        
        comments += 1
        
        comment = entry['commentBody']
        create_date = entry['createDate']
        
        comment_time = float(create_date)        
        
        ## First, lets filter the comments that mention keywords new york times or nyt. So we have the body of
        ## comments that are about the magazine itself:
        for keyword_nyt in keywords_nyt:
            
                       
            was_in_staff = False
            
            if keyword_nyt in comment.lower():
                comments_about_nyt_counter += 1
                
                month_of_comment = datetime.datetime.fromtimestamp(int(comment_time)).strftime('%b %Y')
                
                if month_of_comment in wanted_months:
                
                    nyt_comments_per_month[month_of_comment] = nyt_comments_per_month[month_of_comment] + 1

                    for keyword_staff in keywords_staff:
                        if keyword_staff in comment.lower():
                            comments_about_staff_counter += 1    
                            was_in_staff = True
                            staff_comments_per_month[month_of_comment] = staff_comments_per_month[month_of_comment] + 1
                            break

                    for keyword_ethics in keywords_ethics:
                        if keyword_ethics in comment.lower():
                            comments_about_ethics_counter += 1
                            ethics_comments_per_month[month_of_comment] = ethics_comments_per_month[month_of_comment] + 1
                            if was_in_staff == True:
                                comments_in_both_categories += 1
                            break

                    break
                
print( comments_about_nyt_counter, "comments mention any of these:", ','.join(keywords_nyt) )
print( comments_about_staff_counter, "comments mention any of these:", ','.join(keywords_staff) )
print( comments_about_ethics_counter, "comments mention any of these:", ','.join(keywords_ethics) )
print( comments_in_both_categories, "comments that mention staff and ethics keywords" )
print( "There are in total", comments, '\n' )

print('About nyt: ', nyt_comments_per_month, '\n')
print('About staff: ', staff_comments_per_month, '\n')
print('About ethics:', ethics_comments_per_month, '\n')

35694 comments mention any of these: new york times, nyt 
5867 comments mention any of these: staff,journalist,writer,editor
2625 comments mention any of these: ethics,truth,fairness,diversity,honesty
599 comments that mention staff and ethics keywords
There are in total 2176364 

About nyt:  OrderedDict([('Jan 2017', 9900), ('Feb 2017', 8378), ('Mar 2017', 8824), ('Apr 2017', 7796), ('May 2017', 10258), ('Jan 2018', 5774), ('Feb 2018', 5602), ('Mar 2018', 6948), ('Apr 2018', 6796)]) 

About staff:  OrderedDict([('Jan 2017', 887), ('Feb 2017', 714), ('Mar 2017', 666), ('Apr 2017', 674), ('May 2017', 1041), ('Jan 2018', 434), ('Feb 2018', 406), ('Mar 2018', 511), ('Apr 2018', 534)]) 

About ethics: OrderedDict([('Jan 2017', 474), ('Feb 2017', 329), ('Mar 2017', 354), ('Apr 2017', 278), ('May 2017', 373), ('Jan 2018', 227), ('Feb 2018', 187), ('Mar 2018', 182), ('Apr 2018', 221)]) 



Unsurprisingly, defining the right keywords seems to be very difficult task. I am not sure if first filtering those comments that are about new york times and then checking in which category they fall is reasonable since it only takes only 35 685 comments from 2 176 364 possible comments. If we look for some kind of temporal trends we can say that the there is probably something wrong with the data (or my code) with comments of April 2017 since compared to the other months there are very few comments. However, what I find interesting is that January 2017 and May 2018 seem to have been quite busy time for commenting. In 2018 it seems that the March and April are the busiest times for commenting. There does not seem to be any noteworthy trends in the comments.

# Natural language analysis

In many languages, different words can have different forms. For example, 'I have an apple' and 'I have several apples' convey almost the same information, similarly 'She had an apple' and 'She has an apple' are almost identical. In Finnish language, such examples are much more extensive thanks to the many suffixes words may have several forms.

![Joke about conjugation](https://i1.wp.com/finnishteacher.com/wp-content/uploads/2018/11/finnish-language-meme.png?resize=1024%2C591&ssl=1)

This might make analysis difficult! Therefore often the language is **stemmed** or **lemmatized** into its basic form. Furthermore, tools such as [Natural Language Toolkit](https://www.nltk.org/) allow parsing text to identify proper nouns, identify named entities or determine if a word is adjective, noun etc.

## Dataset

Use same dataset.

## Tasks

Replicate the previous exercise using proper stemmatization. If results change, how and why?

In [44]:
import nltk
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()

In [66]:
message = 'This is a longer example! Many words are included here, and we shall all words.'
stemmed = ''

for word in nltk.word_tokenize( message ):
    stemmed += stemmer.stem( word ) + ' '
    
print( stemmed )

this is a longer exampl ! mani word are includ here , and we shall all word . 


Stemming the keywords:

In [59]:
keywords_nyt = "New York Times, NYT "

stemmed = ''
for word in nltk.word_tokenize( keywords_nyt ):
    stemmed += stemmer.stem( word ) + ' '

keywords_nyt = stemmed.lower()
keywords_nyt = keywords_nyt.split(',')

keywords_staff = 'staff,journalist,writer,editor'

stemmed = ''
for word in nltk.word_tokenize( keywords_staff ):
    stemmed += stemmer.stem( word ) + ' '

keywords_staff = stemmed.lower()
keywords_staff = keywords_staff.split(',')

keywords_ethics = 'ethics,truth,fairness,diversity,honesty'

stemmed = ''
for word in nltk.word_tokenize( keywords_ethics ):
    stemmed += stemmer.stem( word ) + ' '

keywords_ethics = stemmed.lower()
keywords_ethics = keywords_ethics.split(',')


In [64]:
path = 'data/nyt/'
files = os.listdir( path ) ## see all files in directory
files = filter( lambda file_name: file_name.startswith("Comments"), files ) ## choose only data files
files = map( lambda file_name: path + file_name, files ) ## add path to file names

In [65]:
wanted_months = ['Jan 2017', 'Feb 2017', 'Mar 2017', 'Apr 2017', 'May 2017', 'Jan 2018', 'Feb 2018', 'Mar 2018', 
                'Apr 2018'] # In the 2017 data there were also comments from June to December

nyt_comments_per_month = OrderedDict([('Jan 2017', 0), ('Feb 2017', 0), ('Mar 2017', 0), ('Apr 2017', 0), 
                                      ('May 2017', 0), ('Jan 2018', 0), ('Feb 2018', 0), ('Mar 2018', 0), 
                                      ('Apr 2018', 0)])

staff_comments_per_month = OrderedDict([('Jan 2017', 0), ('Feb 2017', 0), ('Mar 2017', 0), ('Apr 2017', 0), 
                                      ('May 2017', 0), ('Jan 2018', 0), ('Feb 2018', 0), ('Mar 2018', 0), 
                                      ('Apr 2018', 0)])

ethics_comments_per_month = OrderedDict([('Jan 2017', 0), ('Feb 2017', 0), ('Mar 2017', 0), ('Apr 2017', 0), 
                                      ('May 2017', 0), ('Jan 2018', 0), ('Feb 2018', 0), ('Mar 2018', 0), 
                                      ('Apr 2018', 0)])


comments = 0

comments_about_nyt_counter = 0
comments_about_staff_counter = 0
comments_about_ethics_counter = 0
comments_in_both_categories = 0

for file in files:
    for entry in csv.DictReader( open( file, encoding='utf-8') ):
        
        comments += 1
        
        comment = entry['commentBody']
        create_date = entry['createDate']
        
        stemmed_comment= ''
        
        comment_time = float(create_date)
        
        for keyword_nyt in keywords_nyt:
            
            was_in_staff = False
            
            if keyword_nyt in comment.lower():
                
                ## Stemming will only be done to comments that have one of the nyt-keywords since
                ## stemming all the comments seems to take too long
                
                month_of_comment = datetime.datetime.fromtimestamp(int(comment_time)).strftime('%b %Y')
                
                if month_of_comment in wanted_months:
                    
                    nyt_comments_per_month[month_of_comment] = nyt_comments_per_month[month_of_comment] + 1
                    
                    for word in nltk.word_tokenize( comment.lower() ):
                        stemmed_comment += stemmer.stem( word ) + ' '

                    comments_about_nyt_counter += 1

                    for keyword_staff in keywords_staff:
                        if keyword_staff in stemmed_comment:
                            comments_about_staff_counter += 1    
                            staff_comments_per_month[month_of_comment] = staff_comments_per_month[month_of_comment] + 1
                            was_in_staff = True
                            break

                    for keyword_ethics in keywords_ethics:
                        if keyword_ethics in stemmed_comment:
                            comments_about_ethics_counter += 1
                            ethics_comments_per_month[month_of_comment] = ethics_comments_per_month[month_of_comment] + 1
                            if was_in_staff == True:
                                comments_in_both_categories += 1
                            break

                    break

print( comments_about_nyt_counter, "comments mention any of these:", ','.join(keywords_nyt) )
print( comments_about_staff_counter, "comments mention any of these:", ','.join(keywords_staff) )
print( comments_about_ethics_counter, "comments mention any of these:", ','.join(keywords_ethics) )
print( comments_in_both_categories, "comments that mention staff and ethics keywords")
print( "There are in total", comments, 'comments\n' )

print('About nyt: ', nyt_comments_per_month, '\n')
print('About staff: ', staff_comments_per_month, '\n')
print('About ethics:', ethics_comments_per_month, '\n')

25203 comments mention any of these: new york time , nyt 
2515 comments mention any of these: staff , journalist , writer , editor 
2413 comments mention any of these: ethic , truth , fair , divers , honesti 
352 comments that mention staff and ethics keywords
There are in total 2176364 comments

About nyt:  OrderedDict([('Jan 2017', 3568), ('Feb 2017', 2956), ('Mar 2017', 3134), ('Apr 2017', 2765), ('May 2017', 3665), ('Jan 2018', 2052), ('Feb 2018', 2060), ('Mar 2018', 2555), ('Apr 2018', 2448)]) 

About staff:  OrderedDict([('Jan 2017', 394), ('Feb 2017', 315), ('Mar 2017', 288), ('Apr 2017', 301), ('May 2017', 398), ('Jan 2018', 179), ('Feb 2018', 184), ('Mar 2018', 228), ('Apr 2018', 228)]) 

About ethics: OrderedDict([('Jan 2017', 398), ('Feb 2017', 263), ('Mar 2017', 289), ('Apr 2017', 260), ('May 2017', 392), ('Jan 2018', 219), ('Feb 2018', 188), ('Mar 2018', 198), ('Apr 2018', 206)]) 



Less comments were found than when the data was not stemmatized. In addition, without stemmatization we there were 5867 comments about staff and 2625 about ethics. With stemmatization we ended up having 2515 comments about staff and 2413 about journalistic guidelines. I would suggest that stemming the words makes the process more precise since stemming the words removes "unnecessary part". For example, in the stemmed ethics keywords "diversity" is translated to "divers". This, in turn, removes the need to have all the possible forms of the word ("diversity", "diverse" are both probably stemmed to "divers").  (However, for some reason I think that I shouls have found more comments after stemming..).

* How would you use this to create a word-document matrix?

To creating word-document matrix we should create, for example, a list of all the documents, or comments, that are about the staff. Then we would just use CountVectorizer to create the word-document matrix. For example:

```
vectorizing_tool = CountVectorizer()
words_in_staff_documents = vectorizing_tool.transform(staff_documents)
```

## Some reflections

The thing that became apparent when doing this exercise was that using dictionary methods is difficult. It is really difficult to find the right keywords that would be able to find all the relevant documents about the topic we are interested in. I guess that what is needed would be premilinary reading of some of the interesting documents (however, how we know which documents are  interesting if our goal is to find documents about some topic from a large body of documents?). In addition, we should have at least basic knowledge about the topic we are interested in and knowledge about the source of the data. Also, it would probably help to be a native speaker of the language in which the documents are. I found this exercise really difficult. 

Dictionary methods seem to be the most rudimentary method of "data science" in the sense that at least in its basic form it does not provide tools for analysis (e.g. an algorithm that provides an output that would simplify the data in some way). However, it is a method that is really easy to understand, it has clear parameters and does not appear as a black box to those who do not have the time resources to understand its mathematical background. This is because it does not have a mathematical background. Although I guess the only reason it could be considered as data science method is that computers makes it possible to go through a really large body of documents. My impression of dictionary methods is that it is a really useful method for finding wanted documents from a large body of documents which then can be then analyzed further with other methods. However, it is quite dependent on how good the researcher is in the "craft of choosing the right keywords". 

How I would utilize dictionary methods would be to first, find the interesting documents from a body of work. For example, all the documents that are about architecture from a large body of newspaper articles that are not indexed properly. After that this body of newspaper articles could be divided into those that are about buildings and those that are more about architecture. At least now I consider dictionary methods as something that can be used to build a solid foundation for the actual analysis that is done with some other machine learning method. 