# Enron Query Elasticsearch Analysis

## Download NLTK Data

macOS: Using `nltk.download('all')` in a Jupyter notebook will download the data 
to the wrong location at `/Users/username/nltk_data`.

To fix this, run the following command in a terminal:
    
```shell
$ sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
```

This will save `nltk_data` for all modules in `/usr/local/share/`.
If you only need a certain subset of NLTK data, then you can change the last command `all`,
which defines download all NLTK data.
For example, you can use the tokenizer, `punkt`, or for sentiment analysis you can use `vader_lexicon`.

The optimal way is to save the path that NLTK looks for the data by default in your shell configuration.
After adding `NLTK_DATA` to your shell configuration, restart the shell and download the data with
the dynamic linking to the path.

```shell
# NLTK Data Path in ~/.bashrc or ~/.zshrc
export NLTK_DATA="/usr/local/share/nltk_data"
# Restart the shell
$ source ~/.zshrc
# Download all NLTK data
$ sudo python -m nltk.downloader -d $NLTK_DATA all
```

In [None]:
import nltk
# Run once here or in terminal
nltk.download('all')

## Import

In [None]:
import requests
import pandas
import json
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from dateutil import parser

# Host for INTA 6450 Class
host = 'http://18.188.56.207:9200/'
# Request enron corpus database
requests.get(host + '_cat/indices/enron').content

b'yellow open enron lVq0is2BTCmgDk2kFyZHTQ 1 1 251735 129380 1.1gb 1.1gb\n'

## Functions

In [None]:
def elasticsearch_results_to_df(results):
    '''
    A function that will take the results of a requests.get 
    call to Elasticsearch and return a pandas.DataFrame object 
    with the results 
    '''
    hits = results.json()['hits']['hits']
    data = pandas.DataFrame([i['_source'] for i in hits], index = [i['_id'] for i in hits])
    data['date'] = data['date'].apply(parser.parse)
    return(data)

def print_df_row(row):
    '''
    A function that will take a row of the data frame and print it out
    '''
    print('____________________')
    print('RE: %s' % row.get('subject',''))
    print('At: %s' % row.get('date',''))
    print('From: %s' % row.get('sender',''))
    print('To: %s' % row.get('recipients',''))
    print('CC: %s' % row.get('cc',''))
    print('BCC: %s' % row.get('bcc',''))
    print('Body:\n%s' % row.get('text',''))
    print('____________________')

# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

def get_sentiment(text,sentiment ):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(text)
    return scores[sentiment]

## Test Query

In [18]:
# Query For a full text match in the "text" field 
# Uses the "match" query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
doc = {
    "query": {
        "match_phrase" : {
            "text" : "Account numbers"
        } 
    },
    "from" : 0, # Starting message to return. 
    "size" : 2000, # Return this many messages. Can't be more than 10,000
}
r=requests.get(host + 'enron/_search',
               data=json.dumps(doc), headers={'Content-Type':'application/json'})
r.raise_for_status()
print("Found %s messages matching the query, of " % r.json()['hits']['total'])
df = elasticsearch_results_to_df(r)
df['processedText'] = df['text'].apply(preprocess_text)
df['pos'] = df['processedText'].apply(get_sentiment,args=("pos",))
df['neu'] = df['processedText'].apply(get_sentiment,args=("neu",))
df['neg'] = df['processedText'].apply(get_sentiment,args=("neg",))
df['compound'] = df['processedText'].apply(get_sentiment,args=("compound",))

Found {'value': 77, 'relation': 'eq'} messages matching the query, of 
