# Week 10 Assignment
Find a source of text and create a bag-of-words representation. Build a simple sentiment analyzer from scratch without using any sentiment packages.

## Data Loading
The data I'm using is the Newsgroups data from scikit-learn. While it is mainly used for classification of newsgroups, we can still do something in regards to sentiment analysis with it.

In [None]:
 import pandas as pd
 from sklearn.datasets import fetch_20newsgroups

newsgroups_data = pd.DataFrame(
    fetch_20newsgroups(
        subset = 'train',
        categories = ['comp.graphics'], 
        shuffle = True, 
        random_state = 1
    ).data,
    columns = ['text']
)
newsgroups_data.head()

Unnamed: 0,text
0,From: davidr@rincon.ema.rockwell.com (David J....
1,From: seth@north6.acpub.duke.edu (Seth Wanders...
2,From: camter28@astro.ocis.temple.edu (Carter A...
3,From: crussell@netcom.com (Chris Russell)\nSub...
4,From: brentb@tamsun.tamu.edu (Brent)\nSubject:...


## Data Cleaning & Preprocessing
We are interested in the `text` column so we should clean up that column first using some text preprocessing steps.

In [None]:
import string
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Chaining some preprocessing steps together, mainly:
# 1) Lowercasing text field
# 2) Removing punctuation
# 3) Replacing newline characters with spaces
# 4) Removing numbers
newsgroups_data['text_cleaned'] = newsgroups_data['text'] \
    .str.lower() \
    .str.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation))) \
    .str.replace('\n', ' ') \
    .str.replace('\d+', '')
    
# Remove stopwords from the cleaned up text field
stop_words = stopwords.words('english')
newsgroups_data['text_cleaned'] = newsgroups_data['text_cleaned'].apply(
    lambda row: ' '.join([word for word in row.split() if word not in stop_words])
)

# Lemmatize words in the text field. This lemmatizing
# step isn't perfect because I am not determining the
# POS (part-of-speech) tag so by default, the lemmatizer
# assumes each word is a noun and tries to find the lemma
# for that form of the word.
lemmatizer = WordNetLemmatizer()
newsgroups_data['text_cleaned'] = newsgroups_data['text_cleaned'].apply(
    lambda row: ' '.join([lemmatizer.lemmatize(word) for word in row.split()])
)

print(f'=====TEXT BEFORE PROCESSING===== \n"{newsgroups_data["text"][0]}"')
print(f'=====TEXT AFTER PROCESSING===== \n"{newsgroups_data["text_cleaned"][0]}"')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
=====TEXT BEFORE PROCESSING===== 
"From: davidr@rincon.ema.rockwell.com (David J. Ray)
Subject: Re: Fractals? what good are they?
Organization: Rockwell International
X-Newsreader: Tin 1.1 PL5
Lines: 16

In regards to fractal commpression, I have seen 2 fractal compressed "movies".
They were both fairly impressive.  The first one was a 64 gray scale "movie" of
Casablanca, it was 1.3MB and had 11 minutes of 13 fps video.  It was a little
grainy but not bad at all.  The second one I saw was only 3 minutes but it
had 8 bit color with 10fps and measured in at 1.2MB.

I consider the fractal movies a practical thing to explore.  But unlike many 
other formats out there, you do end up losing resolution.  I don't know what
kind of software/hardware was used for cr

## Bag-of-Words
scikit-learn has a built-in method of creating a bag-of-words representation of a series of text using the `CountVectorizer` method. I will use that in the following cells. The thing to note is that `CountVectorizer` creates a sparse vector representation so I also convert this sparse representation to a dense one in order to view it. In practice, it should be kept as a sparse vector for performance reasons since a dense vector will take up a bunch of memory.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
text_bow = vectorizer.fit_transform(newsgroups_data['text_cleaned'])
text_bow_dense = pd.DataFrame(text_bow.todense(), columns = vectorizer.get_feature_names())
text_bow_dense.sample(5)

Unnamed: 0,aa,aaaa,aad,aalborg,aamrl,aangeboden,aantal,aao,aaoepp,aaplay,aarhus,aarnet,aau,ab,abad,abandon,abbreviation,abc,abekas,abel,aberdeen,abild,abildskov,ability,ablaze,able,abo,abort,abp,abpsoft,abrash,abraxis,absence,absent,absolute,absolutely,abstact,abstract,abstractsoft,abuse,...,zbuffering,zbww,zc,zcat,zealand,zeit,zemcik,zen,zenith,zenkar,zeno,zentrum,zephyr,zero,zeus,zhao,zhenghao,ziedman,zillion,zip,zipped,zippy,zirkel,zola,zoo,zool,zoom,zooming,zopfi,zorg,zorn,zrz,zsoft,zt,zug,zurich,zvi,zyeh,zyxel,ªl
583,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
212,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
43,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
309,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
235,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Sentiment Analyzer
A brute-force way to build a simple sentiment analyzer would be to simply count the number of occurrences of "positive" words and "negative" words. The sentiment score of the analyzer would simply be the difference between the positive count and negative count. If the final score is positive, that means there are more positive words found so the overall sentiment would be positive. If the final score was negative, there are more negative words so overall sentiment would be negative. Finally, if the final score was 0, there is a tie or no presence of either positive or negative words so it would be neutral.

In [None]:
import numpy as np

def very_simple_sentiment_analyzer(positive_words, negative_words, data):
    '''Analyze a bag-of-words representation of texts for sentiment.

    This function simply uses counts of positive and negative words
    to determine if a particular text field is positive, negative, 
    of neutral.

    Parameters
    ----------
    positive_words : list[str]
        List of words that are considered positive.

    negative_words : list[str]
        List of words that are considered negative.

    data : pandas DataFrame
        A DataFrame containing the bag-of-words representation
        of a corpus.
    '''
    # Sum up all the postive word counts if they exist
    positive_counts = data[data.columns.intersection(positive_words)].sum(axis = 1)

    # Sum up all the negative word counts if they exist
    negative_counts = data[data.columns.intersection(negative_words)].sum(axis = 1)

    # Determine the final label based on the total score.
    total_score = positive_counts - negative_counts
    final_label = np.where(
        total_score > 0, 
        'POSITIVE', 
        np.where(total_score == 0, 'NEUTRAL', 'NEGATIVE') 
    )

    return final_label

positive_words = [
    'awesome',
    'amazing',
    'amaze',
    'best',
    'bountiful',
    'beautiful',
    'beauty',
    'cool',
    'calm',
    'dashing',
    'delicious',
    'decadent',
    'dope',
    'good',
    'impress',
    'impressed',
    'pretty',
    'positive',
    'positively'
]

negative_words = [
    'awful',
    'atrocious',
    'bad',
    'broke',
    'broken',
    'boo',
    'bore',
    'boring',
    'crazy',
    'craze',
    'cantankerous',
    'cranky',
    'doom',
    'darn',
    'death',
    'drat',
    'dislike',
    'ew',
    'fault',
    'faulty',
    'hate',
    'icky',
    'mean',
    'negative',
    'rot',
    'rotten',
    'suck',
    'stink'
]

text_bow_dense['sentiment'] = very_simple_sentiment_analyzer(
    positive_words,
    negative_words,
    text_bow_dense
)
text_bow_dense['sentiment'].value_counts()

NEUTRAL     388
POSITIVE    144
NEGATIVE     52
Name: sentiment, dtype: int64

### Results Analysis
Let's take a look at a "positive" text, "negative" text, and "neutral" text to see how it did. Obviously, I don't expect much since the list of positive and negative words is hardly exhaustive, but it's a start. :)

In [None]:
newsgroups_data_analyzed = newsgroups_data.join(text_bow_dense[['sentiment']])
positive_sample = newsgroups_data_analyzed[newsgroups_data_analyzed['sentiment'] == 'POSITIVE'].sample(1)
negative_sample = newsgroups_data_analyzed[newsgroups_data_analyzed['sentiment'] == 'NEGATIVE'].sample(1)
neutral_sample = newsgroups_data_analyzed[newsgroups_data_analyzed['sentiment'] == 'NEUTRAL'].sample(1)

print(f'=====POSITIVE SAMPLE===== \n"{positive_sample["text"].values[0]}"\n\n====================')
print(f'=====NEGATIVE SAMPLE===== \n"{negative_sample["text"].values[0]}"\n\n===================')
print(f'=====NEUTRAL SAMPLE===== \n"{neutral_sample["text"].values[0]}"\n\n====================')

=====POSITIVE SAMPLE===== 
"From: markl@hunan.rastek.com (Mark Larsen)
Subject: Re: Ray tracer for ms-dos?
Organization: Rastek Corporation, Huntsville, AL
Lines: 32

In article <1r1cqiINNje8@srvr1.engin.umich.edu> tdawson@llullaillaco.engin.umich.edu (Chris Herringshaw) writes:
>
>Sorry for the repeat of this request, but does anyone know of a good
>free/shareware program with which I can create ray-traces and save
>them as bit-mapped files?  (Of course if there is such a thing =)
>
>Thanks in advance
>
>Daemon

There are 2 books published by M&T BOOKS that come with C source code on
floppies.  They are:

Programming In 3 Dimensions, 3-D Graphics, Ray Traycing, and Animation
by: Christopher D. Watkins and Larry Sharp.

Photorealism and Ray Tracing in C
by: Christopher D. Watkins, Stephen B. Coy, and Mark Finlay.

I have the first book and it is a great intro to 3-D, Ray Tracing and
Animation.  Most of the programs are on the disk compiled and ready to run.

I have only glanced at the 