# Project 1 - Text Analysis, Classification, and Prediction

#### **Developer:** Mark Trombly
#### **Course:** Artifical Intelligence Applications
#### **Program Requirements**
~~~
1. Import necessary packages
2. Review data
3. Prepare data for analysis
4. Filter data
5. Display product review sentiment analysis
6. Create prediction analysis
~~~

### Part 1: Import necessary packages/modules

In [None]:
# Note: *After* installing NLTK (if not installed already) *must* download NLTK datasets (corpus).
# Some important datasets: stpwords, guntenbert, framenet_v15, and others.
import sys
import os
import pandas as pd
print(sys.version) # print python version
print(os.environ['CONDA_DEFAULT_ENV']) # print conda environment

import nltk # Natural Language Toolkit - language processing
# ***AFTER initial downloading comment out!***
# nltk.download('punkt_tab') # Sentence Tokenizer: splits text into list of sentences (must be trained before it can be used) ***AFTER initial downloading comment out!***
# nltk.download('averaged_perceptron_tagger_eng') # contains pre-trained (Wall Street Journal) Englist [Part-of-Speech (POS)] ***AFTER initial downloading comment out!***

# word tokenizer
from nltk.tokenize import word_tokenize

# nltk.download('stopwords') # only if needed, then comment out
# use to identify stop words - common words carrying little information (see below)
from nltk.corpus import stopwords

# use for tagging words with their parts of speech (POS) (e.g., nouns, verbs, etc.)
# nltk.download('averaged_perceptron_tagger') # after downloading comment out ***AFTER initial downloading comment out!***

# use for sentiment analysis - analyse positive/negative emotion of text (see below)
from nltk.sentiment import SentimentIntensityAnalyzer
# nltk.download(vader_lexicon') # after downloading comment out ***AFTER initial downloading comment out!***

# required to split data into train and test sets, where feature variables are given as input in method
from sklearn.model_selection import train_test_split

# CountVectorizer() converts collection of text documents into matrix.
from sklearn.feature_extraction.text import CountVectorizer

# classifies document based on counts it finds of multiple keywords
from sklearn.naive_bayes import MultinomialNB # import naive bayes

# used for confusion matrix in classification problems to assess errors in model
from sklearn import metrics

# determines accuracy classification score
from sklearn.metrics import accuracy_score

# library for creating static, animated, and interactive visualizations in Python
import matplotlib.pyplot as plt

### Part 2: Load and review data

In [None]:
# Load reviews into DataFrame
# Note: pipe (|) used instead of commas, as commas occur in reviews, and # indicates indexed column
df = pd.read_csv('GuitarReviews2out.txt', sep='|', index_cols='#')

rows = df.shape[0]   # num rows
cols = df.shape[1]   # num cols

# display number of rows/cols
print(rows)
print(cols)

In [None]:
df.head() # display first 5 reviews

In [None]:
df.iloc[0].review # display first review (from review column)

### Part 3: Prepare, tokenize, and visulize data

In [None]:
# combine all reviews from DataFrame into list for data manipulation/analysis
allTextList = df.review.to_list()

# used oly for comparison
print(allTextList) # display list

In [None]:
allText = ' '.join(allTextList) # join elements of list with space
print(allText) # Note: elements no longer separated by commas, or include single quotation marks (')

In [None]:
# Tokenizers divide strings into Lists of substrings
# resource: https://www.nltk.org/api/nltk.tokenize.html
# example: find words and puctuation in a string
# parse: tokenize text
tokens = nltk.word_tokenize(allText)

# print(tokens) # display all tokens
tokens[:10] # print only first 10 tokens

In [None]:
# determine word frequency
# Note: FreqDist() captures number of times each outcome of experiment has occurred
# https://www.nltk.org/api/nltk.probability.FreqDist.html
wordFrequency = nltk.FreqDist(tokens)
wordFrequency.plot(30)

### Part 4: Filter data

In [None]:
# keep tokens with Letters, using list comprehension
# Note: if necessary, review list comprehensions:
# https://www.w3schools.com/python/python_lists_comprehension.asp
alpha_words = [token for token in tokens if token.isalpha()]

alpha_words[:10] # print first 10 tokens w/letters

In [None]:
# cast alpha words into lower case, using list comprehesion
lower_case_words = [word.lower() for word in alpha_words]

lower_case_words[:10] # print first 10 tokens w/letters in lower case

In [None]:
# find stop words using NLTK stopwords package
# stop words: common words carrying little information
# explained:
# https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/
# https://medium.com/@yashj302/stopwords-nip-python-4aa57dc492af
# examples: "the," "is," "for," "where," "when," "to," "at,"...
# NLTK's list of enclish stopwords: https://gist.github.com/sebleier/554280

# get NLTK English stopwords
stopWords = stopwords.words('english')

type(stopWords) # print stopWords

In [None]:
len(stopWords) # print number of stop words

In [None]:
stopWords[:10] # display first 10 NLTK English stop words

In [None]:
# Remove stop words from lower case words
lower_case_no_stop_words = [word for word in lower_case_words if word not in stopWords]

# display first 10 tokens w/letters in lower case, and parts of speech, *after* removing NLTK Englis stop words
lower_case_no_stop_words[:10]

In [None]:
# determine lower-case words w/no stop words frequency
wordFrequency = nltk.FreqDist(lower_case_no_stop_words)
wordFrequency # display frequency

In [None]:
# visualize/plot word frequency
wordFrequency.plot(30)

In [None]:
# stemming: remove morphological affixes from words, leaving only word stem (algoritym for suffix stripping)
# examples: playing = play, likes/likely/liked = like
# use Porter Stemmer (strip word suffixes)
porterStemmer = nltk.PorterStemmer()
stemmed_words = [porterStemmer.stem(word) for word in lower_case_no_stop_words]

# type(stemmed_words)
stemmed_words[:10] # print first 10 stemmed words

In [None]:
# Add part of speech to each token
# reference: https://www.nltk.org/book/ch05.html
wordsWithTags = nltk.pos_tag(tokens)
wordsWithTags[:10] # display first 10 tokens and their part of speech

In [None]:
# include only nouns (tags beginning with N)
nouns = [word for (word, tag) in wordsWithTags if tag.startswith('N')]
nouns[:8]

In [None]:
# determine noun frequency
wordFrequency = nltk.FreqDist(nouns)
wordFrequency

In [None]:
# visualize/plot noun frequency
wordFrequency.plot(30)

### Part 5: Sentiment analysis

In [None]:
# Sentiment Analysis:
# used to analyse positive/negative emotion of text (determine polarity of text: positive, negative, or neutral)
# https://www.nltk.org/howto/sentiment.html
# https://medium.com/@rslavanyageetha/vader-a-comprehensive-guide-to-sentiment-analysis-in-python-c4f1868b0d2e

# initalize SentimentIntesityAnalyzer object
analyzer = SentimentIntensityAnalyzer()

In [None]:
# analyze first review (from review column)
review1 = df.iloc[0].review
review1

In [None]:
# polarity_scores() method: returns dictionary o sentiment scores
# dictionary contains four key/value pairs: neg, neu, pos, and compound
# i.e., how negative (0-1), how neutral (0-1), how ositive (0-1), as well as a compound score between -1 to 1
# compound: composite score of overall positive or negative sentiment (e.g., 0.9646 is very positive!)

# calculate polarity scores for first review
analyzer.polarity_scores(review1)

In [None]:
# loop through each review using polarity_scores() function
# display index, compound (composite) score, formated to two decimal places, and review title
# Note: sixth review most positive, and ninth review only negative review
compoundList = []
for index, row in df.iterrows():
    text = row.review
    scores = analyzer.polarity_scores(text)
    compound = score3s['compound']
    print(format(index, '2d'), format(compound, '6.2f'), row.title)
    compoundList.append(compound)

In [None]:
# more concisely, use DataFrame apply() method
# define function that calculates and returns compound VADER score
def compoundScore(text):
    scores = analyzer.polarity_scores(text)
    return scores['compound']

# apply analyzer on all reviews in DataFrame and display
# Note: apply() method passes function as an argument, and applies it on every single value of Pandas series

# apply compoundScore() function to "review" column in DataFrame, and create new DF column "compound"
df['compound'] = df['review'].apply(compoundScore)
df # display entire DataFrame

In [None]:
# or just display, index, title, and compound score
print(df[['title', 'compound']])

### Part 6: Text Classification

In [None]:
# Load data file of emails into DataFrame. Note: one line per email, pipe delimited
df = pd.read_csv('emails2.txt', usecols=['isSpam', 'Message'], sep='|')

rows = df.shape[0] # num rows
cols = df.shape[1] # num cols

# display number of rows/cols
print(rows)
print(cols)

In [None]:
df.head() # display first 5 e-mails

In [None]:
# review spam vs. nonspam emails (1=spam, 0=nonspam)
df.isSpam.value_counts()

In [None]:
# review (part of) first email (nonspam)
df.iloc[0].Message[:160] # display first 160 chars. of Message col.

In [None]:
# create function to remove nonLetters
def remove_non_letters(text):
    alist = [c if c.isalpha() else ' ' for c in text]
    return ''.join(alist)

# iterate over Message col. using apply() method, and create new col.
df['NonLettersRemoved'] = df['Message'].apply(remove_non_letters)

In [None]:
# display first 160 chars. of Message col.
df.iloc[0].NonLettersRemoved[:160]

In [None]:
# tokenize e-mails
# create lambda function to tokenize filtered e-mail messages
tokenizer = lambda text: word_tokenize(text)
df['NonLettersRemoved'][:10] # display tokens for first 10 e-mails w/nonLetters removed

In [None]:
# stemming: remove morphological affixes from words, leaving only word stem (algorithm for suffix stripping)
# examples: playing = play, likes/likely/liked = like
# use Porter Stemmer (strip word suffixes)
stemmer = lambda words: [porterStemmer.stem(word) for word in words]
df['NonLettersRemoved'] = df['NonLettersRemoved'].apply(stemmer)

In [None]:
# display stemmed tokens for first 10 e-mails w/nonLetters removed
df['NonLettersRemoved'][:10]

In [None]:
# create Lambda function to rejoin tokenized e-mail messages
rejoiner = lambda words: ' .join(words)
df['NonLettersRemoved'] = df['NonLettersRemoved'].apply(rejoiner)

In [None]:
# compare initial and transformed text for first 5 (nonspam) messages
df.head()

In [None]:
# compare initial and transformed text for th last 5 (spam) messages
df.tail()

### Part 7: Predictive analysis

### Definitions:

  1. **Dependent variables (also called):** response, outcome/output, or target variables (respond to changes in (an)other variable(s))
  2. **Independent variables (also called):** predictor, input, regressor, or explanatory variable(s) (predict/explain changed values of dependent variable(s))


*Dependent* variables **(output on y-axis)** are *always* the ones *being studied*--that is, whose variation(s) is/are being modified somehow!

*Independent* variables **(input on x-axis)** are *always* the ones being manipulated, to study and compare the effects on the dependent variable(s).

**Note:** The designations *independent* and *dependent* variables are used to not imply "cause and effect" (as do "predictor" or "explanatory" terms).

In [None]:
# dependent variable: isSpam (studied var.)
# independent variable: NonLettersRemoved (manipulated var.)

# split data into 25% "test" data and 75% "train" data
# Note: "generally," 25/75 is how data are split into text/train data sets

# returns four results (all Pandas "Series" data type):
# train_text and test_text: contain e-mail text
# train_labels and test_labels: contain binary values from iSpam column
train_text, test_text, train_labels, test_labels = train_test_split(df.NonLettersRemoved, df.isSpam, test_seze=0.25, random_state=1)

In [None]:
train_labels

In [None]:
# CountVectorizer(): Converts collection of text documents into matrix of token counts.
# rows represent documents, and cols represent tokens (i.e, words or n-grams).
# counts occurrences of each token in each document.
# Note: "n-gram" is collection of n successive items in a text document--may include words, numbers, symbols, and punctuation.
# Resource: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# use CountVectorizer() to determine word freq. for each e-mail
# build "bag of words" (bow) features vecorizer and get features

# min_df=1: tracks words occurring at least once
# ngram_range=(1,1): finds single words, rather thaword combinations
bow_vectorizer = CountVectorizer(min_df=1, ngram_range(1,1))

#### **fit() vs. transform() vs. fit_transform() methods:**

- **fit():** calculates mean and variance of each of the features in data.
- **transform():** transforms all features using respective mean and variance
- **fit_transform():** used on training data to scale training data, and also learn scaling parameters of data.

Model built will learn mean and variance of features of training set; learned parameters then used to scale test data.

**Resource:** https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe

In [None]:
# fit_transform(): used on training data
# counts occurrences of each wordin each e-mail
bow_train_features = bow_vectorizer.fit_transform(train_text)

In [None]:
# transform(): used on test data
# transform() : use same mean and variance as calculated from training data to transform test data.
# Bottom-lin: parameters learned by model using training data helps to transform test data.
bow_test_features = bow_vectorizer.transform(test_text)

In [None]:
# multinomial Naive Bayes classifier: suitable for classification with discree features (e.g., word counts for text clasification).
# probabilistic classifier calculates probability distribution of text data
# Multinomial Distribution: used to find probabilities in experiments, where there are more than two outcomes.
# Resource: https://scikit-learn.org/stable/modules/generated/sklearn.nive_bayes.MultinomialNB.html

model = MultinomialNB()

In [None]:
# fit(): trains machine learning model on dataset.
# fit() method: takes in datase (typically, 2D array or matrix), and a set of labels, then fits model to data.
# MultinomialNB fit() method: expects x and y input.
# x: training vectors (i.e., training data)
# y: target values (i.e., labels, targets or classes)

# Note: train model using training data, then predit using new data (i.e., test data, below).
# fit() method: determines probabilities of individual words occurring in nonspam vs spam e-mails.
model.fit(bow_train_features, train_labels)

In [None]:
# predict nonspam vs spam e-mails using model and test data
predictions = model.predict(bow_test_features)

In [None]:
# number of emails in test data
len(test_labels)

In [None]:
# number of emails in training data
len(train_labels

In [None]:
# Evaluating model's predictions:
# Compare actual spam/nonspam e-mails with mode's prediction of spam/nonspam e-mails
test_results = pd.DataFrame({'actual':test_labels.tolist(), 'predict':list(predictions)})
test_results

In [None]:
# display *all* rows where model was incorrect (index value indicates row position)
# Note: only four rows!
test_results[test_results.actual != test_results.predict]

In [None]:
# calculate accuracy score for set of predicted labels against true labels
# resource: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
accuracy_score(test_results.actual, test_results.predict)

# display as percentae (note: 94% accuracy!)
print('Accuracy {:.1%}'.format(accuracy_score(test_results.actual, test_results.predict)))

In [None]:
# also, can check accuracy using confusion matrix
# creates table to assess where errors occurred in model
# rows represent classes outcomes should have been
# columns represent predictions made
# table displays which predictions were wrong

# import "metrics" to use confusion matrix function on "actual" and "predicted" values
# rows represent actual classes that outcomes should have been
# columns represent predictions made
# Using table is an easy to see which predictions are wrong!
# Generic syntax: confusion_matrix = metrics.confusion_matrix(actual, predicted)

confusion_matrix = metrics.confusion_matrix(test_results.actual, test_results.predict)

confusion_matrix

In [None]:
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = [0, 1])

cm_display.plot()
plt.show()

# Confusion Matrix creates four quadrants:
# True Negatives (TN) (Top-Left Quadrant): Prediction no, true value no
# False Positives (FP (Top-Right Quadrant): Prediction yes, true value no
# False Negatives (FN) (Bottom-Left Quadrant): Prediction no, true value yes
# True Positives (TP) (Bottom-Right Quadrant): Prediction yes, true value yes

In [None]:
# interpretation:
# row1: Model correctly categorized **nonspam** e-mails in 33 of 36 cases (91.7%), "specificity"
# row2: Model correctly categorized **spam** e-mails in 33 out of 34 cases (97.1%), "sensitivity"

**Confusion Matrix:** Table compares predicted and actual values.

||**Predicted: No (not spam)** | **Predicted: Yes (spam)** | **Total:** |
|:---|:-----------------------:|:-------------------------:|:-----------|
|**Actual: No  (not spam)**| TN=33 | FP=3 | 36 |
|**Actual: Yes (spam)**| FN=1 | TP=33 | 34 |
|**Total:** | 34 | 36 | |

## Related Metrics:

|**Metric** | **Formula** | **Definition** |
|:---|:-----------------------:|:-------------------------:|
|**Accuracy**| (TP+TN)/(TP+TN+FP+FN) | Percentage of total items classified correctly |
|**Precision**| TP/(TP+FP) | Positive predictions accuracy |
|**Recall/Sensitivity** | TP/(TP+FN) | True positive rate (e.g., assess false positive rate) |
|**Specificity** | TN/(TN+FP) | True negative rate (e.g., assess false negative rate) |
|**F1 score** | 2TP/(2TP+FP+FN) | Weighted average of precision and recall/sensitivity |

**Resources:**
https://machine-learning.paperspace.com/wiki/confusion-matrix
https://classeval.wordpress.com/introduction/basic-evaluation-measures/
https://poojapawani.medium.com/what-is-confusion-matrix-accuracy-sensitivity-specificity-precision-recall-1091b4723714
https://www.w3schools.com/python/python_ml_confusion_matrix.asp

## Examples:

In [None]:
# Accuracy measures how often model is correct
# Calculation: (True Positive + True Negative) / Total Predictions
# Example: Accuracy = metrics.accuracy_score(actual, predicted)

Accuracy = metrics.accuracy_score(test_results.actual, test_results.predict)

In [None]:
# Precision: Of positives predicted, what percentage is *truly* positive?
# Note: Precision does not evaluate correctly predicted negative cases:

# Calculation: True Positive / (True Positive + False Positive)
# Example: Precision = metrics.precision_score(actual, predicted)

Precision = metrics.precision_score(test_results.actual, test_results.predict)

In [None]:
# Sensitivity (aka Recall):
# Of all positive cases, what percentage are *predicted* positive?
# Measures how well model predicts something is positive.

# Translation: Looks at true positives and false negatives (which are positives that have been incorrectly predicted as negative).

# Calculation: True Positive / (True Positive + False Negative)
# Example: Sensitivity recall = metrics.recall_score(actual, predicted, pos_label=0)

Specificity = metrics.recall_score(test_results.actual, test_results.predict, pos_label=0)

In [None]:
# F-score: "Harmonic mean" of precision and sensitivity.
# Considers both false positive and false negative cases--good for imbalanced datasets.
# Note: Score does not take into consideration True Negative values.

# Calculation: 2 * ((Precision * Sensitivity) / (Prcision + Sensitivity))

# Example: F1_score = metrics.f1_score(actual, predicted)

F1_score = metrics.f1_score(test_results.actual, test_results.predict)

In [None]:
# all calculations: print dictionary (Python dictionaries use curly braces {}), that is key:value paris
print({"Accuracy":Accuracy, "Precision":Precision,"Sensitivity_recall":Sensitivity_recall,"Specificity":Specificity,"F1_score":F1_score})

In [None]:
# Or, to format nicely! :)
# my_dictionary = key:value pairs
my_dictionary = {Accuracy":Accuracy,"Precision":Precision,"Sensitivity_recall":Sensitivity_recall,"Specificity":Specificity,"F1_score":F1_score)

# Note: "0" and "1" indicate field order--that is, key=0 and value=1
# Note: '<' Forces field to be left-aligned, within available space (default for most objects)
# Resource: https://docs.python.org/2/library/string.html#string-formatting

print("\n".join("{0: <16}\t{1:.2f}".format(k, v) for k, v in my_dictionary.items()))