# Artificial Intelligence Python Program by Mason Howes

Utilizes **Machine Learning**, **Natural Language Processing**, **Classification**, **Data Visualization** and the **Automation of an Intelligent Behavior**

This program uses a dataset of Amazon.com product reviews, specifically [this dataset](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews). According to the dataset description,

> This is a large-scale Amazon Reviews dataset collected in 2023. This dataset contains 48.19 million items, and 571.54 million reviews from 54.51 million users

The data collected spans from May 1996 to September 2023. Attributions at bottom of program.

**IMPORTANT**: To run this program, please download [the review dataset](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews) as well as this Jupyter Notebook. Place the .csv file and the .ipynb file in the same file location, and make sure the .csv file is named "food_products.csv".

This program will both compute information about the most positive words and reviews, as well as visualize the data as it is processed in cells marked with **Visualization**.

TL;DR - View the visualizations produced with Seed 416 (uncomment in the code to achieve these results)


---
Loads modules

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import math
import string

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer

sns.set()
%matplotlib inline

Reads the dataset into the program

In [None]:
products = pd.read_csv('food_products.csv')

# Uncomment the seed if you want to achieve similar visualizations to those provided in the pictures above
# np.random.seed(416)

Extracts neutral sentiment due to lack of information gained from analysis

In [None]:
products = products[products['rating'] != 3].copy()

**Visualization**: Distribution of the number of reviews per rating (scale of 1-5)

In [None]:
plt.title('Number of Reviews Per Rating')
sns.histplot(products['rating'])

Declares ratings of 4-5 to be considered Positive, and 1-2 to be considered Negative. In the "Sentiment" column, +1 is used to represent Positive, and -1 for Negative.

In [None]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

Building a "Word Count" vector for model analysis.

Removing punctuation and then obtaining word counts for each review.

In [None]:
# Helper function to mass remove punctuation from reviews
def remove_punctuation(text):
    if type(text) is str:
        return text.translate(str.maketrans('', '', string.punctuation))
    else:
        return ''

# Makes the counts for each review
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(products['review_clean'])

# Maps unique words as features
features = vectorizer.get_feature_names_out()

# DataFrame creation with count information
product_data = pd.DataFrame(count_matrix.toarray(),
        index=products.index,
        columns=features)

# Adds old columns to the new DataFrame
product_data['sentiment'] = products['sentiment']
product_data['review_clean'] = products['review_clean']
product_data['summary'] = products['summary']

Splits data into Training, Validation and Test sets for model training.

(80% Training, 10% Validation, 10% Test)

In [None]:
train_data, test_and_validation_data = train_test_split(product_data, test_size=0.2, random_state=3)
validation_data, test_data = train_test_split(test_and_validation_data, test_size=0.5, random_state=3)

Predict the majority class for all datapoints to keep tabs on model accuracy

In [None]:
# Computes most frequent label
vals, counts = np.unique(train_data["sentiment"], return_counts=True)
index = np.argmax(counts)
majority_label = vals[index]

# Finds validation accuracy for majority class classifier
correct = 0
for y in validation_data["sentiment"]:
    if y == majority_label:
        correct += 1

majority_classifier_validation_accuracy = correct / len(validation_data)

Trains a sentiment classifier with logistic regression

In [None]:
sentiment_model = LogisticRegression(penalty='l2', C=1e23, random_state=1)
sentiment_model.fit(train_data[features], train_data['sentiment'])

Finds the most positive and negative word in the sentiment model

In [None]:
most_negative_word = features[np.argmin(coefficients)]
most_positive_word = features[np.argmax(coefficients)]
print('Most Negative Word:', most_negative_word)
print('Most Positive Word:', most_positive_word)

Finds the most positive and negative review in the sentiment model

In [None]:
predictions = sentiment_model.predict_proba(validation_data[features])

most_positive = validation_data.iloc[np.argmax(predictions[:,1])]
most_negative = validation_data.iloc[np.argmax(predictions[:,0])]

most_positive_review = most_positive["review_clean"]
most_negative_review = most_negative["review_clean"]

print('Most Positive Review:')
print(most_positive_review)
print()
print('Most Negative Review:')
print(most_negative_review)

Computes the validation accuracy of the sentiment model

In [None]:
sent_true = validation_data["sentiment"]
sent_pred = sentiment_model.predict(validation_data[features])

sentiment_model_validation_accuracy = accuracy_score(sent_true, sent_pred)

**Visualization**: Creates a confusion matrix to measure the accuracy of the sentiment model

In [None]:
def plot_confusion_matrix(tp, fp, fn, tn):
    """
    Plots a confusion matrix using the values
       tp - True Positive
       fp - False Positive
       fn - False Negative
       tn - True Negative
    """
    data = np.matrix([[tp, fp], [fn, tn]])

    sns.heatmap(data,annot=True,xticklabels=['Actual Pos', 'Actual Neg']
              ,yticklabels=['Pred. Pos', 'Pred. Neg'])

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(sent_true, sent_pred).ravel()
plot_confusion_matrix(tp=tp, fp=fp, tn=tn, fn=fn)

One potential issue with the current program is the way that words are mapped. There are many more unique words (features) than there are reviews (observations).

This portion of the program implements **L2 Regularization** to help avoid overfitting the data, with the goal of **increasing the accuracy**.

In [None]:
# L2 Regularization Penalty Setups
l2_penalties = [0.01, 1, 4, 10, 1e2, 1e3, 1e5]
l2_penalty_names = [f'coefficients [L2={l2_penalty:.0e}]'
                    for l2_penalty in l2_penalties]

Adds coefficients to the table for each model

In [None]:
coef_table = pd.DataFrame(columns=['word'] + l2_penalty_names)
coef_table['word'] = features

Sets up empty list to store accuracies

In [None]:
accuracy_data = []

Trains L2 model

In [None]:
for l2_penalty, l2_penalty_column_name in zip(l2_penalties, l2_penalty_names):
    lr_model = LogisticRegression(penalty='l2', C=1/l2_penalty, fit_intercept=False, random_state=1)

    lr_model.fit(train_data[features], train_data["sentiment"])

    # Saves coefficients
    coef_table[l2_penalty_column_name] = lr_model.coef_[0]

    # Calculates and saves the train and validation accuracies
    train_accuracy = accuracy_score(train_data["sentiment"],
                                    lr_model.predict(train_data[features]))
    validation_accuracy = accuracy_score(validation_data["sentiment"],
                                         lr_model.predict(validation_data[features]))
    accuracy_data.append({"l2_penalty": l2_penalty,
                          "train_accuracy": train_accuracy,
                          "validation_accuracy": validation_accuracy})


Finds 5 most Positive and 5 most Negetive words found in the L2 model

In [None]:
positive_words = coef_table.nlargest(5, "coefficients [L2=1e+00]").iloc[:, 0]
negative_words = coef_table.nsmallest(5, "coefficients [L2=1e+00]").iloc[:, 0]
print(positive_words)
print(negative_words)

**Visualization**: Observes effect the increase in L2 Regularization penalties has on the most positive and negative words

In [None]:
def make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list):

    # Plots coefficients given table w/ rows corresponding to words & columns to l2 penalty,
    # list of + and - words, & list of 12 penalties
    def get_cmap_value(cmap, i, total_words):

        # Computes scale from i=0 to i=total_words - 1 for cmap
        return cmap(0.8 * ((i + 1) / (total_words * 1.2) + 0.15))


    def plot_coeffs_for_words(ax, words, cmap):

        # Plots coeff paths for each word in words given axes & word list
        words_df = table[table['word'].isin(words)]
        words_df = words_df.reset_index(drop=True)

        for i, row in words_df.iterrows():
            color = get_cmap_value(cmap, i, len(words))
            ax.plot(xx, row[row.index != 'word'], '-',
                    label=row['word'], linewidth=4.0, color=color)

    # Canvas creation
    fig, ax = plt.subplots(1, figsize=(10, 6))

    # Set up the xs to plot and draw a line for y=0
    xx = l2_penalty_list
    ax.plot(xx, [0.] * len(xx), '--', linewidth=1, color='k')

    # Plot the positive and negative coefficient paths
    cmap_positive = plt.get_cmap('Reds')
    cmap_negative = plt.get_cmap('Blues')
    plot_coeffs_for_words(ax, positive_words, cmap_positive)
    plot_coeffs_for_words(ax, negative_words, cmap_negative)

    # Set up axis labels, scale, and legend
    ax.legend(loc='best', ncol=2, prop={'size':16}, columnspacing=0.5 )
    ax.set_title('Coefficient path')
    ax.set_xlabel('L2 penalty ($\lambda$)')
    ax.set_ylabel('Coefficient value')
    ax.set_xscale('log')


make_coefficient_plot(coef_table, positive_words, negative_words, l2_penalty_list=l2_penalties)

---


Dataset attribution

2023 version

Bridging Language and Items for Retrieval and Recommendation
Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, Julian McAuley
arXiv