## Large language model project

### The dataset (source: Kaggle.com)
#### This dataset consists of reviews of Fine Foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.
#### Columns: Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text

### Objective: Here in this project we will try out 2 things:
#### 1) Tokenizing the comment text to generate numerical features which can be used to automatically assign ratings to products. This can be helpful in standardizing scores because humans are inherently subjective resulting in a disconnect between what people say in their comments and the scores they give to a product
#### 2) Generating short summaries of comments

In [6]:
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra
from sklearn.model_selection import train_test_split

In [None]:
# Read in the raw file (for the score) and the text embeddings
df_text=pd.read_csv('embedding_train_amazon.csv')
df_amazon_raw = pd.read_csv('Reviews.csv')
df_score=df_amazon_raw[['Score']]

In [None]:
# Add the score column alongside the text embeddings
df_text['Score']=df_amazon_raw[['Score']]

In [None]:
df_text.head()

In [None]:
# Randomize the rows
df_text=df_text.sample(frac=1)

In [None]:
df_score=df_text[['Score']]

In [None]:
df_desc=df_text.drop(['Score'],axis=1)

In [None]:
y_score_train, y_score_test = train_test_split(df_score, test_size=0.4, shuffle=False)

In [None]:
x_text_train, x_text_test = train_test_split(df_desc, test_size=0.4, shuffle=False)

#### Looking for an algorithm which executes relatively fast and requires less memory, we are going to try the LGBMClassifier.
#### LGBMClassifier stands for Light Gradient Boosting Machine Classifier. It uses decision tree algorithms for ranking, classification, and other machine-learning tasks. LGBMClassifier uses a novel technique of Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to handle large-scale data with accuracy, effectively making it faster and reducing memory usage.

In [None]:
import lightgbm as lgb

##### We are first going to try this with default settings

In [None]:
clf_km = lgb.LGBMClassifier()

In [None]:
clf_km.fit(X = x_text_train , y = y_score_train.values.ravel())

In [None]:
y_pred = clf_km.predict(x_text_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_score_test,y_pred)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy))

##### Accuracy of 71% is OK but nothing to write home about.
##### Let's see if changing some of the hyper parameters will be helpful.

In [None]:
clf_km = lgb.LGBMClassifier(n_estimators=500)

In [None]:
clf_km.fit(X = x_text_train , y = y_score_train.values.ravel())

In [None]:
y_pred_1 = clf_km.predict(x_text_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_score_test,y_pred_1)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy))

##### Accuracy of 77% is OK but still nothing to write home about.
##### Let's see if using a different algorithm makes a difference - enter the famous XGBoost !!!!!

In [None]:
import xgboost as xgb

In [None]:
clf_XGB = xgb.XGBClassifier()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_score_train = le.fit_transform(y_score_train.values.ravel())

In [None]:
clf_XGB.fit(X = x_text_train , y = y_score_train)

In [None]:
y_pred_2 = clf_XGB.predict(x_text_test)

In [None]:
accuracy_2=accuracy_score(y_score_test,y_pred_2)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy))

#### Still 77%
#### We can of course try other algorithms to see if the accuracy can be improved further, but I am getting the feeling that the accuracy will probably not go much higher. This is because of the inherent subjectivity of humans and the lack of a rigorous relationship between what we write in the description and the score.
#### Let's now try the summary real quick to see if it gives better prediction of scores.

In [None]:
df_summ=pd.read_csv("embedding_train_amazon_summ.csv")

In [None]:
df_summ['Score']=df_amazon_raw[['Score']]

In [None]:
df_summ=df_summ.sample(frac=1)

In [None]:
df_score_summ=df_summ[['Score']]

In [None]:
df_text_summ=df_summ.drop(['Score'],axis=1)

In [None]:
y_score_train, y_score_test = train_test_split(df_score_summ, test_size=0.4, shuffle=False)

In [None]:
x_text_train, x_text_test = train_test_split(df_text_summ, test_size=0.4, shuffle=False)

In [None]:
clf_km = lgb.LGBMClassifier(n_estimators=500)

In [None]:
clf_km.fit(X = x_text_train , y = y_score_train.values.ravel())

In [None]:
y_pred_1 = clf_km.predict(x_text_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_score_test,y_pred_1)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy))

##### Slightly better than the descriptive comments. But we are still up against the inherent subjectivity of the human mind

### We will now use a trained LLM to generate summaries of documents
### We are looking for a light-weight LLM which can execute on a decent PC.
### I tried GPT-J (model size: 6B) and C4AI Command-R (model size: 35B) and my laptop ran out of memory.
### I tried Llama 3B v2 but it was quite slow.
### So I then tried Mistral 7B

In [1]:
import fireworks.client
import os
import dotenv
import chromadb
import json
from tqdm.auto import tqdm
import random

# you can set envs using Colab secrets
dotenv.load_dotenv()

fireworks.client.api_key = 'KTGKcoCndQttxHOjG4cYALmEXR0ByhYBgtrozJesElA5eJ2A'

In [2]:
def get_completion(prompt, model=None, max_tokens=50):

    fw_model_dir = "accounts/fireworks/models/"

    if model is None:
        model = fw_model_dir + "llama-v2-7b"
    else:
        model = fw_model_dir + model

    completion = fireworks.client.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=0
    )

    return completion.choices[0].text

In [3]:
def call_model(textinput):

    p1 = """[INST] Summarise the following in 15 words :{"""
    p2="""}[/INST]"""
    prompt= p1+textinput+p2

    summ=get_completion(prompt, model=mistral_llm, max_tokens=200)
    return summ

In [4]:
# Set the LLM
mistral_llm = "mistral-7b-instruct-4k"

In [7]:
df_text = pd.read_csv('reviews.csv')

In [8]:
# Process the first 200 comments
df_text_reduced=df_text[['Text']][:200]

In [9]:
summaries = df_text_reduced[['Text']].apply(lambda x: call_model(x.to_json()), axis=1).tolist()

In [10]:
df_text_reduced=df_text_reduced.assign(summ=summaries)

In [11]:
df_text_reduced.to_csv("amazon_summ.csv",index = False)

In [None]:
## Commented out - can use this to process in batches when handling large files
# Generate embeddings, and index titles in batches
#batch_size = 50

# loop through batches and generated + store embeddings
#for i in tqdm(range(0, len(df_text_red), batch_size)):
   # df_text_red_sub=df_text_red[i : i + batch_size]
 #   summaries = df_text_red_sub[['Text']].apply(lambda x: call_model(x.to_json()), axis=1).tolist()
   # df_text_red_sub=df_text_red_sub.assign(summ=summaries)
 #   filename="amazon_summ"+str(i)+".csv"
  #  df_text_red_sub.to_csv(filename,index = False)
