**TEXT CLASSIFICATION PROJECT:**
(The project is divided into 3 classifiers)
1. Sentiment Classifier
2. Spam Email Classifier
3. Text Summarization

**The project is inspired by what I did in the Udemy course "Natural Language Processing" by ProfessionAI.**

1. **SENTIMENT CLASSIFICATION** (sentiment analysis), the model that will try to understand if a sentence is positive or negative

First of all, you need to import the dataset on which the model is to be trained.

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [None]:
# extracting the compressed file
!tar -xzf aclImdb_v1.tar.gz

Creating a function to read all reviews from all files and then return them together with the corresponding target

In [None]:
from os import listdir
from sklearn.utils import shuffle


def get_xy(files_path, labels=["pos","neg"]):


    label_map = {labels[0]:1, labels[1]:0}

    reviews = []
    y = []

    for label in labels:
      path = files_path+label
      for file in listdir(path):
        review_file = open(path+"/"+file)
        review = review_file.read()

        reviews.append(review)
        y.append(label_map[label])

    # sklearn's shuffle function allows us to
    # mix multiple arrays in the same way

    reviews, y = shuffle(reviews,y)

    return(reviews,y)

Using the function to get reviews and targeting in two lists

In [None]:
reviews_train, y_train = get_xy("aclImdb/train/")
reviews_test, y_test = get_xy("aclImdb/test/")

print("First review of the test set")
print(reviews_test[0])
print("Sentiment: %d" % y_test[0])

Coding of reviews ("bag of words")

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(max_features=5000)

bow_train = bow.fit_transform(reviews_train)
bow_test = bow.transform(reviews_test)

X_train = bow_train.toarray()
X_test = bow_test.toarray()

X_train.shape

Standardization of created arrays

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

Model creation and training using logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=0.001)
lr.fit(X_train, y_train)

Prediction and verification of the result using two parameters that measure the model's accuracy:

log loss and accuracy

In [None]:
from sklearn.metrics import accuracy_score, log_loss

train_pred = lr.predict(X_train)
train_pred_proba = lr.predict_proba(X_train)

train_accuracy = accuracy_score(y_train, train_pred)
train_loss = log_loss(y_train, train_pred_proba)

test_pred = lr.predict(X_test)
test_pred_proba = lr.predict_proba(X_test)

test_accuracy = accuracy_score(y_test, test_pred)
test_loss = log_loss(y_test, test_pred_proba)

print("Train Accuracy %.4f - Train Loss %.4f" % (train_accuracy, train_loss))
print("Test Accuracy %.4f - Test Loss %.4f" % (test_accuracy, test_loss))

the model is quite accurate (94% on training data and 87% on testing data)

**-> MODEL TEST** (Sentiment analysis [positive or negative])

In [None]:
# FIRST REVIEW TO BE CLASSIFIED (POSITIVE)
review = "This is the best movie I've ever seen"
prediction = lr.predict(bow.transform([review]))
if prediction[0] == 0: # if negative model returns 0
  print("La recensione è negativa")
else: # otherwise it returns 1
  print("La recensione è positiva")

In [None]:
# SECOND REVIEW TO BE CLASSIFIED (NEGATIVE)
review = "This is the worst movie I've ever seen"
prediction = lr.predict(bow.transform([review]))
if prediction[0] == 0:
  print("The review is negative")
else:
  print("The review is positive")

-> the model successfully recognized which review was positive and which was negative

2. **EMAIL SPAM CLASSIFICATION** (spam classification)

Importing libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

Reading the training dataset

In [None]:
spam = pd.read_csv('data/spam.csv')

In [None]:
spam.head()

In [None]:
# @title v1

from matplotlib import pyplot as plt
import seaborn as sns
spam.groupby('v1').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

Preparing the training dataset by creating labels and arrays

In [None]:
z = spam['v2'] # v2 = email text
y = spam["v1"] # v1 = label (label): spam or not spam (ham)
z_train, z_test,y_train, y_test = train_test_split(z,y,test_size = 0.2)

"Tokenization": consists of dividing a text into smaller entities, called tokens.

In [None]:
cv = CountVectorizer()
features = cv.fit_transform(z_train)

Model training

In [None]:
model = svm.SVC()
model.fit(features,y_train)

Verification on test datasets

In [None]:
features_test = cv.transform(z_test)
print(model.score(features_test,y_test))

the model is very accurate (98%)

**-> MODEL TEST** (Classifying an email [spam or not])

In [None]:
# TEST ON A SPAM EMAIL
email = ["URGENT! You have won a 1 week FREE membership in our å£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18"]
feature_test = cv.transform(email)
result = model.predict(feature_test)
print(f"The email is {result[0]}")

In [None]:
# TEST ON A NON-SPAM EMAIL (HAM)
email = ["Please don't text me anymore. I have nothing else to say."]
feature_test = cv.transform(email)
result = model.predict(feature_test)
print(f"The mail is {result[0]}")

-> the model successfully recognized spam mail and non-spam mail (ham)

3. **SUMMARIZE A TEXT**

Importing the article to be summarized

In [None]:
!pip install newspaper3k

In [None]:
from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
page = Article("https://www.sciencedaily.com/releases/2021/08/210811162816.htm", config=config)
page.download()
page.parse()
print(page.text)

Importing libraries

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

Loading the model from the spaCy library

In [None]:
nlp = spacy.load('en_core_web_sm')

Text encoding (the spaCy library does most of the work automatically)

In [None]:
doc= nlp(page.text)

Tokenization

In [None]:
tokens=[token.text for token in doc]

Coding of text corpus words

In [None]:
word_frequencies={}
for word in doc:
  if word.text.lower() not in list(STOP_WORDS):
    if word.text.lower() not in punctuation:
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
  word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent]=word_frequencies[word.text.lower()]
      else:
        sentence_scores[sent]+=word_frequencies[word.text.lower()]

per = percentage of the sentences in the article you want to extract

In [None]:
per = 0.05

In [None]:
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)

**-> Model Test** (Summarizing a Text)

In [None]:
print(summary.replace(",", "\n"))

-> the article has been successfully summarized

**TEST MODELS VIA A UI CREATED USING THE STREAMLIT LIBRARY**

In [None]:
!pip install -q streamlit

In [None]:
from joblib import dump, load
dump(lr, 'sent_classifier.sav')

In [None]:
dump(model, 'spam_classifier.sav')

In [None]:
%%writefile app.py

import streamlit as st
from joblib import load
import spacy
from io import StringIO
from transformers import pipeline
st.set_page_config(page_title="ThreeTools Algo")

@st.cache_resource
def sent_load():
  return load('/content/sent_classifier.sav')
@st.cache_resource
def spam_load():
  return load('/content/spam_classifier.sav')
@st.cache_resource
def summ_load():
  return spacy.load('en_core_web_sm')

sent_classifier = sent_load()
spam_classifier = spam_load()
nlp = summ_load()

def summarize(text, per):
  doc=nlp(text)
  tokens=[token.text for token in doc]
  word_frequencies={}
  for word in doc:
    if word.text.lower() not in list(STOP_WORDS):
      if word.text.lower() not in punctuation:
        if word.text not in word_frequencies.keys():
          word_frequencies[word.text] = 1
        else:
          word_frequencies[word.text] += 1
  max_frequency=max(word_frequencies.values())
  for word in word_frequencies.keys():
    word_frequencies[word]=word_frequencies[word]/max_frequency
  sentence_tokens= [sent for sent in doc.sents]
  sentence_scores = {}
  for sent in sentence_tokens:
    for word in sent:
      if word.text.lower() in word_frequencies.keys():
        if sent not in sentence_scores.keys():
          sentence_scores[sent]=word_frequencies[word.text.lower()]
        else:
          sentence_scores[sent]+=word_frequencies[word.text.lower()]
  select_length=int(len(sentence_tokens)*per)
  summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
  final_summary=[word.text for word in summary]
  summary=''.join(final_summary)
  return summary


file_boolean = False
string_data = str()
text = str()


with st.sidebar:
  with st.container(height=60, border=False):
    st.title("Functions:")

  with st.container(height=70, border=False):
    but1 = st.button("Emotion & Sentiment Analysis", use_container_width=True)
  with st.container(height=50, border=False):
    but2 = st.button("SMS & Email Spam Classification", use_container_width=True)
  with st.container(height=40, border=False):
    but3 = st.button("Summarizer", use_container_width=True)

with st.container(height=220, border=False):
  col1, col2 = st.columns([3, 2])

  with col1:
    st.title("ThreeTools Algorithm")

  with col2:
    uploaded_file = st.file_uploader(label="UPLOADER (Upload a .txt file), for Summarization", type=['txt'])
    if uploaded_file is not None:
      stringio = StringIO(uploaded_file.getvalue().decode("utf-8"))
      #st.write(stringio)

      string_data = stringio.read()
      #print(string_data)

      file_boolean = True


if "messages" not in st.session_state.keys():
  st.session_state.messages = [{"role": "assistant", "content": "Hi, how can I help you?"}]
  print(st.session_state.messages)

for message in st.session_state.messages:
  with st.chat_message(message["role"]):
    st.write(message["content"])


if prompt := st.chat_input("Write the sentence to classify or upload it from the uploader"):
  st.session_state.messages.append({"role": "user", "content": prompt})
  with st.chat_message("user"):
    text = prompt
    st.write(prompt)

if st.session_state.messages[-1]["role"] != "assistant":
  with st.chat_message("assistant"):
    with st.spinner("Loading..."):
      if but1:
        try:
          prediction = sent_classifier.predict(bow.transform([text]))
          if prediction[0] == 0: # if negative model returns 0
            print("The review is negative")
          else: # otherwise it returns 1
            print("The review is positive")
        except:
          st.write("Something went wrong, please try again")
      if but2:
        try:
          # TEST ON A SPAM EMAIL
          email = [text]
          feature_test = cv.transform(email)
          result = spam_classifier.predict(feature_test)
          print(f"The email is {result[0]}")
        except:
          st.write("Something went wrong, please try again")
      if but3:
        try:
          print(summarize(string_data, 0.05).replace(",","\n"))
        except:
          st.write("Something went wrong, please try uploading the " "text to summarize in the UPLOADER")

In [None]:
!npm install localtunnel

In [None]:
!streamlit run app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com