# Custom Text Classifier

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# sample data of sentences and their associated sentiment labels
data = pd.DataFrame(
        [
            ("i love spending time with my friends and family", "positive"),
            ("that was the best meal i've ever had in my life", "positive"),
            ("i feel so grateful for everything i have in my life", "positive"),
            ("i received a promotion at work and i couldn't be happier", "positive"),
            ("watching a beautiful sunset always fills me with joy", "positive"),
            (
                "my partner surprised me with a thoughtful gift and it made my day",
                "positive",
            ),
            ("i am so proud of my daughter for graduating with honors", "positive"),
            (
                "listening to my favorite music always puts me in a good mood",
                "positive",
            ),
            (
                "i love the feeling of accomplishment after completing a challenging task",
                "positive",
            ),
            ("i am excited to go on vacation next week", "positive"),
            ("i feel so overwhelmed with work and responsibilities", "negative"),
            ("the traffic during my commute is always so frustrating", "negative"),
            ("i received a parking ticket and it ruined my day", "negative"),
            (
                "i got into an argument with my partner and we're not speaking",
                "negative",
            ),
            ("i have a headache and i feel terrible", "negative"),
            ("i received a rejection letter for the job i really wanted", "negative"),
            ("my car broke down and it's going to be expensive to fix", "negative"),
            ("i'm feeling sad because i miss my friends who live far away", "negative"),
            (
                "i'm frustrated because i can't seem to make progress on my project",
                "negative",
            ),
            ("i'm disappointed because my team lost the game", "negative"),
        ], columns=["text", "sentiment"]
)

> Problem: create an algorithm that can classify the sentences by the sentiment score.

In [None]:
# shuffle data to ensure that positive and negative examples are mixed
data = data.sample(frac=1).reset_index(drop=True)

In [None]:
x = data['text']
y = data['sentiment']

In [None]:
count_vectorizer = CountVectorizer()

In [None]:
count_vectorizer_fit = count_vectorizer.fit_transform(x)

In [None]:
bag_of_words = pd.DataFrame(count_vectorizer_fit.toarray(), columns=count_vectorizer.get_feature_names_out())
bag_of_words

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words, y, test_size=0.3, random_state=7)

## Logistic Regression
- Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.
- In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
- Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1".

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logistic_regression_model = LogisticRegression(random_state=1).fit(x_train, y_train)

In [None]:
y_pred_logistic_regression = logistic_regression_model.predict(x_test)

In [None]:
accuracy_score(y_pred_logistic_regression, y_test)

In [None]:
print(classification_report(y_test, y_pred_logistic_regression, zero_division=0))

# Naive Bayes
- Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors.
- In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
- For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
naive_bayes_alg = MultinomialNB().fit(x_train, y_train)

y_prediction_naive_bayes = naive_bayes_alg.predict(x_test)

accuracy_score(y_prediction_naive_bayes, y_test)

In [None]:
print(classification_report(y_test, y_prediction_naive_bayes, zero_division=0))

# Linear Support Vector Machine (Linear SVC)
- Linear Support Vector Machine is a linear model for classification and regression problems.
- It can solve linear and non-linear problems and work well for many practical problems.
- The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes.

In [None]:
from sklearn.linear_model import SGDClassifier
# there are other packages that can be used to implement the SVM algorithm
# such as the libsvm library

In [None]:
svm_algo = SGDClassifier().fit(x_train, y_train)

y_prediction_svm = svm_algo.predict(x_test)

accuracy_score(y_prediction_svm, y_test)

In [None]:
print(classification_report(y_test, y_prediction_svm, zero_division=0))

> NOTE
> - scores are not that great, so we have to think of improving out data
> - we can use more data, or use more advanced models, clean the data, etc.