<center> <img src="https://yildirimcaglar.github.io/ds3000/ds3000.png"> </center>

<center> <h2> Text Classifiers</h2></center>

## Outline
1. <a href='#1'>Logistic Regression</a>
2. <a href='#2'>Multilayer Perceptron Classifier</a>




<a id="1"></a>

## 1. Logistic Regression
* Despite its name, a classification algorithm
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [1]:
import pandas as pd
data = pd.read_csv("res/game_reviews.csv")
features = data["comment"]
target = data["sentiment"]

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = LogisticRegression().fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))


Classification accuracy on training set:  0.9012009495880463
Classification accuracy on testing set:  0.8217801047120419


### 1.1. Probability Estimates
* possible to get probability estimates for predictions
* use model.predict_proba()
    * Returned estimates for all classes are ordered by the label of classes.


* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba



In [3]:
model.predict(X_test_vectorized)

array([0, 1, 1, ..., 1, 0, 0], dtype=int64)

In [4]:
model.predict_proba(X_test_vectorized)

array([[0.74943183, 0.25056817],
       [0.07551562, 0.92448438],
       [0.27700778, 0.72299222],
       ...,
       [0.49807486, 0.50192514],
       [0.58939542, 0.41060458],
       [0.65454571, 0.34545429]])

In [5]:
proba = pd.DataFrame(model.predict_proba(X_test_vectorized), columns = ["Not recommended", "Recommended"])
proba["prediction"] = model.predict(X_test_vectorized)

In [6]:
proba

Unnamed: 0,Not recommended,Recommended,prediction
0,0.749432,0.250568,0
1,0.075516,0.924484,1
2,0.277008,0.722992,1
3,0.590125,0.409875,0
4,0.111302,0.888698,1
...,...,...,...
4770,0.271429,0.728571,1
4771,0.633448,0.366552,0
4772,0.498075,0.501925,1
4773,0.589395,0.410605,0


<a id="2"></a>

## 2. Multilayer Perceptron Classifier
* Simple, feed-forward neural network
* hidden_layer_sizes = (10) adds a single hidden layer with 10 hidden units

* https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier


In [7]:
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = MLPClassifier(solver = "adam", hidden_layer_sizes = (10), activation ="relu", 
                      random_state = 3000).fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))


Classification accuracy on training set:  0.9969278033794163
Classification accuracy on testing set:  0.7951832460732984


In [8]:
model.n_layers_

3

In [9]:
model.coefs_

[array([[ 0.28296065, -0.22105024, -0.2174355 , ..., -0.23400576,
          0.29859513,  0.31121737],
        [ 0.13273665, -0.09415211, -0.10321598, ..., -0.08206359,
          0.12935822,  0.13261685],
        [ 0.00774221,  0.03681403,  0.02944204, ...,  0.02962293,
          0.00253839,  0.00299492],
        ...,
        [-0.01895684,  0.05293769,  0.0586672 , ...,  0.057522  ,
         -0.02275426, -0.03385177],
        [-0.27278427,  0.2912648 ,  0.27547451, ...,  0.27952719,
         -0.2690866 , -0.28629307],
        [ 0.10712993, -0.11743189, -0.13380156, ..., -0.12133759,
          0.1017334 ,  0.10750843]]),
 array([[-2.90132981],
        [ 3.06195661],
        [ 3.19361664],
        [ 3.16842952],
        [-2.74120839],
        [ 3.19857573],
        [-2.58825901],
        [ 3.12090005],
        [-2.74447616],
        [-2.64547645]])]