# Text Classification with SpaCy

This model will focus on movie reviews. The dataset used for this model is binary: The model will train on whether the review is positive or negative.

# Preperations for training the model

In [6]:
from spacy.tokens import DocBin
from ml_datasets import imdb # The dataset that will be used to train the model is IMDB user reviews dataset.
import fr_core_news_md
import spacy

In [7]:
train_data, valid_data = imdb() # Creating 2 variables for training data and validation data.

Defining values for the labeles, if the review is positive, it has a positive value of 1 and negative value of 0 and vice versa.

In [8]:
def make_docs(data):
    docs = []
    for doc, label in nlp.pipe(data, as_tuples=True):
        if label == "pos":
            doc.cats["positive"] = 1
            doc.cats["negative"] = 0
        else:
            doc.cats["positive"] = 0
            doc.cats["negative"] = 1
        docs.append(doc)
    return(docs)

In [9]:
nlp = spacy.load("en_core_web_trf") # Loading SpaCy English transformer pipeline (roberta-base).

Creating the training and validation data to begin training the model.

In [12]:
num_texts = 500

train_docs = make_docs(train_data[:num_texts])
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./pandas/train.spacy")

valid_docs = make_docs(valid_data[:num_texts])
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./pandas/valid.spacy")

# Testing the model

The model is trained after 8 hours with 0.96 precision.

In [8]:
nlp = spacy.load("output\model-best") # The best model.

In [9]:
text = train_data[2072] # A random movie review that the model has not seen.

In [16]:
doc = nlp(text[0])
text

('This is a very old and cheaply made film--a typical low-budget B-Western in so many ways. Gary Cooper was not yet a star and this film is highly reminiscent of the early films of John Wayne that were done for "poverty row" studios. With both actors, their familiar style and persona were still not completely formed. This incarnation of Gary Cooper doesn\'t seem exactly like the Cooper of just a few years later (he talks faster in this early film, among other things).\n\n\n\nHowever, unlike the average B-movie of the era, there are at least a few interesting elements that make the film unique (if not good). If you ever want to see the woman that was married to Errol Flynn for seven years, this is your chance. Lili Damita stars as the female love interest and this is a very, very odd casting choice, as she has a heavy accent (she was French) and wasn\'t even close to being "movie star pretty". Incidentally, she was also married to director Michael Curtiz. \n\n\n\nBut for me, the most me

In [11]:
doc.cats # Printing the guesses.

{'positive': 0.09725569933652878, 'negative': 0.8936318755149841}

As can be seen, the model is relativelty sure that this is a negative review and as can be seen at the end of the text, it is a negative review.

In [13]:
text2 = "I like this movie."
doc2 = nlp(text2)
doc2.cats

{'positive': 0.9999939203262329, 'negative': 4.825131782126846e-06}

The other models are bad at evaluating short and non movie related texts because dataset contains all movie related and long text. The best model is somewhat good at short texts too.

In [18]:
text3 = "This movie is 5/10. Not bad, not good." # A neutral and short review.
doc3 = nlp(text3)
doc3.cats

{'positive': 0.40605244040489197, 'negative': 0.4592644274234772}

In [19]:
text4 = "Honest and totally honest. If people are giving 8-9 stars for just graphics and designing & VFX, back up people, there is alot more there should be for a 9 star. It starts good, it gives you hopes, the comedy is good, but then, it gets all sentimental and emotional. I know it is needed, but it just gets so much that it kind of gets sleepy. They carry the story too slow because of this emotional backstories, not actually focusing on the main writing, I do not really think the back stories were really required so much and even more so, not so much in DETAIL. After all things said and done, the movie heads to main plotline, how to get to stop the villain, a squad is formed for that, the main character is stuck and needs to find a way out, it all gets interesting, and then, A CLIFFHANGER. End. Wasted too much time in the unnecessary and making a part 3 out of that seems like a loosen your pocket for more in the next one. People giving 8-9 stars, even 10, seriously! Grow up kiddos. This movie wasn not MARVEL at all. It was Disneyland for kids."
doc4 = nlp(text4)
doc4.cats

{'positive': 0.0020606203470379114, 'negative': 0.9981915354728699}

This review is taken from the IMDB website manually. The user gave 5 stars to the movie but did not gave enough credit for the part they liked. As can be seen from this example, what can be trusted is actual review and not the stars.

In [20]:
text4

'Honest and totally honest. If people are giving 8-9 stars for just graphics and designing & VFX, back up people, there is alot more there should be for a 9 star. It starts good, it gives you hopes, the comedy is good, but then, it gets all sentimental and emotional. I know it is needed, but it just gets so much that it kind of gets sleepy. They carry the story too slow because of this emotional backstories, not actually focusing on the main writing, I do not really think the back stories were really required so much and even more so, not so much in DETAIL. After all things said and done, the movie heads to main plotline, how to get to stop the villain, a squad is formed for that, the main character is stuck and needs to find a way out, it all gets interesting, and then, A CLIFFHANGER. End. Wasted too much time in the unnecessary and making a part 3 out of that seems like a loosen your pocket for more in the next one. People giving 8-9 stars, even 10, seriously! Grow up kiddos. This mo