This code will help understanding text and its classification using spacy packages. All the preprocessing of the text will be done using Spacy packages. For training model, either spacy or skicit packages will be used. 

In [1]:
import pandas as pd
import numpy as np
import spacy
import re
import tqdm
from spacy.tokenizer import Tokenizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

Read the input data.

In [2]:
df = pd.read_csv('input.csv')
print(df)

                                                  text  target
0    If you like adult comedy cartoons, like South ...       1
1    Bromwell High is a cartoon comedy. It ran at t...       1
2    All the world's a stage and its people actors ...       1
3    FUTZ is the only show preserved from the exper...       1
4    I came in in the middle of this film so I had ...       1
..                                                 ...     ...
97   When a man who doesn't have Alzheimer's can't ...       0
98   There are some nice shots in this film, it cat...       0
99   A very cheesy and dull road movie, with the in...       0
100  Three part "horror" film with some guy in a bo...       0
101  Three part "horror" film with some guy in a bo...       0

[102 rows x 2 columns]


Stop words are removed first and then lemmatization is applied by converting to root words if any.

In [3]:
nlp = spacy.load("en_core_web_sm")

stop_words = nlp.Defaults.stop_words

stop_words.add("in")

count = 0


df['text_no_stop'] = df['text']

for text in df['text']:
    words = []
    doc = nlp(text)
    for word in doc:
        if word.text not in stop_words:
            words.append(word.lemma_)
            
    
    stop_word = re.sub(r'[^0-9a-zA-Z_//.]+'," ",str(words))
    
       
    df['text_no_stop'][count] = stop_word
   
    
    count+=1
    

In [4]:
df

Unnamed: 0,text,target,text_no_stop
0,"If you like adult comedy cartoons, like South ...",1,if like adult comedy cartoon like South Park ...
1,Bromwell High is a cartoon comedy. It ran at t...,1,Bromwell High cartoon comedy . it run time pr...
2,All the world's a stage and its people actors ...,1,all world stage people actor like . who hell ...
3,FUTZ is the only show preserved from the exper...,1,FUTZ preserve experimental theatre movement N...
4,I came in in the middle of this film so I had ...,1,I come middle film I idea credit title till I...
...,...,...,...
97,When a man who doesn't have Alzheimer's can't ...,0,when man Alzheimer remember film probably wor...
98,"There are some nice shots in this film, it cat...",0,there nice shot film catch landscape beautifu...
99,"A very cheesy and dull road movie, with the in...",0,a cheesy dull road movie intention hip modern...
100,"Three part ""horror"" film with some guy in a bo...",0,three horror film guy board house implore vie...


Token2Vec conversion.

In [5]:
count = 0

df['word_vector'] = df['text']

for text in df['text_no_stop']:
    
    word_vec=[]
    
    doc = nlp(text)
   
    df['word_vector'][count] = doc.vector
    
    count+=1


In [6]:
"""Reshaping the Spacy Vector Matrix so that this vector matrix will be used for SCIKIT models"""
vector_list = np.concatenate([nlp(i).vector.reshape(1,-1) for i in df['text_no_stop']])
print(vector_list.shape)

(102, 96)


As the input is very small, instead of splitting the input data I am using different files for testing and scoring.

In [7]:
test = pd.read_csv('test.csv')
test.head()


Unnamed: 0,text,target
0,"What an inspiring movie, I laughed, cried and ...",1
1,This is just a short comment but I stumbled on...,1
2,My family and I have viewed this movie often o...,1
3,What a lovely heart warming television movie. ...,1
4,"What are the odds of a ""Mermaid"" helium balloo...",1


Do the same preprocessing step for test file as input file.  If large input file, this will be done at once and then split the input as train and test. 

In [8]:
count = 0


test['text_no_stop'] = test['text']

for test_text in df['text']:
    words = []
    doc = nlp(test_text)
    for word in doc:
        if word.text not in stop_words:
            words.append(word.lemma_)
            
    
    stop_word = re.sub(r'[^0-9a-zA-Z_//.]+'," ",str(words))
    
       
    test['text_no_stop'][count] = stop_word
   
    
    count+=1

In [9]:
count = 0

test['word_vector'] = test['text']

for test_text in test['text_no_stop']:
    
    word_vec=[]
    
    doc = nlp(test_text)
   
    test['word_vector'][count] = doc.vector
    
    count+=1

In [10]:
test_vector_list = np.concatenate([nlp(i).vector.reshape(1,-1) for i in test['text_no_stop']])
print(test_vector_list.shape)

(40, 96)


In [11]:
"""Fitting with the logistic Regression model"""
model = LogisticRegression()
clf = model.fit(vector_list,df['target'])


In [13]:
"""Predict the target for the test data and then score with the actuals"""

y_test = np.array(test['target'])
score = model.score(test_vector_list,y_test)
y_hat = model.predict(test_vector_list)
accuracy_score = metrics.accuracy_score(y_test,y_hat)
print(score)
print(accuracy_score)

0.5
0.5


The above result is not really great accuracy as we have only 50% right result. Fine tuning the data by adding more preprocessing steps will give a higher accuracy.  And using large data will automatically increase the accuracy and reduce the loss.  This code is not really for optimization instead to understand how to preprocess using Spacy and then model with different libraries. 