https://realpython.com/python-keras-text-classification/

In [4]:
import pandas as pd


In [5]:
filepath_dict={'yelp':"data/sentiment_analysis/yelp_labelled.txt",
              'amazon':"data/sentiment_analysis/yelp_labelled.txt",
              "imdb":"data/sentiment_analysis/yelp_labelled.txt"}

In [6]:
filepath_dict

{'yelp': 'data/sentiment_analysis/yelp_labelled.txt',
 'amazon': 'data/sentiment_analysis/yelp_labelled.txt',
 'imdb': 'data/sentiment_analysis/yelp_labelled.txt'}

In [7]:
df_list=[]

for source, filepath in filepath_dict.items():
    #Name=equence of Hashable, optional Sequence of column labels to apply. If the file contains a header row,
    #then you should explicitly pass ``header=0`` to override the column names.
    df=pd.read_csv(filepath,names=['sentence', 'label'],sep="\t")
    df['source']=source ## Add another column filled with the source name
    df_list.append(df)
    

df=pd.concat(df_list) #Concatenate pandas objects along a particular axis.


In [10]:

df_list[0:3]

[                                              sentence  label source
 0                             Wow... Loved this place.      1   yelp
 1                                   Crust is not good.      0   yelp
 2            Not tasty and the texture was just nasty.      0   yelp
 3    Stopped by during the late May bank holiday of...      1   yelp
 4    The selection on the menu was great and so wer...      1   yelp
 ..                                                 ...    ...    ...
 995  I think food should have flavor and texture an...      0   yelp
 996                           Appetite instantly gone.      0   yelp
 997  Overall I was not impressed and would not go b...      0   yelp
 998  The whole experience was underwhelming, and I ...      0   yelp
 999  Then, as if I hadn't wasted enough of my life ...      0   yelp
 
 [1000 rows x 3 columns],
                                               sentence  label  source
 0                             Wow... Loved this place.      

In [29]:
df.iloc[0:11]

Unnamed: 0,sentence,label,source
0,Wow... Loved this place.,1,yelp
1,Crust is not good.,0,yelp
2,Not tasty and the texture was just nasty.,0,yelp
3,Stopped by during the late May bank holiday of...,1,yelp
4,The selection on the menu was great and so wer...,1,yelp
5,Now I am getting angry and I want my damn pho.,0,yelp
6,Honeslty it didn't taste THAT fresh.),0,yelp
7,The potatoes were like rubber and you could te...,0,yelp
8,The fries were great too.,1,yelp
9,A great touch.,1,yelp


The resulting vector is also called a feature vector. In a feature vector, each dimension can be a numeric or categorical feature, like for example the height of a building, the price of a stock, or, in our case, the count of a word in a vocabulary.
Let’s quickly illustrate this. Imagine you have the following two sentences:

In [14]:
sentences = ['John likes ice cream', 'John hates chocolate.']

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
#(we used it in our Recommender System)


vectorizer=CountVectorizer(min_df=0.0,lowercase=False)
vectorizer.fit(sentences)
vectorizer.vocabulary_


{'John': 0, 'likes': 5, 'ice': 4, 'cream': 2, 'hates': 3, 'chocolate': 1}

This vocabulary serves also as an index of each word. Now, you can take each sentence and get the word occurrences of the words based on the previous vocabulary. The vocabulary consists of all five words in our sentences, each representing one word in the vocabulary.When you take the previous two sentences and transform them with the CountVectorizer you will get a vector representing the count of each word of the sentence:

In [23]:
vectorizer.transform(sentences).toarray()


array([[1, 0, 1, 0, 1, 1],
       [1, 1, 0, 1, 0, 0]], dtype=int64)

This is considered a Bag-of-words (BOW) model, which is a common way in NLP to create vectors out of text.Each document is represented as a vector. You can use these vectors now as feature vectors for a machine learning model. This leads us to our next part, defining a baseline model.

## Defining a Baseline Model:

When you work with machine learning, one important step is to define a baseline model. This usually involves a simple model, which is then used as a comparison with the more advanced models that you want to test. In this case, you’ll use the baseline model to compare it to the more advanced methods involving (deep) neural networks, the meat and potatoes of this tutorial.

We start by taking the Yelp data set which we extract from our concatenated data set. From there, we take the sentences and labels. The .values returns a NumPy array instead of a Pandas Series object which is in this context easier to work with:

In [33]:
from sklearn.model_selection import train_test_split 

df_yelp=df[df['source']=='yelp']
sentences=df_yelp['sentence'].values
y=df_yelp['label'].values

In [35]:
sentences[0:5],y[0:5]

(array(['Wow... Loved this place.', 'Crust is not good.',
        'Not tasty and the texture was just nasty.',
        'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.',
        'The selection on the menu was great and so were the prices.'],
       dtype=object),
 array([1, 0, 0, 1, 1], dtype=int64))

In [36]:
sentences_train, sentences_test, y_train, y_test=train_test_split(sentences,y,test_size=0.25,random_state=1000)

Here we will use again on the previous BOW model to vectorize the sentences. You can use again the CountVectorizer for this task. Since you might not have the testing data available during training, you can create the vocabulary using only the training data. Using this vocabulary, you can create the feature vectors for each sentence of the training and testing set:

In [44]:
vectorizer.fit(sentences_train)

#teansformer:Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary
#fitted with fit or the one provided to the constructor.
    
X_train=vectorizer.transform(sentences_train)
X_test=vectorizer.transform(sentences_test)
X_train

<750x1938 sparse matrix of type '<class 'numpy.int64'>'
	with 7453 stored elements in Compressed Sparse Row format>

You can see that the resulting feature vectors have 750 samples which are the number of training samples we have after the train-test split. Each sample has 2505 dimensions which is the size of the vocabulary. Also, you can see that we get a sparse matrix. This is a data type that is optimized for matrices with only a few non-zero elements, which only keeps track of the non-zero elements reducing the memory load

CountVectorizer performs tokenization which separates the sentences into a set of tokens as you saw previously in the vocabulary. It additionally removes punctuation and special characters and can apply other preprocessing to each word. If you want, you can use a custom tokenizer from the NLTK library with the CountVectorizer or use any number of the customizations which you can explore to improve the performance of your model. The classification model we are going to use is the logistic regression which is a simple yet powerful linear mode

In [40]:
from sklearn.linear_model import LogisticRegression 

classifier=LogisticRegression()

In [45]:
classifier.fit(X_train,y_train)
score=classifier.score(X_test,y_test)

print("Accuracy score:",score)

Accuracy score: 0.772
