**Bag Of Words Model**

In [150]:
import numpy as np
import pandas as pd

**A function to transform into a table**

In [151]:
def data_table(doc):
    df = pd.read_csv(doc,delimiter='\t', header=None, names=['Review_text','Review_class'])
    return df.head()

**read 1st txt file**

In [152]:
data_amazon = data_table('amazon_cells_labelled.txt')

**read 2nd txt file**

In [153]:
data_yelp = data_table('yelp_labelled.txt')

**read 3rd txt file**

In [154]:
data_imdb = data_table('imdb_labelled.txt')

**Concatenate All Data into ONE**

In [165]:
data = pd.concat([data_amazon,data_yelp,data_imdb])
data.head()

Unnamed: 0,Review_text,Review_class
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


**Shape of the big file**

In [163]:
data.shape

(15, 2)

**Important Libraries for NLP**

In [157]:
import re
import string
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package punkt to /home/femi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/femi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Steps**
<br><i>Create a **function** that :</i>
- creates an empty list for the cleaned text
- converts all text to be cleaned into a list and assign to a variable
- changes all converted text into lower case using **<i> for loop </i>** 
- 

In [158]:
def clean_text(df):
    cleaned_reviews = []
    lines = df['Review_text'].values.tolist()
    # lines
    for eachText in lines:
        eachText = eachText.lower()
        pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[|*\(\),]|(?:%[0-9a-fA-F]))+')
        eachText = pattern.sub('', eachText)
        eachText = re.sub(r"[,.\"!@#$%^&*(){}?/;'~:<>+=-]","",eachText)
        tokens = word_tokenize(eachText)
        table = str.maketrans('','',string.punctuation)
        stripped = [w.translate(table) for w in tokens]
        words = [word for word in stripped if word.isalpha()]
        stop_words = set(stopwords.words("english"))
        stop_words.discard("not")
        PS = PorterStemmer()
        words = [PS.stem(w) for w in words if not w in stop_words]
        words = ' '.join(words)
        cleaned_reviews.append(words)
    return cleaned_reviews

cleaned_reviews = clean_text(data)
cleaned_reviews



['way plug us unless go convert',
 'good case excel valu',
 'great jawbon',
 'tie charger convers last minutesmajor problem',
 'mic great',
 'wow love place',
 'crust not good',
 'not tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'slowmov aimless movi distress drift young man',
 'not sure lost flat charact audienc nearli half walk',
 'attempt arti black white clever camera angl movi disappoint becam even ridicul act poor plot line almost nonexist',
 'littl music anyth speak',
 'best scene movi gerardo tri find song keep run head']

**Build vocabulary**

In [162]:
# print(np.shape(data))
# print(np.shape(data['Review_class']))
data.shape

(15, 2)

In [160]:
from sklearn.feature_extraction.text import CountVectorizer
CV = CountVectorizer(min_df=3)
X = CV.fit_transform(cleaned_reviews).toarray()
y = data['Review_class'].values   # target vector
print(np.shape(X))
print(np.shape(y))

(15, 3)
(15,)


In [161]:
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mForwardRef[0m[0;34m([0m[0;34m'PathLike[str]'[0m[0;34m)[0m[0;34m,[0m [0mstr[0m[0;34m,[0m [0mIO[0m[0;34m[[0m[0;34m~[0m[0mT[0m[0;34m][0m[0;34m,[0m [0mio[0m[0;34m.[0m[0mRawIOBase[0m[0;34m,[0m [0mio[0m[0;34m.[0m[0mBufferedIOBase[0m[0;34m,[0m [0mio[0m[0;34m.[0m[0mTextIOBase[0m[0;34m,[0m [0m_io[0m[0;34m.[0m[0mTextIOWrapper[0m[0;34m,[0m [0mmmap[0m[0;34m.[0m[0mmmap[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m=[0m[0;34m<[0m[0mobject[0m [0mobject[0m [0mat[0m [0;36m0x7f6e31001320[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;32mNone[0m