## [Python Keras text classification]('https://realpython.com/python-keras-text-classification/')

In [2]:
import pandas as pd

filepath_dict = {'yelp':   'data/yelp_labelled.txt',
                 'amazon': 'data/amazon_cells_labelled.txt',
                 'imdb':   'data/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

sentence    Wow... Loved this place.
label                              1
source                          yelp
Name: 0, dtype: object


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
# quick two sentence demo
sentences = ['John likes ice cream.', 'John hates chocolate.']

# Next, you can use the CountVectorizer provided by the scikit-learn library to vectorize sentences. It takes the words of each sentence and creates a vocabulary of all the unique words in the sentences. This vocabulary can then be used to create a feature vector of the count of the words:

cv = CountVectorizer(min_df=0, lowercase=False)
cv.fit(sentences)
cv.vocabulary_

# this vocabulary serves also as an index of each word. Now, you can take each sentence and get the word occurrences of the words based on the previous vocabulary. The vocabulary consists of all five words in our sentences, each representing one word in the vocabulary. When you take the previous two sentences and transform them with the CountVectorizer you will get a vector representing the count of each word of the sentence:

cv.transform(sentences).toarray()

# => array([[1, 0, 1, 0, 1, 1], [1, 1, 0, 1, 0, 0]])
#can see resulting feature vectors for each sentence based on prev vocab
# Bag of Words model - common way to create vectors out of text
# eacd document represented as a vector

array([[1, 0, 1, 0, 1, 1],
       [1, 1, 0, 1, 0, 0]], dtype=int64)

# Defining a baseline model
simple model - used as comparison with more advanced models you want to test  
this case - use baseline model to compare to the more advanced methods involving neural networks  

first split data into training and test set  
avoid overfitting - model trained too well on data and has just memorized training data  
    would account for large accuracy in training data but low in testing data 

start with yelp set - extract sentences and labels  
`.values` returns numpy array instead of pandas series object - easier to work with



In [6]:
from sklearn.model_selection import train_test_split

df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)

### Here we will use again on the previous BOW model to vectorize the sentences.  
### You can use again the CountVectorizer for this task. Since you might not have the testing data available during training, you can create the vocabulary using only the training data.  
### Using this vocabulary, you can create the feature vectors for each sentence of the training and testing set:

In [7]:
cv = CountVectorizer()
cv.fit(sentences_train)

X_train = cv.transform(sentences_train)
X_test = cv.transform(sentences_test)

X_train

<750x1714 sparse matrix of type '<class 'numpy.int64'>'
	with 7368 stored elements in Compressed Sparse Row format>

750 = number of training samples 
1714 = size of vocabulary 

sparse matrix = data type optimized for matrices with only a few non-zero elements - only keeps track of the non-zero elements reducing memory load.  

CV performs tokenization - seperates sentences into set of tokens as prev seen in the vocabulary  
also removes punctuation and special chars, and apply other preprocessing to each word  

can use a custom tokenizer from the NLTK library with the CV or any number of customizations

## Classification  - logistic regression
simple but powerful linear model - form of regression b/w 0 and 1 based on input feature vector  
by specifying a cutoff value (0.5) regression model is used for Classification  



In [8]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print('accuracy: ', score)

accuracy:  0.796


reached 79.6% accuracy  - now check other data sets

this script perform and evaluates whole process for each dataset


In [9]:
for source in df['source'].unique():
    df_source = df[df['source'] == source]
    sentences = df_source['sentence'].values
    y = df_source['label'].values

    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(sentences_train)
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)

    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))

Accuracy for yelp data: 0.7960
Accuracy for amazon data: 0.7960
Accuracy for imdb data: 0.7487
