## Types of Strings Data in NLP

#### 1. Categorical Data

In [33]:
""" This is a string data from a fixed number of list 
    This can be a drop-down menu for the user to select from 
"""
categorical_data = ["red", "yellow", "green", "brown"]

print(categorical_data) 

['red', 'yellow', 'green', 'brown']


#### 2. Free strings that can be semantically mapped to categories

In [34]:
""" This is when the user is allowed to type/color a color without a fixed list (draw-down menu) provided. 
    The result can still be mapped as mapped as a categorical data. 
    One must however check for spelling mistates as they might mean the same thing
    like February and Februry 
"""

' This is when the user is allowed to type/color a color without a fixed list (draw-down menu) provided. \n    The result can still be mapped as mapped as a categorical data. \n    One must however check for spelling mistates as they might mean the same thing\n    like February and Februry \n'

#### 3. Structured String Data

#### 4. Text data

In [35]:
""" This is the form of data with no specific structure 
    For example, blog message, twitter comment, youtube comment, etc
"""

' This is the form of data with no specific structure \n    For example, blog message, twitter comment, youtube comment, etc\n'

### All Imports

In [36]:
import pandas as pd 
from sklearn.model_selection import train_test_split 

### Read the IMDB Data 

In [44]:
data_imdb = pd.read_csv('data/imdb_dataset.csv')
print(data_imdb.head())


                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [45]:
print(data_imdb.sentiment.unique())

['positive' 'negative']


### Function to remove break line from document

In [39]:
def first_preprocess(data):
    return data.replace('<br /><br />', " ")   

In [40]:
data_imdb["new_review"] = data_imdb["review"].apply(lambda x : first_preprocess(x))
process_data = data_imdb[["new_review", "sentiment"]]
print(process_data.head())

                                          new_review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production.  The filming te...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [41]:
X, y = data_imdb["new_review"], data_imdb["sentiment"]

In [50]:
X_ = [i for i in X]
y_ = [i for i in y]

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(type(X_train))
print(len(X_train), len(X_test), len(y_train), len(y_test)) 

<class 'pandas.core.series.Series'>
33500 16500 33500 16500


In [52]:
X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.33, random_state=42)

print(type(X_train))
print(len(X_train), len(X_test), len(y_train), len(y_test)) 


<class 'list'>
33500 16500 33500 16500


In [55]:
for i in range(4):
    print(X_train[i])
    print(y_train[i])

Randolph Scott is heading into Albuquerque to take a job with his uncle. However, on the way there, the stage is held up--even though they are not carrying a strongbox. However, a nice lady on board is concealing $10,000 for her and her brother's business...and the robbers seem to know this. Once in town, Scott goes to this uncle about the job. However, he soon learns that this uncle is a jerk--the typical bad guy from Westerns. You know, the rich guy who only wants to become richer by cheating and stealing and threatening until he owns everything. And, it just so happens that this jerk was behind the robbery. Scott demands that the uncle returns the money and then Scott goes into business with the nice lady and her brother. Not surprisingly, this is NOT the end of the problems---just the beginning. Again and again, intrigues of various types occur to try to crush the uncle's opposition. One trick is to bring in a pretty lady to befriend Scott and his partners. She's a crack shot and i

In [43]:
print(type(X_train))
print(len(X_train), len(X_test), len(y_train), len(y_test))

<class 'pandas.core.series.Series'>
33500 16500 33500 16500


## Representing Data as BAG OF WORDS

### 1. Tokenization

### 2. Vocabulary Building

### 3. Encoding

## Applying Bag-of-Words to a Toy Dataset

In [69]:
bards_words = ["The fool doth think he is wise", "but the wise man knows himself to be a fool"]

from sklearn.feature_extraction.text import CountVectorizer 

vect = CountVectorizer()
vect.fit(bards_words)

print(f"Vocabulary size: {len(vect.vocabulary_)}\n")
print(f"Vocabulary content: \n{(vect.vocabulary_)}")

Vocabulary size: 13

Vocabulary content: 
{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


### Bag of Words

In [70]:
bag_of_words = vect.transform(bards_words)
print(f"bag of words : {repr(bag_of_words)}")

bag of words : <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


In [71]:
print(f"bag of words : \n{(bag_of_words)}")

bag of words : 
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 6)	1
  (0, 9)	1
  (0, 10)	1
  (0, 12)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	1
  (1, 7)	1
  (1, 8)	1
  (1, 9)	1
  (1, 11)	1
  (1, 12)	1


In [73]:
print(f"bag of words : \n{(bag_of_words.toarray())}")

bag of words : 
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


### Apply the Bag-of-Words to the IMDB Dataset 

In [74]:
vect = CountVectorizer().fit(X_train)

In [76]:
X_train = vect.transform(X_train)

In [77]:
print(f"X_train: \n{repr(X_train)}")

X_train: 
<33500x86358 sparse matrix of type '<class 'numpy.int64'>'
	with 4541135 stored elements in Compressed Sparse Row format>


In [94]:
feature_names = vect.get_feature_names_out()
print(f"Number of features : {len(feature_names)}\n")
print(f"First 20 features : \n{'-'*70}\n{feature_names[:20]}\n")
print(f"Feature 2000th - 2020th features : \n{'-'*70}\n{feature_names[2000:2020]}")

Number of features : 86358

First 20 features : 
----------------------------------------------------------------------
['00' '000' '00000000000' '00000001' '00001' '000dm' '000s' '001' '003830'
 '007' '0079' '0080' '0083' '009' '0093638' '00am' '00o' '00pm' '00s'
 '00schneider']

Feature 2000th - 2020th features : 
----------------------------------------------------------------------
['adarsh' 'adas' 'aday' 'adays' 'add' 'addam' 'addams' 'addario' 'added'
 'addendum' 'adder' 'addict' 'addicted' 'addicted2you' 'addicting'
 'addiction' 'addictions' 'addictive' 'addicts' 'addie']


### The first set of features are actually not words but numeric numbers which has the form of a string. We have to eliminate all those features as the have no semantic significance. 

#### Before Let's train the model and check the accuracy first

In [97]:
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model    import LogisticRegression 
import numpy as np 

In [None]:
scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)

In [99]:
print(f"Mean cross-validation accuracy :  {np.mean(scores) :.3f}")

Mean cross-validation accuracy :  0.88


In [100]:
from sklearn.model_selection import GridSearchCV


In [None]:
param_grid = {"C" : [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

In [104]:
print(f"Best cross-validation score : {grid.best_score_:.3f}")
print(f"Best parameters : {grid.best_params_}")

Best cross-validation score : 0.890
Best parameters : {'C': 0.1}


In [105]:
X_test = vect.transform(X_test)

In [106]:
print(f"Test score: {grid.score(X_test, y_test) :.3f}")

Test score: 0.898
