# Bag of Words

1. Sentiment analysis - 
    * Positive/Negative
2. Topics

* Cars, Sports, Cooking

### Loading our Data

We'll begin by loading up some data from the newsgroups dataset.

In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

documents = pd.DataFrame(newsgroups_train['data'], columns = ['text'])
y = newsgroups_train['target']

In [5]:
documents[:3]

Unnamed: 0,text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...


In [4]:
# newsgroups_train['target_names']

In [6]:
first_document = documents['text'][0]
first_document

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [8]:
documents['text'].iloc[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [9]:
# document -> [will, "not", ""]

In an NLP problem, an observation like this is referred to as a **document**.

> In NLP, a **document** is a distinct text. This generally means that each article, book, or so on is its own document.  It is the equivalent of an observation in machine learning.

**segmentation** break our document into chunks, whether sentences, or lines, or heading, body and footer, it's called .  

>  A **token** is a string of contiguous characters between two spaces.

* `will not`-> `will` and `not`
* `won't` -> `won`, `'t`

In [14]:
tokenized_doc = str.split(first_document)
tokenized_doc[6:10]

['WHAT', 'car', 'is', 'this!?']

In [13]:
what -> 2, car -> 4

In [15]:
first_document = "What car is this"

In [None]:

[0, 0, 1, 0, 1, ]
[0, 1, 0, 0, 2]
[0, 1, 0, 1, 1]
#      what  #car

In [None]:
* bag of words

* Histogram: {'what': 1, 'car': 2, 'is': 5}
    
[0, 0, 1, 0, 2]

### Encoding Each Word

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
vectorizer =CountVectorizer()

In [39]:
documents.text.iloc[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [24]:
X = vectorizer.fit_transform(newsgroups_train['data'])


In [41]:
vectorizer.vocabulary_['car']

37780

In [53]:
# X[1].toarray()[0]

> The way that we translate our words into text is called **text representation**.  And when we represent our documents as a vector, its called a **vector space model**.

In [54]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(newsgroups_train.data)

In [72]:
vectorizer.inverse_transform(X[:4])

[array(['from', 'lerxst', 'wam', 'umd', 'edu', 'where', 'my', 'thing',
        'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host',
        'rac3', 'organization', 'university', 'of', 'maryland', 'college',
        'park', 'lines', '15', 'was', 'wondering', 'if', 'anyone', 'out',
        'there', 'could', 'enlighten', 'me', 'on', 'saw', 'the', 'other',
        'day', 'it', 'door', 'sports', 'looked', 'to', 'be', 'late', '60s',
        'early', '70s', 'called', 'bricklin', 'doors', 'were', 'really',
        'small', 'in', 'addition', 'front', 'bumper', 'separate', 'rest',
        'body', 'all', 'know', 'can', 'tellme', 'model', 'name', 'engine',
        'specs', 'years', 'production', 'made', 'history', 'or',
        'whatever', 'info', 'you', 'have', 'funky', 'looking', 'please',
        'mail', 'thanks', 'il', 'brought', 'by', 'your', 'neighborhood'],
       dtype='<U180'),
 array(['from', 'edu', 'subject', 'this', 'nntp', 'posting', 'host',
        'organization', 'univ

In [56]:
# y

In [58]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = .2)

In [59]:
from sklearn.linear_model import LogisticRegression

In [60]:
lr = LogisticRegression().fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [62]:
lr.score(X_test, y_test)

0.8780380026513478

In [73]:
# newsgroups_train['target_names']

In [70]:
coeficients = pd.Series(lr.coef_[1], vectorizer.get_feature_names()).sort_values(ascending = False)
coeficients[:20]

graphics     1.573041
image        1.002384
images       0.976303
3d           0.885399
24           0.774670
files        0.751026
pov          0.717016
library      0.691739
package      0.652439
3do          0.645722
file         0.642139
tiff         0.612601
polygon      0.598174
imagine      0.583751
animation    0.571993
cview        0.568688
42           0.562986
gif          0.546851
vga          0.532252
color        0.514511
dtype: float64

**corpus**

`vocabulary`

* vectorizer vocabulary

### What's lost from BOW

### Stop Words

In [80]:
documents.iloc[0].text

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [None]:
[['From: lerxst@wam.umd.edu', 'lerxst@wam.umd.edu (where', 'where's my']]

In [81]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

vectorizer = CountVectorizer(stop_words = ENGLISH_STOP_WORDS, ngram_range=(2, 3))

In [82]:
X_ngrams = vectorizer.fit_transform(newsgroups_train.data)

In [88]:
X_ngrams.shape

(11314, 1186545)

In [87]:
list(vectorizer.vocabulary_.items())[-20:]

[('cbr serial', 255199),
 ('number jh2sc281xpm100187', 758092),
 ('jh2sc281xpm100187 engine', 594925),
 ('engine number', 412653),
 ('number 2101240', 757683),
 ('2101240 turn', 44172),
 ('signals mirrors', 969551),
 ('mirrors lights', 710599),
 ('lights taped', 644290),
 ('taped track', 1039529),
 ('track riders', 1072631),
 ('riders session', 912805),
 ('session willow', 958492),
 ('willow springs', 1146450),
 ('springs tomorrow', 996569),
 ('tomorrow guess', 1068261),
 ('ll miss', 655714),
 ('miss help', 711308),
 ('help baby', 527863),
 ('baby kjg', 190499)]

1. Stop words 

Stop words - exclude common words from our features

* Downside
* Stop words can 

In [None]:
2. Ngrams



In [None]:
"On hold"

### Training a model

[Naive Bayes Newsgroups](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)

[NLP Newsgroups](https://medium.com/@siyao_sui/nlp-with-the-20-newsgroups-dataset-ab35cd0ea902)



[FastAI NLP](https://github.com/fastai/course-nlp/blob/master/2-svd-nmf-topic-modeling.ipynb)

[Sklearn text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

[Naive Bayes Classifier](https://towardsdatascience.com/the-naive-bayes-classifier-e92ea9f47523)