### Data prep 

In this notebook we are going to try to set up the text data in a way most efficient for machine learning 


In [1]:
%pylab inline

import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
from sklearn.datasets import fetch_20newsgroups

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score

import warnings;
warnings.simplefilter("ignore")

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [4]:
# download data 
data = fetch_20newsgroups(subset='all', categories=['comp.graphics', 'sci.med'])

In [5]:
data['target_names']

['comp.graphics', 'sci.med']

In [6]:
texts = data['data']
target = data['target']

In [7]:
#sample text
texts[0]

'From: dyer@spdcc.com (Steve Dyer)\nSubject: Re: Analgesics with Diuretics\nOrganization: S.P. Dyer Computer Consulting, Cambridge MA\n\nIn article <ofk=lve00WB2AvUktO@andrew.cmu.edu> Lawrence Curcio <lc2b+@andrew.cmu.edu> writes:\n>I sometimes see OTC preparations for muscle aches/back aches that\n>combine aspirin with a diuretic.\n\nYou certainly do not see OTC preparations advertised as such.\nThe only such ridiculous concoctions are nostrums for premenstrual\nsyndrome, ostensibly to treat headache and "bloating" simultaneously.\nThey\'re worthless.\n\n>The idea seems to be to reduce\n>inflammation by getting rid of fluid. Does this actually work? \n\nThat\'s not the idea, and no, they don\'t work.\n\n-- \nSteve Dyer\ndyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer\n'

In [8]:
#target for the sample text
data['target_names'][target[0]]


'sci.med'

In [9]:
#train test split 
texts_train, texts_test, y_train, y_test = train_test_split(texts, target, test_size=0.2, random_state=10)

### BAG-of-WORDS
The most native way to generate feature discribtion of the texts would be `vectorization`. Let's say that we have a collection of texts $D = \{d_i\}_{i=1}^l$ and a dictionary of all words from our sample space $V = \{v_j\}_{j=1}^d$. In this case some text $d_i$ can be described as a vector $(x_{ij})_{j=1}^d$, where $\large{x_{ij} = \sum_{v \in d_i}[v= v_j]}$. This would imply that a given text $d_i$ is described by a vector of a number of ~mentions~ in the collection.  

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(encoding='utf8')
_ = vectorizer.fit(texts_train)
len(vectorizer.vocabulary_)

29227

In [11]:
vectorizer.transform(texts_train[:1])

<1x29227 sparse matrix of type '<class 'numpy.int64'>'
	with 43 stored elements in Compressed Sparse Row format>

In [12]:
print(vectorizer.transform(texts_train[:1]).indices)
print(vectorizer.transform(texts_train[:1]).data)

[  451  3280  3285  4133  4134  5214  5848  7692  8797  9094  9438  9616
  9904 11526 12024 12364 13304 13697 14400 14592 14944 14950 16427 17425
 18814 19241 19490 20697 20955 20984 21019 21280 21693 21990 22035 22535
 23585 25311 26187 26499 26956 28495 28772]
[1 1 1 2 2 3 1 1 4 1 1 1 5 2 1 1 1 1 4 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 5 1 1]


Using grid search we fill find optimal hyperparameters. To use the binary classification we will use the accuarcy as target metric. 

$$Accuracy(\hat{y}, y) = \frac{1}{N}\sum_{i=1}^{N}[\hat{y}=y]$$

In [13]:
# build training func
def train_model(X_train, y_train):
    alphas = np.logspace(-1, 3, 10)
    searcher = GridSearchCV(LogisticRegression(), [{'C': alphas, 'max_iter': [500]}],
                            scoring='accuracy', cv=5, n_jobs=-1)
    searcher.fit(X_train, y_train)

    best_alpha = searcher.best_params_["C"]
    print("Best alpha = %.4f" % best_alpha)
    
    model = LogisticRegression(C=best_alpha, max_iter=500)
    model.fit(X_train, y_train)
    
    return model

In [14]:
#data transfromation 
X_train = vectorizer.transform(texts_train)
X_test = vectorizer.transform(texts_test)

In [15]:
model = train_model(X_train, y_train)

print("Train accuracy = %.4f" % accuracy_score(y_train, model.predict(X_train)))
print("Test accuracy = %.4f" % accuracy_score(y_test, model.predict(X_test)))

Best alpha = 0.7743
Train accuracy = 1.0000
Test accuracy = 0.9720


### TD-IDF

Another way of struckturing data for this scenario is [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) (**T**erm **F**requency–**I**nverse **D**ocument **F**requency). 

- $D$ is the collection of texts, where $t$ is the frequency of a unique word in text $d \in D$
- TermFrequency - proportion of the word in text $\large{tf(t, d) = \frac{n_{td}}{\sum n_{td}}}$
- InverseDocumentFrequency $\large{idf(t,d) = log\frac{|D|}{|\{d \in D:t \in d\}|}}$, where $|\{d \in D:t \in d\}|$ i the number of texts in the collection containing word $t$. 

Then for every word, text $(t,d)$ pair we can compute ${tf{\text -}idf(t,d,D) = tf(t,d) \times idf(t, D)}$

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(encoding='utf8')
_ = vectorizer.fit(texts_train)
len(vectorizer.vocabulary_)

29227

In [17]:
vectorizer.transform(texts_train[:1])

<1x29227 sparse matrix of type '<class 'numpy.float64'>'
	with 43 stored elements in Compressed Sparse Row format>

In [18]:
print(vectorizer.transform(texts[:1]).indices)
print(vectorizer.transform(texts[:1]).data)

[29097 28831 28793 28751 28665 27566 26802 26499 26318 26283 26211 26209
 25758 25396 25311 25072 24692 24588 24253 23753 23743 22882 22873 22227
 22035 22002 21137 21124 19566 19553 19490 19356 19263 19241 18937 18935
 18816 18280 16947 16821 16812 16775 16445 16140 16116 14432 14253 14120
 14015 13218 13150 12376 12024 11805 11721 10241 10080  9773  9741  9712
  9632  9631  8497  7916  7743  7692  7482  7473  7316  6729  6385  6167
  5606  5186  4963  4616  4575  4555  4461  4138  4127  4097  3818  3633
  3486  3419]
[0.02847645 0.03142552 0.10526438 0.10312729 0.05644174 0.08756307
 0.07828735 0.05600843 0.02587729 0.07735152 0.05526081 0.04728733
 0.07421514 0.09664853 0.01676178 0.123904   0.25103298 0.06939308
 0.09992653 0.05728882 0.10066724 0.1015241  0.09846807 0.08101145
 0.05042139 0.08830816 0.23376552 0.12170482 0.18525966 0.12170482
 0.01735912 0.04445822 0.12170482 0.01877742 0.05769067 0.12170482
 0.03931898 0.08550803 0.06326732 0.07149103 0.08830816 0.12170482
 0.082

In [19]:
# transform data and find optimal hyperparam 

X_train = vectorizer.transform(texts_train)
X_test = vectorizer.transform(texts_test)

model = train_model(X_train, y_train)

print("Train accuracy = %.4f" % accuracy_score(y_train, model.predict(X_train)))
print("Test accuracy = %.4f" % accuracy_score(y_test, model.predict(X_test)))

Best alpha = 1000.0000
Train accuracy = 1.0000
Test accuracy = 0.9873


### Stemming and Lemmatization 

- [Stemming](https://en.wikipedia.org/wiki/Stemming) - For the skiing technique, see Stem (skiing). For the climbing technique, see Glossary of climbing terms § stem.
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

- [Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) - is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

#### Stemming

In [20]:
# importing a stemming lib 
import nltk
stemmer = nltk.stem.snowball.EnglishStemmer()

In [21]:
#test

print(stemmer.stem('vizuallization'))

vizual


Let's try **stemming** for the preprocessing of thr text before **vectorization**. We will use **td-idf** for vectorization, because it has shows the best results before. 

In [22]:
def stem_text(text, stemmer):
    tokens = text.split()
    return ' '.join(map(lambda w: stemmer.stem(w), tokens))

stemmed_texts_train = []
for t in tqdm(texts_train):
    stemmed_texts_train.append(stem_text(t, stemmer))
    
stemmed_texts_test = []
for t in tqdm(texts_test):
    stemmed_texts_test.append(stem_text(t, stemmer))

  0%|          | 0/1570 [00:00<?, ?it/s]

  0%|          | 0/393 [00:00<?, ?it/s]

Comparing the initial and stemmed text

In [23]:
print(texts_train[1])

From: black@sybase.com (Chris Black)
Subject: cystic breast disease
Organization: Sybase, Inc.
Lines: 18

My mom has just been diagnosed with cystic breast disease -- a big
relief, as it was a lump that could have been cancer.  Her doctor says
she should go off caffeine and chocolate for 6 months, as well as
stopping the estrogen she's been taking for menopause-related reasons.
She's not thrilled with this, I think especially because she just gave
up cigarettes -- soon she won't have any pleasures left!  Now, I thought
I'd heard that cystic breasts were common and not really a health risk.
Is this accurate?  If so, why is she being told to make various
sacrifices to treat something that's not that big of a deal?

Thanks for any information.

-- Chris

-- 
black@sybase.com

Note:  My mailer tends to garble subject lines.  



In [24]:
print(stemmed_texts_train[1])

from: black@sybase.com (chris black) subject: cystic breast diseas organization: sybase, inc. lines: 18 my mom has just been diagnos with cystic breast diseas -- a big relief, as it was a lump that could have been cancer. her doctor say she should go off caffein and chocol for 6 months, as well as stop the estrogen she been take for menopause-rel reasons. she not thrill with this, i think especi becaus she just gave up cigarett -- soon she won't have ani pleasur left! now, i thought i'd heard that cystic breast were common and not realli a health risk. is this accurate? if so, whi is she be told to make various sacrific to treat someth that not that big of a deal? thank for ani information. -- chris -- black@sybase.com note: my mailer tend to garbl subject lines.


In [25]:
vectorizer = TfidfVectorizer(encoding='utf8')
_ = vectorizer.fit(stemmed_texts_train)
len(vectorizer.vocabulary_)

27715

In [26]:
X_train = vectorizer.transform(stemmed_texts_train)
X_test = vectorizer.transform(stemmed_texts_test)

In [27]:
model = train_model(X_train, y_train)

print("Train accuracy = %.4f" % accuracy_score(y_train, model.predict(X_train)))
print("Test accuracy = %.4f" % accuracy_score(y_test, model.predict(X_test)))

Best alpha = 129.1550
Train accuracy = 1.0000
Test accuracy = 0.9924


#### Lemmatization

In [28]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text, stemmer):
    tokens = text.split()
    return ' '.join(map(lambda w: lemmatizer.lemmatize(w), tokens))

lemmatized_texts_train = []
for t in tqdm(texts_train):
    lemmatized_texts_train.append(lemmatize_text(t, stemmer))
    
lemmatized_texts_test = []
for t in tqdm(texts_test):
    lemmatized_texts_test.append(lemmatize_text(t, stemmer))

  0%|          | 0/1570 [00:00<?, ?it/s]

  0%|          | 0/393 [00:00<?, ?it/s]

In [29]:
print(lemmatized_texts_train[1])

From: black@sybase.com (Chris Black) Subject: cystic breast disease Organization: Sybase, Inc. Lines: 18 My mom ha just been diagnosed with cystic breast disease -- a big relief, a it wa a lump that could have been cancer. Her doctor say she should go off caffeine and chocolate for 6 months, a well a stopping the estrogen she's been taking for menopause-related reasons. She's not thrilled with this, I think especially because she just gave up cigarette -- soon she won't have any pleasure left! Now, I thought I'd heard that cystic breast were common and not really a health risk. Is this accurate? If so, why is she being told to make various sacrifice to treat something that's not that big of a deal? Thanks for any information. -- Chris -- black@sybase.com Note: My mailer tends to garble subject lines.


In [30]:
print(stemmed_texts_train[1])

from: black@sybase.com (chris black) subject: cystic breast diseas organization: sybase, inc. lines: 18 my mom has just been diagnos with cystic breast diseas -- a big relief, as it was a lump that could have been cancer. her doctor say she should go off caffein and chocol for 6 months, as well as stop the estrogen she been take for menopause-rel reasons. she not thrill with this, i think especi becaus she just gave up cigarett -- soon she won't have ani pleasur left! now, i thought i'd heard that cystic breast were common and not realli a health risk. is this accurate? if so, whi is she be told to make various sacrific to treat someth that not that big of a deal? thank for ani information. -- chris -- black@sybase.com note: my mailer tend to garbl subject lines.


In [31]:
vectorizer = TfidfVectorizer(encoding='utf8')
_ = vectorizer.fit(lemmatized_texts_train)
len(vectorizer.vocabulary_)

28717

In [32]:
X_train = vectorizer.transform(lemmatized_texts_train)
X_test = vectorizer.transform(lemmatized_texts_test)

In [33]:
model = train_model(X_train, y_train)

print("Train accuracy = %.4f" % accuracy_score(y_train, model.predict(X_train)))
print("Test accuracy = %.4f" % accuracy_score(y_test, model.predict(X_test)))

Best alpha = 359.3814
Train accuracy = 1.0000
Test accuracy = 0.9873


## Summary

- BAG-of-WORDS       Test accuracy = 0.9720 
- TD-IDF             Test accuracy = 0.9873

We can see that td-idf has shown to be more accurate, this can be contributed to the nature of the model, which allows us to see the **more significant** words. 


Next we've used two approaches to the text preprocessing with the following results: 

- Lemmatization     Test accuracy = 0.9924
- Stemming          Test accuracy = 0.9873