# Text Analysis
In this notebook we will practice the following items:
+ Text data vectorization
- Additional machine learning models
- apply supervised machine learning on text data, specifically
- Text classification (into topics) using 20newsgroup data
- Familiarize with the `pipeline` object

## Import python modules

In [30]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, GridSearchCV 


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import silhouette_score

from sklearn.datasets import fetch_20newsgroups
from sklearn import datasets

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# linear regression
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

import warnings
warnings.simplefilter("ignore")
%matplotlib inline

# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Vectorization
In the vectorization process of text features we need to convert text to a set of representative numerical values.<br/>
For example, most automatic mining of social media data relies on some form of encoding the text as numbers.<br/>

We will review following Sickit Learn vectorizers: 
* The CountVectorizer
* The TfidfVectorizer

### The CountVectorizer
One of the simplest methods of encoding data is by *word counts*: <br/>
you take each snippet of text, count the occurrences of each word within it, <br/>
and put the results in a table.

For example, consider the following set of three phrases:

In [2]:
sample = ['problem of evil',
          'evil queen is evil',
          'horizon problem']

For a vectorization of this data based on word count, we could construct <br/>
   a column representing the word "problem," the word "evil," the word "horizon," and so on.<br/>
   
While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [3]:
sample
vec = CountVectorizer()
X_train = vec.fit_transform(sample)
type(X_train)
X_train
type(X_train.toarray())
X_train.toarray()
pd.DataFrame(X_train.toarray(), columns=vec.get_feature_names_out())

['problem of evil', 'evil queen is evil', 'horizon problem']

scipy.sparse.csr.csr_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

numpy.ndarray

array([[1, 0, 0, 1, 1, 0],
       [2, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]], dtype=int64)

Unnamed: 0,evil,horizon,is,of,problem,queen
0,1,0,0,1,1,0
1,2,0,1,0,0,1
2,0,1,0,0,1,0


#### Some parameters you should review:
* **analyzer** - default=’word’ but we could change to ‘char’, ‘char_wb’
  * Option ‘char_wb’ creates character n-grams only from text inside word boundaries
* **tokenizer** - Override the string tokenization step while preserving the preprocessing and n-grams generation steps
* **stop_words** - if a list is set (stop_words=python_lst), it is assumed to contain stop words, all of which will be removed from the resulting tokens.
* **ngram_range** - tuple - (min_n, max_n), default=(1, 1) - if changed we could catch ngrams.
* **min_df** - float in range [0.0, 1.0] or int, default=1 - the minimum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_df** - float in range [0.0, 1.0] or int, default=1.0 - the maximum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_features** - int, default=None - max_features ordered by term frequency across the corpus.

For additional information click the link: [sklearn's CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

### Mini Exercise 1 
Use the `sample` list as input and do the following:
1. Fit a `CountVectorizer` to the `sample` data.
2. Use the `stop_words` parameter with ['is', 'of'] as stop words <br/>
3. Use the `ngram_range` parameter for both unigram (one word) and bigrams (two words)<br/>
4. display the vectorized dataset<br/>

In [None]:
# your solution

In [9]:
vec = CountVectorizer(stop_words=['is', 'of'], ngram_range=(1, 2))
X_train = vec.fit_transform(sample)
X_train = pd.DataFrame(X_train.toarray(), columns=vec.get_feature_names_out())
X_train

Unnamed: 0,evil,evil queen,horizon,horizon problem,problem,problem evil,queen,queen evil
0,1,0,0,0,1,1,0,0
1,2,1,0,0,0,0,1,1
2,0,0,1,1,1,0,0,0


### Mini Exercise 2
Use the `sample` list as input and do the following:
1. Fit a `CountVectorizer` to the `sample` data.
2. now use the `analyzer` with the `char` value<br/>
3. Use the `ngram_range` parameter for a (3,4) range<br/>
4. display the vectorized dataset<br/>

In [None]:
# your solution

In [67]:
vec = CountVectorizer(analyzer='char', ngram_range=(3, 4))
X_train = vec.fit_transform(sample)
X_train = pd.DataFrame(X_train.toarray(), columns=vec.get_feature_names_out())
X_train

vec_char_ngrams = CountVectorizer(analyzer='char', tokenizer=lambda x:x.split(), ngram_range=(3,4))
X_train = vec_char_ngrams.fit_transform(sample)
pd.DataFrame(X_train.toarray(), columns=vec_char_ngrams.get_feature_names_out())

Unnamed: 0,ev,evi,is,is.1,of,of.1,pr,pro,qu,que,...,rob,robl,s e,s ev,uee,ueen,vil,vil.1,zon,zon.1
0,1,1,0,0,1,1,0,0,0,0,...,1,1,0,0,0,0,1,0,0,0
1,1,1,1,1,0,0,0,0,1,1,...,0,0,1,1,1,1,2,1,0,0
2,0,0,0,0,0,0,1,1,0,0,...,1,1,0,0,0,0,0,0,1,1


Unnamed: 0,ev,evi,is,is.1,of,of.1,pr,pro,qu,que,...,rob,robl,s e,s ev,uee,ueen,vil,vil.1,zon,zon.1
0,1,1,0,0,1,1,0,0,0,0,...,1,1,0,0,0,0,1,0,0,0
1,1,1,1,1,0,0,0,0,1,1,...,0,0,1,1,1,1,2,1,0,0
2,0,0,0,0,0,0,1,1,0,0,...,1,1,0,0,0,0,0,0,1,1


### The TfidfVectorizer
There are some issues with the `CountVectorizer` approach, <br/>
   the raw word counts lead to features which put too much weight on words that appear very frequently, <br/>
   and this can be sub-optimal in some classification algorithms.<br/>

One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*),<br/>
   which weights the word counts by a measure of how often they appear in the documents.<br/>
The syntax for computing these features is similar to the previous example:

#### Some parameters you should review:
* **norm** -  default=’l2’ - ‘l1’ also possible. 
  * `'l2'` - sum of squares of vector elements is 1,
  * `'l1'` - Sum of absolute values of vector elements is 1
* **use_idf** - bool, default=True - if is False, like CountVectorizer, but with tf, instead of count.
* **sublinear_tf** - bool, default=False - if is True (zipf law), replace tf with 1 + log(tf).
* **stop_words** - if a list is set (stop_words=python_lst), it is assumed to contain stop words, all of which will be removed from the resulting tokens.
* **ngram_range** - tuple - (min_n, max_n), default=(1, 1) - if changed we could catch ngrams.
* **min_df** - float in range [0.0, 1.0] or int, default=1 - the minimum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_df** - float in range [0.0, 1.0] or int, default=1.0 - the maximum number of documents (or ratio of documents), for which the word (or basic unit) should appear in.
* **max_features** - int, default=None - max_features ordered by term frequency across the corpus.

For additional information click the link: [sklearn's TfidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [10]:
vec = TfidfVectorizer()
X_train_idf = vec.fit_transform(sample)
sample
pd.DataFrame(X_train_idf.toarray(), columns=vec.get_feature_names_out())

['problem of evil', 'evil queen is evil', 'horizon problem']

Unnamed: 0,evil,horizon,is,of,problem,queen
0,0.517856,0.0,0.0,0.680919,0.517856,0.0
1,0.732359,0.0,0.481482,0.0,0.0,0.481482
2,0.0,0.795961,0.0,0.0,0.605349,0.0


### Mini Exercise 3
Use the `sample` list as input and do the following:
1. Fit a `TfidfVectorizer` to the `sample` data.
2. now use the `sublinear_tf ` with the True value and `use_idf` as False (this is actually a CountVectorizer, as we studied in the lecture)<br/>
3. display the vectorized dataset<br/>
4. Now try the `min_df` parameter with a value of 2, what changed?

In [None]:
# your solution

In [14]:
vec = TfidfVectorizer(sublinear_tf=True, use_idf=False)
X_train = vec.fit_transform(sample)
pd.DataFrame(X_train.toarray(), columns=vec.get_feature_names_out())

vec = TfidfVectorizer(sublinear_tf=True, use_idf=False, min_df=2)
X_train = vec.fit_transform(sample)
pd.DataFrame(X_train.toarray(), columns=vec.get_feature_names_out())

Unnamed: 0,evil,horizon,is,of,problem,queen
0,0.57735,0.0,0.0,0.57735,0.57735,0.0
1,0.767495,0.0,0.453295,0.0,0.0,0.453295
2,0.0,0.707107,0.0,0.0,0.707107,0.0


Unnamed: 0,evil,problem
0,0.707107,0.707107
1,1.0,0.0
2,0.0,1.0


## Additional Machine Learning models"
Some Other models, some of which we did not study **yet**:

In [15]:
# import the iris dataset
iris = datasets.load_iris()

In [16]:
# load another sample dataset for regression
diabetes = datasets.load_diabetes()

### Regression

#### Linear Regresion

In [17]:
# linear regression
regr = linear_model.LinearRegression()
regr.fit(diabetes.data, diabetes.target)

LinearRegression()

In [18]:
# regression coefficients
print(diabetes.data.shape)
print(regr.coef_.shape,regr.coef_)
print(regr.intercept_)

(442, 10)
(10,) [ -10.01219782 -239.81908937  519.83978679  324.39042769 -792.18416163
  476.74583782  101.04457032  177.06417623  751.27932109   67.62538639]
152.1334841628965


In [19]:
# mean squared error
np.mean((regr.predict(diabetes.data)-diabetes.target)**2)

2859.6903987680657

In [20]:
# explained variance (r^2)
regr.score(diabetes.data, diabetes.target)

0.5177494254132934

### Clustering

#### K-means

In [21]:
# k means clustering
k_means = KMeans(n_clusters=3, init='k-means++', n_init=5)
k_means.fit(iris.data)
print(k_means.labels_)
print(k_means.cluster_centers_)

KMeans(n_clusters=3, n_init=5)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
 2 1]
[[5.006      3.428      1.462      0.246     ]
 [5.9016129  2.7483871  4.39354839 1.43387097]
 [6.85       3.07368421 5.74210526 2.07105263]]


### Classification

#### SVM:

In [22]:
sgd_svm_cls = SGDClassifier(loss='hinge', penalty='l2', 
                    alpha=1e-3, random_state=42, max_iter=5, tol=None)
sgd_svm_cls.fit(iris.data[:-2], iris.target[:-2])
sgd_svm_cls.predict(iris.data[-2:])


SGDClassifier(alpha=0.001, max_iter=5, random_state=42, tol=None)

array([2, 2])

In [23]:
linear_svm_cls=LinearSVC()
linear_svm_cls.fit(iris.data[:-2], iris.target[:-2])
linear_svm_cls.predict(iris.data[-2:])

LinearSVC()

array([2, 2])

#### Perceptron

In [24]:
perceptron_cls=Perceptron(tol=1e-3, random_state=42, alpha=0.00001, max_iter=10)
perceptron_cls.fit(iris.data[:-2], iris.target[:-2])
perceptron_cls.predict(iris.data[-2:])

Perceptron(alpha=1e-05, max_iter=10, random_state=42)

array([2, 2])

#### Artificial Neural Networks (ANN):

In [25]:
mlp_cls = MLPClassifier(activation='logistic',solver='sgd')
mlp_cls.fit(iris.data[:-2], iris.target[:-2])
mlp_cls.predict(iris.data[-2:])

MLPClassifier(activation='logistic', solver='sgd')

array([2, 2])

## Text processing - Text Classification pipeline
Let's get familiarize with the `pipeline` object

**Text Classification Flow**

For this task we will use a dataset called “Twenty Newsgroups”. This is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups (topics). The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

We will use the built-in [dataset loader for 20 newsgroups](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset) from scikit-learn. Our task is to train a classifier to correctly classify a new post into one of the topics (newsgroups) based on its content. We will use part of the examples provided [here](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier)

In [31]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=12) # use sklearn's method

Let's take a look on some of the documents (feel free to change the document id's you look on)

In [34]:
doc_id=11
print('\n'.join([line for line in twenty_train.data[doc_id].split('\n') if line.strip()])) # looking on the first doc
print("it's topic id is:",twenty_train.target[doc_id])
print("it's topic name is:",twenty_train.target_names[twenty_train.target[doc_id]])

From: hrs1@cbnewsi.cb.att.com (herman.r.silbiger)
Subject: ANSI/AIIM MS-53 Standard Image File Format
Organization: AT&T
Keywords: image, file format
Lines: 6
wing the suggestion of Stu Lynne, I have posted the Image File Format executable and source code to alt.sources.
Herman Silbiger
.
it's topic id is: 1
it's topic name is: comp.graphics


Let's take a look on the 20 topics:

In [35]:
# first 5 classes:
twenty_train.target_names[:5]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware']

It's time to turn it into a feature matrix (do you remember how to do it?)

In [36]:
count_vect = CountVectorizer(stop_words="english")
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 129796)

Wow! Over 120,000 features! That's might be too much, we don't need all of them, let's limit ourselves to the top 10000 features:

In [37]:
count_vect = CountVectorizer(stop_words="english",max_features=10000)
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 10000)

That's more reasonable (you can always test later again, what happens if you keep the larger number of features, or reduce the number even more aggressively)

As seen earlier, it's recommended now to normalize the data (according to the relative frequency)

In [43]:
pd.DataFrame(X_train_counts.toarray(), columns=count_vect.get_feature_names_out()).head(3)
X_train_normalized = preprocessing.normalize(X_train_counts, norm='l1')
pd.DataFrame(X_train_normalized.toarray(), columns=count_vect.get_feature_names_out()).head(3)
#X_train_normalized.toarray()

Unnamed: 0,00,000,005,01,02,02238,02p,03,030,0358,...,zone,zoo,zoology,zoom,zq,zs,zuma,zv,zx,zz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,00,000,005,01,02,02238,02p,03,030,0358,...,zone,zoo,zoology,zoom,zq,zs,zuma,zv,zx,zz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.007576,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Money time! Time to train the classifier. We will use the Naive Bayes classifier (SVM works well for texts as well).

In [44]:
clf_nb = MultinomialNB().fit(X_train_normalized, twenty_train.target)

Ok, let's evaluate the model on the test set. But...

Before we run it, we need to pass it through the same steps of feature extraction, filtering and normalization (exactly as in train phase). We have to use the same vectorizer object (otherwise we will get different feature ids). This can be complicated, and that's why we have the `pipeline` object that come to our help:


## `pipeline` Object

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [45]:

text_clf_nb = Pipeline([
    ('vect', CountVectorizer(stop_words="english",max_features=10000)),
    ('norm', preprocessing.Normalizer(norm='l1')),
    ('clf_nb', MultinomialNB()),
])

The names vect, norm and clf (classifier) are arbitrary. We can use them for example to perform grid search for suitable hyperparameters. We will now train the model with a single command:

In [48]:
clf_nb = text_clf_nb.fit(twenty_train.data, twenty_train.target)
clf_nb

Pipeline(steps=[('vect',
                 CountVectorizer(max_features=10000, stop_words='english')),
                ('norm', Normalizer(norm='l1')), ('clf_nb', MultinomialNB())])

what's next? 

correct, evaluation on test set. Evaluating the predictive accuracy of the model is equally easy:

In [49]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=12)
docs_test = twenty_test.data
predicted = text_clf_nb.predict(docs_test)
np.mean(predicted == twenty_test.target)


0.6481678173127987

We achieved 64.8% accuracy.

In [50]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       1.00      0.03      0.06       319
           comp.graphics       0.59      0.70      0.64       389
 comp.os.ms-windows.misc       0.65      0.78      0.71       394
comp.sys.ibm.pc.hardware       0.60      0.71      0.65       392
   comp.sys.mac.hardware       0.89      0.58      0.70       385
          comp.windows.x       0.74      0.72      0.73       395
            misc.forsale       0.68      0.85      0.76       390
               rec.autos       0.64      0.84      0.73       396
         rec.motorcycles       0.60      0.91      0.72       398
      rec.sport.baseball       0.44      0.89      0.59       397
        rec.sport.hockey       0.80      0.91      0.85       399
               sci.crypt       0.73      0.85      0.79       396
         sci.electronics       0.74      0.47      0.57       393
                 sci.med       0.74      0.63      0.68       396
         

### Pipeline Exercise 1
We have defined a subset below of 4 categories from 20 newsgroup. Build a classifier to classify an unseen document into any of the 4 categories. This time use only top 1000 features, and no function words. Evaluate your performance on the test set

In [52]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

train_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=32)

test_data = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=32)


In [None]:
# your code here

In [68]:
# My sol
text_clf_nb = Pipeline([
    ('vect', CountVectorizer(max_features=1000)),
    ('norm', preprocessing.Normalizer(norm='l1')),
    ('clf_nb', MultinomialNB()),
])
text_clf_nb.fit(train_data.data, train_data.target)
predicted = text_clf_nb.predict(test_data.data)
np.mean(predicted == test_data.target)


# pipeline exercise 1 - solution
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words="english",max_features=1000)),
    ('norm', preprocessing.Normalizer(norm='l1')),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])
text_clf.fit(train_data.data, train_data.target)
y_pred = text_clf.predict(test_data.data)
np.mean(y_pred == test_data.target)

Pipeline(steps=[('vect', CountVectorizer(max_features=1000)),
                ('norm', Normalizer(norm='l1')), ('clf_nb', MultinomialNB())])

0.6418109187749668

Pipeline(steps=[('vect',
                 CountVectorizer(max_features=1000, stop_words='english')),
                ('norm', Normalizer(norm='l1')),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])

0.7789613848202397

Can you display a confusion matrix? Which categories were confused? What can you do to fix it?

In [None]:
# your code here

In [72]:
pd.DataFrame(metrics.confusion_matrix(y_pred=y_pred ,y_true=test_data.target), columns=train_data.target_names, index=train_data.target_names)

Unnamed: 0,alt.atheism,comp.graphics,sci.med,soc.religion.christian
alt.atheism,156,29,20,114
comp.graphics,3,375,6,5
sci.med,9,101,261,25
soc.religion.christian,9,11,0,378


In [70]:
# To fix confusion, add stop_words='english'
text_clf_nb = Pipeline([
    ('vect', CountVectorizer(stop_words='english', max_features=1000)),
    ('norm', preprocessing.Normalizer(norm='l1')),
    ('clf_nb', MultinomialNB()),
])

text_clf_nb.fit(train_data.data, train_data.target)
predicted = text_clf_nb.predict(test_data.data)
np.mean(predicted == test_data.target)

pd.DataFrame(metrics.confusion_matrix(y_pred=predicted ,y_true=test_data.target), columns=train_data.target_names, index=train_data.target_names)

Pipeline(steps=[('vect',
                 CountVectorizer(max_features=1000, stop_words='english')),
                ('norm', Normalizer(norm='l1')), ('clf_nb', MultinomialNB())])

0.7336884154460719

Unnamed: 0,alt.atheism,comp.graphics,sci.med,soc.religion.christian
alt.atheism,41,31,67,180
comp.graphics,0,370,14,5
sci.med,0,71,314,11
soc.religion.christian,0,18,3,377
