<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# Getting the Sklearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the [function documentation](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [18]:
# Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes')
 )
# Setting testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes')
)

In [19]:
from collections import Counter

In [174]:
list(data_train.keys())

['data', 'filenames', 'target_names', 'target', 'DESCR']

In [20]:
data_train

{'data': ["Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych",
  '\n\nSeems to be, barring evidence to the contrary, that Koresh was simply\nanother deranged fanatic who thought it neccessary to take a whole bunch of\nfolks with him, children and all, to satisfy his delusional mania. Jim\nJones, circa 1993.\n\n\nNope - fruitcakes like Koresh have been demonstrating such evil corruption\nfor centuries.',
  "\n >In article <19

In [175]:
data_test

{'data': ['TRry the SKywatch project in  Arizona.',
  'The Vatican library recently made a tour of the US.\n Can anyone help me in finding a FTP site where this collection is \n available.',
  'Hi there,\n\nI am here looking for some help.\n\nMy friend is a interior decor designer. He is from Thailand. He is\ntrying to find some graphics software on PC. Any suggestion on which\nsoftware to buy,where to buy and how much it costs ? He likes the most\nsophisticated \nsoftware(the more features it has,the better)',
  'RFD\n                          Request For Discussion\n                                for the\n                          OPEN  TELEMATIC GROUP\n\n                                  OTG\n\nI have proposed the forming of a consortium/task force for the\npromotion of NAPLPS/JPEG, FIF to openly discuss ways, method,\nprocedures,algorythms, applications, implementation, extensions of\nNAPLPS/JPEG standards.  These standards should facilitate the creation\nof REAL_TIME Online appli

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`?
- What does `data_train` contain? 
- How many data points does `data_train` contain?
- How many data points of each category does `data_train` contain?
- Inspect the first data point, what does it look like?

In [21]:
type(data_train)

sklearn.utils.Bunch

In [22]:
len(data_train)

5

In [29]:
from collections import Counter
print(Counter(data_train.data[0].lower().split()))
print()



Counter({'the': 7, 'a': 4, 'to': 4, 'you': 3, 'file': 3, '.prj': 3, 'is': 3, 'in': 3, 'that': 2, 'if': 2, 'save': 2, '.3ds': 2, 'are': 2, 'does': 2, 'anyone': 2, 'file?': 2, 'texture': 2, 'format': 2, 'hi,': 1, "i've": 1, 'noticed': 1, 'only': 1, 'model': 1, '(with': 1, 'all': 1, 'your': 1, 'mapping': 1, 'planes': 1, 'positioned': 1, 'carefully)': 1, 'when': 1, 'reload': 1, 'it': 1, 'after': 1, 'restarting': 1, '3ds,': 1, 'they': 1, 'given': 1, 'default': 1, 'position': 1, 'and': 1, 'orientation.': 1, 'but': 1, 'their': 1, 'positions/orientation': 1, 'preserved.': 1, 'know': 1, 'why': 1, 'this': 1, 'information': 1, 'not': 1, 'stored': 1, 'nothing': 1, 'explicitly': 1, 'said': 1, 'manual': 1, 'about': 1, 'saving': 1, 'rules': 1, 'file.': 1, "i'd": 1, 'like': 1, 'be': 1, 'able': 1, 'read': 1, 'rule': 1, 'information,': 1, 'have': 1, 'for': 1, '.cel': 1, 'available': 1, 'from': 1, 'somewhere?': 1, 'rych': 1})



In [38]:
type(data_train)

sklearn.utils.Bunch

In [32]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [35]:
print(data_train.data[0])

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


In [37]:
len(data_train.data)

2034

In [30]:
from sklearn.feature_extraction.text import CountVectorizer


In [118]:
cvec = CountVectorizer()
cvec.fit(data_train)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [119]:
len(data_train.data)

2034

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Repeat eliminating English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- What are the 20 words that are most common in the whole corpus?
- What are the 20 most common words in each of the 4 classes?
- Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    - You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    - Create a confusion matrix.

**BONUS:**
- Try a couple of modifications:
    - restrict max_features
    - change max_df and min_df
    - for each of the above print a confusion matrix and investigate what gets mixed

# I put stop_words={"English"}, but nothing changed

In [120]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words="english")
cvec.fit(data_train.data)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [121]:
document_matrix = cvec.transform(data_train.data)
document_matrix


<2034x26576 sparse matrix of type '<class 'numpy.int64'>'
	with 133634 stored elements in Compressed Sparse Row format>

In [122]:
cvec.get_feature_names()


['00',
 '000',
 '0000',
 '00000',
 '000000',
 '000005102000',
 '000062david42',
 '0001',
 '000100255pixel',
 '00041032',
 '0004136',
 '0004246',
 '0004422',
 '00044513',
 '0004847546',
 '0005',
 '0007',
 '00090711',
 '000usd',
 '0012',
 '001200201pixel',
 '0018',
 '00196',
 '0020',
 '0022',
 '0028',
 '0029',
 '0033',
 '0034',
 '0038',
 '0049',
 '006',
 '0065',
 '0094',
 '0098',
 '00index',
 '00pm',
 '01',
 '0100',
 '013846',
 '01752',
 '0179',
 '01821',
 '01826',
 '0184',
 '01852',
 '01854',
 '01890',
 '018b',
 '0195',
 '0199',
 '01a',
 '02',
 '020',
 '0200',
 '020359',
 '020637',
 '02115',
 '02138',
 '02139',
 '02154',
 '02178',
 '0223',
 '0235',
 '023b',
 '0245',
 '03',
 '030',
 '0300',
 '03051',
 '0330',
 '034',
 '034101',
 '04',
 '040',
 '040286',
 '0410',
 '04110',
 '041493003715',
 '0418',
 '045',
 '04g',
 '05',
 '050',
 '0500',
 '050524',
 '0511',
 '05402',
 '05446',
 '0545',
 '054589e',
 '058',
 '06',
 '060',
 '0605',
 '06111',
 '06179397',
 '06487',
 '0649',
 '067',
 '0674',
 

In [123]:
len(cvec.get_feature_names())

26576

In [124]:
print("Number of nonzero entries:")
print(document_matrix.nnz)
print("Highest count:")
print(document_matrix.max())
print("Row means:")
print(document_matrix.mean(axis=1))
print("Transform to numpy array format:")
print(document_matrix.toarray())


Number of nonzero entries:
133634
Highest count:
232
Row means:
[[0.00199428]
 [0.00101595]
 [0.00116647]
 ...
 [0.00086544]
 [0.00361228]
 [0.        ]]
Transform to numpy array format:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [125]:
import pandas as pd
df = pd.DataFrame(cvec.transform(data_train.data).toarray(),
                 columns=cvec.get_feature_names())
df.transpose().sort_values(0, ascending=False).transpose()


Unnamed: 0,file,3ds,prj,orientation,does,save,texture,information,format,able,...,earths,earthquake,earthly,earthings,earthinfo,earthers,earth,ears,earnshaw,zyxel
0,6,3,3,2,2,2,2,2,2,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [141]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [126]:
pd.DataFrame(df.sum(axis=0)).sort_values(by=0,ascending=False).head(20)


Unnamed: 0,0
space,1061
people,793
god,745
don,730
like,682
just,675
does,600
know,592
think,584
time,546


In [None]:
#1. What are the 20 words that are most common in the whole corpus?
#2. What are the 20 most common words in each of the 4 classes?


In [139]:
from sklearn.linear_model import LogisticRegression


In [144]:
type(data_train)

sklearn.utils.Bunch

In [145]:
X = df


In [148]:
data_train.target

array([1, 3, 2, ..., 1, 0, 1])

In [149]:
df_target = pd.DataFrame(data_train.target)

In [150]:
y = df_target

In [151]:
model = LogisticRegression(C=10**10, multi_class='ovr',solver='lbfgs')
model.fit(X,y)
print(model.score(X,y))
print(model.intercept_, model.coef_)


  y = column_or_1d(y, warn=True)


0.9783677482792527
[-1.38629503 -1.01159977 -1.01160133 -1.01160659] [[-28.69958834  11.5326483   -1.68969603 ...  -0.2169364   -0.4338728
   -0.90541626]
 [  7.9818317  -16.97961247  -2.42471592 ...  -0.2096413   -0.41928261
    0.97488451]
 [ -1.40000637   9.73402884   2.94897587 ...   0.47190076   0.94380151
   -1.24500369]
 [-15.07146976  -6.19689041  -1.01934716 ...  -0.18914246  -0.37828493
   -0.54075205]]


In [152]:
from sklearn.metrics import accuracy_score


In [157]:
predictions = model.predict

In [165]:
accuracy_score(y, model.predict(X))


0.9783677482792527

In [172]:
from sklearn.model_selection import cross_val_score
accs = cross_val_score(model, X, data_train.target, cv=10)
print(accs)
print(np.mean(accs))


[0.83414634 0.85365854 0.84878049 0.85294118 0.81773399 0.81280788
 0.82758621 0.84653465 0.80693069 0.88118812]
0.8382308086488516


In [168]:
from sklearn.model_selection import train_test_split


In [169]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=1)


In [170]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics


In [173]:
scores = cross_val_score(model, X, data_train.target, cv=10)
print("Cross-validated scores:", scores)
print("Mean of Ccoss-validated scores:", scores.mean())


Cross-validated scores: [0.83414634 0.85365854 0.84878049 0.85294118 0.81773399 0.81280788
 0.82758621 0.84653465 0.80693069 0.88118812]
Mean of Ccoss-validated scores: 0.8382308086488516


### 4. TF-IDF

Let's see if TF-IDF improves the accuracy.

- Initialize a TF-IDF Vectorizer and repeat the analysis above.
- Does the score improve with respect to the count vectorizer? 
- Print out the number of features for this model.

**BONUS:**
- Change the parameters of either (or both!) models to improve your score.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english', norm='l2')
tvec.fit(data_train.data)


In [None]:
df = pd.DataFrame(tvec.transform(data_train.data).todense(),
                  columns=tvec.get_feature_names(),
                  )
df.transpose().sort_values(0, ascending=False).transpose()


### 5. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

### Bonus: Other classifiers

Adapt the code from [this example](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

### Bonus: 

- #### Fit a model to the 20newsgroups dataset with all classes

- #### Choose texts, for example from newspaper articles, and check what is the class label predicted for them. Does the predicted label meet your expectations?