### CIS 9: Lab 4
Natural Language Processing: Multinomial NB

In [2]:
### Name: Nitya Kashyap

In this lab you will train an ML model to categorize news articles.

In [3]:
# import modules
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

The [BBC News](https://www.bbc.com/news) is a British news organization that reports on current events around the world. In this exercise you will train an NLP model to categorize the topics of news articles. The model will determine whether a news articles is on sports, politics, etc.

The training data are from BBC News and have been preprocessed for ML. The training input file is `news.csv` ([source](https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed?select=docs_stage_3_preprocessed.csv))

1. __Read data from _news.csv_ into a DataFrame__.<br>
Then __print the number of rows and columns of the DataFrame__<br>
and __print the first 5 rows__ to see what the data looks like.

In [4]:
df = pd.read_csv("news.csv")
print("rows, columns:", df.shape)
df.head()

rows, columns: (2205, 24)


Unnamed: 0,DocId,DocTextlen,DocText,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,...,PUNCT,SCONJ,SYM,VERB,X,INTJ,DocType,FileSize,FilePath,DocCat
0,B_001,2553,ad sale boost time_warner profit quarterly pro...,31,61,15.0,15.0,13.0,28,114,...,55,3.0,9.0,53,0.0,0.0,Business,2560,../input/bbc-full-text-document-classification...,0
1,B_002,2248,dollar gain greenspan speech dollar hit high l...,33,54,15.0,21.0,9.0,44,99,...,43,5.0,2.0,43,0.0,0.0,Business,2252,../input/bbc-full-text-document-classification...,0
2,B_003,1547,yukos unit buyer face loan claim owner embattl...,11,32,3.0,15.0,4.0,25,71,...,26,3.0,4.0,42,0.0,0.0,Business,1552,../input/bbc-full-text-document-classification...,0
3,B_004,2395,high fuel price hit ba profit british_airways ...,36,53,16.0,17.0,8.0,26,114,...,62,8.0,10.0,45,0.0,0.0,Business,2412,../input/bbc-full-text-document-classification...,0
4,B_005,1565,pernod takeover talk lift domecq share uk drin...,15,32,5.0,13.0,8.0,14,68,...,35,5.0,3.0,26,0.0,0.0,Business,1570,../input/bbc-full-text-document-classification...,0


---

2. Data cleaning

2a. __Print all the column labels__.

In [5]:
print("column labels:", df.columns)

column labels: Index(['DocId', 'DocTextlen', 'DocText', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ',
       'DET', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM',
       'VERB', 'X', 'INTJ', 'DocType', 'FileSize', 'FilePath', 'DocCat'],
      dtype='object')


2b. Since the data have been preprocessed, each row or news article has multiple features, some of which we don't need for our ML training purpose.

The column labels that are all uppercase such as ADJ, ADV, NOUN... denote the count of adjectives, adverbs, nouns... that are in the article. We can remove these columns because Parts of Speech are not used by the MultinomialMB model.

The columns we want to keep are:
- DocText: contains the news articles
- DocType: categories of the news articles, as strings
- DocCat: categories of the news articles, as numbers

Given that the columns containing Parts of Speech can be removed due to the reason above, create a Raw NBConvert cell to __explain why the other columns can also be removed__, so that we only keep the 3 columns DocText, DocType, and DocCat.

2c. __Create a DataFrame with the 3 columns__ that you want to keep.<br>
Then __print the first 5 rows__ of the DataFrame.

In [6]:
df.drop(columns=["DocId", "DocTextlen", "ADJ", "ADP", "ADV", "AUX", "CCONJ",
       "DET", "NOUN", "NUM", "PART", "PRON", "PROPN", "PUNCT", "SCONJ", "SYM",
       "VERB", "X", "INTJ", "FileSize", "FilePath"], inplace=True)
df.head()

Unnamed: 0,DocText,DocType,DocCat
0,ad sale boost time_warner profit quarterly pro...,Business,0
1,dollar gain greenspan speech dollar hit high l...,Business,0
2,yukos unit buyer face loan claim owner embattl...,Business,0
3,high fuel price hit ba profit british_airways ...,Business,0
4,pernod takeover talk lift domecq share uk drin...,Business,0


2d. __Shorten the column labels__ by removing the 'Doc' from each label and lowercase all letters.<br>
Then __print the first 5 rows__ of the DataFrame.

In [7]:
df.columns = df.columns.str.replace("Doc", "").str.lower()
# df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,text,type,cat
0,ad sale boost time_warner profit quarterly pro...,Business,0
1,dollar gain greenspan speech dollar hit high l...,Business,0
2,yukos unit buyer face loan claim owner embattl...,Business,0
3,high fuel price hit ba profit british_airways ...,Business,0
4,pernod takeover talk lift domecq share uk drin...,Business,0


2e. __Check and remove any NaN__.

In [8]:
print("check for NaN:")
print(df.isna().sum())

check for NaN:
text    0
type    0
cat     0
dtype: int64


---

3. Analyze data

3b. __Show the count of each DocType categories__<br>
and then __show the count of each DocCat categories__ 

In [9]:
print("DocType categories:")
print(df.groupby("type").type.count())

print("\nDocCat categories:")
print(df.groupby("cat").cat.count())

DocType categories:
type
Business         510
Entertainment    381
Politics         413
Sport            506
Tech             395
Name: type, dtype: int64

DocCat categories:
cat
0    510
1    381
2    413
3    506
4    395
Name: cat, dtype: int64


3b. The output seems to show that the there is a one-to-one correspondence between the strings in DocType and numbers in DocCat.

Write code to __print the proof that they correspond with each other__. This means to show that all "Business" DoctType are 0 in DocCat, all "Sport" DocType are 3 in DocCat, etc.

_Challenge: write a loop to check and print the 5 results, instead of copy-paste code 5 times_

In [10]:
# in all rows, the strings in type corresponds with the numbers in cat
lookup = dict(zip(df.cat.unique(), df.type.unique()))
print("lookup table:", lookup)
for k,v in lookup.items():
#     print(k,v)
    print("\ncategory", k, "matches type", v, "in", df[(df.cat == k) & (df.type == v)].shape[0], "rows")
    print("or equivalently: there are ", df[(df.cat == k) & (df.type != v)].shape[0], "rows in which category", k, "and type", v, "do not match") # should be zero

lookup table: {0: 'Business', 1: 'Entertainment', 2: 'Politics', 3: 'Sport', 4: 'Tech'}

category 0 matches type Business in 510 rows
or equivalently: there are  0 rows in which category 0 and type Business do not match

category 1 matches type Entertainment in 381 rows
or equivalently: there are  0 rows in which category 1 and type Entertainment do not match

category 2 matches type Politics in 413 rows
or equivalently: there are  0 rows in which category 2 and type Politics do not match

category 3 matches type Sport in 506 rows
or equivalently: there are  0 rows in which category 3 and type Sport do not match

category 4 matches type Tech in 395 rows
or equivalently: there are  0 rows in which category 4 and type Tech do not match


3c. __Create a lookup table__ which is a dictionary where each unique DocCat value is the key, and the corresponding DocType string is the value.<br>
Then __print the lookup table__.

In [11]:
# already set up
print(lookup)

{0: 'Business', 1: 'Entertainment', 2: 'Politics', 3: 'Sport', 4: 'Tech'}


---

4. Preparing data for ML

4a. Now that you've proven that DocType and DocCat have the same data, choose the column that makes it less work for you to use the ML model, then __remove one of the columns__. <br>
Then __show the first 5 rows__ of the DataFrame.

In [12]:
df.drop(columns=["type"], inplace =True)
df.head()

Unnamed: 0,text,cat
0,ad sale boost time_warner profit quarterly pro...,0
1,dollar gain greenspan speech dollar hit high l...,0
2,yukos unit buyer face loan claim owner embattl...,0
3,high fuel price hit ba profit british_airways ...,0
4,pernod takeover talk lift domecq share uk drin...,0


4b. __Create the X and y datasets__ and __print the shape__ of each.

In [14]:
X = df.text
y = df.cat
print("shape of X:", X.shape, "shape of y:", y.shape)
# print(type(X))

shape of X: (2205,) shape of y: (2205,)
<class 'pandas.core.series.Series'>


4c. Since the training data are already preprocessed. We want to take a look at one sample news article to see if there needs to be further preprocessing.

__Print the news article at row 0__ to inspect it.

In [13]:
print(df.text[0])
# or df.at[0, 'text']

ad sale boost time_warner profit quarterly profit media giant timewarner jump 76 $ 1.13bn £ 600 m month december $ 639 m year early firm big investor google benefit sale high speed internet connection high advert sale timewarner say fourth quarter sale rise 2 $ 11.1bn $ 10.9bn profit buoy gain offset profit dip warner_bros user aol time_warner say friday own 8 search engine google internet business aol mixed fortune lose 464,000 subscriber fourth quarter profit low precede quarter company say aol underlie profit exceptional item rise 8 strong internet advertising revenue hope increase subscriber offer online service free timewarner internet customer try sign aol exist customer high speed broadband timewarner restate 2000 2003 result follow probe the_us_securities_exchange_commission sec close conclude time_warner's fourth quarter profit slightly well analyst expectation film division see profit slump 27 $ 284 m help box office flop alexander catwoman sharp contrast year early final fil

4d. The preprocessing that we've discussed in class are related to stop words and stemming.<br>

__Create a Raw NBConvert cell to explain__:
- Does it look like stop words have been removed? Give examples from the text.
- Does it look like stemming was applied? Give examples from the text.

4e. __Convert the preprocessed data to numbers__ so it's ready for the ML model.<br>
Then __print the shape of the X dataset__ that will be used with the model.

In [14]:
vect = CountVectorizer()
vect.fit(X)
X_vectors = vect.transform(X)
# print(X_vectors)
X_vectors.shape

(2205, 28975)

---

5. Train and test the model

5a. __Create X and y training and testing datasets__.<br>
Then __print the shape of each dataset__.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size = 0.2)
print("X training dataset size:", X_train.shape, "X testing dataset size:", X_test.shape, "y training dataset size:", y_train.shape, "y testing dataset size:", y_test.shape)

X training dataset size: (1764, 28975) X testing dataset size: (441, 28975) y training dataset size: (1764,) y testing dataset size: (441,)


5b. __Train and test the ML model__<br>
and then __print the accuracy measurements__.

_There are more than one accuracy measurement._

In [61]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)

# measure accuracy
print("accuracy score:", metrics.accuracy_score(y_test, y_predict))
print("confusion matrix:")
metrics.confusion_matrix(y_test, y_predict, labels=list(lookup.keys()))

accuracy score: 0.981859410430839
confusion matrix:


array([[107,   0,   2,   0,   1],
       [  0,  61,   1,   0,   2],
       [  1,   0,  89,   0,   0],
       [  0,   0,   0, 102,   0],
       [  0,   0,   1,   0,  74]])

5c. Create a Raw NBConvert cell to __discuss whether the accuracy measurements agree with each other__.

---

6. Real life testing of the model you've trained.

6a. __Print the lookup table__ you created in step 3c, which shows the corresponding values of the DocType and DocCat columns.

In [62]:
print(lookup)

{0: 'Business', 1: 'Entertainment', 2: 'Politics', 3: 'Sport', 4: 'Tech'}


6b. One advantage of working in NLP is that it's easier to come up with testing data. 

1. Go to the [BBC News](https://www.bbc.com/news) website to find the different types of news categories.<br>
2. __Choose 3 of the news categories__ in the BBC web page header that match the categories that the ML model has learned.<br>
3. For each category, click on the category link to __find today's news articles in that category__.<br>
> - Then click to open an article and __copy the first 4-5 paragraphs of the article__.<br>
> - Create a Code cell and paste the paragraphs into a Python string.

_You should end up with 3 Code cells, each has a Python string which is the 4-5 paragraphs of a news article._

In [92]:
entertainment_article = '''In the darkest corner of a grand museum that looks like a neo-classical palace lies a not-so-secret room.
It is filled with statues of Congolese people, which have been regarded as racist, that were once part of the permanent exhibition.
Schoolchildren on educational tours file past the Leopard Man, men with spears and women almost naked.
This is the Africa Museum in Tervuren, just outside Brussels, and until recently those sculptures were part of the permanent exhibition.
After facing years of heavy criticism nationally and internationally, the museum worked with a group of experts from the African diaspora in Belgium to rethink the controversial statues on display.'''

In [93]:
tech_article = '''Thousands of Reddit communities will be inaccessible on Monday in protest at how the site is being run.
Reddit is introducing controversial charges to developers of third-party apps, which are used to browse the social media platform.
But this has resulted in a backlash, with moderators of some of the biggest subreddits making their communities private for 48 hours in protest.
Almost 3,500 subreddits will be inaccessible as a result.
A subreddit is the name given to a forum within the Reddit platform - effectively a community of people who gather to discuss a particular interest.'''

In [94]:
sport_article = '''Novak Djokovic says it is not down to him to decide if he is the greatest player of all time after he won a men's record 23rd Grand Slam title.
Serbia's Djokovic won the French Open on Sunday, moving him one clear of Rafael Nadal in terms of men's majors.
He is level with Serena Williams on 23 and could equal Margaret Court's all-time record of 24 at Wimbledon in July.
"I don't want to enter in these discussions. I'm writing my own history," Djokovic, 36, said.
"I don't want to say I am the greatest. I leave those discussions to someone else."'''

6c. __Create a DataFrame from the 3 Python strings__.<br>
Then __print the DataFrame__.

_An example DataFrame is shown below, from news articles on 6/3. Your text will be different._

In [95]:
X = pd.DataFrame(columns=["text"], data=[[entertainment_article], [tech_article], [sport_article]])
X

Unnamed: 0,text
0,In the darkest corner of a grand museum that l...
1,Thousands of Reddit communities will be inacce...
2,Novak Djokovic says it is not down to him to d...


6d. __Test the ML model__ that you've trained with your new data in the DataFrame.

This means:
- preprocess the new data (your answer in step 4d will determine how you preprocess the new data).
- convert the new data to numbers
- test the model with the data
- print the categories of news  that the model predicted. Use the lookup table to convert the numeric result from the model into the category string.
Example:<br>`
Article 1 : Sport
Article 2 : Business
Article 3 : Tech
`

_You'll need 4 Code cells, one for each step above_.

In [96]:
tokenizer = RegexpTokenizer('\w+')
stop_words=set(stopwords.words("english"))
#stemmer = PorterStemmer()

def preprocess(string):
    s = tokenizer.tokenize(string.lower()) # no stop words in training data, so need to remove here too
    s = [word for word in s if word not in stop_words]
#    s = [stemmer.stem(word) for word in s] # don't do this stemming because the training data wasn't stemmed (want training and testing data to be similar, so process them the same)
    return ' '.join(s)

X_processed = pd.DataFrame([preprocess(X.loc[i, "text"]) for i in range(len(X))])
X_processed
# print(X_processed.loc[1,0])

Unnamed: 0,0
0,darkest corner grand museum look like neo clas...
1,thousand reddit commun inaccess monday protest...
2,novak djokov say decid greatest player time me...


In [97]:
X_vect = vect.transform(X_processed[0])
# print(X_vect)
X_vect.shape # only 3 rows, as expected

(3, 28975)

In [102]:
y_pred = classifier.predict(X_vect)

In [103]:
# print(len(y_pred))
actual = ["Entertainment", "Tech", "Sport"]
for i in range(len(y_pred)):
    print(actual[i], "article predicted as:", lookup[y_pred[i]])

Entertainment article predicted as: Entertainment
Tech article predicted as: Tech
Sport article predicted as: Sport


6e. Create a Raw NBConvert cell to __discuss the result of your test__.