## Introduction

Welcome to the DSN Internship Coding Challenge! This assessment will put your Natural Language Processing (NLP) and problem-solving abilities to the test. :

- Section one of the assessmnet will require you to build a text classification model.

Good luck! If you have questions about the framing of the questions, please contact **recruitment@datasciencenigeria.ai**

### How to Use and Submit this Notebook.
- Make a copy of this document and rename it **Firstname_Lastname_DSNInternshipCodingAssessment.ipynb**
- Before attempting to submit, ensure that you have ran all of the cells in your notebook and the output visible.
- Once you’ve completed all tasks, save and download a copy of the notebook as .ipynb
- Submit a link (make sure that the link is set to "Anyone on the internet with the link can view"), the downloaded copy of your final notebook via this [link](https://forms.gle/t8sFNrfAymZUrfJq7).

### What Not to Do.
- Do not share this document with any external party
- No teamwork is permitted
- After submitting a copy of your script, you are not permitted to make any changes to the online version; any discrepancy between the online and submitted copies will render your application null and void.

### Dataset

This is a news [dataset](https://drive.google.com/file/d/1NgPM7_mFCDKnuqI9SamMCrkF1mE5AgAI/view?usp=sharing) which contains 2225 examples of news articles with their respective labels. Use to the link to learn more about the dataset

## Section 1

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Task

**This is to test your knowledge on NLP**

Build and train a machine learning model with the provided dataset to classify the news category or topic. You can use any architecture or model, in this test.

**Make sure to plot the accuracy vs epochs and loss vs epochs graphs**

# New Section

In [3]:
import pandas as pd
import re
import numpy as np
import pickle
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import matplotlib.pyplot as plt
# from nltk.corpus import stopwords
nlp = spacy.load("en_core_web_sm")


In [4]:
STOP_WORDS.add("mr")
STOP_WORDS.add("said")
STOP_WORDS.add("mrs")

In [5]:
# datafile = "/content/bbc-text.csv"
datafile = "/content/drive/MyDrive/DSN/Dataset/ML/bbc-text.csv"

In [6]:
textData = pd.read_csv(datafile)

In [7]:
# make a copy of the dataset
textData1 = textData.copy()

In [8]:
textData1

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


## **Perform Basic EDA on the dataset**

In [9]:
#checking to see the value counts of each category
textData1["category"].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [10]:
# checking the shape of the data
textData.shape

(2225, 2)

In [11]:
# Checking data summary of the data
# As seen below, the data type of all columns are of data type "Object"
textData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [12]:
#check if there is any missing data
# There are no missing data
textData.isna().value_counts()

category  text 
False     False    2225
dtype: int64

In [13]:
textData1["text"][0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

# **Perform Text Preprocessing**

#### **1.   Removal of Punctuations**

In [14]:
test1 = textData1["text"][0]
test1

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

In [15]:
#using a string method to get all the punctuations
myPunctuations = string.punctuation
myPunctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
for puncts in myPunctuations:
    test1 = test1.replace(puncts, "")

In [17]:
# a function to remove punctuations:
def removePunct(text):
    for puncts in myPunctuations:
        text = text.replace(puncts, "")
    return text

In [18]:
textData1["text"] = textData1["text"].apply(removePunct)

In [19]:
textData1

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy exchatshow h...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


In [20]:
# checking to make sure the removePuncts fucntion works as it should
textData1["text"][5]

'howard hits back at mongrel jibe michael howard has said a claim by peter hain that the tory leader is acting like an  attack mongrel  shows labour is  rattled  by the opposition  in an upbeat speech to his party s spring conference in brighton  he said labour s campaigning tactics proved the tories were hitting home mr hain made the claim about tory tactics in the antiterror bill debate  something tells me that someone  somewhere out there is just a little bit rattled   mr howard said mr hain  leader of the commons  told bbc radio four s today programme that mr howard s stance on the government s antiterrorism legislation was putting the country at risk he then accused the tory leader of behaving like an  attack mongrel  and  playing opposition for opposition sake   mr howard told his party that labour would  do anything  say anything  claim anything to cling on to office at all costs   so far this year they have compared me to fagin  to shylock and to a flying pig this morning peter

#### **2. Removal of extra white space from the text corpus**

In [21]:
textData1["text"] = textData1["text"].apply(lambda x: x.strip())

#### **3. Stop words Removal**

In [22]:
#making use of Spacy stopwords because of how large the language base of spacy is for NLP projects.
#compared to NLTK, spacy has more detailed stop words than NLTK.
stopWords = STOP_WORDS

In [23]:
def removeStopwords(text):
    text = nlp(text)
    tokens = [word.text for word in text if word.text not in stopWords]
    cleanedText = " ".join(tokens)
    return cleanedText


In [24]:
textData1["text"][0]

'tv future in the hands of viewers with home theatre systems  plasma highdefinition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices  one of the most talkedabout technologies of ces has been digital and personal video recorders dvr and pvr these settop boxes  like the us s tivo and the uk s sky system  allow people to record  store  play  pause and forward wind tv programmes when they want  essentially  the technology allows for much more personalised tv they are also being builtin to highdefinition tv

In [25]:
#Before we proceed to remove the stopwords, we need to tokenize the text corpus.
#when I ran this code, I observed that remove the stopwords directly without tokenizing it affects the lexical structure of the text
# I will be using spacy model to tokenize the text, because of how robust spacy is.

textData1["text"] = textData1["text"].apply(removeStopwords)

In [26]:
# Checking to see if the stopwords are actually removed from the text corpus
textData1["text"][0]

'tv future hands viewers home theatre systems   plasma highdefinition tvs   digital video recorders moving living room   way people watch tv radically different years   time   according expert panel gathered annual consumer electronics las vegas discuss new technologies impact favourite pastimes leading trend   programmes content delivered viewers home networks   cable   satellite   telecoms companies   broadband service providers rooms portable devices   talkedabout technologies ces digital personal video recorders dvr pvr settop boxes   like s tivo uk s sky system   allow people record   store   play   pause forward wind tv programmes want   essentially   technology allows personalised tv builtin highdefinition tv sets   big business japan   slower europe lack highdefinition programming people forward wind adverts   forget abiding network channel schedules   putting alacarte entertainment networks cable satellite companies worried means terms advertising revenues   brand identity   v

In [27]:
textData1["text"][1]

'worldcom boss   left books   worldcom boss bernie ebbers   accused overseeing 11bn £ 58bn fraud   accounting decisions   witness told jurors   david myers comments questioning defence lawyers arguing ebbers responsible worldcom s problems phone company collapsed 2002 prosecutors claim losses hidden protect firm s shares myers pleaded guilty fraud assisting prosecutors   monday   defence lawyer reid weingarten tried distance client allegations cross examination   asked myers knew ebbers   accounting decision     aware    myers replied   know ebbers accounting entry worldcom books    weingarten pressed      replied witness myers admitted ordered false accounting entries request worldcom chief financial officer scott sullivan defence lawyers trying paint sullivan   admitted fraud testify later trial   mastermind worldcom s accounting house cards   ebbers   team     looking portray affable boss   admission pe graduate economist abilities   ebbers transformed worldcom relative unknown 160b

## **Perform Feature Extraction**

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split


# spliting the data into features (X) and labels (y)
X = textData1["text"]
y = textData1["category"]


# Split the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the CountVecotorizer
vectorizer = CountVectorizer()

# Fit the vectorizer on the training data
vectorizer.fit(X_train)

# Transform the training and test data into feature vectors
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

## **Building and Training the Classifier Model Using Naive Bayes**

In [29]:
# Import ML models to use and train the dataset
# In this case, I am making use of Naive Bayes model
# Nave Bayes model works best with Text data compared to other classification model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()


# Train the classifier
nb_classifier.fit(X_train, y_train)


# Make the predictions on the test data
y_pred = nb_classifier.predict(X_test)


# Evaluate the models accuracy to test it's perfomance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")


# Calculate the precision, recall and F1 Score.
# These are all peformance metrics to check how accurate the Model is
precision = precision_score(y_test, y_pred, average = "weighted")
recall = recall_score(y_test, y_pred, average = "weighted")
f1 = f1_score(y_test, y_pred, average = "weighted")


print(f"Model Precision: {precision:.2f}")
print(f"Model Recall Score: {recall:.2f}")
print(f"Model F1 Score: {f1:.2f}")

Model Accuracy: 0.96
Model Precision: 0.96
Model Recall Score: 0.96
Model F1 Score: 0.96


## **Store the Naive Bayes Classifier in a pickle file so it can be used to test the prediction of a text data**

In [30]:
#Storing the model classifier in a pickle file
Model_filename = "nb_classifier.pkl"

# storing the model vectorizer in a pickle file as well
vectorizer_filename = "vectorizer.pkl"

with open(vectorizer_filename, "wb") as vf:
    pickle.dump(vectorizer, vf)

with open(Model_filename, "wb") as file:
    pickle.dump(nb_classifier, file)


## **Load The Model and test it on any text data**

In [31]:
# Loading the Model Vectorizer from the pickle file to be used
with open(vectorizer_filename, "rb") as f:
    loadedVectorizer = pickle.load(f)

# Loading the Model Classifier from pickle file
with open(Model_filename, "rb") as mf:
    loadedClassifier = pickle.load(mf)

In [32]:
# the text string to be used to test the prediction of the model
text = "tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels. although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky+.  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc  there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone.  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest.  the reality is that with broadband connections  anybody can be the producer of content.  he added:  the challenge now is that it is hard to promote a programme with so much choice.   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone  the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands  mr hanlon suggested.  on the other end  you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them   said mr hanlon.  ultimately  the consumer will tell the market they want.   of the 50 000 new gadgets and technologies being showcased at ces  many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them  instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with 100-hours of recording capability  instant replay  and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want."

In [33]:
# A function to use and call the classifier to make predictions with the test string
def classifyText(text):
    prediction = loadedClassifier.predict(loadedVectorizer.transform([text]))
    return (f"Text Classification: {prediction[0].capitalize()}")

In [34]:
# calling the function on the test string, and getting the output
classifyText(text)

'Text Classification: Tech'