### Classification

Machine Learning Technique for `Supervised` Learning

`Predict`  Class for New Data

Text Mining : Word becomes `Features`

Preprocessing : `TF-IDF`

In [1]:
with open("../Data/Course Descriptions.txt",'r') as file:
    description = file.read().splitlines()
    
print(f'Sample Course Description : {description[:2]}')

Sample Course Description : ['In this practical, hands-on course, learn how to do data preparation, data munging, data visualization, and predictive analytics. ', 'PHP is the most popular server-side language used to build dynamic websites, and though it is not especially difficult to use, nonprogrammers often find it intimidating. ']


In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
lemmatizer = WordNetLemmatizer()

`Clean` Text

In [3]:
def clean_text(str):
    tokens = nltk.word_tokenize(str)
    nostopwords = list(filter(lambda token : token not in stopwords.words('english'), tokens))
    lemmatized = [lemmatizer.lemmatize(word) for word in nostopwords]
    return lemmatized

vectorizer = TfidfVectorizer(tokenizer=clean_text)
tfidf = vectorizer.fit_transform(description)

print(f'Features : \n{vectorizer.get_feature_names()[:25]}')
print(f'\nSize of TF-IDF Matrix : {tfidf.shape}')

Features : 
["'ll", "'re", "'s", '(', ')', ',', '.', '?', 'actively', 'adopting', 'amazon', 'analysis', 'analytics', 'application', 'applied', 'architect', 'architecture', 'around', 'aspect', 'associate', 'aws', 'basic', 'become', 'begin', 'big']

Size of TF-IDF Matrix : (20, 240)


`Building` the Model

In [4]:
with open("../Data/Course Classification.txt","r") as file:
    labels = file.read().splitlines()
    
print(f'Labels : \n{labels}')

Labels : 
['Data-Science', 'Programming', 'Programming', 'Cloud-Computing', 'Data-Science', 'Programming', 'Data-Science', 'Programming', 'Cloud-Computing', 'Data-Science', 'Data-Science', 'Programming', 'Programming', 'Cloud-Computing', 'Programming', 'Cloud-Computing', 'Cloud-Computing', 'Cloud-Computing', 'Programming', 'Programming']


In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
print(f'Classes : {le.classes_}')

Classes : ['Cloud-Computing' 'Data-Science' 'Programming']


`Encoding` Labels to Numeric 

In [6]:
encoded_labels = le.transform(labels)
print(f'Encoded Labels : {encoded_labels}')

Encoded Labels : [1 2 2 0 1 2 1 2 0 1 1 2 2 0 2 0 0 0 2 2]


`Encoded Labels`

`1` - Data Science

`2` - Programming

`0` - Cloud Computing

Splitting Dataset into `Train` and `Test` Set

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(tfidf, encoded_labels, test_size=0.2, random_state = 0)

Classification using `Naive Bayes`

In [8]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)

`Predictions` and Evaluation

In [9]:
from sklearn.metrics import confusion_matrix, accuracy_score
print(f'Testing with Test Set :\n')
y_pred = classifier.predict(X_test)
print(f'Confusion Matrix :\n{confusion_matrix(y_test, y_pred)}\n')
print(f'Prediction Accuracy : {accuracy_score(y_test, y_pred)*100:.2f}%')

print(f'Testing with Full Corpus :\n')
y_pred = classifier.predict(tfidf)
print(f'Confusion Matrix :\n{confusion_matrix(encoded_labels, y_pred)}\n')
print(f'Prediction Accuracy : {accuracy_score(encoded_labels, y_pred)*100:.2f}%')

Testing with Test Set :

Confusion Matrix :
[[1 0]
 [1 2]]

Prediction Accuracy : 75.00%
Testing with Full Corpus :

Confusion Matrix :
[[6 0 0]
 [0 5 0]
 [1 0 8]]

Prediction Accuracy : 95.00%
