In [1]:
# # Introduction Here

In [9]:
# # Importing Libraries
import pickle
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

### The dataset we will be using is the NEWS datasets

In [3]:
data = pd.read_csv("abcnews-date-text.csv",error_bad_lines=False,usecols =["headline_text"])
data.head()



  data = pd.read_csv("abcnews-date-text.csv",error_bad_lines=False,usecols =["headline_text"])


Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


### Data Preprocessing

In [4]:
data[data['headline_text'].duplicated(keep=False)].sort_values('headline_text').head(8)

Unnamed: 0,headline_text
57973,10 killed in pakistan bus crash
116304,10 killed in pakistan bus crash
912357,110 with barry nicholls
673104,110 with barry nicholls
676569,110 with barry nicholls
748865,110 with barry nicholls
827317,110 with barry nicholls episode 15
898182,110 with barry nicholls episode 15


In [5]:
data = data.drop_duplicates('headline_text')

However, when doing natural language processing, words must be converted into vectors that machine learning algorithms can make use of. If your goal is to do machine learning on text data, like movie reviews or tweets or anything else, you need to convert the text data into numbers. This process is sometimes referred to as “embedding” or “vectorization”.

In terms of vectorization, it is important to remember that it isn’t merely turning a single word into a single number. While words can be transformed into numbers, an entire document can be translated into a vector. Not only can a vector have more than one dimension, but with text data vectors are usually high-dimensional. This is because each dimension of your feature data will correspond to a word, and the language in the documents you are examining will have thousands of words.

TF-IDF
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

### Natural Language Processing
<p> Just like last time, where we encoded the datetime to fit the machine learning model, we also need to change the datasets to fit the Cluster algorithm </p>
<p> In NLP, 
    <ul>
        <li> we remove stopwords like '.', ',' 
            <li>we get the root word of data from the process of stemming. Example: go, went and gone all has the common root word of move</li>
    <li>we break the sentence into words and punctuation called Tokenization</li>
        <li> we convert the datasets into vectors (embedding/vectorization)</li>
        <li> we use Term Frequency-Inverse Document Frequency to reflect the importance of a word to the document </li>   
    </ul>

In [7]:
# # Removing Stop Words
punc = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}',"%"]
stop_words = text.ENGLISH_STOP_WORDS.union(punc)
desc = data['headline_text'].values

# # Use Stemming and Tokenization
stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')

def tokenize(text):
    return [stemmer.stem(word) for word in tokenizer.tokenize(text.lower())]

# # Vectorize using TF-IDF
vectorizer = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize, max_features = 1000)
X = vectorizer.fit_transform(desc)
words = vectorizer.get_feature_names()



### Unsupervised Learning
-- description for unsuperbised learning with references

### K-Means Clustering
--description fo K-Means Clustering with references

Due to the limitations of the PC computer power, we were not able to train the model on the local computer. For the solution we rented a Virtual Machine on the Azure and use its compute power to run the Clustering algorithm for us, create the model, save it as a pickle file and use that pickle file here on the notebook. First we save the X value as a pickle file and load it onto the VM for it to train


In [11]:
# # Pickling X for the CLustering algorithm to train on
with open("X.pkl", "wb") as fp:
    pickle.dump(X, fp)

In [13]:
# # Pickling words for the CLustering algorithm to train on
with open("words.pkl", "wb") as fp:
    pickle.dump(words, fp)

So, the train function will run on the Azure Virtual Machine and will give us the model file as pickle

In [22]:
# # Training the model

# # Unpickling
# with open("X.pkl", "rb") as fp:
#     X = pickle.load(fp)
# number_of_cluster = 10
    
def train(number_of_cluster, X):
    kmeans = KMeans(n_clusters = number_of_cluster, n_init = 20)
    kmeans.fit(X)
    
    pickle.dump(kmeans, open('model.pkl', 'wb'))
    print('Model is pickled and Saved')

Proof of Training on Azure

<img src="1.png" /> 

Training Process on Azure VM

<img src="2.png"/>

Now we load the model on the local and see the test results

In [23]:
# load the model
model = pickle.load(open("model.pkl", "rb"))

# use pickled loaded model to predict
common_words = model.cluster_centers_.argsort()[:,-1:-26:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

0 : say, council, govt, australia, court, report, warn, fund, urg, water, face, accus, man, health, chang, boost, nsw, consid, qld, public, cut, ban, wa, hear, rural
1 : man, death, nsw, sydney, year, open, hit, jail, attack, claim, wa, set, miss, chang, qld, hous, day, world, home, die, hospit, elect, talk, final, return
2 : polic, interview, investig, probe, man, search, hunt, offic, miss, arrest, death, car, shoot, drug, seek, attack, say, driver, crash, murder, fatal, assault, suspect, extend, warn
3 : charg, man, murder, face, assault, drug, polic, child, sex, woman, court, teen, death, stab, drop, alleg, rape, men, attack, guilti, shoot, bail, sydney, fatal, yo
4 : new, zealand, law, year, open, council, polic, home, deal, hospit, centr, set, hope, australia, appoint, look, announc, chief, say, minist, govt, mayor, south, rule, servic
5 : support, communiti, indigen, group, urg, council, servic, fund, govt, remot, say, aborigin, ralli, offer, seek, centr, health, nt, need, labor,