# Wikipedia scraping and KMeans clustering
The example below is taken and adapted from a workshop example from the University of Exeter. The example consisted of code only, explanations and commenting have been added for insights and learning.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import wikipedia

##### Outline
This notebook takes 2 wikipedia pages and converts all the sentences in each page into a single line with '.' as the delimiter. The two pages are then put together to form a single corpus. The corpus is transformed into features, and the KMeans algorithm is used to find two distinct clusters which will predictably represent the two different wikipedia pages.

Predictions are made for two newly introduced samples as to whether they are closer to cluster 0 or 1. That is, whether they are more likely to be written on the first or the second wikipedia page/article.

##### Scikit Learn documentation
About Term frequency times inverse document frequency (Tfidf)
https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

About the KMeans clustering algorithm:
https://scikit-learn.org/stable/modules/clustering.html#k-means

About adjusted random scoring: 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score

In [2]:
# Obtain formatted wiki pages for Exeter and Pizza
p_wiki = wikipedia.page("University_of_Exeter")
page_1 = p_wiki.content.replace("\n", "").split(sep='.')

p_wiki = wikipedia.page("Pizza")
page_2 = p_wiki.content.replace("\n", "").split(sep='.')

# Create a single corpus and define each word as separate features
documents = page_1 + page_2
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Create clusters with KMeans
true_k = 2
model = KMeans(n_clusters=true_k, max_iter=100)
model.fit(X)

# Display features for each cluster
print("Top teams per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d" % i),
    #print(terms) # Every feature in the cluster
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print()

# Predict labels for samples
Y = vectorizer.transform(["I live in Devon."])
prediction = model.predict(Y)
print('Prediction for "I live in Devon":',prediction)

Y = vectorizer.transform(["You cook it in the oven"])
prediction = model.predict(Y)
print('Prediction for "You cook it in the oven":',prediction)

Top teams per cluster:
Cluster 0
 pizza
 dough
 oven
 baked
 cheese
 crust
 bread
 similar
 ingredients
 pizzas

Cluster 1
 university
 exeter
 campus
 college
 luke
 school
 students
 streatham
 centre
 student

Prediction for "I live in Devon": [1]
Prediction for "You cook it in the oven": [0]
