## Worksheet 8 - Generative models 3

Problem 3:

Part a)

Unpacked and sorted through the directories, have 20 classifications of news types, which are informed by the directory hierarchy

Part b)

Leverage data-set and hierarchy on scikit-learn. Links to the same directory as specified in the homework. A trainling label will link fo the lableed value according to the directory that it lies in, i.e.: alt.atheism 1, etc.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
# Remove strong identifiers of article category
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))
# Remove strong identifiers of article category
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'))

In [None]:
print(newsgroups_train.filenames.shape)
print(newsgroups_test.filenames.shape)

We have 11,314 documents of training data and 7,532 documents of test data.

Part c)

In [None]:
length = newsgroups_train.filenames.shape[0]

In [None]:
import numpy as np
unique, counts = np.unique(newsgroups_train.target, return_counts=True)

In [None]:
import pandas as pd
prior_prob = pd.DataFrame({'class':unique, 'prior_prob':counts/length})

In [None]:
# Fraction of total documents that belong to each class. Appear to be less on the last class.
# Appears that the class is transformed from the range 0-19 as opposed to 1-20.
prior_prob

In [None]:
vocab = {}
reverse_vocab = {}
count = 0
a = open('./vocabulary.txt', 'r')
for v in a:
    val = v.strip()
    vocab[val] = count
    reverse_vocab[count] = val
    count += 1

In [None]:
vocab['baseball']

In [None]:
# Vectorize each training document, using the vocabulary document
vectorizer = TfidfVectorizer(strip_accents='unicode', decode_error = 'ignore', stop_words='english', vocabulary=vocab)
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [None]:
# Vectorize the test document, using the vocabulary document
vectors_test = vectorizer.fit_transform(newsgroups_test.data)
vectors_test.shape

Part d) Used a different smoothing constant than 1, 1 did not perform as well. MultinomialNB uses logs inherently.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB(alpha=0.046)
clf.fit(vectors, newsgroups_train.target)
# Naive bayes uses prior probability distributions from above

Part e)

In [None]:
# Predictions:
pred = clf.predict(vectors_test)

In [None]:
print('The model is', round(metrics.accuracy_score(newsgroups_test.target, pred)*100), '% accurate')
print('The model has an error rate of', round((1- metrics.accuracy_score(newsgroups_test.target, pred))*100),'%')

## Worksheet 9 - Clustering

Problem 1:

In [None]:
f = open('./Animals_with_Attributes/Features/README-features.txt', 'r')
file_contents = f.read()
print (file_contents)
f.close()

Problem 2:

In [None]:
# Different animal classes
f = open('./Animals_with_Attributes/classes.txt', 'r')
classes_str = f.read()
print (classes_str)
f.close()

In [None]:
# Different available features
f = open('./Animals_with_Attributes/predicates.txt', 'r')
features_str = f.read()
print (features_str)
f.close()

Problem 3:

In [None]:
classes = ''.join([i for i in classes_str if not i.isdigit()]).split()

In [None]:
classes[:10]

In [None]:
features = ''.join([i for i in features_str if not i.isdigit()]).split()

In [None]:
features[:10]

In [None]:
animals_data = pd.read_fwf("./Animals_with_Attributes/predicate-matrix-continuous.txt", header=None).values
print('The shape of the data is', animals_data.shape)

In [None]:
animal_df = pd.DataFrame(data = animals_data, columns = features)
animal_df.index = classes
animal_df.head()

In [None]:
# Import K Means Package
from sklearn.cluster import KMeans

# Set k = 10
km10 = KMeans(n_clusters=10)
km10.fit(animals_data)
# Get cluster assignment labels
labels = km10.labels_
# Format results as a DataFrame
results = pd.DataFrame([animal_df.index,labels]).T
results.columns = ['class', 'cluster']

In [None]:
results.groupby('cluster')['class'].apply(list)

To me, it looks like the clusters make pretty good sense. The large aquatic/land animals are grouped together, the flying animal is alone, the bears are together, and the household pets are grouped together.

Problem 4:

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = 10,10

In [None]:
HC = linkage(animals_data, 'ward')

In [None]:
plt.title('Hierarchical Clustering of Animals')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(HC, labels= classes, orientation='right')
plt.show()

The hierarchial clusters make sense to me as the larger land animals are grouped together, the smaller land animals are grouped together, and the aquatic animals are grouped together. In these examples, however, I believe the K-means was very comparable, especially when I see the Bat and Monkey family so similar.

## Worksheet 10 - PCA and SVD

Problem 4:

In [None]:
from sklearn.decomposition import PCA
pc2 = PCA(2)
animals_data2d = pc2.fit_transform(animals_data)
print(animals_data2d.shape)

In [None]:
print('Reduced to 2-D dimensionality retained',sum(pc2.explained_variance_ratio_),'of the original datas variance')

In [None]:
# Let us plot the hierarchial clusters again:
HC2 = linkage(animals_data2d, 'ward')
plt.title('Hierarchical Clustering of Animals (PCA 2-D)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(HC2, labels= classes, orientation='right')
plt.show()

In [None]:
# Plot values in 2-D
fig = plt.figure(1, figsize=(10, 10))
ax = fig.add_subplot(111)
for i,point in enumerate(animals_data2d):
    ax.annotate(classes[i], xy=point, xytext=point)
    
plt.scatter(animals_data2d[:,0], animals_data2d[:,1])
plt.title('PCA Projection of Animals')

While it does seem sensible, I think a few higher dimensions may yield more accurate results. This just seems to separate aquatic from non-aquatic animals