In [176]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

# Feature extraction

## Raw Text Data

raw_data.txt contains text from 11039 Wikipedia articles falling into the categories "Actor", "Movie", "Math", or "Mathematician".

We can use `CountVectorizer` to get a "Bag-of-Words" representation of the data

In [147]:
vectorizer = CountVectorizer(encoding='utf-8', strip_accents='unicode')
with open('./data/raw_data.txt' , 'r', encoding='utf-8') as f:
    lines = f.readlines()
    X = vectorizer.fit_transform(lines)

In [148]:
vectorized

<11039x128513 sparse matrix of type '<class 'numpy.int64'>'
	with 3026265 stored elements in Compressed Sparse Row format>

In [149]:
X.shape

(11039, 128513)

## Graph edges - hyperlinks between articles

The rows of graph.csv contain a pair of article numbers, the first number indicating an article with a hyperlink to the article associated with the second number.

In [152]:
df_graph = pd.read_csv('./data/graph.csv', sep= ' ', header=None, 
                 names=['link_in', 'link_to']).astype(int)

In [153]:
df_graph.head()

Unnamed: 0,link_in,link_to
0,5172,3360
1,5172,2636
2,5172,1059
3,5172,4689
4,6758,2211


The following cell prints the number of unique entries in each column, and fe find that every article has at least one hyperlink to another article (11039 total articles).

In [161]:
df_graph['link_in'].unique().shape, df_graph['link_to'].unique().shape

((11039,), (11039,))

We can construct an adjacency matrix from this data. This will be a $11039\times11039$ matrix, where each element $(i,j)$ has a value of 1 if there is a link between the pair of articles $i$ and $j$ or a value of 0 if there is no link between the articles.

We will construct this as a `coo_matrix`, with data provided in `(data, (i, j))` tuple format. For `data`, we just use the value `1` for each entry to indicate a connection between indices `i` and `j` specified in the two columns of the dataframe.

In [175]:
ones = np.ones(len(df_graph['link_in']))
graph = sp.coo_matrix((ones, (df_graph['link_in'], df_graph['link_to'])))
graph

<11039x11039 sparse matrix of type '<class 'numpy.float64'>'
	with 174309 stored elements in COOrdinate format>

# Dimensionality reduction

We first use singular value decomposition to reduce the dimensionality of the bag-of-words features to 1000 dimensions.

In [None]:
svd = TruncatedSVD(n_components=1000, random_state=0)
X_svd = svd.fit_transform(X)
X_svd.shape