In [1]:
import spacy
import nltk
import pandas as pd

In [2]:
from collections import Counter

In [73]:
data = pd.read_csv(r"D:\Data\wikipedia-ml\wikipedia_machine_learning.csv",sep='\t')

The format of the data is that each row is a series, and each series has 3 elements: 0:title, 1: url, 2: body text 

In [74]:
d0 = data.loc[0]

In [75]:
type(d0)

pandas.core.series.Series

In [76]:
d0[0]

'Outline of computer vision'

In [77]:
d0[1]

'https://en.wikipedia.org/wiki/Outline_of_computer_vision'

In [78]:
d0[2][:100]

'The following outline is provided as an overview of and topical guide to computer vision: Computer v'

In [79]:
data.shape

(7318, 3)

In [80]:
articles = data.apply(lambda x: x[2], axis=1)

In [81]:
titles = data.apply(lambda x: x[0], axis=1)

In [82]:
urls = data.apply(lambda x: x[1], axis=1)

In [83]:
df = pd.concat([titles,urls,articles],axis=1)

In [84]:
df.columns = ['title', 'url','original_article']

The next stage is to apply typical NLP preprocessing steps before looking at what words tend to be used often:
- lowercase
- remove stopwords

In [85]:
list(set([type(a) for a in df.original_article]))

[float, str]

In [86]:
float_cols = df[df['original_article'].apply(lambda x: isinstance(x, float))]

In [87]:
float_cols.head()

Unnamed: 0,title,url,original_article
100,Category:Robotics suites,https://en.wikipedia.org/wiki/Category:Robotics_suites,
5449,Category:Search algorithms,https://en.wikipedia.org/wiki/Category:Search_algorithms,


In [88]:
df['original_article'].isna().sum()

2

There's not many nulls, I'm just going to drop them, I'd check the proportion of nulls, but since the number of instances is in the thousands, and this is 2, I'm just going to drop them

In [89]:
df = df.dropna(how='any')
df.shape

(7316, 3)

In [90]:
articles = df['original_article'].apply(lambda s: s.lower())

I need to create a dataframe with a column for each word, where the row is the count of that word in that article

In [91]:
 from nltk.corpus import stopwords

In [92]:
articles = articles.apply(lambda row: nltk.word_tokenize(row))

In [93]:
articles.head()

0    [the, following, outline, is, provided, as, an, overview, of, and, topical, guide, to, computer, vision, :, computer, vision, –, interdisciplinary, field, that, deals, with, how, computers, can, b...
1    [the, following, outline, is, provided, as, an, overview, of, and, topical, guide, to, natural, language, processing, :, natural, language, processing, –, computer, activity, in, which, computers,...
2    [the, following, outline, is, provided, as, an, overview, of, and, topical, guide, to, robotics, :, robotics, is, a, branch, of, mechanical, engineering, ,, electrical, engineering, and, computer,...
3    [the, accuracy, paradox, is, the, paradoxical, finding, that, accuracy, is, not, a, good, metric, for, predictive, models, when, classifying, in, predictive, analytics, ., this, is, because, a, si...
4    [action, model, learning, (, sometimes, abbreviated, action, learning, ), is, an, area, of, machine, learning, concerned, with, creation, and, modification, of, software, agen

In [94]:
stop_words = set(stopwords.words('english')) 

I was conservative with removing the punctuation, because some of it might be important, like ? especially, and ==

In [95]:
punctuation = "()'',.:"

I need to remove punctuation as well as stopwords

In [96]:
def remove_unwanted(tokens):
    tokens = [w for w in t if w not in stop_words and w not in punctuation]
    return tokens

In [97]:
punctuation

"()'',.:"

In [98]:
a = articles.loc[0]

In [99]:
articles = articles.apply(lambda x: [w for w in x if w not in stop_words and w not in punctuation])

In [100]:
print(articles.shape)
print(df.shape)

(7316,)
(7316, 3)


In [53]:
c = Counter(a)

In [55]:
df['orig_']

['following',
 'outline',
 'provided',
 'overview',
 'topical',
 'guide',
 'computer',
 'vision',
 'computer',
 'vision',
 '–',
 'interdisciplinary',
 'field',
 'deals',
 'computers',
 'made',
 'gain',
 'high-level',
 'understanding',
 'digital',
 'images',
 'videos',
 'perspective',
 'engineering',
 'seeks',
 'automate',
 'tasks',
 'human',
 'visual',
 'system',
 'computer',
 'vision',
 'tasks',
 'include',
 'methods',
 'acquiring',
 'digital',
 'images',
 'image',
 'sensors',
 'image',
 'processing',
 'image',
 'analysis',
 'reach',
 'understanding',
 'digital',
 'images',
 'general',
 'deals',
 'extraction',
 'high-dimensional',
 'data',
 'real',
 'world',
 'order',
 'produce',
 'numerical',
 'symbolic',
 'information',
 'computer',
 'interpret',
 'image',
 'data',
 'take',
 'many',
 'forms',
 'video',
 'sequences',
 'views',
 'multiple',
 'cameras',
 'multi-dimensional',
 'data',
 'medical',
 'scanner',
 'technological',
 'discipline',
 'computer',
 'vision',
 'seeks',
 'apply',
 '

In [46]:
#df.drop('article',axis=1)
df['article']=articles

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


next: sort columns by count frequency

next make a dataframe with the most frequent words

In [None]:
Something like this https://www.aclweb.org/anthology/W15-1526.pdf would be good to try - supposedly outperforms LDA, but

In [47]:
df['length'] = articles.apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [48]:
df['length'].describe()

count     7316.000000
mean      1324.216785
std       1505.295536
min          4.000000
25%        346.000000
50%        808.000000
75%       1711.000000
max      17054.000000
Name: length, dtype: float64

With this, discarding articles with less than a threshold of non-stopwords will help train an algorithm

In [56]:
df['original_article'][0]

"The following outline is provided as an overview of and topical guide to computer vision: Computer vision – interdisciplinary field that deals with how computers can be made to gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do. Computer vision tasks include methods for acquiring digital images(through image sensors), image processing, and image analysis, to reach an understanding of digital images. In general, it deals with the extraction of high-dimensional data from the real world in order to produce numerical or symbolic information that the computer can interpret. The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner. As a technological discipline, computer vision seeks to apply its theories and models for the construction of computer vision systems. As a scientific discipline, computer v

In [57]:
df.article[0]

['following',
 'outline',
 'provided',
 'overview',
 'topical',
 'guide',
 'computer',
 'vision',
 'computer',
 'vision',
 '–',
 'interdisciplinary',
 'field',
 'deals',
 'computers',
 'made',
 'gain',
 'high-level',
 'understanding',
 'digital',
 'images',
 'videos',
 'perspective',
 'engineering',
 'seeks',
 'automate',
 'tasks',
 'human',
 'visual',
 'system',
 'computer',
 'vision',
 'tasks',
 'include',
 'methods',
 'acquiring',
 'digital',
 'images',
 'image',
 'sensors',
 'image',
 'processing',
 'image',
 'analysis',
 'reach',
 'understanding',
 'digital',
 'images',
 'general',
 'deals',
 'extraction',
 'high-dimensional',
 'data',
 'real',
 'world',
 'order',
 'produce',
 'numerical',
 'symbolic',
 'information',
 'computer',
 'interpret',
 'image',
 'data',
 'take',
 'many',
 'forms',
 'video',
 'sequences',
 'views',
 'multiple',
 'cameras',
 'multi-dimensional',
 'data',
 'medical',
 'scanner',
 'technological',
 'discipline',
 'computer',
 'vision',
 'seeks',
 'apply',
 '

# Following along LSA tutorial

In [58]:
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_colwidth", 200)

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

stopwords have been removed, words have been tokenized

In [102]:
def make_topics(articles, num_topics): 
    articles = articles.apply(lambda x: ' '.join(x))

    vectorizer = TfidfVectorizer(stop_words='english', 
                                 max_features= 1000,
                                max_df=0.5,
                                smooth_idf=True)

    X = vectorizer.fit_transform(articles)

    print("shape: ", X.shape)

    # SVD represent documents and terms in vectors 
    svd_model = TruncatedSVD(n_components=num_topics, algorithm='randomized', n_iter=100, random_state=122)

    svd_model.fit(X)

    len(svd_model.components_)

    terms = vectorizer.get_feature_names()

    for i, comp in enumerate(svd_model.components_):
        terms_comp = zip(terms, comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
        print("Topic "+str(i)+": ")
        for t in sorted_terms:
            print(t[0])
            print(" ")

In [104]:
make_topics(articles,20)

shape:  (7316, 1000)
Topic 0: 
model
 
systems
 
language
 
information
 
learning
 
computer
 
research
 
Topic 1: 
x_
 
mathbf
 
frac
 
distribution
 
left
 
right
 
probability
 
Topic 2: 
language
 
languages
 
english
 
word
 
words
 
meaning
 
text
 
Topic 3: 
language
 
languages
 
programming
 
software
 
algorithm
 
code
 
image
 
Topic 4: 
learning
 
neural
 
machine
 
language
 
intelligence
 
artificial
 
algorithm
 
Topic 5: 
mathbf
 
x_
 
intelligence
 
ai
 
research
 
learning
 
frac
 
Topic 6: 
mathbf
 
image
 
images
 
brain
 
space
 
signal
 
video
 
Topic 7: 
mathbf
 
search
 
algorithm
 
optimization
 
problem
 
algorithms
 
analysis
 
Topic 8: 
search
 
algorithm
 
x_
 
optimization
 
algorithms
 
problem
 
game
 
Topic 9: 
programming
 
systems
 
design
 
theory
 
logic
 
computer
 
control
 
Topic 10: 
neural
 
network
 
brain
 
neurons
 
networks
 
neuron
 
x_
 
Topic 11: 
game
 
games
 
player
 
probability
 
video
 
ai
 
music
 
Topic 12: 
x_
 
genetic
 
y_
 


In [108]:
import topicgetter

In [109]:
model, terms = topicgetter.make_topics(articles, 20)

shape:  (7316, 1000)
Topic 0: 
model
 
systems
 
language
 
information
 
learning
 
computer
 
research
 
Topic 1: 
x_
 
mathbf
 
frac
 
distribution
 
left
 
right
 
probability
 
Topic 2: 
language
 
languages
 
english
 
word
 
words
 
meaning
 
text
 
Topic 3: 
language
 
languages
 
programming
 
software
 
algorithm
 
code
 
image
 
Topic 4: 
learning
 
neural
 
machine
 
language
 
intelligence
 
artificial
 
algorithm
 
Topic 5: 
mathbf
 
x_
 
intelligence
 
ai
 
research
 
learning
 
frac
 
Topic 6: 
mathbf
 
image
 
images
 
brain
 
space
 
signal
 
video
 
Topic 7: 
mathbf
 
search
 
algorithm
 
optimization
 
problem
 
algorithms
 
analysis
 
Topic 8: 
search
 
algorithm
 
x_
 
optimization
 
algorithms
 
problem
 
game
 
Topic 9: 
programming
 
systems
 
design
 
theory
 
logic
 
computer
 
control
 
Topic 10: 
neural
 
network
 
brain
 
neurons
 
networks
 
neuron
 
x_
 
Topic 11: 
game
 
games
 
player
 
probability
 
video
 
ai
 
music
 
Topic 12: 
x_
 
genetic
 
y_
 


TypeError: cannot unpack non-iterable NoneType object

In [116]:
type(terms)

list

In [118]:
len(terms)

1000

In [120]:
len(model.components_)

20