In [49]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [139]:
import warnings
warnings.filterwarnings('ignore')
from collections import Counter
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

import enchant
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

In [237]:
df = pd.read_csv('data/articles.csv', index_col='index')
df.date = pd.to_datetime(df.date)

In [121]:
art = []
for ix, tag in enumerate(df.tags):
    if re.match(".* art,.*", tag) or re.match("art,.*", tag): art.append(ix)

science = []
for ix, tag in enumerate(df.tags):
    if re.match(".*science,.*", tag): science.append(ix)

BrainPickings.org has 5000+ articles containing 5 million words. The author is interested in the evolving themes in her work. While this can be visualized with a stacked barchart (years on x, count on y, stacking is based on tags), I want to turn this into an NLP problem involving vectorization of the articles and then classification. The classification can be done in two ways: time and tags. Time would predict Early (written before 2014) or Late (written 2014-present) while tags would divide articles by popular tags and then try to guess which tag the article is filed under. Tags should be different, like Science an Art, two popular ones. In order to use rarer tags, you will have to undersample the more popular class 

Classifying articles into one of many tage is a multi-label problem. You'll want to calculate probabilities for each tag for every document. When making the model, think about the target and how to best transform the "natural language" into a vector that captures the essence of the target. Suppose you want to classify an article as being about Art or Science. A basic model would be to look for the words "art" and "science" in the article and return a vector of the counts. So if an article says "art" once and "science" three times th vector is [1,3]. You want the simplest representation of the article that gives the clearest signal of what the classification should be.

If the classification is about whether the author wrote the article early in their career or later, then you'll want to represent the article as a different vector. Maybe the inexperinced writer used a lot of curse words or said "like" to much. Maybe the stronger writer writes longer sentences or uses a wider vocabulary. In this case, using too small a vocabulary will not be a good idea. You want to represent the early articles as sparser with more reliance on commoner words. 

In choosing the right vector to capture the right signal, you'll want to think about the particular problem. Do you need rare words, do you need to filter out words that occur in most articles, do you need stems or lemmas, do you need puncutation, do you want a few stop words or many? Be careful as well you don't make a useless model. Suppose you left t4he date in the article and the model just used that to predict when it was written. Or maybe the author always signed off with a phrase and then stopped doing it, so the model just sees that.

Topic modeling is the process of discovering salient feautures in a corpus in an unsupervied way. 

In [172]:
art_df=df.iloc[art]
art_df["art"]=1
sci_df=df.iloc[science]
sci_df["art"]=0

In [179]:
df2 = pd.concat([art_df, sci_df])
#~250 articles are duplicates, tagged both art and science
df2=df2[df2.index.value_counts()==1]
df2=df2.reset_index(drop=True)

In [191]:
#the corpus...
documents = df2.title + df2.content

In [193]:
#cleaning
documents = documents.str.lower()
documents = documents.str.replace(',','')
documents = documents.str.replace('[^\w\s]',' ')

In [218]:
documents[666]

'5  mostly  vintage children s books by iconic graphic designers   saul bass milton glaser paula scher bruno munari paul rand as a lover of  children s books  i have a particularly soft spot for little known gems by well known creators  after  two   rounds  of excavating obscure children s books by famous authors of literature for grown ups and icons of the  art   world  here are five wonderful vintage children s books by some of history s most celebrated graphic designers  henri s walk to paris by saul bass saul bass   1920 1996  is commonly considered the  greatest graphic designer  of all time responsible for some of the most timeless logos and most memorable  film title sequences  of the twentieth century  in 1962 bass collaborated with former librarian  leonore klein  on his only children s book which spent decades as a prized out of print collector s item  this year half a century later rizzoli reprinted  henri s walk to paris    public library     an absolute gem like only bass 

In [247]:
TFIDF = TfidfVectorizer(strip_accents="unicode",
                             min_df=3,
                             max_df=.96,
                             stop_words='english') #ngram_range=(1,2)
X = TFIDF.fit_transform(documents)
X

<2766x25612 sparse matrix of type '<class 'numpy.float64'>'
	with 796009 stored elements in Compressed Sparse Row format>

In [248]:
vocabulary = np.array(list(TFIDF.vocabulary_.keys()))
len(vocabulary)

25612

In [249]:
vocabulary

array(['mmm', 'invisible', 'furniture', ..., 'lisel', 'mueller',
       'conservancy'], dtype='<U19')

In [228]:
X.todense()[9]

matrix([[0.02513346, 0.        , 0.03838287, ..., 0.        , 0.        ,
         0.        ]])

#### Modeling

In [198]:
y = df2.art
y.value_counts()

1    1827
0     939
Name: art, dtype: int64

In [233]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    stratify=y, random_state=36)

In [234]:
X_train.shape

(2212, 14210)

In [235]:
#oversample science class
X_train, y_train = SMOTE().fit_resample(X_train, y_train.ravel())

In [236]:
X_train.shape

(2922, 14210)