## Investigating tax topic to see if it can be further divided into subtopics  
Worried it may be a catch-all topic because of its size

In [1]:
import pandas as pd
doj = pd.read_csv('topics.csv')

In [7]:
taxes = doj[doj['topicname']=='Taxes']

In [8]:
taxes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1503 entries, 58 to 13061
Data columns (total 9 columns):
Unnamed: 0         1503 non-null int64
components         1503 non-null object
contents           1503 non-null object
date               1503 non-null object
title              1503 non-null object
topicnumber        1503 non-null int64
strengthoftopic    1503 non-null float64
year               1503 non-null int64
topicname          1503 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 117.4+ KB


## Tokenizing and transforming contents of articles in tax subtopic only to create an NMF matrix

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import numpy as np
from sklearn.decomposition import NMF
import string
import unidecode
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
import re

def customtokenizer(article):
    punc = str.maketrans('','',string.punctuation+"''``''``\"")
    article_c = article.translate(punc)
    dig = str.maketrans('','',string.digits)
    article_c=article_c.translate(dig)
    article_c = unidecode.unidecode(article_c)
    article_c = article_c.lower()
    regex = re.compile(r'(?u)\b\w\w+\b')
    article_c = re.findall(regex,article_c)
    stop_words = stopwords.words('english')
    article_c = [y for y in article_c if y not in stop_words]
    stemmer = SnowballStemmer('english')
    article_c = [stemmer.stem(y) for y in article_c] 
    return article_c

In [19]:
vectorizer = TfidfVectorizer(tokenizer=customtokenizer, stop_words=stopwords.words('english'))
 
X = vectorizer.fit_transform(taxes['contents'])
 
idx_to_word = np.array(vectorizer.get_feature_names())

In [24]:
nmf = NMF(n_components=5, solver="mu")
 
W = nmf.fit_transform(X)
 
H = nmf.components_
 
# print the topics
 
for i, topic in enumerate(H):
 
    print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word[topic.argsort()[-20:]]])))

Topic 1: deduct,ir,one,perman,enjoin,court,incom,justic,depart,claim,busi,feder,credit,alleg,injunct,complaint,custom,return,prepar,tax
Topic 2: incom,special,trial,agent,addit,district,fals,year,file,ciraolo,prison,deputi,act,general,sentenc,us,divis,investig,assist,attorney
Topic 3: time,million,respons,depart,cash,busi,wage,quarter,court,account,collect,payrol,withheld,employe,compani,ir,fail,pay,employ,tax
Topic 4: defend,year,beyond,mere,incom,obstruct,file,juri,maximum,alleg,grand,proven,innoc,presum,convict,ir,fals,charg,count,indict
Topic 5: court,jenkin,bogus,larg,frivol,marti,issu,promot,fals,million,withhold,request,base,redempt,fraudul,claim,oid,scheme,form,refund


#### It's difficult to tell whether these topics are actually different or just spurious. It's possible that the Department of Justice just investigates and prosecutes a lot of tax crimes. Without further information about possible ways to proceed, we will continue with having taxes as just one topic.