**This notebook reads in a CSV of tags assigned in Overview (overviewdocs.com) and tries to generlate them using machine learning, producing a new CSV for import back into Overview.**

Overview is an open source document mining tool does OCR, search, and visualization of document sets up to the millions.

To use this script:

- Manually tag a bunch of documents in Overview. Ensure that each document you review gets exactly one tag (could be "None" or "Other") and that all documents you didn't review have no tags. 
- Then export as CSV in "all tags in one column" format. 
- Copy the CSV into the same directory as this notebook and call it `overview-tags-in.csv`
- Run this notebook. It will write `overview-tags-out.csv`
- Import this CSV into Overview to create a new document set with computer-assigned tags (Unfortunately there is currently no way to merge tags into an existing document set.)



In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Load document text and tags 
df = pd.read_csv('overview-tags-in.csv')
df.head()

Unnamed: 0,id,title,text,url,tags
0,,,Menendez Statement on Black History Month\n ...,http://menendez.senate.gov/newsroom/press/rele...,
1,,,Menendez Praises Susan G. Komen For Reversing ...,http://menendez.senate.gov/newsroom/press/rele...,
2,,,Menendez Applauds Dentists’ Pro-Bono Work For ...,http://menendez.senate.gov/newsroom/press/rele...,Healthcare
3,,,Senator Menendez Applauds Passage of STOCK Act...,http://menendez.senate.gov/newsroom/press/rele...,Other
4,,,Menendez Hails Banking Committee Passage of Ir...,http://menendez.senate.gov/newsroom/press/rele...,Iran


In [3]:
# Create document vectors from the text
vectorizer = CountVectorizer(stop_words='english', min_df=2) # keep only words that appear in at least 2 docs
matrix = vectorizer.fit_transform(df.text)

In [4]:
# Take a look at the document vectors
vectors = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())

In [5]:
# Add the tags into the vectors dataframe, so we can split into train/predict sets
df2 = pd.concat([df.tags, vectors], axis=1)

# Split into documents that have a tag (training data), and those that don't (data to predict)
train = df2[~pd.isnull(df.tags)]
predict = df2[pd.isnull(df.tags)]
len(train)

72

In [6]:
# build the model
x = train.iloc[:,1:].values
y = train.iloc[:,0].values
rf = RandomForestClassifier()
rf.fit(x,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [7]:
# Actually predict new tags
x_predict = predict.iloc[:,1:].values
y_predict = rf.predict(x_predict)

In [8]:
# Marge the predicted tags back into the main dataframe. 
# To do this we need to give them the index of predict frame (basically, so they can remember their row numbers)
predicted_tags = pd.DataFrame(y_predict, index=predict.index, columns=['tags'])

# Prefex all coputer-generated tag names with 'bot-'
predicted_tags = 'bot-' + predicted_tags

# Merge tages
df.update(predicted_tags)
df

Unnamed: 0,id,title,text,url,tags
0,,,Menendez Statement on Black History Month\n ...,http://menendez.senate.gov/newsroom/press/rele...,bot-Other
1,,,Menendez Praises Susan G. Komen For Reversing ...,http://menendez.senate.gov/newsroom/press/rele...,bot-Other
2,,,Menendez Applauds Dentists’ Pro-Bono Work For ...,http://menendez.senate.gov/newsroom/press/rele...,Healthcare
3,,,Senator Menendez Applauds Passage of STOCK Act...,http://menendez.senate.gov/newsroom/press/rele...,Other
4,,,Menendez Hails Banking Committee Passage of Ir...,http://menendez.senate.gov/newsroom/press/rele...,Iran
5,,,Transportation Subcommittee Chair Says Biparti...,http://menendez.senate.gov/newsroom/press/rele...,Transit
6,,,"Menendez, Lautenberg Announce More than $2 Mil...",http://menendez.senate.gov/newsroom/press/rele...,Fire Dept
7,,,Menendez Hosts NJ Small Business Leaders at Na...,http://menendez.senate.gov/newsroom/press/rele...,Jobs
8,,,Senator Menendez Slams Unfair Imprisonment of ...,http://menendez.senate.gov/newsroom/press/rele...,Other
9,,,Menendez Hails President Obama’s Plan to Help ...,http://menendez.senate.gov/newsroom/press/rele...,Housing


In [9]:
# Output!
df.to_csv('overview-tags-out.csv')