# Crunchbase HW: Assign Categories to Articles

The goal of this project is to be able to accurately categorize various articles in order to determine which articles are the most relevant to the needs of Crunchbase.

In [1]:
import pandas as pd

First, let's read in our data. It's in json format, so we can use the pandas built-in json reader to read the data

In [2]:
data = pd.read_json('corpus.txt', lines=True)
newsDF = pd.DataFrame(data)
newsDF = newsDF[newsDF.columns[::-1]]

In [3]:
newsDF.head()

Unnamed: 0,title,thumbnailUrl,text,tags,siteName,pageUrl,humanLanguage,html,date,crawlName,author
0,There’s Reason to Be Skeptical of a Tesla-Powe...,{'string': 'http://www.slate.com/content/dam/s...,"Joe Raedle/Getty Images\nLast week, I wrote a ...","[{'score': 0.75, 'count': 10, 'label': 'Puerto...",{'string': 'Slate Magazine'},http://www.slate.com/articles/technology/techn...,{'string': 'en'},,1507313040000,{'string': 'slate_tech'},{'string': 'Eleanor Cummins'}
1,Police: Threats to West Des Moines students we...,{'string': 'https://www.gannett-cdn.com/-mm-/9...,Zach Boyden-Holmes/The Register\nPolice have i...,"[{'score': 0.65, 'count': 3, 'label': 'West De...",{'string': 'Des Moines Register'},http://www.desmoinesregister.com/story/news/cr...,{'string': 'en'},,1507248000000,{'string': 'demoines_register_business'},{'string': 'Kelly McGowan'}
2,Gunman reserved two rooms at Blackstone,{'string': 'http://www.trbimg.com/img-59d66185...,Chicago police are investigating whether Steph...,"[{'score': 0.38, 'count': 1, 'label': 'The Bla...",{'string': 'chicagotribune.com'},http://www.chicagotribune.com/news/local/break...,{'string': 'en'},,1507161600000,{'string': 'chicago_tribue_business'},{'string': 'Chicago Tribune'}
3,"No. 5 Georgia Bulldogs, Vanderbilt Commodores ...",{'string': 'https://cdnph.upi.com/svc/sv/upi/2...,"NASHVILLE, Tenn. -- No. 5 Georgia, coming off ...",[],{'string': 'UPI'},https://www.upi.com/Sports_News/College-Footba...,{'string': 'en'},,1507055340000,{'string': 'upi_business'},
4,Craig Robinson and Adam Scott buddy up in Fox'...,{'string': 'http://www.trbimg.com/img-59cea4c3...,,[],{'string': 'chicagotribune.com'},http://www.chicagotribune.com/entertainment/tv...,{'string': 'en'},,1506776400000,{'string': 'cb3_chicago_tribue_business'},


# Clean the Data

It's important first that we identify what columns/features are going to be the most important to us. In my opinion, the 'text' column will be the most useful for the topic modeling process, so we will focus on that.

We start by importing the required nltk libraries, and add to the stopword list to clean the text of any words that we do not find important in classifying the topic.

In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

In [5]:
stop = set(nltk.corpus.stopwords.words('english'))
# newStopWords = ['string', 'said', 'would', 'you', 'year', 'one','also', 'ha', '-', 'and', 'andn', 'n','inn', 'ofn', 'ofnn', 'andnn', 'ann','thenn','tonn']
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
stop.update(['string', 'said', 'would', 'you', 'year', 'one','also', 'ha', '–', 'and','like', 'get', 'andn', 'n','inn', 'ofn', 'ofnn', 'andnn', 'ann','thenn','tonn', 'u'])
def clean_text(doc):
    stop_remove = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_remove = ''.join(ch for ch in stop_remove if ch not in exclude)
    cleaned = " ".join(lemma.lemmatize(word) for word in punc_remove.split())
    return cleaned

In [6]:
import numpy as np
cleaned = pd.Series(newsDF['text'].astype(str)).apply(clean_text)

I do a second cleaning to remove some of the other words that become apparent after the first cleaning.

In [7]:
# second cleaning

new_cleaned = cleaned.apply(clean_text)

Now that the articles have been stripped of their stopwords, we can also drop any columns we don't really need to worry about

In [8]:
newsDF['cleaned_text'] = new_cleaned
newsDF.head()
newsDF_copy = newsDF.drop(['thumbnailUrl', 'tags', 'siteName', 'pageUrl', 'humanLanguage', 'html', 'crawlName', 'author'], axis = 1)

newsDF_copy.head(10)

Unnamed: 0,title,text,date,cleaned_text
0,There’s Reason to Be Skeptical of a Tesla-Powe...,"Joe Raedle/Getty Images\nLast week, I wrote a ...",1507313040000,joe raedlegetty image last week wrote story sl...
1,Police: Threats to West Des Moines students we...,Zach Boyden-Holmes/The Register\nPolice have i...,1507248000000,zach boydenholmesthe register police identifie...
2,Gunman reserved two rooms at Blackstone,Chicago police are investigating whether Steph...,1507161600000,chicago police investigating whether stephen p...
3,"No. 5 Georgia Bulldogs, Vanderbilt Commodores ...","NASHVILLE, Tenn. -- No. 5 Georgia, coming off ...",1507055340000,nashville tenn 5 georgia coming two impressive...
4,Craig Robinson and Adam Scott buddy up in Fox'...,,1506776400000,
5,'Mark Felt: The Man Who Brought Down the White...,,1507215120000,
6,Stars tweet support for Ashley Judd after she ...,OSCAR-WINNER Brie Larson has spoken out in sup...,1507303440000,oscarwinner brie larson spoken support fellow ...
7,< Review: 'Blade Runner 2049',"RACHEL MARTIN, HOST:\n""Blade Runner"" is back. ...",1507248000000,rachel martin host blade runner back sure harr...
8,Hurricane Nate threatens US central Gulf Coast...,MANAGUA: Hurricane Nate may strengthen on Satu...,1507397400000,managua hurricane nate may strengthen saturday...
9,Board Meeting On 03 Nov 17,Board meeting for approval of Q2 unaudited fin...,1507334400000,board meeting approval q2 unaudited financial ...


# Topic Modeling

At this point, we are ready to prepare our cleaned text for topic modeling. We will convert the cleaned text for each document into an array of strings to tokenize.

In [9]:
import numpy as np

# convert the cleaned text to an array of strings
clean_text_arr = newsDF['cleaned_text'].astype(str)
clean_text_arr = np.asarray(clean_text_arr)
clean_text_arr = [i.split(" ") for i in clean_text_arr]


Latent Dirichlet Allocation (LDA) is going to be the method of choice when it comes to topic modeling. Essentially we are going to create a dictionary before converting to the bag-of-words model. This dictionary will be saved for when we identify the topics.

In [10]:
import gensim
from gensim import corpora

dictionary = corpora.Dictionary(clean_text_arr)
corpus = [dictionary.doc2bow(text) for text in clean_text_arr]

import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

I am going off the assumption that there will be about 5 topics: Miscellaneous news, culture, sports, business, and tech.

In [11]:
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=50)

topics = ldamodel.print_topics(num_words=3)
for topic in topics:
    print(topic)

(0, '0.010*"game" + 0.006*"team" + 0.006*"first"')
(1, '0.005*"time" + 0.004*"people" + 0.004*"new"')
(2, '0.005*"state" + 0.004*"trump" + 0.003*"people"')
(3, '0.008*"company" + 0.006*"market" + 0.005*"business"')
(4, '0.005*"new" + 0.004*"user" + 0.004*"device"')


The output above can help us decipher more concrete labels to these topics. 'Game', 'team', 'first' sounds like sports. 'Time', 'people', 'new' sounds like it could be entertainment. Words like 'state', 'trump' and 'people' make me think of a domestic news article. Then topic 3 sounds like it business related, and topic 4 sounds like it's related to tech/science. 

I then create a dictionary corresponding to these labels.

In [22]:
topic_dict = {0:'sports', 1:'entertainment/culture', 2:'domestic/local news', 3:'business', 4:'tech'}

The function below is meant to take our array of the tokenized clean text and apply a label from the dictionary above by picking the topic that produces the highest probability for a given document.

In [23]:
from operator import itemgetter

def get_topic(doc):
    test_run = doc
    test_run = dictionary.doc2bow(test_run)
    topic_pred = ldamodel.get_document_topics(test_run)
    topic_sort = sorted(topic_pred, key=itemgetter(1), reverse=True)
    return (topic_dict.get(topic_sort[0][0]))
                        
clean_series = pd.Series(clean_text_arr)

With the new set of topics we now have from labeling the articles, we can now append this column to the data frame. In my opinion, the categorize 'business' and 'tech' seem to be the most relevant to the interests of Crunchbase.

In [24]:
newsDF_copy['category'] = clean_series.apply(get_topic)
newsDF_copy.drop(columns=['date']).head(100)

Unnamed: 0,title,text,cleaned_text,category
0,There’s Reason to Be Skeptical of a Tesla-Powe...,"Joe Raedle/Getty Images\nLast week, I wrote a ...",joe raedlegetty image last week wrote story sl...,domestic/local news
1,Police: Threats to West Des Moines students we...,Zach Boyden-Holmes/The Register\nPolice have i...,zach boydenholmesthe register police identifie...,domestic/local news
2,Gunman reserved two rooms at Blackstone,Chicago police are investigating whether Steph...,chicago police investigating whether stephen p...,domestic/local news
3,"No. 5 Georgia Bulldogs, Vanderbilt Commodores ...","NASHVILLE, Tenn. -- No. 5 Georgia, coming off ...",nashville tenn 5 georgia coming two impressive...,sports
4,Craig Robinson and Adam Scott buddy up in Fox'...,,,domestic/local news
5,'Mark Felt: The Man Who Brought Down the White...,,,domestic/local news
6,Stars tweet support for Ashley Judd after she ...,OSCAR-WINNER Brie Larson has spoken out in sup...,oscarwinner brie larson spoken support fellow ...,entertainment/culture
7,< Review: 'Blade Runner 2049',"RACHEL MARTIN, HOST:\n""Blade Runner"" is back. ...",rachel martin host blade runner back sure harr...,entertainment/culture
8,Hurricane Nate threatens US central Gulf Coast...,MANAGUA: Hurricane Nate may strengthen on Satu...,managua hurricane nate may strengthen saturday...,domestic/local news
9,Board Meeting On 03 Nov 17,Board meeting for approval of Q2 unaudited fin...,board meeting approval q2 unaudited financial ...,business


# Areas of Improvement

There are a few areas where I feel the classification can be improved.

1. We could have done more fine tuning in the cleaning of the documents. This could have allowed for more concise and more distinct topic labels. Additionally, we could have removed the entries where the 'text' field was blank.

2. Determining the number of topics seemed very rough and arbitrary. Perhaps there is a more formal way to get a more effective number of topics. I can see that deciding there are too many topics can lead to 'overfitting' for a lack of a better term, when the goal is to simply categorize.

3. Recall we picked the topic with the highest probability to label each document. This leaves plenty of room for error. If we fine-tuned the number of topics and number of words for each topic, we could probably get the probabilities to be much higher when labeling the documents.

Overall, I feel this was a decent starting point for filtering out which documents to hone in on. As a possible next step, I think it would be interesting to apply what I had done on the text of the articles to just the titles themselves, or perhaps including the title in the text (new column with text and title appended). Although the model can be improved, I believe it offers a solid platform for determining the more relevant articles that are of interest to Crunchbase.