# Review the code in homework/bill_classifier.ipynb
Understand the steps in creating a text classifier

Comment in your PR on the utility of something like this for the work you've done or would like to do

In [11]:
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer

I went through the notebook and found some of the code a bit confusing. 
I understand the general idea of what the code is doing. It is taking text and classifying it correponding to a specific category. In order to use our text as input for sklearn we need to make some changes.
The first step is to separate the bill title from the bill category, which is fairly easy since they are separated by a "|".
From there we give numbers to our labels and assign them into a new variable called correct_labels.
I get stuck in the step where you applied the CountVectorizer. I am not sure what the output is.
At this point we define our model and fit our x,y. Our x would be everything in our data variable (wich I guess that is information for each word in our biil title) and our y is what we have stored in our correct labels.
We now have our model ready.
But I am very very confused with the lasts lines of code:
    test_data = vectorizer.transform(docs_new)

    for i in range(len(docs_new)):
        print('%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])]))

I do think this is useful for dealing with big amounts of text data. As journalists we are constantly on deadlines and trying to be as timely as possible. Tools like this one would help us deal with big amounts of information in less time. I would like to understand this code in depth so I can create a text classifier.

In [26]:
########## STEP 1: DATA IMPORT AND PREPROCESSING ##########

# Here we're taking in the training data and splitting it into two lists: One with the text of
# each bill title, and the second with each bill title's corresponding category. Order is important.
# The first bill in list 1 should also be the first category in list 2.
training = [line.strip().split('|') for line in open("bills.txt", 'r', encoding='utf8').readlines()]
text = [t[0] for t in training if len(t) > 1]
labels = [t[1] for t in training if len(t) > 1]

# A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to
# be numbers, not strings. The LabelEncoder performs this transformation.
encoder = preprocessing.LabelEncoder()

correct_labels = encoder.fit_transform(labels)

In [25]:
print(text)



In [27]:
print(labels)


['Education', 'Public Services', 'Labor and Employment', 'Labor and Employment', 'Crime', 'Social Issues', 'Business and Consumers', 'Housing and Property', 'Education', 'Commerce', 'Campaign Finance and Election Issues', 'Energy', 'Military', 'Legal Issues', 'Energy', 'Crime', 'Budget, Spending, and Taxes', 'Other', 'Environmental', 'Public Services', 'Energy', 'Public Services', 'Energy', 'Commerce', 'Housing and Property', 'Energy', 'Resolutions', 'Energy', 'Health', 'Health', 'Education', 'Labor and Employment', 'Legal Issues', 'Environmental', 'Energy', 'Senior Issues', 'Health', 'Housing and Property', 'Commerce', 'Family and Children Issues', 'Budget, Spending, and Taxes', 'Family and Children Issues', 'Military', 'Recreation', 'Family and Children Issues', 'Public Services', 'Environmental', 'Labor and Employment', 'Transportation', 'Legal Issues', 'Campaign Finance and Election Issues', 'Legal Issues', 'Environmental', 'Labor and Employment', 'Housing and Property', 'Housing a

In [37]:
print(correct_labels)
print(len(labels))
print(len(correct_labels))

[10 31 25 ..., 19 19 27]
5750
5750


In [34]:
########## STEP 2: FEATURE EXTRACTION ##########
vectorizer = CountVectorizer(stop_words='english')
data = vectorizer.fit_transform(text)
print(data)

  (0, 4986)	1
  (0, 5059)	1
  (0, 7052)	1
  (0, 3241)	1
  (0, 5719)	1
  (0, 5391)	1
  (0, 6894)	1
  (0, 7242)	1
  (1, 4986)	1
  (1, 7052)	1
  (1, 5391)	1
  (1, 6894)	1
  (1, 4995)	1
  (1, 4617)	1
  (1, 5970)	1
  (1, 6808)	1
  (1, 5933)	1
  (2, 4986)	1
  (2, 5059)	1
  (2, 7052)	1
  (2, 5391)	1
  (2, 6894)	1
  (2, 4995)	1
  (2, 7053)	1
  (2, 2036)	1
  :	:
  (5743, 6882)	1
  (5743, 1776)	1
  (5744, 6040)	1
  (5744, 6896)	1
  (5744, 6453)	1
  (5744, 5209)	1
  (5745, 6396)	1
  (5745, 5263)	1
  (5745, 5742)	1
  (5746, 6016)	1
  (5746, 5288)	1
  (5746, 5525)	1
  (5747, 6396)	1
  (5747, 5263)	1
  (5747, 5742)	1
  (5748, 6016)	1
  (5748, 5288)	1
  (5748, 5525)	1
  (5749, 948)	1
  (5749, 7069)	1
  (5749, 5829)	1
  (5749, 6204)	1
  (5749, 6896)	1
  (5749, 7002)	1
  (5749, 1776)	1


In [16]:
########## STEP 3: MODEL BUILDING ##########
model = DecisionTreeClassifier()
fit_model = model.fit(data, correct_labels)


In [17]:
# ########## STEP 4: EVALUATION ##########
# Evaluate our model with 10-fold cross-validation
scores = cross_validation.cross_val_score(model, data, correct_labels, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))




Accuracy: 0.65 (+/- 0.05)


In [18]:
# ########## STEP 5: APPLYING THE MODEL ##########
docs_new = ["Public postsecondary education: executive officer compensation.",
            "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
            "Political Reform Act of 1974: campaign disclosures.",
            "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
        ]

test_data = vectorizer.transform(docs_new)

for i in range(len(docs_new)):
    print('%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])]))
   

Public postsecondary education: executive officer compensation. -> ['Education']
An act to add Section 236.3 to the Education code, related to the pricing of college textbooks. -> ['Education']
Political Reform Act of 1974: campaign disclosures. -> ['Campaign Finance and Election Issues']
An act to add Section 236.3 to the Penal Code, relating to human trafficking. -> ['Crime']


