# Text Data IIb: Building a tweet classifier

Name: Jocelyn Shen

In this project, you'll finish off all the work that we've done so far, and build a collection of classifiers that 

## 1: Reshaping your code into classes

At this point, you should have a solid understanding of text preprocessing, and the Naive Bayes algorithm.  Take the code from your previous projects, and consolidate it into two Python classes: 

* `TextPreprocessor`: Given a Pandas series of tweets (with punctuation, links, and other garbage), perform all the preprocessing necessary to construct a Pandas dataframe of vectorized tweets.  It should have some keyword arguments (set to reasonable defaults) that include all of the choices that you might make regarding text processing.  Here's an example, taken from our class worksheet:
    
    ```
    >>> df = pd.DataFrame({"Tweets":["Trick or Treat!",
                                     "One to Two Guesses",
                                     "Try this one weird trick!",
                                     "That's weird, you might guess",
                                     "Can you guess these 10 health tricks?"]})
    >>> vectorized_tweets = TextProcessor(df, N=1000)
    >>> vectorized_tweets
    ```
    
|      | "one" | "weird" | "trick" | "guess" |
|------|------|------|------|------|
|   "Trick or Treat!"  |  0   |  0   |  1   |  0    |
| "One to Two Guesses"  |   1  |   0  |  0   |  1    |
| "Try this one weird trick!"   |   1  |  1   |  1   |   0   |
| "That's weird, you might guess"   |  0   |  1   |  0   |  1    |
| "Can you guess these 10 health tricks?"   |   0  |  0   |  1   |  1 |
    
* `NaiveBayes`: Given a Pandas Dataframe of vectorized tweets (the output of your `TextPreprocessor` class), train a Naive Bayes classifier, which can then be used to classify tweets it hasn't seen before.  

    ```
    >>> model = NaiveBayes()
    >>> model.fit(vectorized_tweets, y)      // y is the column of 1's and 0's, as usual
    >>> model.predict("Man, @nicholaszufelt is such a great teacher! #brownnoseforlife")
    1
    ```

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_circles, make_moons, make_classification
from sklearn.linear_model import LogisticRegression
from string import punctuation
from collections import Counter
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re
from math import *
import random

%matplotlib inline

plt.style.use('fivethirtyeight')

fivethiryeight_colors = {"blue": "#30a2da", 
                         "red": "#fc4f30", 
                         "green": "#e5ae38", 
                         "yellow": "#6d904f", 
                         "gray": "#8b8b8b"}
fivethirtyeight_rb = ["#30a2da", "#fc4f30"]

In [2]:
tweets_class = pd.read_csv("tweets_class.txt", encoding='ISO-8859-1', sep = "\t")
tweets_class.head()

Unnamed: 0,sentiment,text
0,0.0,Moving the Tesla announcement to Wednesday. Ne...
1,1.0,@markpinc @TeslaMotors thanks!
2,0.0,@Reuters Umm...Autobahn?
3,0.0,@vicentes obviously wrong
4,0.0,@Cocoanetics @heiseonline Not actually based o...


## TextProcessor Class

In [11]:
class TextProcessor:
    """TextProcessor class takes a dF and the corpus size as parameters
    Processeses text"""
    def __init__(self, df, N):
        self.all_tweets = df['text'].values
        self.df = df
        self.CORPUS_SIZE = N
        self.common_words = []
    
    """Removes all punctuation from a string
    parameters: s, string to remove punctuation from
    return: w, word without punctuation"""
    def remove_punctuation(self, s):
        w = ''.join(c for c in s if c not in punctuation)
        return w
    
    """Makes sure the word contains only characters
    parameters: word, word checking for any non-characters
    return: True or False, depending on if the word has all characters or not"""
    def all_char(self, word):
        for letter in word:
            if not letter.isalnum():
                return False
        return True
    
    """Stems a given line
    parameters: line, line to stem
    return: stemmed_line, the stemmed line"""
    def stem_a_line(self, line,stemmed_words=None, stop_words=None):  
        stemmer = PorterStemmer()
        words = re.split("\s", line)
        words = [word for word in words if (len(word) > 0 and self.all_char(word))]
        stemmed_line = []
        for word in words:
            temp = word.replace("\n", "").lower()
            if "https" not in temp and temp[0] != "@":
                w = stemmer.stem(temp)
                if w not in stopwords.words('english'):
                    stemmed_line.append(w)
                    if stemmed_words != None:                                    
                        if w not in stemmed_words: stemmed_words.append(w)        
                else:
                     if stemmed_words != None:                                     
                            if w not in stop_words: stop_words.append(w) 
        if stemmed_words != None:  
            return(stemmed_words, stop_words, stemmed_line)
        else:
            return stemmed_line
        
    """Vectorize tweets
    return: DataFrame of all vectorized, N mos common words"""
    def process(self):
        textdata = pd.DataFrame({'all_tweets': self.all_tweets})
        for i in range(len(self.all_tweets)):
            self.all_tweets[i] = str(self.all_tweets[i])
            self.all_tweets[i] = ''.join([c for c in self.all_tweets[i] if c not in punctuation])
        stemmed_words = []     
        stop_words = []       
        stemmed_tweets =[]
        s = []
        for line in self.all_tweets:
            stemmed_words, stop_words, stemmed_line = self.stem_a_line(line,stemmed_words, stop_words)
            stemmed_tweets = stemmed_tweets + stemmed_line
            s.append(stemmed_line)
        common_words = [w[0] for w in Counter(stemmed_words).most_common(self.CORPUS_SIZE)]
        v = []
        for tweet in self.all_tweets:
            a = []
            for i in range(self.CORPUS_SIZE):
                if common_words[i] in tweet or tweet in common_words[i]:
                    a.append(1)
                else:
                    a.append(0)
            v.append(a)
        for i in range(len(common_words)):
            textdata[common_words[i]] = [n[i] for n in v] 
        self.common_words = common_words
        return textdata
    
    """Adds bigrams to the N common words list"""
    def bigram(self, multiplier):
        common_words = self.common_words
        BIGRAM_MULTIPLIER = 2
        bigram_lyrics = [] 
        for ilyric in range(len(self.df.text)):
            bigram = u''  
            i_bigram = 0
            stemmed_line = self.stem_a_line(self.df.text[ilyric])
            for iwd in range(self.CORPUS_SIZE):
                if common_words[iwd] in stemmed_line:
                    bigram += "_"+common_words[iwd]
                    i_bigram += 1
                    if i_bigram>BIGRAM_MULTIPLIER: 
                        break
            if i_bigram > 1: 
                stemmed_line.append(bigram)
                if bigram not in common_words: 
                    common_words.append(bigram)   
            bigram_lyrics.append(stemmed_line)  
        textdata = pd.DataFrame({'all_tweets': self.all_tweets})
        stemmed_words = []     
        stop_words = []       
        stemmed_tweets =[]
        s = []
        for line in self.all_tweets:
            stemmed_words, stop_words, sL = self.stem_a_line(line,stemmed_words, stop_words)
            stemmed_tweets = stemmed_tweets + sL
            s.append(sL)
        v = []
        for tweet in self.all_tweets:
            a = []
            for i in range(len(common_words)):
                if common_words[i] in tweet or tweet in common_words[i]:
                    a.append(1)
                else:
                    a.append(0)
            v.append(a)
        for i in range(len(common_words)):
            textdata[common_words[i]] = [n[i] for n in v] 
        return textdata


In [4]:
ex = TextProcessor(tweets_class, 1000)
processed = ex.process()
processed.head()

Unnamed: 0,all_tweets,extend,unkar,1949,demonstr,lit,risk,vc,asshol,goal,...,nerdsunit,signup,mute,raven,1julieanderson,procur,http,hottest,lschibi,bringi
0,Moving the Tesla announcement to Wednesday. Ne...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,@markpinc @TeslaMotors thanks!,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,@Reuters Umm...Autobahn?,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,@vicentes obviously wrong,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,@Cocoanetics @heiseonline Not actually based o...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## NaiveBayes

In [5]:
class NaiveBayes:
    """Constructs a NaiveBayes Classifier"""
    def __init__(self):
        self.pP = None
        self.pN = None
        self.prob_p = None
        self.prob_n = None
        self.common_words = None
    
    def all_char(self, word):
        for letter in word:
            if not letter.isalnum():
                return False
        return True 
    
    def stem_a_line(self, line,stemmed_words=None, stop_words=None):  
        stemmer = PorterStemmer()
        words = re.split("\s", line)
        words = [word for word in words if (len(word) > 0 and self.all_char(word))]
        stemmed_line = []
        for word in words:
            temp = word.replace("\n", "").lower()
            if "https" not in temp and temp[0] != "@":
                w = stemmer.stem(temp)
                if w not in stopwords.words('english'):
                    stemmed_line.append(w)
                    if stemmed_words != None:                                    
                        if w not in stemmed_words: stemmed_words.append(w)        
                else:
                     if stemmed_words != None:                                     
                            if w not in stop_words: stop_words.append(w) 
        if stemmed_words != None:  
            return(stemmed_words, stop_words, stemmed_line)
        else:
            return stemmed_line
    
    """Generates vector of probabilities using NaiveBayes
    parameters: vectorized_tweets, the dataframe of vectorized tweets
                y, the sentimet of the tweets"""
    def fit(self, vectorized_tweets,y):
        den_positive = 0                     
        den_negative = 0                 
        for v in y:
            if v == 0:
                den_negative += 1
            else:
                den_positive += 1
        common_words = vectorized_tweets.columns.tolist()
        common_words = [word for word in common_words if word != "all_tweets"]
        self.common_words = common_words
        N = len(common_words)
        prob_p = {}        
        prob_n = {}
        vect_n = np.zeros(N)
        vect_p = np.zeros(N)
        for i in range(N):
            for j in range(len(y)):
                if y[j] == 0.0:
                    vect_n[i] = (vect_n[i] + vectorized_tweets[common_words[i]][j])
                else:
                    vect_p[i] = (vect_p[i] + vectorized_tweets[common_words[i]][j])
        for iwrd in range(N):
            prob_n[common_words[iwrd]] = float(vect_n[iwrd])/den_negative    
            prob_p[common_words[iwrd]] = float(vect_p[iwrd])/den_positive   
        self.pP = float(den_positive)/(den_positive + den_negative)
        self.pN = float(den_negative)/(den_positive + den_negative)
        self.prob_p = prob_p
        self.prob_n = prob_n
    
    """Given a line, predict the sentiment, 1 or 0
    parameters: line, the line to predict
    return: result, the sentiment of the line"""
    def predict(self, line):
        stemmed_line = self.stem_a_line(line)
        prob_p = self.pP
        prob_n = self.pN
        for w in sorted(set(stemmed_line)):   
            if w in self.common_words:
                prob_p *= self.prob_p[w]    
                prob_n  *= self.prob_n[w]
        if prob_n == 0.0 and prob_p == 0.0:
            result = random.randint(0,1)
        elif prob_n > prob_p:
            result = 0
        else:
            result = 1
        return result


In [6]:
nb = NaiveBayes()
nb.fit(processed,tweets_class['sentiment'])

In [12]:
ex = TextProcessor(tweets_class, 1000)
ex.process()
ex.bigram(2).head()

Unnamed: 0,all_tweets,extend,unkar,1949,demonstr,lit,risk,vc,asshol,goal,...,_push_superbowl,_blame_steelersn,_k12_superbowl,_litt_camcuss_say,_alert_200k,_could_derrickros,_miss_troubl_followu,_wearephx_releas,_option_insidehoop_buck,_cousin_seat
0,Moving the Tesla announcement to Wednesday Nee...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,markpinc TeslaMotors thanks,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Reuters UmmAutobahn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,vicentes obviously wrong,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Cocoanetics heiseonline Not actually based on ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Combine the results into a data frame called "Results" then calculate the accuracy score and output the confusion matrix of this tweets_clas

In [1]:
all_results = []
classification = tweets_class.sentiment
correct = []
for i in range(len(tweets_class.text)):
    line = tweets_class.text[i]
    all_results.append(tweets_class.predict(line))
    if nb.predict(line) == classification[i]:
        correct.append(1)
    else:
        correct.append(0)
results = pd.DataFrame({'prediction': all_results, 'classification': classification, 'correct': correct})
print(float(sum(correct))/len(classification))
false_p = 0
false_n = 0
true_p = 0
true_n = 0
for i in range(len(results)):
    if results['classification'][i] == 1 and results['prediction'][i] == 1:
        true_p = true_p + 1
    if results['classification'][i] == 1 and results['prediction'][i] == 0:
        false_n = false_n + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 1:
        false_p = false_p + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 0:
        true_n = true_n + 1
c_matrix = np.matrix([[true_p,false_p],[false_n,true_n]])
print(c_matrix)
results.head()

NameError: name 'tweets_class' is not defined

### Add bigrams to the tweets_class and recreate the Results data frame

In [13]:
processed = ex.bigram(2)
nb = NaiveBayes()
nb.fit(processed,tweets_class['sentiment'])

In [14]:
all_results = []
classification = tweets_class.sentiment
correct = []
for i in range(len(tweets_class.text)):
    line = tweets_class.text[i]
    all_results.append(nb.predict(line))
    if nb.predict(line) == classification[i]:
        correct.append(1)
    else:
        correct.append(0)
results = pd.DataFrame({'prediction': all_results, 'classification': classification, 'correct': correct})
print(float(sum(correct))/len(classification))
false_p = 0
false_n = 0
true_p = 0
true_n = 0
for i in range(len(results)):
    if results['classification'][i] == 1 and results['prediction'][i] == 1:
        true_p = true_p + 1
    if results['classification'][i] == 1 and results['prediction'][i] == 0:
        false_n = false_n + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 1:
        false_p = false_p + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 0:
        true_n = true_n + 1
c_matrix = np.matrix([[true_p,false_p],[false_n,true_n]])
print(c_matrix)
results.head()

0.6156770985668324
[[ 650  279]
 [1043 1445]]


Unnamed: 0,classification,correct,prediction
0,0.0,0,1
1,1.0,1,1
2,0.0,1,0
3,0.0,1,0
4,0.0,1,1


## 2: Train-Test split 

Now, before you build your model, you need to construct some datasets to try out.  Fortunately, we've been working on building one!  You can download `tweets_class.txt` from Canvas.  It's a `.txt`, not a `.csv`, because tweets have a lot of commas in them, and that messed with Pandas `read_csv` method.  So `tweets_class.txt` is a tab-delimited dataset.  You'll need to change the delimiter for `read_csv`.  I also found that including the keyword `encoding='ISO-8859-1'` helped a ton as well.

Take the data, split off some amount of it as a **test dataset**.  This means that you don't give it to your model, but you run it through the model after training to test its accuracy on tweets it hasn't seen before.  How much to split off is a good question, and the numbers vary from 10% to 50%.  Both of those extremes I think are a little over-the-top, I would recommend about 20-30%.  The remaining dataset is called your **training dataset**.

A quick google search allowed me to find the gigantic [Sentiment Analysis Dataset](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip).  It's 150MB, consisting of 1.5 million tweets which have been labeled for sentiment.  You may be getting frustrated with me at this point: "Why did we have to go label all those tweets when there's a huge dataset already?!"  I have many answers to this, but the most important one is that all of these labeled tweets were labeled by someone else's model, so I didn't want you to work off of entirely computer-generated data (It's the same idea behind "a copy of a copy of a copy...".  Nonetheless, let's pad our dataset with it.  Here's a line I found useful for opening that massive dataset in pandas:

In [15]:
df = pd.read_csv("Sentiment Analysis Dataset.csv", encoding='ISO-8859-1', error_bad_lines=False)

b'Skipping line 8836: expected 4 fields, saw 5\n'
b'Skipping line 535882: expected 4 fields, saw 7\n'


Add some number of these labeled tweets to your dataset.  How much is up to you, I might suggest some number of thousands, but less than 10 thousand (Though feel free to try more!).  Take them randomly from the dataframe, not from the top.  

In [16]:
df.columns = ['ItemID', 'sentiment', 'SentimentSource', 'text']
df.head()

Unnamed: 0,ItemID,sentiment,SentimentSource,text
0,1,0,Sentiment140,is so sad for my APL frie...
1,2,0,Sentiment140,I missed the New Moon trail...
2,3,1,Sentiment140,omg its already 7:30 :O
3,4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,Sentiment140,i think mi bf is cheating on me!!! ...


### Take 1% of the original sentiment analysis dataset, then split it into testing data and training data. Add the training data to the original tweets class. 

In [17]:
df_n = df[0:int(0.001*len(df))]
test_data = df_n[0:int(0.3*len(df_n))]
training_data = df_n[int(0.3*len(df_n)):]
all_classifications = [sentiment for sentiment in tweets_class.sentiment] + [sentiment for sentiment in training_data.sentiment]
all_text =  [text for text in tweets_class.text]+[text for text in training_data.text]
data = pd.DataFrame({'text': all_text, 'sentiment':all_classifications})
data.head()

Unnamed: 0,sentiment,text
0,0.0,Moving the Tesla announcement to Wednesday Nee...
1,1.0,markpinc TeslaMotors thanks
2,0.0,Reuters UmmAutobahn
3,0.0,vicentes obviously wrong
4,0.0,Cocoanetics heiseonline Not actually based on ...


## 3: Train and test your models

Build both a Naive Bayes and a Logistic Regression classifier on your training dataset, and test them on your test dataset.  Which is better? What percent of positives and negatives do you have? How many false positives and false negatives do you have?  Interpret your results.  Is your model better or worse when you include the computer-generated data?  Add some bigrams to, and try changing your `N` in Naive Bayes, and try changing your `C` in Logistic Regression.  What's the best model? (This is why creating a robust class in part 1 will help you.)

### Processing the training data
Re-fit using NaiveBayes for the new training data set

In [18]:
vectorized_tweets = TextProcessor(data, 1000).process()
nbayes_tweets = NaiveBayes()
nbayes_tweets.fit(vectorized_tweets, data.sentiment)

### Classifying testing data and returning accuracy score, confusion matrix


In [19]:
all_results = []
classification = test_data.sentiment
correct = []
for i in range(len(test_data.text)):
    line = test_data.text[i]
    all_results.append(nbayes_tweets.predict(line))
    if nbayes_tweets.predict(line) == classification[i]:
        correct.append(1)
    else:
        correct.append(0)
results = pd.DataFrame({'prediction': all_results, 'classification': classification, 'correct': correct})
print(float(sum(correct))/len(classification))
false_p = 0
false_n = 0
true_p = 0
true_n = 0
for i in range(len(results)):
    if results['classification'][i] == 1 and results['prediction'][i] == 1:
        true_p = true_p + 1
    if results['classification'][i] == 1 and results['prediction'][i] == 0:
        false_n = false_n + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 1:
        false_p = false_p + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 0:
        true_n = true_n + 1
c_matrix = np.matrix([[true_p,false_p],[false_n,true_n]])
print(c_matrix)
results.head()

0.6511627906976745
[[  9  21]
 [147 296]]


Unnamed: 0,classification,correct,prediction
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0


### Adding bigrams
Re-fit the model with the bigrams added and re-process

In [22]:
ex = TextProcessor(data, 1000)
ex.process()
processed = ex.bigram(2)
nb = NaiveBayes()
nb.fit(processed,data.sentiment)

In [23]:
all_results = []
classification = test_data.sentiment
correct = []
for i in range(len(test_data.text)):
    line = test_data.text[i]
    all_results.append(nb.predict(line))
    if nb.predict(line) == classification[i]:
        correct.append(1)
    else:
        correct.append(0)
results = pd.DataFrame({'prediction': all_results, 'classification': classification, 'correct': correct})
print(float(sum(correct))/len(classification))
false_p = 0
false_n = 0
true_p = 0
true_n = 0
for i in range(len(results)):
    if results['classification'][i] == 1 and results['prediction'][i] == 1:
        true_p = true_p + 1
    if results['classification'][i] == 1 and results['prediction'][i] == 0:
        false_n = false_n + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 1:
        false_p = false_p + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 0:
        true_n = true_n + 1
c_matrix = np.matrix([[true_p,false_p],[false_n,true_n]])
print(c_matrix)
results.head()

0.642706131078224
[[ 11  19]
 [145 298]]


Unnamed: 0,classification,correct,prediction
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0


### Logistic regression
Get all the words in common words and then make sure no values are NaN or infinite. Use LogisticRegression() to predict

In [143]:
col = [n for n in processed.columns.values if n is not "all_tweets"]
X = processed[col]
y=data.sentiment
logreg = LogisticRegression()
isnan_idx = []
for i in range(len(y)):
    if isnan(y[i]):
        isnan_idx.append(i)
y = y.drop(y.index[[isnan_idx]])
X = X.drop(X.index[[isnan_idx]])
model = logreg.fit(X,y)


   extend  unkar  rod  1949  demonstr  lit  risk  jm  vc  asshol     ...       \
0       0      0    0     0         0    0     0   0   0       0     ...        
1       0      0    0     0         0    0     0   0   0       0     ...        
2       0      0    0     0         0    0     0   0   0       0     ...        
3       0      0    0     0         0    0     0   0   0       0     ...        
4       0      0    0     0         0    0     0   0   0       0     ...        

   _bore_talk  _rain_gone  _take_ride  _month_talk  _till_eat_wait  \
0           0           0           0            0               0   
1           0           0           0            0               0   
2           0           0           0            0               0   
3           0           0           0            0               0   
4           0           0           0            0               0   

   _happi_hous  _sink_must  _ew_gt_came  _could_sync  _miss_month  
0            0          

In [170]:
stemmed = []
for lin in test_data.text:
    stemmed.append(stem_a_line(lin))
test_data['stemmed_text'] = stemmed
all_results = []
classification = test_data.sentiment
correct = []
for i in range(len(test_data.text)):
    text = test_data.stemmed_text[i]
    arr = [1 if word in text else 0 for word in col]
    a = pd.DataFrame(arr)
    
    all_results.append(int(round(model.predict(a.values.reshape(1,-1))[0])))
    if int(round(model.predict(a.values.reshape(1,-1))[0])) == classification[i]:
        correct.append(1)
    else:
        correct.append(0)
results = pd.DataFrame({'prediction': all_results, 'classification': classification, 'correct': correct})
print(float(sum(correct))/len(classification))
false_p = 0
false_n = 0
true_p = 0
true_n = 0
for i in range(len(results)):
    if results['classification'][i] == 1 and results['prediction'][i] == 1:
        true_p = true_p + 1
    if results['classification'][i] == 1 and results['prediction'][i] == 0:
        false_n = false_n + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 1:
        false_p = false_p + 1
    if results['classification'][i] == 0 and results['prediction'][i] == 0:
        true_n = true_n + 1
c_matrix = np.matrix([[true_p,false_p],[false_n,true_n]])
print("c_matrix")
print(c_matrix)
results.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0.6596194503171248
c_matrix
[[  2   7]
 [154 310]]


Unnamed: 0,classification,correct,prediction
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0
