<a href="https://colab.research.google.com/github/nishusingh11/python-scripting-for-social-science/blob/main/classification_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analyzing insults with Naive Bayes: pandas and sklearn

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
%matplotlib inline

## Loading and preparing the data

Let's open the CSV file with `pandas`.

In [None]:
import os.path
site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.

In [None]:
df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


Write a pandas command to give you just the insults.

In [None]:
# Solution replaces df on the RHS
insult_df = df[df['Insult'] ==1]

In [None]:
insult_df[:25]

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
7,1,,"""shut the fuck up. you and the rest of your fa..."
8,1,20120502173553Z,"""Either you are fake or extremely stupid...may..."
9,1,20120620160512Z,"""That you are an idiot who understands neither..."
15,1,20120611090207Z,"""FOR SOME REASON U SOUND RETARDED. LOL. DAMN. ..."
16,1,20120320162532Z,"""You with the 'racist' screen name\n\nYou are ..."
18,1,20120320075347Z,"""your such a dickhead..."""
19,1,20120320203947Z,"""Your a retard go post your head up your #%&*"""
34,1,20120515132156Z,"""Allinit123, your\xa0hypocrisy\xa0is sickening..."
37,1,20120620161958Z,"""I can't believe the stupid people on this sit..."


In [None]:
df['Comment'][79:85]

79    "Fact : Georgia passed a strict immigration po...
80              "Of course you would bottom feeder ..."
81    "M\xe1tenlos!!\nhttp://1.bp.blogspot.com/-YVSZ...
82    "You are\xa0 a fukin moron. \xa0\xa0 You are j...
83    "He is doing what any president doe's on this ...
84    "...yeah, and you're a f'ing expert.....go bac...
Name: Comment, dtype: object

In [None]:
df['Comment'][79]

'"Fact : Georgia passed a strict immigration policy and most of the Latino farm workers left the area. Vidalia Georgia now has over 3000 agriculture job openings and they have been able to fill about 250 of them in past year. All you White Real Americans who are looking for work that the Latinos stole from you..Where are you ? The jobs are i Vadalia just waiting for you..Or maybe its the fact that you would rather collect unemployment like the rest of the Tea Klaners.. You scream..you complain..and you sit at home in your wife beaters and drink beer..Typical Real White Tea Klan...."'

NB:  `insult_df` is **not** modified by the following sort.

In [None]:
insult_df = df

In [None]:
insult_df['Size'] = df['Comment'].apply(len)
insult_df['Size'].sort_values(ascending = False)

2004    17805
3416    10716
1305     4769
3068     4312
3208     4016
        ...  
3919        8
755         8
45          8
2937        6
3112        6
Name: Size, Length: 3947, dtype: int64

Now we define the feature matrix $\mathbf{X}$ and the labels $\mathbf{y}$.

In [None]:
len(insult_df.loc[3208]['Comment'].split())

703

In [None]:
insult_df.loc[755]

Insult                   1
Date       20120620121441Z
Comment           "Retard"
Size                     8
Name: 755, dtype: object

In [None]:
insult_df.loc[45]

Insult                   1
Date       20120619074710Z
Comment           "faggot"
Size                     8
Name: 45, dtype: object

In [None]:
y = df['Insult']

We want to use one of the linear classifiers in `sklearn`,
bit the learners in `sklearn` only work with numerical arrays. How to convert text into a matrix of numbers?
As discussed in lecture and in our text,
obtaining the feature matrix from the text is not trivial. 

The classical solution is to first extract a **vocabulary**: a list of words used throughout the corpus. Then, we can count, for each document in the sample, the frequency of each word. We end up with a **sparse matrix**: a huge matrix containing mostly zeros. Here, `sklearn` and `pandas` make it possible to do this in two lines. 

In [None]:
print(text.TfidfVectorizer.__doc__)

Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to :class:`CountVectorizer` followed by
    :class:`TfidfTransformer`.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

    Parameters
    ----------
    input : {'filename', 'file', 'content'}, default='content'
        - If `'filename'`, the sequence passed as an argument to fit is
          expected to be a list of filenames that need reading to fetch
          the raw content to analyze.

        - If `'file'`, the sequence items must have a 'read' method (file-like
          object) that is called to fetch the bytes in memory.

        - If `'content'`, the input is expected to be a sequence of items that
          can be of type string or byte.

    encoding : str, default='utf-8'
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}, default='strict'
        Instruction on what to do if a b

In [None]:
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['Comment'])


In [None]:
X.shape

(3947, 16469)

In [None]:
y.shape

(3947,)


The TFIDF vectorizer uses a simple formula to assign a significance score to the
count of each vocabulary item in each document. Our TFIDF matrix is stored in `X`.

Say a word occurs n times in a document.
TFIDF is a very popular measure of the significance of that fact
first proven to be useful in
document retrieval.  It has some competitors in classification, but
we have used it here mainly because it's the easiest **feature weighting scheme**
to use in `sklearn`.

In [None]:
# Shape and Number of non zero entries
print(f'Shape: ({X.shape[0]:,} x {X.shape[1]:,})  Non-zero entries: {X.nnz:,}')

Shape: (3,947 x 16,469)  Non-zero entries: 100,269


There are 3,947 comments and 16,469 different words. Let's estimate the sparsity of this feature matrix.

In [None]:
print(("The document matrix X is ~{0:.2%} non-zero features.".format(
          X.nnz / float(X.shape[0] * X.shape[1]))))

The document matrix X is ~0.15% non-zero features.


A `TdidfVectorizer` instance stores its `decode` dictionary in the attribute `vocabulary_` (note
the trailing underscore!):

In [None]:
tf.vocabulary_['moron']

8704

The `sklearn` module stores many of its internally computed arrays as **sparse matrices**.  This is basically a 
very clever computer science device for not wasting all the space that very sparse matrices 
waste.  Natural language representations are often **quite** sparse.  The .15% non zero features
firgure we just looked at was typical.  Sparse matrices come at a cost, however; although some
computations can be done while the matrix is in sparse form, some cannot, and to do those
you have to convert the matrix to a nonsparse matrix, do what you need to do, and then, probably,
convert it back.  This is costly.  We're going to do it now, but only because we're goofing
around. Conversion to non-sparse format should in general be avoided whenever possible.

In [None]:
XA = X.toarray()

Consider Tweet 3942:

In [None]:
insult_df.loc[3942]['Comment']

'"you are both morons and that is never happening"'

Ok, now we can check the TFIDF matrix for the statistic for `'moron'` in this tweet:

In [None]:
XA[3942][8704]

0.0

Oh, maybe we didn't learn that:

In [None]:
tf.vocabulary_['morons']

8707

Totally different word, found at a totally different place in XA:

In [None]:
XA[3942][8707]

0.5139224706716653

## Training

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

In [None]:
X_train,X_test, y_train,y_test = train_test_split(X,y)

In [None]:
y_test.shape

(987,)

We use a **Bernoulli Naive Bayes classifier**.

In [None]:
bnb =nb.BernoulliNB()

bnb.fit(X_train, y_train);

In [None]:
bnb.score(X_test, y_test)

0.7213779128672746

Now try re-executing the previous three cells.  The results shoudl be the same, right?

Well, are they?  

Ok, re-execute the same three cells again.  Now one more time.  Now try the following
piece of code:

#### Basic train and test loop

In [None]:
def split_and_fit(X,Y,test_size=.2):
    (X_train, X_test,
       y_train, y_test) = train_test_split(X, y,
                                         test_size=test_size)
    bnb = nb.BernoulliNB()
    return bnb.fit(X_train, y_train),X_train, X_test, y_train,y_test

num_runs = 10
for test_run in range(num_runs):
    clf, X_train, X_test, y_train,y_test = split_and_fit(X,y)
    print('{0}'.format(clf.score(X_test, y_test)))

0.7354430379746836
0.7556962025316456
0.7455696202531645
0.740506329113924
0.740506329113924
0.7278481012658228
0.7379746835443038
0.7734177215189874
0.7443037974683544
0.7481012658227848


What's happening?  How should we deal this with this when we report our evaluations?

Explain the purpose of the code in the next cell.

Ans:

Every time splitted data producing different accuracy. So it is better to split the data multiple times and calculate the average of accuracy for for more accurate result.

#### Refined train and test loop

In [None]:
num_runs = 100
#a_total = 0
#p_total = 0
#r_total = 0
#insults_total = 0

stats = np.zeros((4,))
for test_run in range(num_runs):
    clf, X_train, X_test, y_train,y_test = split_and_fit(X,y)
    #score = clf.score(X_test, y_test)
    predicted = clf.predict(X_test)
    y_array = y_test.values
    prop_insults = y_array.sum()/len(y_array)
    stats = stats + np.array([clf.score(X_test, y_test),
                              precision_score(predicted, y_test),
                              recall_score(predicted, y_test),
                              prop_insults])
    #p_score = precision_score(predicted, y_test)
    #r_score = recall_score(predicted, y_test)
    #a_total += score
    #p_total += p_score
    #r_total += r_score
    #insults_total += prop_insults
normed_stats = stats/num_runs
labels = ['Accuracy','Precision','Recall','Pct Insults']
for (i,s) in enumerate(normed_stats):
    print(f'{labels[i]} {s:.2f}')
#print('Accuracy {:.2%}'.format(a_total/num_runs))
#print('Precision {:.2%}'.format(p_total/num_runs))
#print('Recall {:.2%}'.format(r_total/num_runs))
#print('Avg Pct Insults {:.2%}'.format(insults_total/num_runs))

Accuracy 0.75
Precision 0.16
Recall 0.61
Pct Insults 0.26


Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments).

In [None]:
dir(bnb)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_X',
 '_check_X_y',
 '_check_alpha',
 '_check_n_features',
 '_count',
 '_estimator_type',
 '_get_param_names',
 '_get_tags',
 '_init_counters',
 '_joint_log_likelihood',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_update_class_log_prior',
 '_update_feature_log_prob',
 '_validate_data',
 'alpha',
 'binarize',
 'class_count_',
 'class_log_prior_',
 'class_prior',
 'classes_',
 'coef_',
 'feature_count_',
 'feature_log_prob_',
 'fit',
 'fit_prior',
 'get_params',
 'intercept_',
 'n_features_',
 'n_features_in_',
 '

In [None]:
bnb.feature_count_.shape

(2, 16469)

In [None]:
# We first get the words corresponding to each feature.
names = np.asarray(tf.get_feature_names())
# Next, we display the 50 words with the largest
# coefficients.
# NB Wajnt to switch over to using bnb.feature_count_.shape[0]
coefficient_matrix = bnb.coef_[0,:]
print(coefficient_matrix.shape)
# Sorting gives us smallest first, we reverse the order and take top 50
top_fifty_feat_indices = np.argsort(coefficient_matrix)[::-1][:50]
print((','.join(names[top_fifty_feat_indices])))

(16469,)
you,your,are,the,to,and,of,that,is,it,in,like,have,on,for,re,not,just,so,an,xa0,idiot,this,up,all,go,fuck,what,with,get,do,be,no,don,but,can,or,if,ass,as,stupid,bitch,about,know,me,because,who,little,my,out




Finally, let's test our estimator on a few test sentences.


In [None]:
predicted = bnb.predict(tf.transform([
    "I totally agree with you.",
    "You are so stupid.",
    "I love you."
    ]))

print(predicted)

[0 0 0]


In [None]:
print(predicted)
print(y_test[:3])

[0 0 0]
1768    0
2378    0
350     1
Name: Insult, dtype: int64


Not real impressive.  The word *stupid* was not recognized as an insult.

> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).

In [None]:
print((bnb.predict(tf.transform([ "I totally agree with you.", "You are so stupid.", "I love you." ]))))

[0 0 0]


## Homework

Read the on line book draft chapter about doing the movie review data,
and try the clasifier used there, an SVM, on this data.  Be sure
top stick with the scikit learn (it has an SVM implementation).

Show your code, and print out results.  Which classifier does better?

#### Help with getting the movie reviews data.

Execute the next two cells.

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
%matplotlib inline

In [None]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [None]:
# Given
from nltk.corpus import movie_reviews as mr

def get_file_strings (corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))

pos_file_ids = data['pos']
neg_file_ids = data['neg']

pos_file_ids[:5]

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt']

This illustrates how to get all the positive reviews.

In [None]:
# Given
# Storing positive reviews and negative reviews
pos_file_reviews = get_file_strings (mr, pos_file_ids)
neg_file_reviews = get_file_strings (mr, neg_file_ids)


In [None]:
# Given
# Testing
# First 20 words of first positive review
print(pos_file_reviews[0][:20])
print()
# First 20 words of second positive review
print(pos_file_reviews[1][:20])


films adapted from c

every now and then a


In [None]:
# Testing
tf = text.TfidfVectorizer()
X = tf.fit_transform(pos_file_reviews)

After executing the code above, the names `pos_file_reviews` and `neg_file_reviews` each contain a list of reviews.  Each review is a list of words.  A list of word lists like `pos_file_reviews`  can be passed to `text.TfidfVectorizer()` via the `fit_transform` method to train a vectorizer for machine learning.

Just remember when testing the trained vectorizer use
`transform` in place of `fit_transform`.

What you will need to do to train the classifier (call it `clf`) is pass the matrix of vectorized training data
(you will call `X`) to `clf`'s
`fit(...)` method, along with an aligned sequence
of labels `y`.  By saying the two sequences are aligned, I mean this:  `X[i][:]` is the vector representation for a review that has the class 
`y[i]`.

The steps  are

1.  Create training data: a sequence of reviews (the code above did this) and an aligned sequence of review labels(each label is either `pos` or `neg`)  The training sequence should be a balanced mix of positive and negative reviews.

2.  Same procedure to create test data. Use 9 times as many training documents as test documnts (1000 positive reviews  + 1000 negative reviews means 1800 training examples and 200 test examples).

3.  Train and test the models multiple times and take the averege precision. recall and accuracy scores as the measure of your model's performance. The cells above labeled **Training and Test loops** illustrate this step.


The code cell below illustrates one way of 
getting randomly mixed data with
aligned labels (steps 1 and 2)

#### Help with steps 1 and 2

In [None]:
# Given
# Testing
# Lets work on letters instead of documents
# There are 2 classes, letters from the first half of the
# alphabet ('f') and letters frmm the last half ('l')

from random import shuffle
from string import ascii_lowercase
f_lets = ascii_lowercase[:13]
print(f_lets)
l_lets = ascii_lowercase[13:]
print(l_lets)
f_pairs = [(let,'f') for let in f_lets]
l_pairs = [(let,'l') for let in l_lets]
# Way too orderly, the classes arent mixed yet.
data = f_pairs + l_pairs
shuffle(data)
prepared_data, prepared_labels = zip(*data)
print(prepared_data)
print(prepared_labels)

abcdefghijklm
nopqrstuvwxyz
('j', 'n', 'm', 'k', 'z', 't', 'a', 'i', 'v', 'y', 'c', 'r', 'o', 's', 'w', 'd', 'l', 'b', 'f', 'x', 'e', 'u', 'q', 'h', 'p', 'g')
('f', 'l', 'f', 'f', 'l', 'l', 'f', 'f', 'l', 'l', 'f', 'l', 'l', 'l', 'l', 'f', 'f', 'f', 'f', 'l', 'f', 'l', 'l', 'f', 'l', 'f')


In [None]:
# Homework
from random import shuffle                                # using shuffle module
pos_pair = [(i,'pos') for i in pos_file_reviews]          # Mapping positive reviews with label 'pos'.           
neg_pair = [(i,'neg') for i in neg_file_reviews]          # Mapping negative reviews with lable 'neg'.
all_data = pos_pair + neg_pair                            # Storing posiive and negative pair in all_data variable.

shuffle(all_data)                                         # Shuffling the all_data using shuffle function.
input, output = zip(*all_data)                            # Segregating input and output data


In [None]:
# Testing

print(input[:20])
print(output[:20])

('you know something , christmas is not about presents . \nit\'s about over-hyped holiday films with lots of merchandising and product tie-ins . \nat least that would seem to be the message of " the grinch , " which has been advertised since last christmas and whose logo is currently plastered all over stores . \nhollywood expects us to ignore this cynical greed as the movie scolds us about losing the true spirit of the season . \nyou know the plot : there\'s this evil furry green guy called the grinch ( jim carrey ) who lives on a mountain overlooking whoville . \ndown below all the whos are preparing for their whobilation , but the grinch is determined to steal their christmas . \nthe movie is , of course , a live-action version of the beloved children\'s book , which was previously adapted into a 1966 tv special by looney tunes animator chuck jones . \nit\'s rare that a big budget hollywood release is shamed by a thirty-year-old half-hour cartoon , but that\'s the case when jones\' 

In [None]:
# Testing
# from sklearn import preprocessing

# encoder = preprocessing.LabelEncoder()
# output = encoder.fit_transform(output)

###  Summary for step 3:  The basic training of the vectorizer and the classifier.

In [None]:
# Homework continued..
# I have covered all the above 3 steps in this cell. 

from sklearn.svm import LinearSVC
from nltk.classify import NaiveBayesClassifier
from sklearn import  model_selection, metrics


num_runs = 9                          # Executing code for 9 times for optimized score

def modelling(clf):                   # Defining fucntion, which takes classifier name to train the model. For this assignment I hve used only SVM.  
    scores = np.zeros((3,))           # Intializing the numpy object to store the scores 

    for test_run in range(num_runs):  # This will execute 9 times to calculate the scores by adding all scores and will take average at the end.
        x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(input, output, test_size=0.2)    # Using sklearn module train_test_split to split the input and output in 1800 train set and 200 test set.
        tf = text.TfidfVectorizer()   # Creating Tfidf Vectorizer instance
        X = tf.fit_transform(x_train) # Extract the tfidf features from trained data
        X_test = tf.transform(x_test) # Extract the tfidf features from test data
        clf = clf.fit(X, y_train)     # Fit the model to the training data
        predicted = clf.predict(X_test)  # Test the model on test data.
        scores = scores + np.array([accuracy_score(y_test, predicted),                  # Storing the accuracy_score, precision_score and recall_score on score variable
                                precision_score(y_test, predicted, pos_label = 'pos'),
                                recall_score( y_test, predicted, pos_label = 'pos')])
    return scores


In [None]:
# Training the model using support vector machine classifier
classification_model = LinearSVC(loss='squared_hinge', penalty="l2",       # create SVM classifier instance. We can use other classifier model, just need to change the "classification_model" value.
                dual=False, tol=1e-3)                     
scores = modelling(classification_model)                                    # calling modelling funtion to train the model and storing "accuracy_score, precision_score and recall_score" in scores variable.  
normed_scores = scores/num_runs                                             # Calucalting the average of all scores and store in normed_scores.
labels = ['Accuracy','Precision','Recall']                                  # Intializing the scores list for print purpose
for (i,s) in enumerate(normed_scores):                                      # Getting all score iteratively
    print(f' Linear SVC: {labels[i]} {s:.4f}')                              # Printing the model results.

 Linear SVC: Accuracy 0.8528
 Linear SVC: Precision 0.8440
 Linear SVC: Recall 0.8631


Ans: 

I have trained support vector classifier, which is giving approx 85 % accuracy.