# Part 1.

The deadline for Part 1 is **1:30 pm Feb 6th, 2020**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---


In this part we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [2]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2020-02-12 16:08:48--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 74.125.20.138, 74.125.20.102, 74.125.20.101, ...
Connecting to docs.google.com (docs.google.com)|74.125.20.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/p075sendvfo9q0te2bo28bvh5ad7oe4i/1581523200000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2020-02-12 16:08:48--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/p075sendvfo9q0te2bo28bvh5ad7oe4i/1581523200000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 74.125.20.132, 2607:f8b0:400e:c07::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.g

In [3]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test. Make sure that each row 

---

appears only in one of the splits.

In [5]:
df['v2']


0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object

In [0]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
split the dataset into train/val/test
"""

df_copy = df.copy()
train_set = df_copy.sample(frac=0.7, random_state=10)
rest = df_copy.drop(train_set.index)
val_set = rest.sample(frac=0.5, random_state=10)
test_set = rest.drop(val_set.index)


train_texts, train_labels = train_set['v2'], train_set['v1']
val_texts, val_labels     = val_set['v2'], val_set['v1']
test_texts, test_labels   = test_set['v2'], test_set['v1']

In [7]:
train_texts.iloc[3899]

"Hi I'm sue. I am 20 years old and work as a lapdancer. I love sex. Text me live - I'm i my bedroom now. text SUE to 89555. By TextOperator G2 1DA 150ppmsg 18+"

In [8]:
len(train_set)

3900

## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [9]:
import string
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    # remove the punctuation
    # remove the stop words
    preprocessed_data = []
    for _ in range(len(data)):
      nonpunct = [char for char in data.iloc[_] if char not in string.punctuation]
      nonpunct = ''.join(nonpunct)
      sub_preprocessed_data = [word for word in nonpunct.split() if word.lower() not in stopwords.words('english')]
      preprocessed_data.append(sub_preprocessed_data)
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
train_data[:5]

[['K', 'k', 'pa', 'lunch', 'aha'],
 ['Sorry', 'Ill', 'call', 'later', 'meeting'],
 ['Never',
  'try',
  'alone',
  'take',
  'weight',
  'tear',
  'comes',
  'ur',
  'heart',
  'falls',
  'ur',
  'eyes',
  'Always',
  'remember',
  'STUPID',
  'FRIEND',
  'share',
  'BSLVYL'],
 ['Hey', 'happy', 'birthday'],
 ['Ill', 'hand', 'phone', 'chat', 'wit', 'u']]

In [0]:
# data = train_data[:5]
# vocab = []
# for line in data:
#   for word in line:
#     if word not in vocab:
#       vocab.append(word)
# # print(data)
# print(len(vocab))

# flat_list = [item for sublist in data for item in sublist]

# num = [[x,flat_list.count(x)] for x in set(vocab)]
# max_fea = 10
# print(num)
# vo = []
# for i in range(max_fea):
#   vo.append(num[i][0])
  
# print(vo)
# vo.index('ur')
# only keeps the most 

# import numpy as np
# matrix = np.zeros((len(data), len(vocab)))

# print(data)
# print(vocab)

# row = 0
# for sent in data:
#   for sw in sent:
#     for j, word in enumerate(vocab):
#       if word == sw:
#         matrix[row,j] += 1
#   row += 1
# matrix 

In [0]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = []
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
      
        # Naive vocab list
        naive_vocab = []
        for sentance in dataset:
          for word in sentance:
            if word not in naive_vocab:
              naive_vocab.append(word)
        
        flat_list = [item for sublist in dataset for item in sublist]
        num = [[x,flat_list.count(x)] for x in set(naive_vocab)]
        for i in range(self.max_features):
          self.vocab_list.append(num[i][0])
    
    def token_2_index(self, token):
        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab
        return self.vocab_list.index(token)



    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        row = 0

        for sentance in dataset:
          for sw in sentance:
            for col, word in enumerate(self.vocab_list):
              if sw == word:
                data_matrix[row, col] += 1
          row += 1

        return data_matrix

In [13]:
def count_word(dataset):
  vocab_list = []
  for sentance in dataset:
          for word in sentance:
            if word not in vocab_list:
              vocab_list.append(word)
  return len(vocab_list)


# take 90% of most frequently used word into account
frequent_rate = 1.
all_word = count_word(train_data)
max_features = int(frequent_rate * all_word)
print('selected max feature num:{}, {} of the {} dataword'.format(max_features, frequent_rate, all_word))


vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


selected max feature num:9215, 1.0 of the 9215 dataword


You can add more features to the feature matrix.

## Model

We train logistic regression model and save prediction for train, val and test.


In [0]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [0]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    err = np.mean(np.abs(y_true - y_pred))
    accuracy = 1. - err
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    # precision: out of all the positive prediction, what fractions are right
    # recall : out of all the positive samples, what fractions are correct 
    precision = np.sum(y_true * y_pred) / np.sum(y_pred)
    recall = np.sum(y_true * y_pred) / np.sum(y_true)

    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

In [16]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.995, F1 score: 0.983
Validation accuracy: 0.981, F1 score: 0.927
Test accuracy: 0.980, F1 score: 0.909


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** No, in logistic regression, it is the logistic loss been optimized.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** In genearal, a model with 0.99 accuracy is a very well model. However, if the test set have 99% positive samples and 1% negetive samples, and and model simpley just classify all te samples as positive, which indicates the model has no classification ability, yet the model can still achieve 99% accuacy. It is caused by the imbalanced of the dataset, and the F1-score can help with the class imbalance. 

### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [41]:
# 1 - spam, 0 - ham
print('------------- train data set ------------')
print('')
for i in range(3):
  print('text:', train_set.iloc[i]['v2'])
  print('true:{}, predict:{}'.format('ham' if train_set.iloc[i]['v1'] < 0.5 else 'spam',
                                     'ham' if y_train_pred[i] < 0.5 else 'spam'))
  print('')

print('')

print('------------- validation data set ------------')
print('')
for i in range(3):
  print('text:', val_set.iloc[i]['v2'])
  print('true:{}, predict:{}'.format('ham' if val_set.iloc[i]['v1'] < 0.5 else 'spam',
                                     'ham' if y_val_pred[i] < 0.5 else 'spam'))
  print('')


------------- train data set ------------

text: K k pa Had your lunch aha.
true:ham, predict:ham

text: Sorry, I'll call later in meeting
true:ham, predict:ham

text: Never try alone to take the weight of a tear that comes out of ur heart and falls through ur eyes... Always remember a STUPID FRIEND is here to share... BSLVYL
true:ham, predict:ham


------------- validation data set ------------

text: complimentary 4 STAR Ibiza Holiday or å£10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+
true:spam, predict:ham

text: Sorry completely forgot * will pop em round this week if your still here?
true:ham, predict:ham

text: When did dad get back.
true:ham, predict:ham



**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

> 缩进块



**Your answer:** 

In [34]:
wrong_index = np.where(np.abs(y_val_pred - y_val)>0.5)

count = 1
for i in wrong_index[0][:10]:
  print('example ', count)
  count += 1
  print(val_set.iloc[i]['v2'])
  print('class:', val_set.iloc[i]['v1'])
  print(' ')

example  1
complimentary 4 STAR Ibiza Holiday or å£10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+
class: 1
 
example  2
You have an important customer service announcement from PREMIER.
class: 1
 
example  3
Save money on wedding lingerie at www.bridal.petticoatdreams.co.uk Choose from a superb selection with national delivery. Brought to you by WeddingFriend
class: 1
 
example  4
Win the newest ÛÏHarry Potter and the Order of the Phoenix (Book 5) reply HARRY, answer 5 questions - chance to be the first among readers!
class: 1
 
example  5
Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)
class: 1
 
example  6
PRIVATE! Your 2003 Account Statement for 078
class: 1
 
example  7
Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to s

**Reason analysis**: Fist of all, because I ust bags of word to vecterized the text, the model classification largely depend on the word that appears on the text. Thus if the spam text has word appearly to be a spam, it won't be classified as spam. Secondly, I didn't remove the capitalization in the text, so the word 'PREMIERE' is different from 'premiere', and some spam text escape from the classifier. Also, the website and tel-phone number is not taken account in this model.

## End of Part 1.
