## K Nearest Neighbors Model for the Product Safety Dataset

Building a K Nearest Neighbors model to predict the __human_tag__ field of the dataset.

Use the notebooks from the class and implement the model, train and test with the corresponding datasets. You will use a __classifier__. We are using F1 score to rank submissions. Sklearn provides the [__f1_score():__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function if you want to see how your model works on your training or validation set.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier 
3. Make predictions on your test dataset 
4. Write your test predictions to a CSV file

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. Let's first run the following credential cell and then download the files.

#### __Training data:__

In [1]:
# import the datasets
import boto3
import os
from os import path
import pandas as pd

bucketname = 'mlu-courses-datalake' # replace with your bucket name
filename1 = 'MLA-NLP/data/final_project/training.csv' # replace with your object key
filename2 = 'MLA-NLP/data/final_project/test.csv' # replace with your object key
pathname = './data/final_project'
s3 = boto3.resource('s3')
if not path.exists("./data/final_project"):
    try:
        os.makedirs(pathname)
    except OSError:
        print ("Creation of the directory %s failed" % path)

s3.Bucket(bucketname).download_file(filename1, './data/training.csv')
s3.Bucket(bucketname).download_file(filename2, './data/test.csv')
print ("Successfully created the directory %s " % path)

Successfully created the directory <module 'posixpath' from '/home/ec2-user/anaconda3/envs/python3/lib/python3.6/posixpath.py'> 


In [2]:
import pandas as pd

train_df = pd.read_csv('./data/training.csv', encoding='utf-8', header=0)
train_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title,human_tag
0,47490,15808037321,"I ordered a sample of the Dietspotlight Burn, ...",6/25/2018 17:51,1,DO NOT BUY!,0
1,16127,16042300811,This coffee tasts terrible as if it got burnt ...,2/8/2018 15:59,2,Coffee not good,0
2,51499,16246716471,I've been buying lightly salted Planters cashe...,3/22/2018 17:53,2,"Poor Quality - Burnt, Shriveled Nuts With Blac...",0
3,36725,14460351031,This product is great in so many ways. It goes...,12/7/2017 8:49,4,"Very lovey product, good sunscreen, but strong...",0
4,49041,15509997211,"My skin did not agree with this product, it wo...",3/21/2018 13:51,1,Not for everyone. Reactions can be harsh.,1


#### __Test data:__

In [3]:
import pandas as pd

test_df = pd.read_csv('./data/test.csv', encoding='utf-8', header=0)
test_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title
0,62199,15449606311,"Quality of material is great, however, the bac...",3/7/2018 19:47,3,great backpack with strange fit
1,76123,15307152511,The product was okay but wasn't refined campho...,43135.875,2,Not refined
2,78742,12762748321,I normally read the reviews before buying some...,42997.37708,1,"Doesnt work, wouldnt recommend"
3,64010,15936405041,These pads are completely worthless. The light...,43313.25417,1,The lighter colored side of the pads smells li...
4,17058,13596875291,The saw works great but the blade oiler does n...,12/5/2017 20:17,2,The saw works great but the blade oiler does n...


In [4]:
train_df["human_tag"].value_counts()

0    53375
1     9759
Name: human_tag, dtype: int64

## 2. Train a KNN Classifier
Here, you will apply pre-processing operations in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). In the competition, we are using the F1 score. In sklearn, you can use the [__f1_score():__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function to see your F1 score on your training or validation set.

In [12]:
# Implement this
import nltk, re

nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list




[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
from sklearn.model_selection import train_test_split


X_train, X_val, y_train, y_val = train_test_split(train_df["text"],
                                                  train_df["human_tag"],
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )



In [17]:
X_train.head()

13635    Maybe I don't know the true definition of &#34...
1058     fingers burnt thru in less than a month...the ...
17710    Please do not purchase this item. It is very d...
38394    Can someone help. First time I cut it tugged a...
15333    These were terrible.  Both bags were burnt and...
Name: text, dtype: object

In [19]:
print("Processing the text fields")
train_text_list = process_text(X_train.tolist())
val_text_list = process_text(X_val.tolist())



Processing the text fields


In [20]:
# We using lists of processed text fields 
X_train = train_text_list
X_val = val_text_list

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier

### PIPELINE ###
##########################

pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
                                  max_features=50)),
    ('knn', KNeighborsClassifier())  
                                ])

# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
pipeline





In [23]:
# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)

## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [27]:
# Implement this
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
# val_predictions = pipeline.predict(X_val)
test_predictions = pipeline.predict(process_text(test_df["text"].values.tolist()))
# print(confusion_matrix(y_val.values, val_predictions))
# print(classification_report(y_val.values, val_predictions))
# print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

## 4. Write your predictions to a CSV file
You can use the following code to write your test predictions to a CSV file. Then upload your file to https://leaderboard.corp.amazon.com/tasks/352/submit

In [28]:
import pandas as pd
 
result_df = pd.DataFrame()
result_df["ID"] = test_df["ID"]
result_df["human_tag"] = test_predictions
 
result_df.to_csv("./data/project_result.csv", encoding='utf-8', index=False)