## Final Project Day 1: K Nearest Neighbors Model for the Product Safety Dataset

For the final project, build a K Nearest Neighbors model to predict the __human_tag__ field of the dataset. You will submit your predictions to the Leaderboard competition here: https://leaderboard.corp.amazon.com/tasks/352

Use the notebooks from the class and implement the model, train and test with the corresponding datasets. You will use a __classifier__. We are using F1 score to rank submissions. Sklearn provides the [__f1_score():__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function if you want to see how your model works on your training or validation set.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Write your test predictions to a CSV file (Given)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. Let's first run the following credential cell and then download the files.

#### __Training data:__

In [1]:
# import the datasets
import boto3
import os
from os import path
import pandas as pd

bucketname = 'mlu-courses-datalake' # replace with your bucket name
filename1 = 'MLA-NLP/data/final_project/test.csv' # replace with your object key
filename2 = 'MLA-NLP/data/final_project/test.csv' # replace with your object key
pathname = '../../data/final_project'
s3 = boto3.resource('s3')
if not path.exists("../../data/final_project"):
    try:
        os.makedirs(pathname)
    except OSError:
        print ("Creation of the directory %s failed" % path)

s3.Bucket(bucketname).download_file(filename1, '../../data/final_project/test.csv')
s3.Bucket(bucketname).download_file(filename2, '../../data/final_project/test.csv')
print ("Successfully created the directory %s " % path)

Successfully created the directory <module 'posixpath' from '/home/ec2-user/anaconda3/envs/python3/lib/python3.6/posixpath.py'> 


#### __Test data:__

In [18]:
import pandas as pd
df = pd.read_csv('../../data/final_project/test.csv', encoding='utf-8', header=0)

def myFunction(x):
    return True if x > 2 else False

variable_1 = df['star_rating']
isPositive = variable_1.map(myFunction)
df['isPositive'] = isPositive

## 2. Train a KNN Classifier
Here, you will apply pre-processing operations in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). In the competition, we are using the F1 score. In sklearn, you can use the [__f1_score():__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function to see your F1 score on your training or validation set.

In [19]:
import nltk, re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.neighbors import KNeighborsClassifier

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# These words are important for our problem. We don't want to remove them.
excl = {'against': True, 'not': True, 'don': True, "don't": True, 'ain': True, 'aren': True, "aren't": True, 'couldn': True, 
        "couldn't": True, 'didn': True, "didn't": True, 'doesn': True, "doesn't": True, 'hadn': True, "hadn't": True,
        'hasn': True, "hasn't": True, 'haven': True, "haven't": True, 'isn': True, "isn't": True, 'mightn': True, 
        "mightn't": True, 'mustn': True, "mustn't": True, 'needn': True, "needn't": True, 'shouldn': True, 
        "shouldn't": True, 'wasn': True, "wasn't": True, 'weren': True, "weren't": True, 'won': True, "won't": True,
        'wouldn': True, "wouldn't": True, 'i': True, 'me': True, 'my': True, 'myself': True, 'we': True, 'our': True, 
        'ours': True, 'ourselves': True, 'you': True, "you're": True, "you've": True, "you'll": True, "you'd": True, 
        'your': True, 'yours': True, 'yourself': True, 'yourselves': True, 'he': True, 'him': True, 'his': True,
        'himself': True, 'she': True, "she's": True, 'her': True, 'hers': True, 'herself': True, 'it': True, 
        "it's": True, 'its': True, 'itself': True, 'they': True, 'them': True, 'their': True, 'theirs': True,
        'themselves': True, 'what': True, 'which': True, 'who': True, 'whom': True, 'this': True, 'that': True,
        "that'll": True, 'these': True, 'those': True, 'am': True, 'is': True, 'are': True, 'was': True, 'were': True, 
        'be': True, 'been': True, 'being': True, 'have': True, 'has': True, 'had': True, 'having': True, 'do': True,
        'does': True, 'did': True, 'doing': True, 'a': True, 'an': True, 'the': True, 'and': True, 'but': True, 
        'if': True, 'or': True, 'because': True, 'as': True, 'until': True, 'while': True, 'of': True, 'at': True,
        'by': True, 'for': True, 'with': True, 'about': True, 'between': True, 'into': True, 'through': True, 'during': True, 
        'before': True, 'after': True, 'above': True, 'below': True, 'to': True, 'from': True, 'up': True, 'down': True, 
        'in': True, 'out': True, 'on': True, 'off': True, 'over': True, 'under': True, 'again': True, 'further': True, 
        'then': True, 'once': True, 'here': True, 'there': True, 'when': True, 'where': True, 'why': True, 'how': True, 
        'all': True, 'any': True, 'both': True, 'each': True, 'few': True, 'more': True, 'most': True, 'other': True, 
        'some': True, 'such': True, 'no': True, 'nor': True, 'only': True, 'own': True, 'same': True, 'so': True, 
        'than': True, 'too': True, 'very': True, 's': True, 't': True, 'can': True, 'will': True, 'just': True, 
        'should': True, "should've": True, 'now': True, 'd': True, 'll': True, 'm': True, 'o': True, 're': True, 
        've': True, 'y': True, 'ma': True, 'shan': True, "shan't": True}

# New stop word list
lemma = nltk.wordnet.WordNetLemmatizer()

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        sent = re.sub(r'[?|!|\'|"|#|.|,|)|(|\|/|*]',r'', sent) # clean the word of any punctuation or special characters
        
        for w in word_tokenize(sent):
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in excl):  
                # lemmatize and add to filtered list
                filtered_sentence.append(lemma.lemmatize(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list

df.head(10)

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,ID,doc_id,text,date,star_rating,title,isPositive
0,62199,15449606311,"Quality of material is great, however, the bac...",3/7/2018 19:47,3,great backpack with strange fit,True
1,76123,15307152511,The product was okay but wasn't refined campho...,43135.875,2,Not refined,False
2,78742,12762748321,I normally read the reviews before buying some...,42997.37708,1,"Doesnt work, wouldnt recommend",False
3,64010,15936405041,These pads are completely worthless. The light...,43313.25417,1,The lighter colored side of the pads smells li...,False
4,17058,13596875291,The saw works great but the blade oiler does n...,12/5/2017 20:17,2,The saw works great but the blade oiler does n...,False
5,21905,15874101741,More powerful than I expected. Especially on ...,7/10/2018 17:13,4,"Keep well watered after use,has a tendency to ...",True
6,78037,17029633241,We have had it for a few months and it is stil...,43233.08542,4,Great price compared to what we saw at the store,True
7,47695,15625164971,Blades were dull. Gave me bad face burn from ...,5/1/2018 18:40,1,Blades were dull. Gave me bad face burn from s...,False
8,63344,13607360001,Loved the first box I purchased from a local s...,43097.14444,2,Burnt tasting...,False
9,66093,17494929541,I expected better quality. The toaster burns t...,43269.53889,1,I expected better quality. The toaster burns t...,False


In [24]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

values_list = df["isPositive"].tolist()
text_list = process_text(df["text"].values.tolist())

pipeline = Pipeline([
    ('text_vect', TfidfVectorizer(binary=True, max_features=5, ngram_range=(1, 5))),
    ('knn', KNeighborsClassifier(n_neighbors=5))                
])

pipeline.fit(text_list, values_list)

['quality material great however backpack strange shape back long one really use length inside folder book big smaller item really protected falling additional pocket inside keeping main compartment useful additional jacket bag cable etc top everything else shorter flapjill may better option unfortunately color flapjill taste apart awkward length shoulder strap stiff comfortable burnt orange brown pictured returned', 'product okay wasnt refined camphor advertised name block burn completely leave behind significant amount ash filler material', 'normally read review buying something reminded always read didnt read thought better alternative deet like use natural-based product account sensitive skin burn skin spray strong chemical odor anything feel attracts mosquito without normally get two bite mostly got mosquito love bite thanks make itchy dont put face despite description get mouth eye easily regret buying ill try walmart easily return dont work', 'pad completely worthless lighter co

Pipeline(steps=[('text_vect',
                 TfidfVectorizer(binary=True, max_features=5,
                                 ngram_range=(1, 5))),
                ('knn', KNeighborsClassifier())])

In [26]:
# from random import randrange
# for i in range(5000):
#     index = randrange(15000)
#     text_list[index] = 0
    

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(text_list)
print(confusion_matrix(values_list, val_predictions))
print(classification_report(values_list, val_predictions))
print("Accuracy (validation):", accuracy_score(values_list, val_predictions))

MemoryError: Unable to allocate 1.00 GiB for an array with shape (8503, 15784) and data type float64

## 4. Write your predictions to a CSV file
You can use the following code to write your test predictions to a CSV file. Then upload your file to https://leaderboard.corp.amazon.com/tasks/352/submit

In [6]:
import pandas as pd

result_df = pd.DataFrame()
result_df["ID"] = df["ID"]
result_df["human_tag"] = val_predictions
result_df.to_csv("./my_result.csv", encoding='utf-8', index=False)