# Machine Learning Challenge

## Overview

The focus of this exercise is on a field within machine learning called [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). We can think of this field as the intersection between language, and machine learning. Tasks in this field include automatic translation (Google translate), intelligent personal assistants (Siri), information extraction, and speech recognition for example.

NLP uses many of the same techniques as traditional data science, but also features a number of specialised skills and approaches. There is no expectation that you have any experience with NLP, however, to complete the challenge it will be useful to have the following skills:

- understanding of the python programming language
- understanding of basic machine learning concepts, i.e. supervised learning


### Instructions

1. Download this notebook!
2. Answer each of the provided questions, including your source code as cells in this notebook.
3. Share the results with us, e.g. a Github repo.

### Task description

You will be performing a task known as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Here, the goal is to predict sentiment -- the emotional intent behind a statement -- from text. For example, the sentence: "*This movie was terrible!"* has a negative sentiment, whereas "*loved this cinematic masterpiece*" has a positive sentiment.

To simplify the task, we consider sentiment binary: labels of `1` indicate a sentence has a positive sentiment, and labels of `0` indicate that the sentence has a negative sentiment.

### Dataset

The dataset is split across three files, representing three different sources -- Amazon, Yelp and IMDB. Your task is to build a sentiment analysis model using both the Yelp and IMDB data as your training-set, and test the performance of your model on the Amazon data.

Each file can be found in the `input` directory, and contains 1000 rows of data. Each row contains a sentence, a `tab` character and then a label -- `0` or `1`. 

**Notes**
- Feel free to use existing machine learning libraries as components in you solution!
- Suggested libraries: `sklearn` (for machine learning), `pandas` (for loading/processing data), `spacy` (for text processing).
- As mentioned, you are not expected to have previous experience with this exact task. You are free to refer to external tutorials/resources to assist you. However, you will be asked to justfify the choices you have made -- so make you understand the approach you have taken.

In [1]:
import os
print(os.listdir("./input"))

['amazon_cells_labelled.txt', 'yelp_labelled.txt', 'imdb_labelled.txt']


In [2]:
!head "./input/amazon_cells_labelled.txt"
# !head "./input/imdb_labelled.txt"
# !head "./input/yelp_labelled.txt"

So there is no way for me to plug it in here in the US unless I go by a converter.	0
Good case, Excellent value.	1
Great for the jawbone.	1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!	0
The mic is great.	1
I have to jiggle the plug to get it to line up right to get decent volume.	0
If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.	0
If you are Razr owner...you must have this!	1
Needless to say, I wasted my money.	0
What a waste of money and time!.	0


# Tasks
### 1. Read and concatenate data into test and train sets.
### 2. Prepare the data for input into your model.

#### 2a: Find the ten most frequent words in the training set.

### 3. Train your model and justify your choices.

### 4. Evaluate your model using metric(s) you see fit and justify your choices.

# 1. Read and concatenate data into test and train sets

In [3]:
yelp = open('./input/yelp_labelled.txt',"r", encoding = "ISO-8859-1") # Yelp txt
yelp_s = yelp.readlines()
yelp_s = [item.strip() for item in yelp_s]

yelp_s[0:4]

['Wow... Loved this place.\t1',
 'Crust is not good.\t0',
 'Not tasty and the texture was just nasty.\t0',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.\t1']

In [4]:
imdb = open('./input/imdb_labelled.txt',"r", encoding = "ISO-8859-1") # IMDB txt
imdb_s = imdb.readlines()
imdb_s = [item.strip() for item in imdb_s]

imdb_s[0:4]

['A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  \t0',
 'Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  \t0',
 'Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  \t0',
 'Very little music or anything to speak of.  \t0']

In [5]:
amazon = open('./input/amazon_cells_labelled.txt',"r", encoding = "ISO-8859-1") # IMDB txt
amazon_s = amazon.readlines()
amazon_s = [item.strip() for item in amazon_s]

amazon_s[0:4]

['So there is no way for me to plug it in here in the US unless I go by a converter.\t0',
 'Good case, Excellent value.\t1',
 'Great for the jawbone.\t1',
 'Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!\t0']

In [6]:
len(yelp_s),len(imdb_s), len(amazon_s)

(1000, 1000, 1000)

In [7]:
# split sentencs & label for yelp
yelp_text = []
yelp_label = []
for i in yelp_s:
    text, st = i.split('\t')
    yelp_text.append(text)
    yelp_label.append(int(st))
len(yelp_text),len(yelp_label)

(1000, 1000)

In [8]:
# split sentencs & label for imdb
imdb_text = []
imdb_label = []
for i in imdb_s:
    text, st = i.split('  \t')
    imdb_text.append(text)
    imdb_label.append(int(st))
len(imdb_text),len(imdb_label)

(1000, 1000)

In [9]:
# split sentencs & label for amazon
amazon_text = []
amazon_label = []
for i in amazon_s:
    text, st = i.split('\t')
    amazon_text.append(text)
    amazon_label.append(int(st))
len(amazon_text),len(amazon_label)

(1000, 1000)

In [10]:
train_text = yelp_text + imdb_text     # (2000 items) ['Wow... Loved this pl...', 'Crust is not good.',..]
train_label = yelp_label + imdb_label  # (2000 items) [1, 0, 0, 1, 1, ...]

test_text = amazon_text     # (1000 items) ['So there is no way f...', 'Good case, Excellent...',...]
test_label = amazon_label   # (1000 items) [0, 1, 1, 0, 1, ...]

len(train_text), len(train_label), len(test_text), len(test_label)

(2000, 2000, 1000, 1000)

# 2. Prepare the data for input into your model

In [40]:
import re
# import spacy
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords as sw
from nltk.stem import WordNetLemmatizer
# # from nltk.stem.porter import PorterStemmer
# # porter = PorterStemmer()


from sklearn.metrics import confusion_matrix,classification_report
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV #for best parameters
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix,classification_report
# from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV


# nlp = spacy.load('en_core_web_sm')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = sw.words()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /Users/jane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/jane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
def prepare_data(sentences):
    text_lower = [t.lower() for t in sentences]                      # lowercase
    text_punc  = [re.sub(r"[^a-z0-9]+", " ", i) for i in text_lower] # remove punc
    text_punc  = [re.sub(r"[0-9]","", i) for i in text_punc]         # remove digits
    text_token = [word_tokenize(i) for i in text_punc]               # tokenisation
    
    # remove stop words - to, the, is, have....
    text_ns=[] 
    for tokens in text_token:
        filtered_sentence = [w for w in tokens if not w in stop_words]
        text_ns.append(filtered_sentence)
    
    # lemmatisation - prices->price, kits->kit ...
    text_lemma = []
    for tokens in text_ns:
        lemma_sentence = [lemmatizer.lemmatize(w) for w in tokens]
        text_lemma.append(lemma_sentence)
    
    # # stemming
    # text_stem = []
    # for tokens in text_ns:
    #     stem = [porter.stem(w) for w in tokens]
    #     text_stem.append(stem)    

    return text_lemma

In [13]:
train_tt = prepare_data(train_text)
test_tt = prepare_data(test_text)

In [14]:
train_string = [' '.join(tokens) for tokens in train_tt]
test_string = [' '.join(tokens) for tokens in test_tt]

In [47]:
# convert text into tf-idf values

bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train_bigram = bigram_vectorizer.fit_transform(train_string)

In [49]:
# bigram with tf-idf

bigram_tf_idf_transformer = TfidfTransformer()
X_train_bigram_tf_idf = bigram_tf_idf_transformer.fit_transform(X_train_bigram)

In [50]:
X_train_bigram_tf_idf

<2000x13410 sparse matrix of type '<class 'numpy.float64'>'
	with 22704 stored elements in Compressed Sparse Row format>

In [52]:
X_train, X_test, y_train, y_test=train_test_split(X_train_bigram_tf_idf, y, test_size=0.2, random_state=1)

X_train.shape, X_test.shape, len(y_train), len(y_test)


((1600, 13410), (400, 13410), 1600, 400)

# 2a: Find the ten most frequent words in the training set

In [26]:
word_list =  []  # 12533 words 
for item in train_tt: 
    for word in item:
        word_list.append(word)
len(word_list)

12533

In [27]:
count = Counter(word_list)
most_frequent_words = count.most_common(10)

most_frequent_words  # top 10 frequent words

[('movie', 212),
 ('film', 187),
 ('good', 153),
 ('food', 127),
 ('place', 118),
 ('great', 111),
 ('time', 103),
 ('like', 97),
 ('bad', 89),
 ('service', 87)]

# 3. Train your model and justify your choices

In [54]:
# Logistic Regression model
lr = LogisticRegressionCV(cv=100,scoring='accuracy',random_state=0,n_jobs=-1,verbose=3,max_iter=500).fit(X_train,y_train)

# put IMDB & Yelp data into LR model for training
y_pred = lr.predict(X_test) 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   55.9s finished


In [55]:
# get model evaluation
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.83      0.79      0.81       205
           1       0.79      0.83      0.81       195

    accuracy                           0.81       400
   macro avg       0.81      0.81      0.81       400
weighted avg       0.81      0.81      0.81       400



In [56]:
# prepare test data (AMAZON) for evaluation

test_data_vec = bigram_vectorizer.transform(test_string)
test_data = bigram_tf_idf_transformer.transform(test_data_vec)

In [57]:
# put amazon data into the trained LR model and get result
test_pred = lr.predict(test_data)

In [59]:
print(classification_report(test_pred ,test_label))

              precision    recall  f1-score   support

           0       0.76      0.74      0.75       514
           1       0.73      0.76      0.74       486

    accuracy                           0.75      1000
   macro avg       0.75      0.75      0.75      1000
weighted avg       0.75      0.75      0.75      1000

