# Ensemble methods
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Two families of ensemble methods are usually distinguished:

* Averaging methods: Here the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. Example: Random forest

* Boosting methods: Here base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Example: Adaboost

### Random Forest classifier
In random forests, each tree in the ensemble is built from a sample drawn with replacement from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

### Sklearn
High level module built on NumPy, SciPy, and matplotlib. Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency.

#### Step 1: Import data
*Create dataframes from the csv data*

In [52]:
import pandas as pd
import numpy as np


TRAIN_CSV = "C:\\Users\kmpoo\Dropbox\HEC\Teaching\Python for PhD May 2019\python4phd\Session 3\Sent\sentence_review.csv"
dataframe = pd.read_csv(TRAIN_CSV, sep=",",error_bad_lines=False,header= 0, low_memory=False, encoding = "Latin1")
print(dataframe)


       sentiment                                        review_text
0              1  It makes me sad that the name has Honduras in ...
1              4  My friends got take out from this place, I tri...
2              5  My wife took me here. The service ain't top no...
3              4  This is the best place to eat hondurean food. ...
4              5  Belizean fish market has always been the go to...
5              4  Bird's Nest is a nice find in an area you woul...
6              1  I'm giving this Place one star because of ther...
7              1  I usually enjoy this establishment very much! ...
8              2  Super mediocre. If you like your brownies to t...
9              5  YAY!  This place is gr8!  Where do I start?  T...
10             5  Everything in the menu is delicious!!! They ca...
11             5  No complaints, I ordered an American which tas...
12             5  I looooove this place! I live pretty close by ...
13             3  The Bird's Nest hot chocolate 

#### Step 2: Manipulate the dataframe and add new features
*Create a new feature "length of the sentence"*



In [63]:
dataframe = dataframe.assign(nWords = lambda x : x['review_text'].str.split().str.len() )
dataframe['bi_senti'] = [ "positive" if x >= 4 else "negative" for x in dataframe['sentiment']]
print(dataframe)
print(dataframe['bi_senti'].value_counts())

       sentiment                                        review_text  nWords  \
0              1  It makes me sad that the name has Honduras in ...     172   
1              4  My friends got take out from this place, I tri...      87   
2              5  My wife took me here. The service ain't top no...      51   
3              4  This is the best place to eat hondurean food. ...     100   
4              5  Belizean fish market has always been the go to...      22   
5              4  Bird's Nest is a nice find in an area you woul...     208   
6              1  I'm giving this Place one star because of ther...     138   
7              1  I usually enjoy this establishment very much! ...     137   
8              2  Super mediocre. If you like your brownies to t...     133   
9              5  YAY!  This place is gr8!  Where do I start?  T...     134   
10             5  Everything in the menu is delicious!!! They ca...     138   
11             5  No complaints, I ordered an Americ

#### Step 3: Split into train and test samples

In [66]:
from sklearn.utils import shuffle #To shuffle the dataframe
from sklearn.model_selection import train_test_split

dataframe = shuffle(dataframe)
df_train, df_test = train_test_split(dataframe, test_size=0.2)
print("size of trainig data ", len(df_train))

size of trainig data  9247


#### Step 4: Tokenization of sentence and feature extraction
Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
##### Convert sentences to TF-IDF vectors 
tf-idf ( term frequency–inverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
def vectorize(corpus, train_sentence, test_sentence):
    word_vectorizer = TfidfVectorizer(
                                        strip_accents='unicode',
                                        analyzer='word',
                                        stop_words='english',
                                        ngram_range=(1, 3),
                                        max_features=20000)

    word_vectorizer.fit(corpus)
    train_x = word_vectorizer.transform(train_sentence)
    test_x = word_vectorizer.transform(test_sentence)
    print(train_x)
    return train_x, test_x

train_x, test_x = vectorize(dataframe['review_text'],df_train['review_text'],df_test['review_text'])

  (0, 18586)	0.0916710092342351
  (0, 18511)	0.11149597929281477
  (0, 18277)	0.18676973302304148
  (0, 17372)	0.24454292948436535
  (0, 17361)	0.1149300612958263
  (0, 17273)	0.16632313358455292
  (0, 16778)	0.20350381105507792
  (0, 15163)	0.21601556317008766
  (0, 14526)	0.2278088514523289
  (0, 14524)	0.15534846648546435
  (0, 14276)	0.23796918118284127
  (0, 13634)	0.2289194284961458
  (0, 13333)	0.05993400948598399
  (0, 12800)	0.1802811834309952
  (0, 10782)	0.14166975956345212
  (0, 10613)	0.23126034450296204
  (0, 10603)	0.13480380562637528
  (0, 9693)	0.27548243902597286
  (0, 9688)	0.17465156057058517
  (0, 8967)	0.1530405486031496
  (0, 8145)	0.1738006325901372
  (0, 7641)	0.09570965669639206
  (0, 7197)	0.1588797619758763
  (0, 6617)	0.19992504756369334
  (0, 6257)	0.1925557231747529
  :	:
  (9246, 2378)	0.10775848858201226
  (9246, 2213)	0.09142233523808797
  (9246, 2209)	0.11311419713615518
  (9246, 2189)	0.10931426144380321
  (9246, 2169)	0.12291460827618025
  (9246, 19

#### Step 5:Generate the classifier

In [68]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10)
r = rfc.fit(train_x, df_train['bi_senti'])
acc = r.score(test_x,df_test['bi_senti'])
print("accuracy of rfc is = ", acc)
rfc.predict_proba(test_x)[0:10]


accuracy of rfc is =  0.8784602076124568


array([[0. , 1. ],
       [0.2, 0.8],
       [0.1, 0.9],
       [0.1, 0.9],
       [0. , 1. ],
       [0.8, 0.2],
       [0.1, 0.9],
       [0.1, 0.9],
       [0.1, 0.9],
       [0.3, 0.7]])

#### Step 6: Use the model to predict the sentiment of your sentence

In [71]:
s1 = pd.Series('my name is poonacha')
s2 = pd.Series("his movie was absolutely horrible. A boring, random, nonsensical mess from start to finish. The film is incompetently directed from a very poor script. It feels more like a superhero movie from the early 2000's such as Catwoman or Daredevil. Watching it makes it clear that the people involved had no idea what they were doing, and should never have been put in charge of a project this size to begin with. The story makes no sense, and the whole reason Batman wants to kill Superman is contrived. Batman and Superman hate each other because they both cause collateral damage and human death, and neither one ever sees fit to point out their similarities, or try and talk to each other about their different perspectives. Apparently that would have been too interesting, so of course Snyder didn't include it.")
x1, x2 = vectorize(dataframe['review_text'],s1,s2)
print(rfc.predict_proba(x1))



[[0. 1.]]
