# Functions and implementation

In [15]:
import string
import re

from collections import Counter, OrderedDict

import numpy as np
import pandas as pd 

from nltk.tokenize import WordPunctTokenizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2



## Summary of the process 

* Receive all documents (i.e. texts) as a `List`, `DataFrame` or `Series`.  


* Define function to clean a string:
 * ```python
     def clean(doc, regex_list=[("<[^>]*>", "")] ):
        
        # remove or replace characters
        for regex in regex_list:
            doc = re.sub(regex[0], regex[1], doc)
        # lowercase
        doc = doc.lower()
        # tokenize
        words = tokenizer.tokenize(doc)
        # remove punctuation
        words = list(filter(lambda x: x not in string.punctuation, words))
        # stem
        stems = list(map(stemmer.stem, words))
        new_doc = " ".join(stems)
        return new_doc
   ```  


* Appy `df.applymap(clean)` or `series.apply(clean)` or List comprehension.   


* Define a function to build the set of distinct words (i.e., the vocabulary) occurring in `docs` (see `Counter()` and `OrderedDict()` in [library collections](https://docs.python.org/2/library/collections.html)):
 * ```python
     def build_vocabulary(docs):
        vocabulary = Counter()

        for doc in docs:
            words = doc.split()
            vocabulary.update(words)

        return OrderedDict(vocabulary.most_common())
 ```


* Define a function to vectorize the documents, i.e., convert documents into a table where each line represents one document and the columns are the word counts.  
 * ```python
     def vectorize(docs):
        vocabulary = build_vocabulary()
        vectors = []
        for doc in docs:
            vector = np.array([doc.count(word) for word in vocabulary])
            vectors.append(vector)

        return vectors
 ```

* `docs_BOW = pd.DataFrame(vectorize(), columns=build_vocabulary())`  


* Improvements to the bag of words representation:
 * Remove *stop words*
 * Term Frequency - Inverse Document Frequency (TF-IDF)
 * nest

**Or...**

* use fsadf

## Alternatives

* Instead of a `clean()` function define a sklearn `Class` with the clean function inside so we can call a `transform()` method to apply the cleaning to each document. In this way we can include this step in a Pipeline.  


* Use sklearn [CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) instead of `build_vocabulary()` + `vectorize()`.  
It receives an iterable of documents and returns a vectorized sparse matrix of token counts of those documents.  


* Use sklearn [TfidfTransformer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to transform a count matrix to a normalized tf or tf-idf representation.  
`CountVectorizer()` followed by `TfidfTransformer()` is equivalent to [TfidfVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).  


* next

# Implementation

In [13]:
df = pd.read_csv('./BLU07 - Feature Extraction/data/imdb_sentiment.csv')
df.head()

Unnamed: 0,sentiment,text
0,Negative,"Aldolpho (Steve Buscemi), an aspiring film mak..."
1,Negative,"An unfunny, unworthy picture which is an undes..."
2,Negative,A failure. The movie was just not good. It has...
3,Positive,I saw this movie Sunday afternoon. I absolutel...
4,Negative,Disney goes to the well one too many times as ...


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
sentiment    5000 non-null object
text         5000 non-null object
dtypes: object(2)
memory usage: 78.2+ KB
