> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

# Exercise Set 15: Text as Data 1

*Morning, August 22, 2018*

In this Exercise Set you will implement a sklearn classifier to do Sentiment Analysis using the labeled review data that you collected in exercise set 8. You will also practice your basic python skills while implementing the tf-idf weighing scheme. 

# Exercise Section 15.1: Writing your own TFIDF vectorizer

In this exercise you will practice your python skills while implementing the [Term Frequency - Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) scheme.
> **Ex. 15.1.0:** First we load the data: using the `pd.read_csv` function. link to the data is here: 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv 

> Next we define a variable `tokenized` to be transformed using our TF-IDF vectorizer, by tokenizing a the text column (reviewBody) in the dataframe using the `nltk.word_tokenize` function. 


In [27]:
#[Answer 15.1.0]
import pandas as pd
import nltk
#nltk.download('punkt')

url = 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv'

data = pd.read_csv(url)
print(data['reviewBody'].head())

type(data['reviewBody'][0])


tokenized = data['reviewBody'].apply(nltk.word_tokenize).values

print(tokenized)

0    Lots of inventory, very fast and efficient. I ...
1    I did not received the map I had ordered and p...
2    After searching a number of stores here in my ...
3    Website is not intuitive.  I don't like having...
4    Outstanding customer service, appreciated the ...
Name: reviewBody, dtype: object
[list(['Lots', 'of', 'inventory', ',', 'very', 'fast', 'and', 'efficient', '.', 'I', 'would', 'recommend', 'this', 'company', '.'])
 list(['I', 'did', 'not', 'received', 'the', 'map', 'I', 'had', 'ordered', 'and', 'paid', 'for', 'within', 'the', 'stated', 'delivery', 'time', '.', 'I', 'emailed', 'Mapscompany', 'and', 'their', 'only', 'first', 'response', 'was', 'to', 'send', 'me', 'an', 'invitation', 'to', 'review', 'my', 'experience', '!', 'My', 'initial', 'review', 'was', 'therefore', 'extremely', 'negative', '.', 'Since', 'then', 'a', 'support', 'team', 'member', 'emailed', 'me', 'and', 'was', 'reassuring', 'but', 'ultimately', 'did', "n't", 'provide', 'me', 'with', 'any', 'concret

Now we are to define a our tfidf transformation of the tokenized texts. 
Remember that:
$ IDF = \log\frac{N}{n_t} $

$ TF = \frac{c_{t_i,d}}{d_c} $

where 

$N$ is the number of documents.

$n_t$ is the number of documents with the token present

$c_{t,d}$ is the is the number of times a token t is present in d

$c_d$ is the number of tokens in document

We need to do the following steps:
1. For each word count the number of Document it is present in.
2. Transform this document count into inverse document frequency. 
3. Calculate the term frequncy in each document.
4. Finally we weight the term frequency in each document with the inverse document frequency of each term.
5. We return this as a sparse vector. 

> **Ex. 15.1.1:** 
Import the Counter object from the builtin package collections (Hint1). This is essentially a dictionary designed for keeping counts, same syntax, but extra functionality. We don't have to initialize each key. We can write: 

```python
c = Counter()
# then we can do this
c['hej']+=1
# without first defining c['hej'] = 0
```


>* Initialize a Counter object and assign it to the variable `dc` (document count).
>* Define a list named `text_counts`. In this container we will store each document after we have converted it to counts of tokens.
>* Run through all tokenized texts and
    * initialize a Counter object with the tokenized text as input, assign this object to a variable `c_t`. >This will now contain a count of each token in the document. Append `c_t` to our list `text_counts`.
    * run though each key in the `c_t` and increment the document count variable `dc` by one. (Hint2)

(hint1: from ... import ...)

(hint2: dc[token]+=1)

In [88]:
#[Answer to 15.1.1]
from collections import Counter

dc = Counter()

text_counts = []

for i in tokenized:
    c_t = Counter(i)
    text_counts.append(c_t)
    for token in c_t:
        dc[token]+=1
        
print(text_counts[0:100])


[Counter({'.': 2, 'Lots': 1, 'of': 1, 'inventory': 1, ',': 1, 'very': 1, 'fast': 1, 'and': 1, 'efficient': 1, 'I': 1, 'would': 1, 'recommend': 1, 'this': 1, 'company': 1}), Counter({'and': 4, '.': 4, 'was': 4, 'to': 4, 'I': 3, 'did': 3, 'map': 3, 'me': 3, 'the': 2, 'emailed': 2, 'an': 2, 'review': 2, 'my': 2, 'a': 2, 'not': 1, 'received': 1, 'had': 1, 'ordered': 1, 'paid': 1, 'for': 1, 'within': 1, 'stated': 1, 'delivery': 1, 'time': 1, 'Mapscompany': 1, 'their': 1, 'only': 1, 'first': 1, 'response': 1, 'send': 1, 'invitation': 1, 'experience': 1, '!': 1, 'My': 1, 'initial': 1, 'therefore': 1, 'extremely': 1, 'negative': 1, 'Since': 1, 'then': 1, 'support': 1, 'team': 1, 'member': 1, 'reassuring': 1, 'but': 1, 'ultimately': 1, "n't": 1, 'provide': 1, 'with': 1, 'any': 1, 'concrete': 1, 'information': 1, 'as': 1, 'where': 1, 'The': 1, 'eventually': 1, 'arrive': 1, 'is': 1, 'great': 1, ',': 1, 'so': 1, 'good': 1, 'end': 1, 'imperfect': 1, 'start': 1}), Counter({'I': 4, '.': 4, 'a': 3, 'a

> **Ex. 15.1.2:** 
Now we define the the inverse document frequency variable `idf` as a dictionary with the tokens as keys and idf weights as values. We do this by running through both the token and the value (document count) in the `dc` variable and calculate the ratio between number documents and the token document counts. 

>Use the `np.log` function for the log transform.

>We can iterate through this using the `.items()` syntax we know from the dictionary. 


In [86]:
#[Answer 15.1.2]
import numpy as np
idf = {}
N = len(data)

for key, value in dc.items():
    idf[key] = np.log(value/N)

print(idf)



> **Ex. 15.1.3:** 
Now we weight the term frequency in each document with the idf value of each token. Here we used our `text_counts` variable that almost holds the frequency, we just need to divide by the number of tokens in the document. 
Define a list container: `tfidf_docs`. 

FIRST LOOP: For each counter in the text_count container:
    * define the variable `doc_n` as sum of all values in the counter - `.values()` .
    * define a dictionary named `tfidf`.
    * SECOND LOOP: run through all tokens, and their counts by using the `.items()` method of the counter.
        * define a value tf as the ratio between the count and the sum.
        * now weight this value with the idf weight found by calling the idf variable with the token as key.
        * assign this weighed term frequency to the tfidf[token].
    * Once outside the second loop. Append the tfidf dictionary to the tfidf_docs list container.

In [92]:
#[Answer 15.1.3]
tfidf_docs = []

for counter in text_counts:
    doc_n = sum(counter.values())
    tfidf = {}
    for token, count in counter.items():
        tf = count/doc_n
        tfidf[token] = tf*idf[token]
    tfidf_docs.append(tfidf)

print(tfidf_docs[0:100])

[{'Lots': -0.4143072065614794, 'of': -0.08970242457651485, 'inventory': -0.4143072065614794, ',': -0.06088634819295711, 'very': -0.1135832394603561, 'fast': -0.17872810224759716, 'and': -0.03171868546821267, 'efficient': -0.2850577625896994, '.': -0.03754232952692359, 'I': -0.04473295839458747, 'would': -0.1434388146305012, 'recommend': -0.17067075489258404, 'this': -0.13349870003624717, 'company': -0.17531341564608224}, {'I': -0.02212069371160919, 'did': -0.0853056279569124, 'not': -0.021924180145561373, 'received': -0.031635588154752654, 'the': -0.014902451497624272, 'map': -0.2674196071420244, 'had': -0.02185972828344711, 'ordered': -0.029476730267622334, 'and': -0.020913418990030335, 'paid': -0.04433176526815774, 'for': -0.013652386728717008, 'within': -0.04234894612422166, 'stated': -0.05867186111064058, 'delivery': -0.03348379744990705, 'time': -0.02292156164266768, '.': -0.012376592151733052, 'emailed': -0.11601131295059026, 'Mapscompany': -0.10121253156017783, 'their': -0.03127

> **Ex. 15.1.extra:** 
Convert the dictionary to a sparse matrix.
* Create a index for each token that you can look up using a dictionary. 
* define the shape of the matrix i.e. n_rows and n_cols, as a tuple containing the number of documents and number of tokens.
* import scipy.sparse as sp. And initialize a sparse matrix you can build incrementally: sp.lil_matrix(). 
    * It takes the shape parameter. And a datatype `dtype` parameter, define the dtype as np.float32. 
* Iterate through the transformed documents from the `tfidf_docs` variable. Add the enumerate() function to keep of track of the row numbers. 
    * SECOND LOOP: iterate through all token, and tfidfscore. Get the index of the token and assign the score to the matrix using doc_idx and token_idx as selectors. i.e. mat[doc_idx,token_idx] = score
        

In [None]:
# [Answer to ex. 15.1.extra]

## Exercise Section 15.2: Supervised Sentiment Analysis


In this exercise I want you to train a classifier to do sentiment analysis of text. You will use the ratings as labels and the reviews as features. You will go through all steps, from preprocessing, feature engineering, cleaning and tokenization, to vectorization and training of the classifier. Then you will wrap it all in a function to make the code reusable. 

And finally you will analyze the performance of the resulting classifier.
> **Ex. 15.2.0:** First we load the data: using the `pd.read_csv` function. link to the data is here: 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv 

In [98]:
# [Answer to Ex. 15.2.0]
import pandas as pd
url = 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv'
data = pd.read_csv(url)
print(data.info())
print(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
Unnamed: 0                  10000 non-null int64
__domain__                  10000 non-null object
address_@type               10000 non-null object
address_addressCountry      3580 non-null object
address_addressLocality     7087 non-null object
address_postalCode          6789 non-null object
address_streetAddress       6859 non-null object
author_@type                10000 non-null object
datePublished               10000 non-null object
email                       7717 non-null object
headline                    10000 non-null object
inLanguage                  10000 non-null object
itemReviewed_@type          10000 non-null object
itemReviewed_name           10000 non-null object
meta_@type                  10000 non-null object
name                        10000 non-null object
reviewBody                  10000 non-null object
reviewRating_@type          10000 non-null objec

### Feature engineering using regular expression.
Because we are essentially creating a model that trains on a bag of words representation of the data, we are not going to think too much about the tokenization scheme. However we want to make sure that emoticons and emojiis are being included as they carry vital information for sentiment analysis. Here we use the regular expression to capture emoticons we wrote in Exercise set 9 or write a new one. 

Furthermore we want to continue exercise 9.2 of capturing references to the cost of a service. We want to embed domain knowledge into the prices, convert it to a categorical variable, from low to high price, instead of the model being presented with unique tokens ($\$921$ $\$10$ $\$935$ ) that it cannot learn from. 

**Ex. 15.2.1:** 
Write a function to capture all digits before or after a dollar sign. But we only capture the digits. We then convert it to a float. And map this to a categorical value. We convert the digit using the following rules: 
* if below $\$10$: return '\__price0__'
* elif between $\$10$ and $\$100$: return '\_\_price1\__'
* elif between $\$100$ and $\$500$: return '\_\_price2\__'
* else: return '\__price3\__'

Instructions here:
>* First write a function `price2category`that takes a float or integer value and outputs a price category (e.g. 100000 = '\__price3__') according to the above rules.
* Compile your currency regular expression and assign it to the variable `currency_re`.
* Use the currency_re variable to find all (.findall) prices/currencies mentioned in a string. Assign this to a variable: `prices`
* define a simple a regular expression that extract only the digits from a string you already know is a price.
* Extract the digits from each price string in the `prices` variable, and assign them to a list named `digits`.
* Now we need to distinquish between ,. used to indicate fraction of a dollar, or to help us read large numbers. Here we use the following patterns that looks behind and ahead and counts how many digits. two digits after indicates fraction of a dollar, and 3 digits indicate helper.: 
```python 
helper_pattern = '(?<=\d)[.,](?=\d{3})' # help pattern 
cent_pattern = '(?<=\d)[,.](?=\d{2})' # cent pattern
```
* Use the cent_pattern to substitute all ,. with '.'. I.e. by applying re.sub(cent_pattern,digit_string) to all digits strings in the `digits` variable. 
* Do same thing with the helper_pattern but now substitute with an empty string ''. 
* Convert the now ready digit strings to float. Using the builtin function `float()`.
* Then convert all float values to pricecategories by applying the price2category function on all the values. 
* Lastly iterate through the original matches in the string, stored in the `prices` variable, and the resulting pricecategories using a for loop where you zip the two variables: zip(prices,price_categories). 
    * For each price, pricecategory pair you overwrite the original string, with a string that has replaced the price with the pricecategory.

* Finally wrap it all in a function named `embed_price_categories` and remember to return the string. 

In [121]:
#[Answer 15.2.1]
import re

def price2category(value):
    if value<10:
        price_cat = '__price0__'
    elif 10<value<100:
        price_cat = '__price1__'
    elif 100<value<500:
        price_cat = '__price2__'
    else:
        price_cat = '__price3__'
    return price_cat


string = 'hej med dig 1234$ 123, 76 dollars, 1 dollar, $100'

#pattern = '^\$?(\d{1,3},?(\d{3},?)*\d{3}(\.\d{0,2})?|\d{1,3}(\.\d{0,2})?|\.\d{1,2}?)$'
pattern = '\$+ | dollars'
currency_re = re.compile(pattern)
currency_re.findall(string)

['$ ', ' dollars']

> **Ex. 15.2.2:** Normalization and tokenization
In this exercise we define a function for normalizing tokens. It should apply our feature engineering of prices, make sure emoticons are tokenized, lower string to noncapital letters.  

>* Define a function `normalize_tokens` taken a string as input. 
* First use the function for extracting prices and substituting for a price class.
* Because the standard tokenization scheme does not have a rule for emoticons we extract all emoticons before tokenization. Do this by using the following precompiled regular expression from the `nltk` package:
nltk.sentiment.util.EMOTICON_RE.findall() - compiled here means that you don't need to specify the pattern, since it is build in. You need to `import nltk.sentiment`.
* If emoticons are found, iterate through the emoticons found and remove them from the string using the builtin string method: `.replace`.
* Now write a list comprehension lowering all strings - i.e. capital to noncapital letters. Use the builtin string method: `.lower()`
* finally add the emoticons found to the token list. And return all tokens.


In [None]:
# [Answer 15.2.2]

> **Ex. 15.2.3** Now we are ready to convert our documents into Sparse Matrices to be used in training the classifier. But first we need to convert our ratings variable into a binary form and split the data into train and test. 
* apply function that return 0 if rating is 3 or below and 1 if rating is above. 
* Next we split our data into test and train, by indexing the first 7500 for traning and last 2500 for testing. 
* We use the sklearn version of the tfidf vectorizer. First import it (if you don't know how to, ask google). 
* Initialize the vectorizer with the arguments, preprocessor = None, tokenizer=normalize_tokens.
* Apply the `.fit` function to the training data only (to make sure no leakage from train to test will happen).
* apply the `.transform` function to both the training and the test data.

In [31]:
#[Answer 15.2.3]

> **Ex. 15.2.4:** Training the model. 
Here we apply a logistic regression model with regularization to predict whether the rating is positive (above 3) or negative. 
* First we import the classifier: from sklearn.linear_model import LogisticRegression 
* Next we initialize it with regularization parameter C=10
* Then we use the .fit method.
* And finally we measure the performance: accuracy, precision, recall, f1 etc.

In [38]:
#[Answer 15.2.4]

>**Ex. 15.2.5:**
Now run the classifier again + evaluation, but this time we do a multiclass prediction. This means changing the `y` variable to be the ratings.

* When doing the evaluation, we also want to see the confusion matrix to inspect the errors.

In [52]:
#[Answer 15.2.5]

##  Bias and Fairness
If we want to use our classifier as a measurement tool, for say measuring public sentiment. We need to understand the bias our classifier has so we can potentially correct it. 
In this you should Calculate performances on subpopulations of the data.

* We should look at how it is skewed towards one class or the other.
* We should look at if it does better under certain product categories.
* We should look at whether it does better when male or female authors. (by inferring gender matching the surname to data from the following register: (Female names: https://ast.dk/_namesdb/export/names?format=xls&gendermask=1, male names: https://ast.dk/_namesdb/export/names?format=xls&gendermask=2, unisex names: https://ast.dk/_namesdb/export/names?format=xls&gendermask=3).


In [51]:
#[Answer here]

# extra
Design a regular expression that locates references to time (days, months, minutes hours) and do a similar categorization using domain knowledge, of what long time is.

In [None]:
# [Answer here]