
# Exercise 7: Bag of Word representation

https://machinelearningmastery.com/gentle-introduction-bag-words-model/

The goal of this exercise is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precisely we will create a labeled data set from textual data using a word count matrix.

As explained in the resource, the Bag of word representation makes the assumption that the order in which the words appear in a text doesn't matter. There are different types of Bag of words representation:

- Boolean: Each document is a boolean vector
- Wordcount: Each document is a word count vector
- TFIDF: Each document is a score vector. The score is detailed in the next exercise.

The data `tweets_train.txt` contains tweets labeled with a sentiment. It gives the positivity of a tweet.

Steps:

1. Preprocess the data using the function implemented in the previous exercise. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.

- Check the shape of the word count matrix
- Set **max_features** to 500 of the initial size of the dictionary.

**Reminder**: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.

2. Using from_spmatrix from scikitlearn create a DataFrame with documents in rows and dictionary in columns.

|     | and | boat | compute |
| --: | --: | ---: | ------: |
|   0 |   0 |    2 |       0 |
|   1 |   0 |    0 |       1 |
|   2 |   1 |    0 |       0 |

3. Create a dataframe with the labels

- 1: positive
- 0: neutral
- -1: negative

|     | target |
| --: | -----: |
|   0 |     -1 |
|   1 |      0 |
|   2 |      1 |

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [32]:
import re
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

file = open('train-tweets.txt')

def prepare(text):
    text = text.lower()
    text = re.sub('[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]', '', text)
    words = word_tokenize(text)


    stop_words = set(stopwords.words('english'))
    filtered_sentence = [w for w in words if not w in stop_words]


    ps = PorterStemmer()

    result = []

    for w in filtered_sentence:
        result.append(ps.stem(w))

    return result

prepared = []
prepared = prepare(file.read())

vectorizer = CountVectorizer(max_features=500)
X = vectorizer.fit_transform(prepared)

# 2.
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vectorizer.get_feature_names())
print(df[['talk', 'team', 'tell']])
print(df.iloc[:3,400:403].to_markdown())

# 3.
print(df.iloc[300:304,499:501].to_markdown())
print(df[['youtube', 'label']])

       talk  team  tell
0         0     0     0
1         0     0     0
2         0     0     0
3         0     0     0
4         0     0     0
...     ...   ...   ...
88878     0     0     0
88879     0     0     0
88880     0     0     0
88881     0     0     0
88882     0     0     0

[88883 rows x 3 columns]
|    |   someon |   someth |   son |
|---:|---------:|---------:|------:|
|  0 |        0 |        0 |     0 |
|  1 |        0 |        0 |     0 |
|  2 |        0 |        0 |     0 |
|     |   young |   your |
|----:|--------:|-------:|
| 300 |       0 |      0 |
| 301 |       0 |      0 |
| 302 |       0 |      0 |
| 303 |       0 |      0 |


KeyError: "None of [Index(['youtube', 'label'], dtype='object')] are in the [columns]"