# Section 9. Bag of Words Text Analysis

#### Instructor: Pierre Biscaye 

This is the second of three notebooks covering the foundations for performing **text analysis** in Python. In the previous part, we learned how to perform text preprocessing. However, we didn't move beyond the text data itself. If we're interested in doing any computational analysis on the text data, we still need approaches to convert the text into a **numeric representation**.

In Part 2 of this series, we'll explore one of the most straightforward ways to generate a numeric representation from text: the **bag-of-words** (BoW). We will implement the BoW representation to transform the airline tweets data into numerical representation, and then build a classifier to explore what we can learn about the sentiment of the tweets. At the heart of the bag-of-words approach lies the assumption that the frequency of specific tokens is informative about the semantics and sentiment underlying the text. We'll make heavy use of the `scikit-learn` package to do so, as it provides a nice framework for constructing the numeric representations.

The content of this notebook is taken from UC Berkeley D-Lab's Python Text Analysis [course](https://github.com/dlab-berkeley/Python-Text-Analysis).
    
### Sections
1. Exploratory Data Analysis and Preprocessing
2. The Bag-of-Words Representation: Learn how to convert text data into a numerical representation through a Bag-of-Words approach.
3. Term Frequency-Inverse Document Frequency: Understand the TF-IDF algorithm and how it complements the Bag-of-Words representation. 
4. Sentiment Classification: Use the numerical representations of text data to perform sentiment analysis.

In [None]:
# Uncomment to install the following packages
# %pip install NLTK
# %pip install spaCy
# !python -m spacy download en_core_web_sm
# %pip install scikit-learn

In [None]:
# Import packages
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from string import punctuation
%matplotlib inline

In [None]:
# Import packages
import re
from string import punctuation
%matplotlib inline

# 1. Exploratory Data Analysis and Preprocessing

Before we do any preprocessing or modeling, we always should do some exploratory data analysis to get a feel for the dataset.

First, let's take a look again at the first few rows of the dataset.

In [None]:
# Load dataset in
tweets_path = 'Data/airline_tweets.csv'
tweets = pd.read_csv(tweets_path, sep=',')

In [None]:
tweets.head()

As a refresher, each row in this dataframe correponds to a tweet. The following columns are of main interests to us. There are other columns containing metadata of the tweet, such as the author of the tweet, when it was created, the timezone of the user, and others, which we will set aside for now. 

- `text` (`str`): the text of the tweet.
- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as "neutral", "positive", or "negative". 
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.

To prepare us for sentiment classification, we'll partition the dataset so as to focus on the "positive" and "negative" tweets for now. 

In [None]:
tweets = tweets[tweets['airline_sentiment'] != 'neutral'].reset_index(drop=True)
tweets.shape

Let's take a look at the text of a few of these tweets.

In [None]:
# Print first five tweets
for idx in range(5):
    print(tweets['text'].iloc[idx])

We can already see that some of these tweets contain negative sentiment—how can we tell this is the case? 

Next, let's take a look at the distribution of sentiment labels in this dataset. 

In [None]:
# Make a bar plot showing the count of tweet sentiments
sns.countplot(data=tweets,
              x='airline_sentiment', 
              color='cornflowerblue',
              order=['positive', 'negative']);

It looks like the majority of the tweets we have in this dataset have been classified as expressing negative sentiment.

Let's take a look at what gets more retweeted:

In [None]:
# Get the mean retweet count for each sentiment
tweets.groupby('airline_sentiment')['retweet_count'].mean()

Negative tweets are clearly retweeted more often than tweets having positive sentiments.

Let's see which airline receives most negative tweets:

In [None]:
# Get the proportion of negative tweets by airline
proportions = tweets.groupby(['airline', 'airline_sentiment']).size() / tweets.groupby('airline').size()
proportions.unstack().sort_values('negative', ascending=False)

It looks like people are most dissatified with US Airways, followed by American Airlines, both having over 85\% negative tweets!

That's enough data exploration for now. 

## Preprocessing

Before conducting our own sentiment analysis, we need to preprocess the text data so that they are in a standard format.

We spent much of the last workshop learning how to preprocess data. Let's apply what we learned! Looking at some of the tweets above, we can see that while they are in pretty good shape, we can do some additional processing on them.

In our pipeline, we'll omit the tokenization process, since we will perform it in a later step. 

Let's put together a text cleaning pipeline. 

We'll accomplish this by writing a function called `preprocess()` that performs the following steps on a text input:
* Step 1: Lowercase text.
* Step 2: Replace the following patterns with placeholders:
    * URLs &rarr; ` URL `
    * Digits &rarr; ` DIGIT `
    * Hashtags &rarr; ` HASHTAG `
    * Tweet handles &rarr; ` USER `
* Step 3: Remove extra blankspaces.

**Question**: Why might some of these steps make sense for a sentiment analysis task? What step(s) might we reconsider?

In [None]:
blankspace_pattern = r'\s+'
url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'
handle_pattern = r'@\w+'
digit_pattern = r'\d+'
hashtag_pattern = r'[＃#]'

def preprocess(text):
    '''Create a preprocess pipeline that cleans the tweet data.'''

    # Step 1: Lowercase
    text = text.lower()

    # Step 2: Replace patterns with placeholders
    text = re.sub(url_pattern, ' URL ', text)
    text = re.sub(handle_pattern, ' HANDLE ', text)
    text = re.sub(digit_pattern, ' DIGITS ', text)
    text = re.sub(r'[＃#]', ' ', text)

    # Step 3: Remove extra whitespace characters
    text = re.sub(blankspace_pattern, ' ', text)
    text = text.strip()
    
    return text

Let's test the `preprocess()` function on an example tweet to see how it's working. Then we can apply it to the entire `text` column in the tweets DataFrame.

In [None]:
example_tweet = 'congrats @Beyonce #Finally and @kendricklamar #NotLikeUs for big wins at the 2025 #Grammys https://abcnews.go.com/GMA/Culture/2025-grammys-winners-list/story?id=118247847'

# Print the example tweet
print(example_tweet)
print(f"{'='*50}")

# Print the preprocessed tweet
print(preprocess(example_tweet))

In [None]:
# Apply the function to text column and assign the preprocessed tweets to a new column
tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))
tweets['text_processed'].head()

Congratulations! Preprocessing is complete. Let's dive into the bag-of-words!

# 2. The Bag-of-Words Representation

The idea of bag-of-words (BoW), as the name suggests, is quite intuitive: we take a text (or all its words) and toss it in a bag. The action of "throwing" the document in a bag disregards relative position between words, so what is "in the bag" is essentially an unsorted set of words [(Jurafsky & Martin, 2024, p.62)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf). We then sort these into a list of unique words and the frequency with which they appear. 

For example, as shown in the following illustration, the word "coffee" appears twice. 

<img src='Images/bow-illustration-1.png' alt="BoW-Part2" width="600">

Analysis based on a bag-of-words representation primarily focuses on word frequency while discarding consideratinos around word order. 

In the context of sentiment analysis, the sentiment of a tweet is conveyed more strongly by specific words. For example, if a tweet contains the word "happy", it likely conveys positive sentiment, but not always (e.g., "not happy" denotes the opposite sentiment). When these words come up more often, they probably more strongly convey the sentiment. Clearly, a bag-of-words approach can help with sentiment analysis, but it has serious limitations.

## Document Term Matrix

Now let's implement the idea of bag-of-words. Before we go deep into that, let's step back for a moment. In practice, text analysis often involves handling many documents. From now on, we use the term **document** to indicate a piece of text that we perform analysis on. It could be a news article, a book chapter, a phrase, a sentence, a tweet, etc.. As long as it can be represented by a string of text, the length dosen't really matter. 

Imagine we have four documents (i.e., the four coffee-related phrases shown above) and toss them all in the bag. Instead of a word-frequency list, we can create a **document-term matrix** (DTM), which preserves information about each document rather than simply aggregating across all documents. In a DTM, the word list is the **vocabulary** (V) that holds all unique words occuring across the documents. For each **document** (D), we count the number of occurences of each word in the vocabulary, and then plug the number into the matrix. In other words, the DTM we construct is a $D \times V$ matrix, where each row corresponds to a document, and each column corresponds to a token (or "term").

In the following example, the unique tokens (in this case individual words) in this set of documents, in alphabetical order, are in columns. For each document, we mark the occurence of each word showing up in the document. The numerical representation for each document is a row in the matrix. For example, "the coffee roaster" or the first document has numerical representation $[0, 1, 0, 0, 0, 1, 1, 0]$.

Note that the left index column now displays these documents as texts, but typically we would just assign an index to each of them. 

$$
\begin{array}{c|cccccccccccc}
 & \text{americano} & \text{coffee} & \text{iced} & \text{light} & \text{roast} & \text{roaster} & \text{the} & \text{time} \\\hline
\text{the coffee roaster} &0 &1	&0	&0	&0	&1	&1	&0 \\ 
\text{light roast} &0 &0	&0	&1	&1	&0	&0	&0 \\
\text{iced americano} &1 &0	&1	&0	&0	&0	&0	&0 \\
\text{coffee time} &0 &1	&0	&0	&0	&0	&0	&1 \\
\end{array}
$$

To create a DTM, we will use `CountVectorizer` from the package `sklearn`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

The image below summarizes the general workflow of `CountVectorizer`:

<img src='Images/CountVectorizer1.png' alt="CountVectorizer" width="500">

Let's walk through these steps with the example of coffee phrases shown above.

### A Toy Example

In [None]:
# A toy example containing four documents (phrases)
test = ['the coffee roaster',
        'light roast',
        'iced americano',
        'coffee time']

The first step is to initialize a `CountVectorizer` object. Within the round paratheses are the parameter settings we may choose to specify. You can take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and see what options are available.  

For now we can just leave it blank for the default settings. 

In [None]:
# Create a CountVectorizer object
vectorizer = CountVectorizer()

The second step is to `fit` this `CountVectorizer` object to the data, which means creating a vocabulary of tokens from the set of documents. Thirdly, we `transform` our data according to the "fitted" `CountVectorizer` object, which means taking each of the document and transforming it into a DTM according to the vocabulary established by the "fitting" step.

It may sound a bit complex but steps 2 and 3 can actually be done in one swoop using a `fit_transform` function.

In [None]:
# Fit and transform to create DTM
test_count = vectorizer.fit_transform(test)

Let's take a look at the resulting DTM. 

In [None]:
test_count

Apparently the return is a "sparse matrix"—a matrix that contains a lot zeros. It actually makes sense. For each document we definitely have words that don't occur at all, which are counted zero in the DTM. This sparse matrix is stored in a "Compressed Sparse Row" format, which is a memory-saving format that is designed to deal with sparse matrix. 

Let's convert it to a dense matrix, where those zeros are organized as in a numpy array.

In [None]:
# Convert DTM to a dense matrix 
test_count.todense()

So this is our DTM. It is the same as shown above, but to make it more reader-friendly, let's convert it to a dataframe. The column names should be tokens in the vocabulary, which we can access with `get_feature_names_out()`.

In [None]:
# Retrieve the vocabulary
vectorizer.get_feature_names_out()

In [None]:
# Create a DTM dataframe
test_dtm = pd.DataFrame(data=test_count.todense(),
                        columns=vectorizer.get_feature_names_out())
test_dtm

Here it is! The DTM of our toy data is now a dataframe. The index of `test_dtm` corresponds to the position of each document in the `test` list. 

Now let's apply this process to the preprocessed tweet data we set up above.

### DTM for Tweets

We'll still begin with initializing a `CountVectorizer` object. In the following cell, we have included a few parameters that people often adjust. These parameters are currently set to their default values.

As shown below, when we construct a DTM, the default is to lowercase the input text. If nothing is provided for `stop_words`, the default is to keep them. The next three parameters are used to control the size of the vocabulary, which we'll return to in a minute.

In [None]:
# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True,
                             stop_words=None,
                             min_df=1,
                             max_df=1.0, 
                             max_features=None)

In [None]:
# Fit and transform to create DTM
counts = vectorizer.fit_transform(tweets['text_processed'])
counts

In [None]:
# Do not run if you have limited memory
np.array(counts.todense())

In [None]:
# Extract tokens
tokens = vectorizer.get_feature_names_out()

In [None]:
# Create DTM
first_dtm = pd.DataFrame(data=counts.todense(),
                         index=tweets.index,
                         columns=tokens)

# Print the shape of DTM
print(first_dtm.shape)

If we leave the `CountVectorizer` to the default setting, in total we have a vocabulary size of 9817. 

In [None]:
first_dtm.head()

Most of the tokens have zero occurences at least in the first five tweets. This is not surprising when most tweets have relatively few words and we are indexing over nearly 10,000 vocabulary terms. 

Let's take a closer look at the DTM.

In [None]:
# Most frequent tokens
first_dtm.sum().sort_values(ascending=False).head(10)

In [None]:
# Least frequent tokens
first_dtm.sum().sort_values(ascending=True).head(10)

It is not surprising to see "handle" and "digit" to be among the most frequent tokens as we replaced each idiosyncratic one to these placeholders. The rest of the most frequent list are mostly stop words, except for "flight".

Perhaps a more interesting pattern is to look for which token appears most in any given tweet:

In [None]:
counts = pd.DataFrame()

# Retrieve the index to the tweet where each token appears most frequently
counts['token'] = first_dtm.idxmax(axis=1)

# Retrieve the number of occurence 
counts['number'] = first_dtm.max(axis=1)

# Filter out placeholders
counts[(counts['token'] != 'digits')
         & (counts['token'] != 'handle')].sort_values('number', ascending=False).head(10)

It looks like among all tweets, at most a token appears 6 times, and it is either the word "It" or the word "worst". 

Let's go back to our tweets dataframe and get the 918th tweet.

In [None]:
# Look at index 918: "worst"
tweets.iloc[918]['text']

## Customizing the `CountVectorizer`

So far we used to the default parameter settings to create our DTMs, but in many cases we may want to customize the `CountVectorizer` object. The purpose of doing so is to further filter out unnecessary tokens. In the example below, we tweak the following parameters:

- `stop_words = 'english'`: exclude English stop words 
- `min_df = 2`: exclude words that don't occur at least twice across all documents
- `max_df = 0.95`: exclude words if they occur in more than 95\% of the documents 

**Question**: Does it seem reasonable to set these parameters? Keep in mind the objective of this text analysis task: sentiment analysis.

Typically we are not interested in words whose frequencies are either too low or too high, so we use the `min_df` and `max_df` parameters to trim them out. Alternatively, we can also define our vocabulary size to be $N$ using the `max_feature` parameter—this tells `CountVectorizer` to only consider the top $N$ most frequent tokens when construct the DTM.

In [None]:
# Customize the parameter setting
vectorizer = CountVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

In [None]:
# Fit, transform, and get tokens
counts = vectorizer.fit_transform(tweets['text_processed'])
tokens = vectorizer.get_feature_names_out()

# Create the second DTM
second_dtm = pd.DataFrame(data=counts.todense(),
                          index=tweets.index,
                          columns=tokens)

In [None]:
print(first_dtm.shape)
print(second_dtm.shape)

Our second DTM has a substantially smaller vocabulary, compared to the first one.

In [None]:
second_dtm.sum().sort_values(ascending=False).head(10)

The most frequent token list now includes words that make more sense to us, for example, "cancelled", "service", etc. Note that it no longer includes "handle," as this was likely included in every tweet.

## Lemmatize the Text Input

Recall from notebook 9a that we introduced using `spaCy` to perform **lemmatization**, i.e., removing morphological affixes on words. With lemmatization, we keep only word stems in texts, which presumbaly should capture the core meaning of the text. 

Now let's implement lemmatization on our tweet data, and pass the lemmatized text to create a third DTM. We'll write a function `lemmatize_text`. It requires a text input, and the output is the same text except all tokens are lemmatized. We will use the `nlp()` pipeline for this, as lemmatization is one of the linguistic annotations that the `nlp` pipeline automatically does. We can use `token.lemma_` to access the annotation.

In [None]:
# Import spaCy (can take a few seconds)
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Create a function to lemmatize text
def lemmatize_text(text):
    '''Lemmatize the text input with spaCy annotations.'''

    # Step 1: Initialize an empty list to hold lemmas
    lemma = []

    # Step 2: Apply the nlp pipeline to input text
    doc = nlp(text)

    # Step 3: Iterate over tokens in the text to get the token lemma
    for token in doc:
        lemma.append(token.lemma_)

    # Step 4: Join lemmas together into a single string
    text_lemma = ' '.join(lemma)
    
    return text_lemma

Let's apply the function to an example tweet first.

In [None]:
# Apply the function to an example tweet
print(tweets.iloc[101]["text_processed"])
print(f"{'='*50}")
print(lemmatize_text(tweets.iloc[101]['text_processed']))

Now let's lemmatize the tweet data, and save the output to a new column `text_lemmatized`.

In [None]:
# This may take a while!
tweets['text_lemmatized'] = tweets['text_processed'].apply(lambda x: lemmatize_text(x))

In [None]:
# Let's save this dataset with the cleaned tweets for future use
tweets.to_csv("Data/tweets_clean.csv", index=False)

Now with the `text_lemmatized` column, let's create a third DTM. The parameter settings are the same as the second DTM. 

In [None]:
# Create the vectorizer (the same param setting as previous)
vectorizer = CountVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

# Fit, transform, and get tokens
counts = vectorizer.fit_transform(tweets['text_lemmatized'])
tokens = vectorizer.get_feature_names_out()

# Create the third DTM
third_dtm = pd.DataFrame(data=counts.todense(),
                   index=tweets.index,
                   columns=tokens)
third_dtm.head()

In [None]:
# Print the shapes of three DTMs
print(first_dtm.shape)
print(second_dtm.shape)
print(third_dtm.shape)

Let's print the top 10 most frequent tokens as usual. These tokens are now word stems, and the counts also change after lemmatization. 

In [None]:
# Get the most frequent tokens in the third DTM
third_dtm.sum().sort_values(ascending=False).head(10)

In [None]:
# Compared to the most frequent tokens in the second DTM
second_dtm.sum().sort_values(ascending=False).head(10)

# 3. Term Frequency-Inverse Document Frequency 

So far, we've been relying on word frequencies to give us information about a document. This assumes that if a word appears more often in a document, it's more informative. However, this may not always be the case. For example, we've already removed stop words because they are not informative, despite the fact that they appear many times in a document. We also know the word "flight" is among the most frequent words, but it is also not that informative, because it appears in many documents. Since we're looking at airline tweets, we shouldn't be surprised to see the word "flight"!

To remedy this, we use a weighting scheme called **tf-idf (term frequency-inverse document frequency)**. The big idea behind tf-idf is to weight a word not just by its frequency within a document, but also by its across documents. The idea is that words are more informative if there is variation in their appearance across documents, and when they appear frequently within a given document. So, when we construct the DTM, we will be assigning each term a **tf-idf score**. Specifically, term $t$ in document $d$ is assigned a tf-idf score as follows:

<img src='Images/tf-idf.png' alt="TF-IDF" width="1200">

In essence, the tf-idf score of a word in a document is the product of two components: term frequency (tf) and inverse document frequency (idf). The idf acts as a scaling factor. If a word occurs in all documents, then idf equals to 1 and no scaling will happen. But idf is typically greater than 1, which is the weight we assign to the word to make the tf-idf score higher, so as to highlight that the word is informative. In practice, we add 1 to both the denominator and nominator ("add-1 smooth"), to prevent any issues with zero occurrences.

We will create a tf-idf DTM using `sklearn`'s `TfidfVectorizer` function. It takes mainly the same parameters as `CountVectorizer`.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create a tfidf vectorizer
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

In [None]:
# Fit and transform 
tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])
tf_dtm

In [None]:
# Create a tf-idf dataframe
tfidf = pd.DataFrame(tf_dtm.todense(),
                     columns=vectorizer.get_feature_names_out(),
                     index=tweets.index)
tfidf.head()

You may have noticed that we still have the same vocabulary size as above. This is because we used the same parameter settings when creating the vectorizer. But the values in the matrix are different now—they are tf-idf scores instead of raw counts. 

## Interpret TF-IDF Values

Let's take a look at the documents where each term has the highest tf-idf value.

We'll use `idxmax()` to find the index to these documents.

In [None]:
# Retrieve the index to the document
tfidf.idxmax()

For example, the term "worst" occurs distinctively in the 918th tweet. 

In [None]:
tfidf.idxmax()['worst']

Recall from previous part that it is the tweet where the word "worst" appears 6 times!

In [None]:
tweets['text'].iloc[918]

How about "delay"? 

In [None]:
tfidf.idxmax()['delay']

In [None]:
tweets['text_processed'].iloc[5740]

It seems there is no instance where 'delay' appears more than once.

In the practice notebook, you will be asked to use the tf-idf dataframe to plot the 10 most informative words in tweets that have been classified as positive and negative.

# 4. Introduction to Sentiment Classification

Now that we have a numerical representation of the text, we are ready operate on it for our sentiment classification task. We'll construct a DTM from the text data, and use that to predict the sentiment labels using a logistic regression model, as covered in Section 7 of this course. 

We will split the tweets data into traning and test samples, then train the model on the training sample. The target is the airline sentiment being positive or negative. The list of features we'll pass to the model is exactly the vocabulary of the DTM. The coefficients from the model will tell us whether a feature contributes positively or negatively to the predicted value. The predicted value will then inform the predicted label, either positive (when $p>=0.5$) or negative (when $p<0.5$). 

We will then evaluate the performance of the model on the test data. 

Now that we have the tf-idf dataframe, the feature set is ready. Let's dive into model specification!

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split

We'll use the `train_test_split` function from `sklearn` to separate our data into two sets:

In [None]:
# Train-test split
X = tfidf
y = tweets['airline_sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

Now let's specify and fit a logistic regression model.

Recall that the positive and negative classes are not balanced, so we can tell the classifier to automatically assign weights (through the `class_weight` parameter) proportional to our data in order to have it pay more attention to the minority class.

In [None]:
# Specify the model
lr = LogisticRegressionCV(Cs=10, # how many different values of the inverse regularization strength (C) should be tested, 10 is default
                                 cv=5, # 5-fold cross-validation
                                 solver='liblinear', # suitable for smaller datasets, supports lasso and ridge regularization
                                 class_weight='balanced', # automatically adjusts weights to handle imbalanced datasets
                                 random_state=5, # for reproducibility
                                 refit=True) # retrains full dataset on the best C value
# Fit the logistic regression model
model = lr.fit(X_train, y_train)

How does it perform?

In [None]:
# Get the training and test accuracy
print(f"Training accuracy: {model.score(X_train, y_train)}")
print(f"Test accuracy: {model.score(X_test, y_test)}")

The model got ~95% accuracy on the training set, and ~91% on the test set - that's pretty good! The similarity between the two performances is also a good sign—it means we were able to generalize pretty well.

Let's also take a look at the fitted coefficients to see if what we see makes sense! 

We can access them using `coef_`, and we can match each coefficient to the tokens from the vectorizer:

In [None]:
# Get coefs of all features
coefs = model.coef_.ravel()

# Get all tokens
tokens = vectorizer.get_feature_names_out()

# Create a token-coef dataframe
importance = pd.DataFrame()
importance['token'] = tokens
importance['coefs'] = coefs

In [None]:
# Get the top 10 tokens with lowest coefs
neg_coef = importance.sort_values('coefs').head(10)
neg_coef

In [None]:
# Plot the top 10 tokens that have the lowest coefs
neg_coef.plot(kind='barh', 
              xlim=(0, -8),
              x='token',
              color='darksalmon',
              title='Top 10 tokens with lowest coeffient values');

In [None]:
# Get the top 10 tokens with highest coefs
pos_coef = importance.sort_values('coefs').tail(10)
pos_coef 

In [None]:
# Plot the top 10 tokens that have the highest coefs
pos_coef.sort_values('coefs', ascending=False).plot(kind='barh', 
                                                    xlim=(0, 14),
                                                    x='token',
                                                    color='cornflowerblue',
                                                    title='Top 10 tokens with highest coeffient values');

**Question:** Do these make sense? Could an analysis based on a set of such keywords have been used for the sentiment classification in the tweets dataset we are using? What would a modeler need to predict sentiment based on tweet text without preexisting classification as positive or negative?


## Key Points

* A Bag-of-Words representation is a simple method to transform our text data to numbers. It focuses on word frequency but not word order. 
* A TF-IDF representation is a step further; it also considers if a certain word distinctively appears in one document or occurs uniformally across all documents. 
* With a numerical representation, we can perform a range of text classification task, such as sentiment analysis. 

