Advanced ML Week 1, Lecture 1: Working with and Preparing Text Data

In this notebook we will be preparing Twitter (X) Tweets for sentiment analysis.  Sentiment analysis is a common text classification challenge to determine whether a text is positive or negative.  

This is useful for companies that want to analyze large numbers of documents, tweets, reviews, etc., to determine public sentiment about a product or service.

The data was originally gathered from Twitter (now X) and hand-labeled.  Of course there will be some human bias in the labeling.  It was downloaded from Kaggle at this site: [Kaggle Twitter Tweets Sentiment Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset/)

There are 3 classes: positive, negative, and neutral.

In [1]:
## Import necessary packages
import pandas as pd
import nltk

# Load the Data

We will download our **corpus** of tweets.

In [3]:
## Download corpus of tweets
df = pd.read_csv('../Data/archive.zip')
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


# Some light EDA

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27481 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27481 non-null  object
dtypes: object(4)
memory usage: 858.9+ KB


In [5]:
df.duplicated().sum()

0

# Some Light Data Cleaning

We see that our **corpus** has 27481 **documents**, each with an ID, the full text, a shortened version, and the labeled sentiment.

Interestingly, one of the tweets has no text!  We definitely want to get rid of that.  We will also drop the `textID` and `selected_text` columns.  We are going to use the entire text of each tweet, not just a subset.

We will keep the label, `sentiment` for later classification and analysis tasks.

In [6]:
df = df.drop(columns=['textID', 'selected_text'])
df = df.dropna()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27480 entries, 0 to 27480
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       27480 non-null  object
 1   sentiment  27480 non-null  object
dtypes: object(2)
memory usage: 644.1+ KB


# Some More EDA
Let's look at some aspects of this text.
* What do the **documents** look like?
* How long do the tend to be?

## View some sample tweets

In [8]:
## Expand how many characters pandas will show
pd.set_option('display.max_colwidth', None)

## Display some of the documents (tweets)
df['text'].head(10)

0                                                             I`d have responded, if I were going
1                                                   Sooo SAD I will miss you here in San Diego!!!
2                                                                       my boss is bullying me...
3                                                                  what interview! leave me alone
4                      Sons of ****, why couldn`t they put them on the releases we already bought
5    http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth
6                                2am feedings for the baby are fun when he is all smiles and coos
7                                                                                      Soooo high
8                                                                                     Both of you
9                            Journey!? Wow... u just became cooler.  hehe... (is that possible!?)
Name: text, dtype: o

In [9]:
df.loc[df['text'].str.contains('http://')]

Unnamed: 0,text,sentiment
5,http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth,neutral
17,"i`ve been sick for the past few days and thus, my hair looks wierd. if i didnt have a hat on it would look... http://tinyurl.com/mnf4kw",negative
35,"Thats it, its the end. Tears for Fears vs Eric Prydz, DJ Hero http://bit.ly/2Hpbg4",neutral
50,Then you should check out http://twittersucks.com and connect with other tweeple who hate twitter,neutral
57,will be back later. http://plurk.com/p/rp3k7,neutral
...,...,...
27374,"says Finally, Im home. http://plurk.com/p/rr121",neutral
27384,This is a much better tool than some I have come across http://www.tweepular.com - Twitter Karma on Steroids,positive
27386,#vwll2009 Would one of the VWLLers want to add this event to our Ning? http://bit.ly/BF5sh Would much appreciate that,positive
27463,LIKE DREW SAID 'GIVE TC A CHANCE' WE WILL MISS THOMAS BUT HAVE TO MOVE ON. SO WATCH THIS! http://bit.ly/r6RfC,negative


We can see here that there are some URLs in the text.  This will be a problem for normalization.  We will remove those.

## Get some statistics on the length of **documents**

Let's see how long each tweet is and determine the average length of tweets

In [10]:
## Determine the length of each tweet
## Create a new column of the lengths of each tweet
df['length']= df['text'].map(len)

In [11]:
## Analyze the statistics of the lengths
df['length'].describe()

count    27480.000000
mean        68.330022
std         35.603870
min          3.000000
25%         39.000000
50%         64.000000
75%         97.000000
max        141.000000
Name: length, dtype: float64

The tweets have an mean length of 68 characters and a median of 64. They range from 3 to 141 characters with a standard deviation of 35.  The middle 50% are between 39 and 97 characters in length.

This gives us some idea of how long they tend to be.

# Text Normalization with NLTK

## Normalizing Casing

It's common practice to lower the casing of the text in our documents to contribut to normalizing.

In [12]:
## Lower the casing of each document
df['lower_text'] = df['text'].str.lower()
df.head()

Unnamed: 0,text,sentiment,length,lower_text
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!
2,my boss is bullying me...,negative,25,my boss is bullying me...
3,what interview! leave me alone,negative,31,what interview! leave me alone
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought"


## Tokenizing

Tokenizing text into single word tokens is simple in Python.  We can just use `str.split()`.  The default separator for `.split()` is one space, so `' '`.

We can access Pandas' string accessor with `df.str.<method>`.  This allows us to apply string methods to all rows in a column.

When processing text, if memory allows, it can be useful to keep many versions of your text: tokenize, lemmatized, no stop words, etc.  Some analysis or modeling packages expect tokenized data and others do not.  We often want to use different versions for different kinds of analysis, too.

In [13]:
## Split the documents into tokens

df['tokens'] = df['lower_text'].str.split()
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i`d, have, responded,, if, i, were, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego!!!]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview!, leave, me, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, ****,, why, couldn`t, they, put, them, on, the, releases, we, already, bought]"


### Better way to tokenize data

NLTK has a more sophisticated tokenization function that will isolate things like punctuation as well.  This way 'hooray' and 'hooray!!!' will be the same token.

In order for NLTK to recognize the punctuation, we will need to download the 'punkt' data.

In [14]:
## Download punkt
nltk.download('punkt')

## Tokenize with nltk.word_tokenize instead

df['tokens'] = df['lower_text'].apply(nltk.word_tokenize)
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/purvikansara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,text,sentiment,length,lower_text,tokens
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]"


## Remove Stop Words

In [18]:
## Download NLTK stopword list
nltk.download('stopwords')

## Load the English stop words.

stop_words = nltk.corpus.stopwords.words('english')
stop_words[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/purvikansara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

<font color=red> NOTICE </font> that all of the stop words are lower case.  It's necessary to ensure that your tokens are all lower case before using this list to remove stop words.

To remove the stop words from each document, we will apply a function that will check each word in the list of tokens against the list of stopwords and remove them if they are in the list.  More specifically, it will only save them if they are NOT in the list.

In [20]:
## Create function to remove stop words
def remove_stopwords(tokens):
    # no_stops = []
    # for token in tokens:
    #     if token not in stop_words:
    #         no_stops.append(token)

    no_stops = [token for token in tokens if token not in stop_words]
    
    return no_stops
    
## Apply the function to the tokenized data

df['no_stops'] = df['tokens'].map(remove_stopwords)
df.head(10)




Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]"
5,http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth,neutral,92,http://www.dothebouncy.com/smf - some shameless plugging for the best rangers forum on earth,"[http, :, //www.dothebouncy.com/smf, -, some, shameless, plugging, for, the, best, rangers, forum, on, earth]","[http, :, //www.dothebouncy.com/smf, -, shameless, plugging, best, rangers, forum, earth]"
6,2am feedings for the baby are fun when he is all smiles and coos,positive,64,2am feedings for the baby are fun when he is all smiles and coos,"[2am, feedings, for, the, baby, are, fun, when, he, is, all, smiles, and, coos]","[2am, feedings, baby, fun, smiles, coos]"
7,Soooo high,neutral,10,soooo high,"[soooo, high]","[soooo, high]"
8,Both of you,neutral,12,both of you,"[both, of, you]",[]
9,Journey!? Wow... u just became cooler. hehe... (is that possible!?),positive,69,journey!? wow... u just became cooler. hehe... (is that possible!?),"[journey, !, ?, wow, ..., u, just, became, cooler, ., hehe, ..., (, is, that, possible, !, ?, )]","[journey, !, ?, wow, ..., u, became, cooler, ., hehe, ..., (, possible, !, ?, )]"


## Remove Punctuation

We can remove punctuation in a similar that we removed stop words.  However, we will get our list of punctuation from the built in Python string library.

In [21]:
## Import built-in String Libary
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [22]:
## Create function to remove punctuation tokens

def remove_punct(tokens):
    no_punct = []
    for token in tokens:
        if token not in punctuation:
            no_punct.append(token)
    return no_punct

## Apply the function to the tokens without punctuation

df['no_stops_no_punct'] = df['no_stops'].apply(remove_punct)
df.head(10)

Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops,no_stops_no_punct
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]","[responded, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]","[sooo, sad, miss, san, diego]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]","[boss, bullying, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]","[interview, leave, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]","[sons, put, releases, already, bought]"
5,http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth,neutral,92,http://www.dothebouncy.com/smf - some shameless plugging for the best rangers forum on earth,"[http, :, //www.dothebouncy.com/smf, -, some, shameless, plugging, for, the, best, rangers, forum, on, earth]","[http, :, //www.dothebouncy.com/smf, -, shameless, plugging, best, rangers, forum, earth]","[http, //www.dothebouncy.com/smf, shameless, plugging, best, rangers, forum, earth]"
6,2am feedings for the baby are fun when he is all smiles and coos,positive,64,2am feedings for the baby are fun when he is all smiles and coos,"[2am, feedings, for, the, baby, are, fun, when, he, is, all, smiles, and, coos]","[2am, feedings, baby, fun, smiles, coos]","[2am, feedings, baby, fun, smiles, coos]"
7,Soooo high,neutral,10,soooo high,"[soooo, high]","[soooo, high]","[soooo, high]"
8,Both of you,neutral,12,both of you,"[both, of, you]",[],[]
9,Journey!? Wow... u just became cooler. hehe... (is that possible!?),positive,69,journey!? wow... u just became cooler. hehe... (is that possible!?),"[journey, !, ?, wow, ..., u, just, became, cooler, ., hehe, ..., (, is, that, possible, !, ?, )]","[journey, !, ?, wow, ..., u, became, cooler, ., hehe, ..., (, possible, !, ?, )]","[journey, wow, ..., u, became, cooler, hehe, ..., possible]"


## Remove URLs

In [25]:
## [v3 For Loop - Continue] Define function to remove URLs
def remove_urls(token_list):
    no_urls = []
    for token in token_list:
        if ('http' in token) | ('www' in token):
            continue
        no_urls.append(token)
    return no_urls

## Remove URLs from no_stops_no_punct
df['no_stops_no_punct'] = df['no_stops_no_punct'].apply(remove_urls)
df.head(10)


Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops,no_stops_no_punct
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]","[responded, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]","[sooo, sad, miss, san, diego]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]","[boss, bullying, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]","[interview, leave, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]","[sons, put, releases, already, bought]"
5,http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth,neutral,92,http://www.dothebouncy.com/smf - some shameless plugging for the best rangers forum on earth,"[http, :, //www.dothebouncy.com/smf, -, some, shameless, plugging, for, the, best, rangers, forum, on, earth]","[http, :, //www.dothebouncy.com/smf, -, shameless, plugging, best, rangers, forum, earth]","[shameless, plugging, best, rangers, forum, earth]"
6,2am feedings for the baby are fun when he is all smiles and coos,positive,64,2am feedings for the baby are fun when he is all smiles and coos,"[2am, feedings, for, the, baby, are, fun, when, he, is, all, smiles, and, coos]","[2am, feedings, baby, fun, smiles, coos]","[2am, feedings, baby, fun, smiles, coos]"
7,Soooo high,neutral,10,soooo high,"[soooo, high]","[soooo, high]","[soooo, high]"
8,Both of you,neutral,12,both of you,"[both, of, you]",[],[]
9,Journey!? Wow... u just became cooler. hehe... (is that possible!?),positive,69,journey!? wow... u just became cooler. hehe... (is that possible!?),"[journey, !, ?, wow, ..., u, just, became, cooler, ., hehe, ..., (, is, that, possible, !, ?, )]","[journey, !, ?, wow, ..., u, became, cooler, ., hehe, ..., (, possible, !, ?, )]","[journey, wow, ..., u, became, cooler, hehe, ..., possible]"


## Results

Note how many fewer tokens we have in our `no_stops_no_punct` tokens than in our original.  However, some information was lost, but a lot was also retained.  

Normalization is a huge part of the NLP process and is always a balance between reducing the size of our vocabulary and therefor simplifying our models, and retaining enough information for the model to extract some meaningful patterns in the texts.  

There are a lot of choices here to make.

# Normalizing Text with spaCy

The spaCy Python package provides text processing pipelines that can do many of these operations, plus much more complicated processing, very fast and in many fewer steps.  For this reason it is a very popular tool.  

It utilizes pretrained language models that can recognize things like parts of speech and named entities (people, specific places, currency, etc.)

spaCy was not included in your original dojo_env, so you will need to install if if you have not already.

We will also download the pretrained english language model trained on millions of web documents.  We will use the small sized one for efficiency.

In [26]:
## Install spacy if necessary
#!pip install spacy

import spacy

## Download the English small-sized model trained on web documents if necessary
spacy.cli.download('en_core_web_sm')

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## The spaCy model

In [27]:
## Load the model.  Disable Named Entity Recognizer (too slow)
nlp_model = spacy.load('en_core_web_sm', disable='ner')

## Display the names of each tranformer pipe
nlp_model.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

We have our model, and we can apply it like a function.  It expects a string of text as the input.

In [28]:
df['text'][5]

'http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth'

In [29]:
## Process a document with the model

doc = nlp_model(df['text'][5])
doc

http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth

The document is a collection of tokens we can iterate over

## Documents and Tokens

In [30]:
## Display the tokens in the document

[token for token in doc]

[http://www.dothebouncy.com/smf,
 -,
 some,
 shameless,
 plugging,
 for,
 the,
 best,
 Rangers,
 forum,
 on,
 earth]

Each token is much more than a string.  It

In [31]:
## Isolate the last token in the document
word = doc[-1]

## Display the text and type of the token
print(word)
type(word)

earth


spacy.tokens.token.Token

Each has many attributes that we can take advantage of, such as the lemma form and whether it is punctuation or space, and whether it is a stop word

In [32]:
## Display the lemmatized form of the token

word.lemma_

'earth'

In [33]:
## Check whether the token is punctuation
word.is_punct


False

In [34]:
## Check whether the token is a space
word.is_space

False

Spacy can even determine the part of speech that the token is!

In [35]:
## Check the part of speech of the token
word.pos_

'NOUN'

In [36]:
## Show the parts of speech for each token in the document

[token.pos_ for token in doc]

['PROPN',
 'PUNCT',
 'DET',
 'ADJ',
 'VERB',
 'ADP',
 'DET',
 'ADJ',
 'PROPN',
 'NOUN',
 'ADP',
 'NOUN']

In [37]:
## Show a list of the lemmas for each token in the document

[token.lemma_ for token in doc]

['http://www.dothebouncy.com/smf',
 '-',
 'some',
 'shameless',
 'plug',
 'for',
 'the',
 'good',
 'Rangers',
 'forum',
 'on',
 'earth']

Notice that spaCy does not lower the case of lemmas.  Let's make sure we do that, too.

In [38]:
## Show a list of only the tokens in the document that are not punctuation or spaces or URLs
lemmas_list = []
for token in doc:
    if token.is_punct:
        continue
    if token.is_space:
        continue
    if token.is_stop:
        continue

    lemmas_list.append(token.lemma_.lower())

lemmas_list


['http://www.dothebouncy.com/smf',
 'shameless',
 'plug',
 'good',
 'rangers',
 'forum',
 'earth']

In [None]:
## Show a list of all the tokens in the document that are not punctuation, spaces, or stop words
[token.lemma_.lower() for token in doc if 
 not token.is_punct and 
 not token.is_space and 
 not token.is_stop]

In order to use spaCy to process our entire dataframe, we will need to make a function and apply it to our text column.

In [39]:
## Let's also remove URLs
## Let's also remove the url
[token.lemma_.lower() for token in doc if 
 not token.is_punct and 
 not token.is_space and 
 not token.is_stop and 
 not 'http' in token.lemma_.lower() and
 not 'www' in token.lemma_.lower()]

['shameless', 'plug', 'good', 'rangers', 'forum', 'earth']

## Preprocessing with spaCy

In [40]:
## Define a function to use spacy to process our text
def spacy_process(text):
        """Lemmatize tokens, lower case, remove punctuation, spaces, and stop words"""
        doc = nlp_model(text)
        processed_doc = [token.lemma_.lower() 
                         for token in doc if not token.is_punct and 
                         not token.is_space and not token.is_stop and 
                         not 'http' in token.lemma_.lower() and 'www' not in token.lemma_.lower()]
        return processed_doc

## process the tweets using the spacy function
df['spacy_lemmas'] = df['text'].apply(spacy_process)
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops,no_stops_no_punct,spacy_lemmas
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]","[responded, going]","[i`d, respond, go]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]","[sooo, sad, miss, san, diego]","[sooo, sad, miss, san, diego]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]","[boss, bullying, ...]","[boss, bully]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]","[interview, leave, alone]","[interview, leave]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]","[sons, put, releases, already, bought]","[son, couldn`t, release, buy]"


We used spaCy to tokenize, lemmatize, and remove punctuation and stopwords from our text in one step!

Notice that the spaCy processed data is a little different than our previously processed data.  The text has been lemmatized and spaCy has a different list of stop words than NLTK.

The learn platform has directions for how you can customize your spaCy stopword list and a function with more flexibility in how spaCy will process your data.

# ngrams
combine multiple words into tokens

In [None]:
## Import the ngrams function
from nltk import ngrams

In [None]:

## Isolate the 6th lemmatized document


In [None]:

# Create list of bigrams


In [None]:

# Create list of trigrams



## Applying `ngrams` to make a new column



We need to make a function that returns a list of bigrams.  It won't work to just pass the ngrams function to `.apply()`


In [None]:
## Create a function to create bigrams
def make_bigrams(doc):
    bigrams = ngrams(doc, 2)
    bigrams = list(bigrams)
    return bigrams

In [None]:
# add bigrams to the df with .apply()
df['bigrams'] = df['spacy_lemmas'].apply(make_bigrams)
df.head()


# Save the final data version for modeling


In [41]:

## Save the processed data
df.to_csv('../Data/processed_data.csv', index=False)

In [43]:
# # Save the processed data
import joblib

joblib.dump(df, '../Data/processed_data.joblib')

['../Data/processed_data.joblib']