# Dealing with Text Data
>  Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 4 exercises "Feature Engineering for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Encoding text

### Cleaning up your text

<div class=""><p>Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline. </p>
<p>In this chapter you will be working with a new dataset containing the inaugural speeches of the presidents of the United States loaded as <code>speech_df</code>, with the speeches stored in the <code>text</code> column.</p></div>

In [2]:
speech_df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/10-feature-engineering-for-machine-learning-in-python/datasets/speech_df.csv')

Instructions 1/2
<p>Print the first 5 rows of the <code>text</code> column to see the free text fields.</p>

In [3]:
# Print the first 5 rows of the text column
print(speech_df['text'].head())

0    Fellow-Citizens of the Senate and of the House...
1    Fellow Citizens:  I AM again called upon by th...
2    WHEN it was first perceived, in early times, t...
3    Friends and Fellow-Citizens:  CALLED upon to u...
4    PROCEEDING, fellow-citizens, to that qualifica...
Name: text, dtype: object


Instructions 2/2
<ul>
<li>Replace all non letter characters in the <code>text</code> column with a whitespace.</li>
<li>Make all characters in the newly created <code>text_clean</code> column lower case.</li>
</ul>

In [4]:
# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
print(speech_df['text_clean'].head())

0    fellow citizens of the senate and of the house...
1    fellow citizens   i am again called upon by th...
2    when it was first perceived  in early times  t...
3    friends and fellow citizens   called upon to u...
4    proceeding  fellow citizens  to that qualifica...
Name: text_clean, dtype: object


**now your text strings have been standardized and cleaned up. You can now use this new column (text_clean) to extract information about the speeches.**

### High level text features

<p>Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words. In this exercise (and the rest of this chapter), you will focus on the cleaned/transformed text column (<code>text_clean</code>) you created in the last exercise.</p>

Instructions
<ul>
<li>Record the character length of each speech in the <code>char_count</code> column.</li>
<li>Record the word count of each speech in the <code>word_count</code> column.</li>
<li>Record the average word length of each speech in the <code>avg_word_length</code> column.</li>
</ul>

In [None]:
# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']])

**These features may appear basic but can be quite useful in ML models.**

## Word counts

### Counting words (I)

<div class=""><p>Once high level information has been recorded you can begin creating features based on the actual content of each text. One way to do this is to approach it in a similar way to how you worked with categorical variables in the earlier lessons. </p>
<ul>
<li>For each unique word in the dataset a column is created. </li>
<li>For each entry, the number of times this word occurs is counted and the count value is entered into the respective column.  </li>
</ul>
<p>These "count" columns can then be used to train machine learning models.</p></div>

Instructions
<ul>
<li>Import <code>CountVectorizer</code> from <code>sklearn.feature_extraction.text</code>.  </li>
<li>Instantiate <code>CountVectorizer</code> and assign it to <code>cv</code>. </li>
<li>Fit the vectorizer to the <code>text_clean</code> column. </li>
<li>Print the feature names generated by the vectorizer.</li>
</ul>

In [24]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
cv = CountVectorizer()

# Fit the vectorizer
cv.fit(speech_df['text_clean'])

# Print feature names
cv.get_feature_names()[:5]

['abandon', 'abandoned', 'abandonment', 'abate', 'abdicated']

**this vectorizer can be applied to both the text it was trained on, and new texts.**

### Counting words (II)

<div class=""><p>Once the vectorizer has been fit to the data, it can be used to transform the text to an array representing the word counts. This array will have a row per block of text and a column for each of the features generated by the vectorizer that you observed in the last exercise. </p>
<p>The vectorizer to you fit in the last exercise (<code>cv</code>) is available in your workspace.</p></div>

Instructions 1/2
<ul>
<li>Apply the vectorizer to the <code>text_clean</code> column. </li>
<li>Convert this transformed (sparse) array into a numpy array with counts.</li>
</ul>

In [9]:
# Apply the vectorizer
cv_transformed = cv.transform(speech_df['text_clean'])

# Print the full array
cv_array = cv_transformed.toarray()
print(cv_array)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Instructions 2/2
<p>Print the dimensions of this numpy array.</p>

In [10]:
# Print the shape of cv_array
print(cv_array.shape)

(58, 9043)


**The speeches have 9043 unique words, which is a lot! In the next exercise, you will see how to create a limited set of features.**

### Limiting your features

<div class=""><p>As you have seen, using the <code>CountVectorizer</code> with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.</p>
<p>For this purpose <code>CountVectorizer</code> has parameters that you can set to reduce the number of features:  </p>
<ul>
<li><code>min_df</code> : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.  </li>
<li><code>max_df</code> : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as "and" or "the".</li>
</ul></div>

Instructions
<ul>
<li>Limit the number of features in the CountVectorizer by setting the minimum number of documents a word can appear to 20% and the maximum to 80%.</li>
<li>Fit and apply the vectorizer on <code>text_clean</code> column in one step. </li>
<li>Convert this transformed (sparse) array into a numpy array with counts. </li>
<li>Print the dimensions of the new reduced array.</li>
</ul>

In [11]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Specify arguements to limit the number of features generated
cv = CountVectorizer(min_df=0.2, max_df=0.8)

# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()

# Print the array shape
print(cv_array.shape)

(58, 818)


**Did you notice that the number of features (unique words) greatly reduced from 9043 to 818?**

### Text to DataFrame

<div class=""><p>Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a pandas DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame.</p>
<p>The numpy array (<code>cv_array</code>) and the vectorizer (<code>cv</code>) you fit in the last exercise are available in your workspace.</p></div>

Instructions
<ul>
<li>Create a DataFrame <code>cv_df</code> containing the <code>cv_array</code> as the values and the feature names as the column names. </li>
<li>Add the prefix <code>Counts_</code> to the column names for ease of identification. </li>
<li>Concatenate this DataFrame (<code>cv_df</code>) to the original DataFrame (<code>speech_df</code>) column wise.</li>
</ul>

In [14]:
# Create a DataFrame with these features
cv_df = pd.DataFrame(cv_array, 
                     columns=cv.get_feature_names()).add_prefix('Counts_')

# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
speech_df_new.head()

Unnamed: 0,Name,Inaugural Address,Date,text,text_clean,Counts_abiding,Counts_ability,Counts_able,Counts_about,Counts_above,Counts_abroad,Counts_accept,Counts_accomplished,Counts_achieve,Counts_across,Counts_act,Counts_action,Counts_acts,Counts_add,Counts_adequate,Counts_administration,Counts_adopted,Counts_advance,Counts_advantage,Counts_affairs,Counts_afford,Counts_after,Counts_again,Counts_against,Counts_age,Counts_ago,Counts_agriculture,Counts_aid,Counts_alike,Counts_almighty,Counts_almost,Counts_alone,Counts_along,Counts_already,Counts_also,...,Counts_vital,Counts_voice,Counts_want,Counts_war,Counts_wars,Counts_washington,Counts_waste,Counts_way,Counts_ways,Counts_weak,Counts_wealth,Counts_weight,Counts_welfare,Counts_were,Counts_what,Counts_whatever,Counts_where,Counts_wherever,Counts_whether,Counts_while,Counts_whole,Counts_whom,Counts_whose,Counts_willing,Counts_wisdom,Counts_wise,Counts_wisely,Counts_wish,Counts_within,Counts_without,Counts_women,Counts_words,Counts_work,Counts_wrong,Counts_year,Counts_years,Counts_yet,Counts_you,Counts_young,Counts_your
0,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House...,fellow citizens of the senate and of the house...,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,2,0,0,1,1,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,...,0,2,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,2,0,0,0,0,0,1,0,5,0,9
1,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by th...,fellow citizens i am again called upon by th...,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, t...",when it was first perceived in early times t...,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,3,1,0,1,0,0,3,0,1,1,0,1,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0,1,0,1,5,3,2,1,0,0,1,3,0,0,0,0,1,0,1,0,3,0,0,0,0,2,3,0,0,0,1
3,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CALLED upon to u...,friends and fellow citizens called upon to u...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,2,0,2,0,1,0,0,0,2,1,0,1,0,0,0,0,1,0,0,0,...,2,1,1,1,0,0,0,0,0,0,0,0,0,0,5,1,3,0,0,0,2,1,2,0,2,1,0,1,1,2,0,0,1,2,0,0,2,7,0,7
4,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualifica...",proceeding fellow citizens to that qualifica...,0,0,1,0,0,0,0,0,0,0,3,1,0,1,0,1,0,1,0,3,1,0,2,6,0,0,1,0,0,0,0,2,0,1,1,...,0,0,1,3,1,0,0,0,0,0,0,1,0,0,4,0,1,0,3,0,1,2,3,0,2,0,0,1,4,2,0,0,0,0,2,2,2,4,0,4


**With the new features combined with the orginial DataFrame they can be now used for ML models or analysis.**

## Term frequency-inverse document frequency

### Tf-idf

<p>While counts of occurrences of words can be useful to build models, words that occur many times may skew the results undesirably. To limit these common words from overpowering your model a form of normalization can be used. In this lesson you will be using Term frequency-inverse document frequency (Tf-idf) as was discussed in the video. Tf-idf has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.</p>

Instructions
<ul>
<li>Import <code>TfidfVectorizer</code> from <code>sklearn.feature_extraction.text</code>.  </li>
<li>Instantiate <code>TfidfVectorizer</code> while limiting the number of features to 100 and removing English stop words. </li>
<li>Fit and apply the vectorizer on <code>text_clean</code> column in one step. </li>
<li>Create a DataFrame <code>tv_df</code> containing the weights of the words and the feature names as the column names.</li>
</ul>

In [16]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed =  tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(), 
                     columns=tv.get_feature_names()).add_prefix('TFIDF_')
tv_df.head()

Unnamed: 0,TFIDF_action,TFIDF_administration,TFIDF_america,TFIDF_american,TFIDF_americans,TFIDF_believe,TFIDF_best,TFIDF_better,TFIDF_change,TFIDF_citizens,TFIDF_come,TFIDF_common,TFIDF_confidence,TFIDF_congress,TFIDF_constitution,TFIDF_country,TFIDF_day,TFIDF_duties,TFIDF_duty,TFIDF_equal,TFIDF_executive,TFIDF_faith,TFIDF_far,TFIDF_federal,TFIDF_fellow,TFIDF_force,TFIDF_foreign,TFIDF_free,TFIDF_freedom,TFIDF_future,TFIDF_general,TFIDF_god,TFIDF_good,TFIDF_government,TFIDF_great,TFIDF_high,TFIDF_history,TFIDF_home,TFIDF_hope,TFIDF_human,...,TFIDF_need,TFIDF_new,TFIDF_office,TFIDF_old,TFIDF_order,TFIDF_party,TFIDF_peace,TFIDF_people,TFIDF_place,TFIDF_policy,TFIDF_political,TFIDF_power,TFIDF_powers,TFIDF_present,TFIDF_president,TFIDF_principles,TFIDF_progress,TFIDF_prosperity,TFIDF_public,TFIDF_purpose,TFIDF_right,TFIDF_rights,TFIDF_secure,TFIDF_service,TFIDF_shall,TFIDF_spirit,TFIDF_state,TFIDF_states,TFIDF_strength,TFIDF_support,TFIDF_things,TFIDF_time,TFIDF_today,TFIDF_union,TFIDF_united,TFIDF_war,TFIDF_way,TFIDF_work,TFIDF_world,TFIDF_years
0,0.0,0.133415,0.0,0.105388,0.0,0.0,0.0,0.0,0.0,0.229644,0.0,0.0,0.111079,0.0,0.060755,0.229644,0.115098,0.064225,0.238637,0.063036,0.14728,0.0,0.178978,0.0,0.147528,0.0,0.0,0.098352,0.0,0.101797,0.0,0.0,0.147528,0.36743,0.133183,0.0,0.0,0.0,0.051787,0.126073,...,0.0,0.049176,0.0,0.0,0.141458,0.070729,0.0,0.17459,0.056532,0.138691,0.0,0.050898,0.065448,0.315182,0.06188,0.063036,0.0,0.064225,0.333237,0.0,0.05554,0.050898,0.0,0.063036,0.145021,0.0,0.0,0.103573,0.0,0.0,0.0,0.045929,0.0,0.136012,0.203593,0.0,0.060755,0.0,0.045929,0.052694
1,0.0,0.261016,0.266097,0.0,0.0,0.0,0.0,0.0,0.0,0.179712,0.0,0.0,0.217318,0.0,0.237725,0.179712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.192418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.179712,0.0,0.233437,0.0,0.0,0.0,0.0,...,0.0,0.0,0.242128,0.0,0.0,0.0,0.0,0.170786,0.0,0.0,0.0,0.0,0.0,0.246652,0.242128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.567446,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.199157,0.0,0.0,0.0,0.0,0.0
2,0.0,0.092436,0.157058,0.073018,0.0,0.0,0.026112,0.06046,0.0,0.106072,0.0,0.056125,0.025654,0.196017,0.224501,0.212143,0.026582,0.029665,0.055113,0.058233,0.068028,0.082669,0.027556,0.0,0.068143,0.0,0.246496,0.045428,0.0,0.02351,0.133321,0.0,0.136285,0.339429,0.102528,0.027556,0.029116,0.0,0.02392,0.058233,...,0.0,0.022714,0.0,0.0,0.130678,0.130678,0.121696,0.403213,0.026112,0.0,0.027556,0.117549,0.03023,0.058233,0.0,0.058233,0.0,0.059331,0.153921,0.0,0.025654,0.02351,0.03023,0.058233,0.089313,0.153921,0.090691,0.21528,0.0,0.116465,0.03203,0.021214,0.0,0.062823,0.070529,0.024339,0.0,0.0,0.063643,0.073018
3,0.0,0.092693,0.0,0.0,0.0,0.090942,0.117831,0.045471,0.053335,0.223369,0.0,0.084421,0.154348,0.0,0.084421,0.127639,0.039983,0.089243,0.0,0.175183,0.051163,0.082899,0.041449,0.059596,0.239161,0.048179,0.0,0.102498,0.17197,0.035363,0.100268,0.0,0.170829,0.382918,0.030844,0.124348,0.087591,0.045471,0.03598,0.0,...,0.0,0.0,0.042992,0.0,0.04914,0.0,0.183051,0.06065,0.039277,0.0,0.124348,0.14145,0.045471,0.0,0.0,0.131387,0.0,0.044621,0.154348,0.0,0.154348,0.070725,0.0,0.0,0.201512,0.0,0.090942,0.0,0.0,0.131387,0.048179,0.0,0.0,0.094497,0.0,0.03661,0.0,0.039277,0.095729,0.0
4,0.041334,0.039761,0.0,0.031408,0.0,0.0,0.067393,0.039011,0.091514,0.27376,0.0,0.0,0.033105,0.0,0.21728,0.109504,0.034302,0.153126,0.14224,0.075146,0.043893,0.0,0.0,0.0,0.234492,0.0,0.159045,0.029311,0.073768,0.060676,0.043011,0.0,0.087934,0.082128,0.026461,0.0,0.075146,0.039011,0.0,0.075146,...,0.10864,0.029311,0.0,0.040535,0.126475,0.0,0.125634,0.0,0.067393,0.0,0.03556,0.091014,0.039011,0.112719,0.0,0.112719,0.036884,0.0,0.463464,0.0,0.033105,0.091014,0.039011,0.037573,0.201694,0.066209,0.312084,0.12347,0.078021,0.075146,0.082667,0.164256,0.0,0.121605,0.030338,0.094225,0.0,0.0,0.054752,0.062817


**Did you notice that counting the word occurences and calculating the Tf-idf weights are very similar? This is one of the reasons scikit-learn is very popular, a consistent API.**

### Inspecting Tf-idf values

<div class=""><p>After creating Tf-idf features you will often want to understand what are the most highest scored words for each corpus. This can be achieved by isolating the row you want to examine and then sorting the the scores from high to low. </p>
<p>The DataFrame from the last exercise (<code>tv_df</code>) is available in your workspace.</p></div>

Instructions
<ul>
<li>Assign the first row of <code>tv_df</code> to <code>sample_row</code>. </li>
<li><code>sample_row</code> is now a series of weights assigned to words. Sort these values to print the top 5 highest-rated words.</li>
</ul>

In [17]:
# Isolate the row to be examined
sample_row = tv_df.iloc[0]

# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())

TFIDF_government    0.367430
TFIDF_public        0.333237
TFIDF_present       0.315182
TFIDF_duty          0.238637
TFIDF_citizens      0.229644
Name: 0, dtype: float64


**Do you think these scores make sense for the corresponding words?**

### Transforming unseen data

<div class=""><p>When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: <em>fit the vectorizer only on the training data, and apply it to the test data.</em></p>
<p>For this exercise the <code>speech_df</code> DataFrame has been split in two:</p>
<ul>
<li><code>train_speech_df</code>: The training set consisting of the first 45 speeches.</li>
<li><code>test_speech_df</code>: The test set consisting of the remaining speeches.</li>
</ul></div>

In [18]:
train_speech_df = speech_df.iloc[:45]
test_speech_df = speech_df.iloc[45:]

Instructions
<ul>
<li>Instantiate <code>TfidfVectorizer</code>. </li>
<li>Fit the vectorizer and apply it to the <code>text_clean</code> column. </li>
<li>Apply the same vectorizer on the <code>text_clean</code> column of the test data. </li>
<li>Create a DataFrame of these new features from the test set.</li>
</ul>

In [20]:
# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(), 
                          columns=tv.get_feature_names()).add_prefix('TFIDF_')
test_tv_df.head()

Unnamed: 0,TFIDF_action,TFIDF_administration,TFIDF_america,TFIDF_american,TFIDF_authority,TFIDF_best,TFIDF_business,TFIDF_citizens,TFIDF_commerce,TFIDF_common,TFIDF_confidence,TFIDF_congress,TFIDF_constitution,TFIDF_constitutional,TFIDF_country,TFIDF_day,TFIDF_duties,TFIDF_duty,TFIDF_equal,TFIDF_executive,TFIDF_faith,TFIDF_far,TFIDF_federal,TFIDF_fellow,TFIDF_force,TFIDF_foreign,TFIDF_free,TFIDF_freedom,TFIDF_future,TFIDF_general,TFIDF_given,TFIDF_god,TFIDF_good,TFIDF_government,TFIDF_great,TFIDF_high,TFIDF_hope,TFIDF_human,TFIDF_important,TFIDF_institutions,...,TFIDF_order,TFIDF_ought,TFIDF_party,TFIDF_peace,TFIDF_people,TFIDF_policy,TFIDF_political,TFIDF_power,TFIDF_powers,TFIDF_present,TFIDF_principle,TFIDF_principles,TFIDF_progress,TFIDF_proper,TFIDF_prosperity,TFIDF_protection,TFIDF_public,TFIDF_purpose,TFIDF_question,TFIDF_republic,TFIDF_revenue,TFIDF_right,TFIDF_rights,TFIDF_secure,TFIDF_self,TFIDF_service,TFIDF_shall,TFIDF_spirit,TFIDF_state,TFIDF_states,TFIDF_subject,TFIDF_support,TFIDF_time,TFIDF_union,TFIDF_united,TFIDF_war,TFIDF_way,TFIDF_work,TFIDF_world,TFIDF_years
0,0.0,0.02954,0.233954,0.082703,0.0,0.0,0.0,0.022577,0.0,0.0,0.02635,0.0,0.02695,0.0,0.022577,0.02954,0.0,0.0,0.065003,0.0,0.03172,0.056409,0.0,0.049296,0.0,0.0,0.049296,0.066626,0.02635,0.0,0.030968,0.195008,0.024111,0.115378,0.11045,0.055135,0.07905,0.0,0.0,0.0,...,0.034158,0.0,0.0,0.3162,0.3026,0.0,0.0,0.025767,0.0,0.0,0.0,0.0,0.030968,0.0,0.0,0.0,0.0,0.02954,0.0,0.0,0.0,0.0,0.0,0.030242,0.0,0.0,0.086457,0.165406,0.0,0.024648,0.0,0.0,0.115378,0.0,0.024648,0.07905,0.033313,0.0,0.299983,0.134749
1,0.0,0.0,0.547457,0.036862,0.0,0.036036,0.0,0.015094,0.0,0.0,0.017617,0.0,0.0,0.0,0.045283,0.01975,0.0,0.0,0.02173,0.0,0.08483,0.037714,0.0,0.016479,0.043459,0.0,0.0,0.089089,0.052851,0.0,0.020704,0.086919,0.01612,0.154278,0.13292,0.018431,0.035234,0.040438,0.043459,0.0,...,0.022837,0.0,0.0,0.334722,0.086705,0.0,0.0,0.017227,0.018857,0.0,0.024041,0.0,0.103522,0.0,0.0,0.0,0.0,0.01975,0.024685,0.0,0.0,0.108108,0.01612,0.020219,0.0,0.0,0.101155,0.036862,0.0,0.0,0.0,0.019296,0.092567,0.0,0.0,0.052851,0.066817,0.078999,0.277701,0.126126
2,0.0,0.0,0.126987,0.134669,0.0,0.131652,0.0,0.0,0.0,0.046997,0.042907,0.0,0.0,0.0,0.036763,0.048102,0.045927,0.0,0.052924,0.0,0.103304,0.0,0.0,0.0,0.0,0.049244,0.040136,0.216981,0.085814,0.0,0.100853,0.052924,0.078521,0.150301,0.071941,0.08978,0.085814,0.24622,0.0,0.0,...,0.0,0.0,0.0,0.042907,0.211174,0.0,0.0,0.0,0.0,0.0,0.0,0.093993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043884,0.117781,0.0,0.054245,0.0,0.0,0.269339,0.0,0.040136,0.0,0.0,0.075151,0.0,0.080272,0.042907,0.054245,0.096203,0.225452,0.043884
3,0.037094,0.067428,0.267012,0.031463,0.03999,0.061516,0.050085,0.077301,0.0,0.0,0.0,0.03999,0.030758,0.0,0.077301,0.134856,0.0,0.0,0.074188,0.0,0.108607,0.03219,0.183116,0.084393,0.0,0.0,0.056262,0.304162,0.09022,0.0,0.0,0.185469,0.027517,0.42138,0.100845,0.031463,0.060146,0.034515,0.037094,0.0,...,0.038984,0.0,0.038984,0.060146,0.222015,0.0,0.092274,0.029408,0.03219,0.094389,0.0,0.03294,0.070687,0.0,0.0,0.0,0.029408,0.0,0.0,0.03999,0.0,0.030758,0.0,0.0,0.07604,0.0,0.024668,0.0,0.0,0.112524,0.0,0.098819,0.21069,0.0,0.056262,0.030073,0.03802,0.235998,0.237026,0.061516
4,0.0,0.0,0.221561,0.156644,0.028442,0.087505,0.0,0.109959,0.0,0.023428,0.021389,0.028442,0.0,0.0,0.018327,0.143872,0.0,0.0,0.026383,0.0,0.077246,0.0,0.162799,0.060023,0.0,0.0,0.060023,0.37858,0.042778,0.025138,0.025138,0.211061,0.019571,0.337164,0.089656,0.0,0.042778,0.220934,0.026383,0.0,...,0.0,0.0,0.027727,0.171114,0.298266,0.0,0.043752,0.041832,0.0,0.022378,0.0,0.0,0.150826,0.0,0.0,0.0,0.0,0.023979,0.059941,0.028442,0.0,0.087505,0.019571,0.024548,0.108166,0.0,0.03509,0.044755,0.023428,0.060023,0.0,0.023428,0.187313,0.131913,0.040016,0.021389,0.081124,0.119894,0.299701,0.153133


**the vectorizer should only be fit on the train set, never on your test set.**

## N-grams

### Using longer n-grams

<div class=""><p>So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:</p>
<ul>
<li>bigrams: Sequences of two consecutive words</li>
<li>trigrams: Sequences of two consecutive words   </li>
</ul>
<p>These can be automatically created in your dataset by specifying the <code>ngram_range</code> argument as a tuple <code>(n1, n2)</code> where all n-grams in the <code>n1</code> to <code>n2</code> range are included.</p></div>

Instructions
<ul>
<li>Import <code>CountVectorizer</code> from <code>sklearn.feature_extraction.text</code>.  </li>
<li>Instantiate <code>CountVectorizer</code> while considering only trigrams.  </li>
<li>Fit the vectorizer and apply it to the <code>text_clean</code> column in one step.  </li>
<li>Print the feature names generated by the vectorizer.</li>
</ul>

In [23]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100, 
                                 stop_words='english', 
                                 ngram_range = (3,3))

# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
cv_trigram_vec.get_feature_names()[:5]

['ability preserve protect',
 'agriculture commerce manufactures',
 'america ideal freedom',
 'amity mutual concession',
 'anchor peace home']

**Here you can see that by taking sequential word pairings, some context is preserved.**

### Finding the most common words

<div class=""><p>Its always advisable once you have created your features to inspect them to ensure that they are as you would expect. This will allow you to catch errors early, and perhaps influence what further feature engineering you will need to do.   </p>
<p>The vectorizer (<code>cv</code>) you fit in the last exercise and the sparse array consisting of word counts (<code>cv_trigram</code>) is available in your workspace.</p></div>

Instructions
<ul>
<li>Create a DataFrame of the features (word counts). </li>
<li>Add the counts of word occurrences and print the top 5 most occurring words.</li>
</ul>

In [25]:
# Create a DataFrame of the features
cv_tri_df = pd.DataFrame(cv_trigram.toarray(), 
                 columns=cv_trigram_vec.get_feature_names()).add_prefix('Counts_')

# Print the top 5 words in the sorted output
print(cv_tri_df.sum().sort_values(ascending=False).head())

Counts_constitution united states    20
Counts_people united states          13
Counts_preserve protect defend       10
Counts_mr chief justice              10
Counts_president united states        8
dtype: int64


**that the most common trigram is constitution united states makes a lot of sense for US presidents speeches.**