# Encoding text data

Data in unstructured form is free form data such as text. We have to transfor it in vectors

In [2]:
import pandas as pd
speech_df = pd.read_csv('inaugural_speeches.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'inaugural_speeches.csv'

In [1]:
speech_df.head()

NameError: name 'speech_df' is not defined

Text is saved in a particular column of the dataframe. Also, this text will need to be cleaned from characters such as punctuation then we can create features after that

In [15]:
#Removing unwanted characters
# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace(r'[^a-zA-Z]', ' ', regex=True)

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
print(speech_df['text_clean'].head())

# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']])

0    fellow citizens of the senate and of the house...
1    fellow citizens   i am again called upon by th...
2    when it was first perceived  in early times  t...
3    friends and fellow citizens   called upon to u...
4    proceeding  fellow citizens  to that qualifica...
Name: text_clean, dtype: object
                                           text_clean  char_cnt  word_cnt  \
0   fellow citizens of the senate and of the house...      8616      1432   
1   fellow citizens   i am again called upon by th...       787       135   
2   when it was first perceived  in early times  t...     13871      2323   
3   friends and fellow citizens   called upon to u...     10144      1736   
4   proceeding  fellow citizens  to that qualifica...     12902      2169   
5   unwilling to depart from examples of the most ...      7003      1179   
6   about to add the solemnity of an oath to the o...      7148      1211   
7   i should be destitute of feeling if i was not ...     19894      3382   


We basically create a column for each word and count the number of times it appears, to avoid creating a too large vector. For that reason, we use the min_df and max_df arguments. They set the minimum and maximum number of documents in which the word can occur. Basically excludes rare and very common words

In [17]:

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
cv = CountVectorizer()
# Fit the vectorizer
cv.fit(speech_df['text_clean'])

# Print feature names
print(cv.get_feature_names_out())

# Apply the vectorizer
cv_transformed = cv.transform(speech_df['text_clean'])

#trasndformed -> sparse array with a row for every text and column for every word counted



# Print the full array
cv_array = cv_transformed.toarray()
print(cv_array)


##### AGAIN WITH A DIFFERENT VERSION
# Specify arguements to limit the number of features generated
cv = CountVectorizer(min_df=.2, max_df=.8)

# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()

# Print the array shape
print(cv_array.shape)
######################################


# Create a DataFrame with these features
cv_df = pd.DataFrame(cv_array, 
                     columns=cv.get_feature_names_out()).add_prefix('Counts_')

# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
print(speech_df_new.head())


['abandon' 'abandoned' 'abandonment' ... 'zealous' 'zealously' 'zone']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(58, 818)
                Name         Inaugural Address                      Date  \
0  George Washington   First Inaugural Address  Thursday, April 30, 1789   
1  George Washington  Second Inaugural Address     Monday, March 4, 1793   
2         John Adams         Inaugural Address   Saturday, March 4, 1797   
3   Thomas Jefferson   First Inaugural Address  Wednesday, March 4, 1801   
4   Thomas Jefferson  Second Inaugural Address     Monday, March 4, 1805   

                                                text  \
0  Fellow-Citizens of the Senate and of the House...   
1  Fellow Citizens:  I AM again called upon by th...   
2  WHEN it was first perceived, in early times, t...   
3  Friends and Fellow-Citizens:  CALLED upon to u...   
4  PROCEEDING, fellow-citizens, to that qualifica...   

     

# TD-IDF Representation

$$ TF-IDF = \frac{\frac{Count \ of \ word \ occurances}{Total \ words \ in \ document}}{\log(\frac{Number \ of \ docs \ word \ is \ in}{Total \ number \ of \ docs})}  $$

Decreases the value of common words while increasing the value of words that do not occur in many documents. The idea is to weight a word the more it appears in the particular document but to deleverage if it appears in too many documents

After transforming the data always check if the words valuation make sense by examining a row


In [19]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(), 
                     columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
print(tv_df.head())

# Isolate the row to be examined
sample_row = tv_df.iloc[0]

# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())

# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')



   TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  \
0      0.000000              0.133415       0.000000        0.105388   
1      0.000000              0.261016       0.266097        0.000000   
2      0.000000              0.092436       0.157058        0.073018   
3      0.000000              0.092693       0.000000        0.000000   
4      0.041334              0.039761       0.000000        0.031408   

   TFIDF_americans  TFIDF_believe  TFIDF_best  TFIDF_better  TFIDF_change  \
0              0.0       0.000000    0.000000      0.000000      0.000000   
1              0.0       0.000000    0.000000      0.000000      0.000000   
2              0.0       0.000000    0.026112      0.060460      0.000000   
3              0.0       0.090942    0.117831      0.045471      0.053335   
4              0.0       0.000000    0.067393      0.039011      0.091514   

   TFIDF_citizens  ...  TFIDF_things  TFIDF_time  TFIDF_today  TFIDF_union  \
0        0.229644  ...    

In [None]:

#I need to create train test split to run below this line
# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(), 
                          columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
print(test_tv_df.head())


# N-grams

N-grams take the idea that words have meaning in a context and therefore their order matters. ngram_range(min, max) gets clusters of words in sequence instead of single words

In [22]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100, 
                                 stop_words='english', 
                                 ngram_range =(3,3))

# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
print(cv_trigram_vec.get_feature_names_out())

# Create a DataFrame of the features
cv_tri_df = pd.DataFrame(cv_trigram.toarray(),
                 columns=cv_trigram_vec.get_feature_names()).add_prefix('Counts_')

# Print the top 5 words in the sorted output
print(cv_tri_df.sum().sort_values(ascending=False).head())


['ability preserve protect' 'agriculture commerce manufactures'
 'america ideal freedom' 'amity mutual concession' 'anchor peace home'
 'ask bow heads' 'best ability preserve' 'best interests country'
 'bless god bless' 'bless united states' 'chief justice mr'
 'children children children' 'citizens united states'
 'civil religious liberty' 'civil service reform' 'commerce united states'
 'confidence fellow citizens' 'congress extraordinary session'
 'constitution does expressly' 'constitution united states'
 'coordinate branches government' 'day task people'
 'defend constitution united' 'distinction powers granted'
 'distinguished guests fellow' 'does expressly say' 'equal exact justice'
 'era good feeling' 'executive branch government'
 'faithfully execute office' 'fellow citizens assembled'
 'fellow citizens called' 'fellow citizens large' 'fellow citizens world'
 'form perfect union' 'general welfare secure' 'god bless america'
 'god bless god' 'good greatest number' 'government p