# Numeric Features from Reviews
>  Imagine you are in the shoes of a company offering a variety of products. You want to know which of your products are bestsellers and most of all - why. We embark on step 1 of understanding the reviews of products, using a dataset with Amazon product reviews. To that end, we transform the text into a numeric form and consider a few complexities in the process.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 2 exercises "Sentiment Analysis in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/natural-language-processing-in-python)

In [None]:
import pandas as pd

## Bag-of-words

### Which statement about BOW is true?

<p>You were introduced to a bag-of-words(BOW) and some of its characteristics in the video. Which of the following statements about BOW <strong>is</strong> true?</p>

<pre>
Possible Answers
Bag-of-words preserves the word order and grammar rules.
Bag-of-words describes the order and frequency of words or tokens within a corpus of documents.
<b>Bag-of-words is a simple but effective method to build a vocabulary of all the words occurring in a document.</b>
Bag-of-words can only be applied to a large document, not to shorter documents or single sentences.
</pre>

### Your first BOW

<div class=""><p>A bag-of-words is an approach to transform text to numeric form. </p>
<p>In this exercise, you will apply a BOW to the <code>annak</code> list before moving on to a larger dataset in the next exercise.  </p>
<p>Your task will be to work with this list and apply a BOW using the <code>CountVectorizer()</code>. This transformation is your first step in being able to understand the sentiment of a text. Pay attention to words which might carry a strong sentiment. </p>
<p>Remember that the output of a <code>CountVectorizer()</code> is a sparse matrix, which stores only entries which are non-zero. To look at the actual content of this matrix, we convert it to a dense array using the <code>.toarray()</code> method.</p>
<p>Note that in this case you don't need to specify the <code>max_features</code> argument because the text is short.</p></div>

Instructions
<ul>
<li>Import the count vectorizer function from <code>sklearn.feature_extraction.text</code>.</li>
<li>Build and fit the vectorizer on the small dataset.</li>
<li>Create the BOW representation with name <code>anna_bow</code> by calling the <code>transform()</code> method.</li>
<li>Print the BOW result as a dense array.</li>
</ul>

In [None]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer(annak)
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]


**You have transformed the first sentence of Anna Karenina to an array counting the frequencies of each word. However, the output is not very readable, is it? We are still missing the names of the features. And does the approach change when we apply it to a larger dataset?**

### BOW using product reviews

<div class=""><p>You practiced a BOW on a small dataset. Now you will apply it to a sample of Amazon product reviews. The data has been imported for you and is called <code>reviews</code>. It contains two columns. The first one is called <code>score</code> and it is <code>0</code> when the review is negative, and <code>1</code> when it is positive. The second column is called <code>review</code> and it contains the text of the review that a customer wrote. Feel free to explore the data in the IPython Shell.</p>
<p>Your task is to build a BOW vocabulary, using the <code>review</code> column.</p>
<p>Remember that we can call the <code>.get_feature_names()</code> method on the vectorizer to obtain a list of all the vocabulary elements.</p></div>

In [None]:
reviews_df = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/sentiment-analysis-in-python/datasets/amazon_reviews_sample.csv', index_col=0)

In [None]:
reviews = reviews_df

Instructions
<ul>
<li>Create a CountVectorizer object, specifying the maximum number of features. </li>
<li>Fit the vectorizer. </li>
<li>Transform the fitted vectorizer.</li>
<li>Create a DataFrame where you transform the sparse matrix to a dense array and make sure to correctly specify the names of columns.</li>
</ul>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  after  all  also  am  an  ...  will  with  work  would  you  your
0      0      0    1     0   0   0  ...     0     1     0      2    0     1
1      0      0    0     0   0   0  ...     0     0     0      1    1     0
2      0      0    3     0   0   1  ...     0     0     1      1    2     0
3      0      0    0     0   0   0  ...     0     0     0      0    0     0
4      0      1    0     0   0   0  ...     0     0     0      0    3     1

[5 rows x 100 columns]


**You have successfully built your first BOW generated vocabulary and transformed it to numeric features of the dataset!**

### Getting granular with n-grams

### Specify token sequence length with BOW

<div class=""><p>We saw in the video that by specifying different length of tokens - what we called n-grams - we can better capture the context, which can be very important.</p>
<p>In this exercise, you will work with a sample of the Amazon product reviews. Your task is to build a BOW vocabulary, using the <code>review</code> column and specify the sequence length of tokens.</p></div>

In [None]:
reviews = reviews_df[:100]

Instructions
<ul>
<li>Build the vectorizer, specifying the token sequence length to be uni- and bigrams.</li>
<li>Fit the vectorizer.</li>
<li>Transform the fitted vectorizer.</li>
<li>In the DataFrame, make sure to correctly specify the column names.</li>
</ul>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1,2))
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   10  10 95  10 cups  ...  zen  zen baseball  zen motorcycle
0   0      0        0  ...    0             0               0
1   0      0        0  ...    0             0               0
2   0      0        0  ...    0             0               0
3   0      0        0  ...    0             0               0
4   0      0        0  ...    0             0               0

[5 rows x 8436 columns]


### Size of vocabulary of movies reviews

<div class=""><p>In this exercise, you will practice different ways to limit the size of the vocabulary using a sample of the <code>movies</code> reviews dataset. The first column is the <code>review</code>, which is of type <code>object</code> and the second column is the <code>label</code>, which is <code>0</code> for a negative review and <code>1</code> for a positive one. </p>
<p>The three methods that you will use will transform the text column to new numeric columns, capturing the count of a word or a phrase in each review. Each method will ultimately result in building a different number of new features.</p></div>

In [None]:
movies_df = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/3-skill-tracks/sentiment-analysis-in-python/datasets/IMDB_sample.csv', index_col=0)
movies = movies_df[:1000]

Instructions 1/3
<p>Using the <code>movies</code> dataset, limit the size of the vocabulary to  <code>100</code>.</p>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(max_features=100)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  all  also  an  and  any  ...  which  who  will  with  would  you
0      0    0     0   0    1    0  ...      0    0     0     1      1    0
1      0    3     1   1   11    0  ...      2    0     2     7      2    3
2      0    0     0   1    7    0  ...      0    0     0     2      0    0
3      0    0     0   2    1    0  ...      0    1     0     0      0    1
4      0    3     0   0    8    0  ...      1    0     0     2      0    0

[5 rows x 100 columns]


Instructions 2/3
<p>Using the <code>movies</code> dataset, limit the size of the vocabulary to include terms which occur in no more than 200 documents.</p>

In [None]:
# Build and fit the vectorizer
vect = CountVectorizer(max_df=200)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   00  000  000s  007  00s  ...  zooms  zsigmond  zulu  zuniga  zvyagvatsev
0   0    0     0    0    0  ...      0         0     0       0            0
1   0    0     0    0    0  ...      0         0     0       0            0
2   0    0     0    0    0  ...      0         0     0       0            0
3   0    0     0    0    0  ...      0         0     0       0            0
4   0    0     0    0    0  ...      0         0     0       0            0

[5 rows x 17669 columns]


Instructions 3/3
<p>Using the <code>movies</code> dataset, limit the size of the vocabulary to ignore terms which occur in less than 50 documents.</p>

In [None]:
# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   10  about  absolutely  acting  action  ...  yes  yet  you  young  your
0   0      0           0       0       0  ...    0    0    0      0     0
1   1      0           0       1       0  ...    0    1    3      0     2
2   0      0           0       0       0  ...    0    0    0      1     0
3   0      0           0       0       1  ...    0    0    1      1     0
4   1      0           0       0       1  ...    0    0    0      0     0

[5 rows x 434 columns]


**Any of the three methods you applied here can be used to limit the size of the vocabulary. Which of the three methods you used resulted in the lowest number of constructed features?**

### BOW with n-grams and vocabulary size

<p>In this exercise, you will practice building a bag-of-words once more, using the <code>reviews</code> dataset of Amazon product reviews. Your main task will be to limit the size of the vocabulary and specify the length of the token sequence.</p>

Instructions
<ul>
<li>Import the vectorizer from <code>sklearn</code>.</li>
<li>Build the vectorizer and make sure to specify the following parameters: the size of the vocabulary should be limited to 1000, include only bigrams, and ignore terms that appear in more than 500 documents.</li>
<li>Fit the vectorizer to the <code>review</code> column.</li>
<li>Create a DataFrame from the BOW representation.</li>
</ul>

In [None]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   1980 style  aa batteries  ...  your money  yr old
0           0             0  ...           0       0
1           0             0  ...           0       0
2           0             0  ...           0       0
3           0             0  ...           0       0
4           0             0  ...           0       0

[5 rows x 1000 columns]


**You have successfully created a bag-of-words representation of the product reviews dataset, including more sophisticated sequence of tokens, while limiting the size of the vocabulary**

## Build new features from text

### Tokenize a string from GoT

<div class=""><p>A first standard step when working with text is to tokenize it, in other words, split a bigger string into individual strings, which are usually single words (tokens). </p>
<p>A string <code>GoT</code> has been created for you and it contains a quote from George R.R. Martin's <em>Game of Thrones</em>. Your task is to split it into individual tokens.</p></div>

In [None]:
import nltk
nltk.download('punkt')
GoT = 'Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armour yourself in it, and it will never be used to hurt you.'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Instructions
<ul>
<li>Import the word tokenizing function from <code>nltk</code>.</li>
<li>Transform the <code>GoT</code> string to word tokens.</li>
</ul>

In [None]:
# Import the required function
from nltk import word_tokenize

# Transform the GoT string to word tokens
print(word_tokenize(GoT))

['Never', 'forget', 'what', 'you', 'are', ',', 'for', 'surely', 'the', 'world', 'will', 'not', '.', 'Make', 'it', 'your', 'strength', '.', 'Then', 'it', 'can', 'never', 'be', 'your', 'weakness', '.', 'Armour', 'yourself', 'in', 'it', ',', 'and', 'it', 'will', 'never', 'be', 'used', 'to', 'hurt', 'you', '.']


### Word tokens from the Avengers

<div class=""><p>Now that you have tokenized your first string, it is time to iterate over items of a list and tokenize them as well. An easy way to do that with one line of code is with a list comprehension.</p>
<p>A list <code>avengers</code> has been created for you. It contains a few quotes from the <em>Avengers</em> movies. You can explore it in the IPython Shell.</p></div>

In [None]:
avengers = ["Cause if we can't protect the Earth, you can be d*** sure we'll avenge it",
 'There was an idea to bring together a group of remarkable people, to see if we could become something more',
 "These guys come from legend, Captain. They're basically Gods."]

Instructions
<ul>
<li>Import the required function and package.</li>
<li>Apply the word tokenizing function on each item of our list.</li>
</ul>

In [None]:
# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers 
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)

[['Cause', 'if', 'we', 'ca', "n't", 'protect', 'the', 'Earth', ',', 'you', 'can', 'be', 'd***', 'sure', 'we', "'ll", 'avenge', 'it'], ['There', 'was', 'an', 'idea', 'to', 'bring', 'together', 'a', 'group', 'of', 'remarkable', 'people', ',', 'to', 'see', 'if', 'we', 'could', 'become', 'something', 'more'], ['These', 'guys', 'come', 'from', 'legend', ',', 'Captain', '.', 'They', "'re", 'basically', 'Gods', '.']]


### A feature for the length of a review

<div class=""><p>You have now worked with a string and a list with string items, it is time to use a larger sample of data.</p>
<p>You task in this exercise is to create a new feature for the length of a review, using the familiar <code>reviews</code> dataset.</p></div>

Instructions 1/2
<ul>
<li>Import the word tokenizing function from the required package.</li>
<li>Apply the function to the <code>review</code> column of the <code>reviews</code> dataset.</li>
</ul>

In [None]:
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column 
word_tokens = [word_tokenize(review) for review in reviews.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

['Stuning', 'even', 'for', 'the', 'non-gamer', ':', 'This', 'sound', 'track', 'was', 'beautiful', '!', 'It', 'paints', 'the', 'senery', 'in', 'your', 'mind', 'so', 'well', 'I', 'would', 'recomend', 'it', 'even', 'to', 'people', 'who', 'hate', 'vid', '.', 'game', 'music', '!', 'I', 'have', 'played', 'the', 'game', 'Chrono', 'Cross', 'but', 'out', 'of', 'all', 'of', 'the', 'games', 'I', 'have', 'ever', 'played', 'it', 'has', 'the', 'best', 'music', '!', 'It', 'backs', 'away', 'from', 'crude', 'keyboarding', 'and', 'takes', 'a', 'fresher', 'step', 'with', 'grate', 'guitars', 'and', 'soulful', 'orchestras', '.', 'It', 'would', 'impress', 'anyone', 'who', 'cares', 'to', 'listen', '!', '^_^']


Instructions 2/2
<ul>
<li>Iterate over the created <code>word_tokens</code> list. </li>
<li>As you iterate, find the length of each item in the list and append it to the empty <code>len_tokens</code> list. </li>
<li>Create a new feature <code>n_words</code> in the <code>reviews</code> for the length of the reviews.</li>
</ul>

In [None]:
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))
#len_tokens = [len(x) for x in word_tokens]

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [None]:
reviews.head()

Unnamed: 0,score,review,n_words
0,1,Stuning even for the non-gamer: This sound tr...,87
1,1,The best soundtrack ever to anything.: I'm re...,109
2,1,Amazing!: This soundtrack is my favorite musi...,165
3,1,Excellent Soundtrack: I truly like this sound...,145
4,1,"Remember, Pull Your Jaw Off The Floor After H...",109


## Can you guess the language?

### Identify the language of a string

<div class=""><p>Sometimes you might need to analyze the sentiment of non-English text. Your first task in such a case will be to identify the foreign language. </p>
<p>In this exercise you will identify the language of a single string. A string called <code>foreign</code> has been created for you. Feel free to explore it in the IPython Shell.</p></div>

In [None]:
foreign = 'La histoire rendu étai fidèle, excellent, et grand.'

In [None]:
%%capture
!pip install langdetect

Instructions
<ul>
<li>Import the required function from the language detection package.</li>
<li>Detect the language of the <code>foreign</code> string.</li>
</ul>

In [None]:
# Import the language detection function and package
from langdetect import detect_langs

# Detect the language of the foreign string
print(detect_langs(foreign))

[fr:0.999997159109143]


### Detect language of a list of strings

<p>Now you will detect the language of each item in a list. A list called <code>sentences</code> has been created for you and it contains 3 sentences, each in a different language. They have been randomly extracted from the product reviews dataset.</p>

In [None]:
sentences = ['La histoire rendu étai fidèle, excellent, et grand.',
 'Excelente muy recomendable.',
 'It had a leak from day one but the return and exchange process was very quick.']

Instructions
<ul>
<li>Iterate over the sentences in the list.</li>
<li>Detect the language of each sentence and append the detected language to the empty list <code>languages</code>.</li>
</ul>

In [None]:
from langdetect import detect_langs

languages = []

# Loop over the sentences in the list and detect their language
for sentence in range(len(sentences)):
    languages.append(detect_langs(sentences[sentence]))
    
print('The detected languages are: ', languages)

The detected languages are:  [[fr:0.9999993247241308], [es:0.9999944158600821], [en:0.9999961920509105]]


### Language detection of product reviews

<div class=""><p>You will practice language detection on a small dataset called <code>non_english_reviews</code>. It is a sample of non-English reviews from the Amazon product reviews. </p>
<p>You will iterate over the rows of the dataset, detecting the language of each row and appending it to an empty list. The list needs to be cleaned so that it only contains the language of the review such as <code>'en'</code> for English instead of the regular output <code>en:0.9987654</code>. Remember that the language detection function might detect more than one language and the first item in the returned list is the most likely candidate. Finally, you will assign the list to a new column. </p>
<p>The logic is the same as used in the slides and the exercise before but instead of applying the function to a list, you work with a dataset.</p></div>

Instructions
<ul>
<li>Iterate over the rows of the <code>non_english_reviews</code> dataset.   </li>
<li>Inside the loop, detect the language of the second column of the dataset.</li>
<li>Clean the string by splitting on a <code>:</code> inside the list comprehension expression.</li>
<li>Finally, assign the cleaned list to a new column.</li>
</ul>

In [None]:
from langdetect import detect_langs
languages = [] 
non_english_reviews = reviews_df
# Loop over the rows of the dataset and append  
for row in tqdm(range(len(non_english_reviews))):
    languages.append(detect_langs(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
non_english_reviews['language'] = languages

non_english_reviews.head()

HBox(children=(FloatProgress(value=0.0, max=10000.0), HTML(value='')))




Unnamed: 0,score,review,language
0,1,Stuning even for the non-gamer: This sound tr...,en
1,1,The best soundtrack ever to anything.: I'm re...,en
2,1,Amazing!: This soundtrack is my favorite musi...,en
3,1,Excellent Soundtrack: I truly like this sound...,en
4,1,"Remember, Pull Your Jaw Off The Floor After H...",en


In [None]:
non_english_reviews[non_english_reviews.language != 'en'].head()

Unnamed: 0,score,review,language
1249,1,Il grande ritorno!: E' dai tempi del tour di ...,it
1259,1,La reencarnación vista por un científico: El ...,es
1260,1,Excelente Libro / Amazing book!!: Este libro ...,es
1261,1,Magnifico libro: Brian Weiss ha dejado una ma...,es
1639,1,El libro mas completo que existe para nosotra...,es


In [None]:
sum(non_english_reviews.language[non_english_reviews.language != 'en'].value_counts())

29

In [None]:
non_english_reviews.language[non_english_reviews.language != 'en'].value_counts()

es    16
fr     8
de     3
it     1
id     1
Name: language, dtype: int64