# Unit 4 Assessment: Natural Language Processing
----

### Overview

The Unit 4 assessment covers Natural Language Processing.  All of the questions for this assessment are contained in the `Unit_4_Assessment_unsolved.ipynb` Jupyter Notebook file. **The assessment is worth 50 points.**

### Files

Use the following link to download the assessment instructions and Jupyter Notebook file.

[Download the Unit 4 Assessment resources](https://2u-data-curriculum-team.s3.amazonaws.com/nflx-data-science-adv/week-14/Assessment/Unit_4_Assessment.zip)

### Instructions

Keep the following mind while working on the assessment: 

* Remember that this is an individual assessment&mdash;you may not work with your classmates. However, you are free to consult your course notes and activities to help you answer the questions. 

* Although this assessment is delivered in a Jupyter Notebook, we recommend that make a copy of the `Unit_4_Assessment_unsolved.ipynb` file and upload into Google Colab. 

    > **Note:** If your answers are not clearly identified, you may receive a score of “0” for that question. 

* When you are ready to submit your assessment, rename the Google Colab notebook file with your last name. For example, `Unit_4_Assessment_<your_last_name>.ipynb`. Please do not clear your outputs if you have written code.

## Question 1

- **5 points**

Regular expressions (regex) are an important tool for working with text data. Which of these would find, and then remove, any text that is not a letter or space?

a. `regex = re.compile("[^0-9@#%!]")
  re_clean = regex.sub('  ', sentence)`

b. `re_clean = regex.sub('  ', sentence)`

c. `regex = re.compile("[^a-zA-Z ]")
  re_clean = regex.sub('  ', sentence)`

d.`regex = re.compile("[a-zA-Z ]")
  re_clean = regex.sub('  ', sentence)`

### Please provide answer below

# C

## Question 2

- **5 points**

Tokenizing means to separate text into individual components, usually into individual words. Assuming the `sentence` object below is a string of text containing a sentence, which string method could you use to tokenize it?

a. `sentence.str.replace('.','')`

b. ``sentence.split('.')``

c. ``tokenize(sentence)``

d.  `sentence.split(' ')`

### Please provide answer below

# D

## Question 3

- **5 points**

Assuming you were going to tokenize the above sentance using `nltk`, instead of a string method, which function would you use?

a. `sentence_tokenize(sentence)`

b. `tokenize(sentence)`

c. `word_tokenize(sentence)` 

d. `split(sentence)`


### Please provide answer below

# C

## Question 4

- **5 points**

When might you use a **word cloud**?


a. When you want to visualize TF-IDF scores.

b. When you want a table of meaning for a sentence.

c. When you want to visualize sentiment scores for each word in a document.

d. When you want to visually display the frequency count of words in an easy to understand way.

### Please provide answer below

# D

## Question 5

- **5 points**

TF-IDF is a method for scoring text. When you're looking at thousands of documents, which of the following would suggest a more interesting document that should be subjected to further analysis?


a. A high TF-IDF score

b. A low TF-IDF score

c. No change in TF-IDF score


### Please provide answer below

# B

## Question 6

- **5 points**

VADER is a tool which can calculate the sentiment of a batch of text. Which of the following will the VADER algorithm return? 

a. Vader will provide a score for each word in each document

b. A `final` score, summarizing all relevant information for the document

c. Two values: `pos` and `neg` scores

d. `pos`, `neu`, `neg`, and `compound` scores for each chunk of text analyzed


### Please provide answer below

# D

## Question 7

- **5 points**

Which of the following is the correct syntax for estimating sentiment through VADER? 

a. `analyzer = SentimentIntensityAnalyzer()
   analyzer.polarity_scores()`

b. `analyzer.polarity_scores()`

c. `SentimentIntensityAnalyzer()`

d. `vaderize()`


### Please provide answer below

# A

## Questions 8-10 (Coding Application Questions)

Read in the dataset of lines of old poems ([curated](https://arxiv.org/abs/2011.02686) from Project Gutenberg), and download the `vader_lexicon` from `nltk`. 

After initializing the `SentimentIntensityAnalyzer`, estimate polarity scores for the lines of poetry. Save these scores to a DataFrame, in order to answer the questions which follow.

In [2]:
# Create a temporary view of the poetry data.  
import pandas as pd
poems = pd.read_csv('../Resources/poems.csv')
poems.iloc[1:5]

Unnamed: 0,poetry_line
1,"it flows so long as falls the rain,"
2,"and that is why, the lonesome day,"
3,"when i peruse the conquered fame of heroes, an..."
4,of inward strife for truth and liberty.


In [3]:
# Import the neccessaries
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download/Update the VADER Lexicon
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...


True

In [23]:
# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

In [24]:
# Iterate through each line to and calculate `polarity_scores`.
# Save the scores to a list, appending each time a new row is scored.
all_lines = []


for value in poems["poetry_line"]:
    text=value
    sentiment=analyzer.polarity_scores(value)
    pos=sentiment['pos']
    neg=sentiment['neu']
    neu=sentiment['neu']
    compound=sentiment['compound']
    
    all_lines.append({
            "poetry_line": text,
            "compound": compound,
            "pos": pos,
            "neg": neg,
            "neu": neu
            
        })
    

# Your Code Here
all_lines[0:2]

[{'poetry_line': 'with pale blue berries. in these peaceful shades--',
  'compound': 0.4939,
  'pos': 0.314,
  'neg': 0.686,
  'neu': 0.686},
 {'poetry_line': 'it flows so long as falls the rain,',
  'compound': 0.0,
  'pos': 0.0,
  'neg': 1.0,
  'neu': 1.0}]

In [25]:
sentiment_df = pd.DataFrame(all_lines)
cols = ["neg", "neu", "pos", "compound", "poetry_line"]
sentiment_df=sentiment_df[cols]
sentiment_df.head()

Unnamed: 0,neg,neu,pos,compound,poetry_line
0,0.686,0.686,0.314,0.4939,with pale blue berries. in these peaceful shad...
1,1.0,1.0,0.0,0.0,"it flows so long as falls the rain,"
2,0.706,0.706,0.0,-0.3612,"and that is why, the lonesome day,"
3,0.652,0.652,0.348,0.7914,"when i peruse the conquered fame of heroes, an..."
4,0.467,0.467,0.533,0.6908,of inward strife for truth and liberty.


In [26]:
# After the loop, convert the list to a DataFrame.
# Add the original line of poetry back into that DataFrame.
sentiment_df = pd.DataFrame(all_lines)
cols = ["neg", "neu", "pos", "compound", "poetry_line"]
sentiment_df=sentiment_df[cols]
sentiment_df.head()

# Your Code Here
sentiment_df.head()

Unnamed: 0,neg,neu,pos,compound,poetry_line
0,0.686,0.686,0.314,0.4939,with pale blue berries. in these peaceful shad...
1,1.0,1.0,0.0,0.0,"it flows so long as falls the rain,"
2,0.706,0.706,0.0,-0.3612,"and that is why, the lonesome day,"
3,0.652,0.652,0.348,0.7914,"when i peruse the conquered fame of heroes, an..."
4,0.467,0.467,0.533,0.6908,of inward strife for truth and liberty.


Analyze the above DataFrame to answer the following questions.

## Question 8

- **5 points**

Across the all the lines of poetry, what was the average positive ("pos") score?

a. 1.50271

b. 0.07353

c. 0.48463

d. 0.10775 

### Please provide answer below

In [28]:
# Calculate average positive sentiment score value:
sentiment_df.describe()
D 

Unnamed: 0,neg,neu,pos,compound
count,749.0,749.0,749.0,749.0
mean,0.819569,0.819569,0.107752,0.044218
std,0.217677,0.217677,0.185353,0.336327
min,0.0,0.0,0.0,-0.886
25%,0.642,0.642,0.0,0.0
50%,1.0,1.0,0.0,0.0
75%,1.0,1.0,0.231,0.1877
max,1.0,1.0,1.0,0.9588


## Question 9

- **5 points**

Across the all the lines of poetry, what was the average negative ("neg") score?

a. 0.07268

b. 9.21031

c. 0.58732

d. 0.13075

### Please provide answer below

In [4]:
# Calculate average negative sentiment score value:
.819

## Question 10
- **5 points**

Based on your answers above, was the poetry more upbeat, on average, or would you quantify the poetry tone as being predominantly gloomy?

a. upbeat on average

b. predominantly gloomy

c. decidedly neutral



### Please provide answer below

# C