# <center> Natural Language Processing (NLP) 

### <center> Applications</center>
<center>
- Translation <br>
- Sentiment analysis <br>
- Question answering <br>
- Chatbots 
   

### <center> Challenges of NLP

Difficult to understand meaning for a computer:

    I was led to believe that the Fyre Festival would be an amazing, transcendent event - I was conned.

Ambiguity because of lack of context (meaning or semantics):

    The pipe couldn't fit through the hole in the wall since it was too big.

versus:

    The pipe couldn't fit through the hole in the wall since it was too small.



## <center> NLTK

In [2]:
import nltk

### <center> Tokenization

In [14]:
sentence = "This is a sentence with 7 words."
nltk.word_tokenize(sentence)

['This', 'is', 'a', 'sentence', 'with', '7', 'words', '.']

In [15]:
paragraph = "This is a sentence. This is another sentence. This is a third sentence"
nltk.sent_tokenize(paragraph)

['This is a sentence.',
 'This is another sentence.',
 'This is a third sentence']

### <center> Remove punctuation 

In [19]:
sentence_stripped = sentence.translate(str.maketrans('', '', string.punctuation))

In [20]:
nltk.word_tokenize(sentence_stripped)

['This', 'is', 'a', 'sentence', 'with', '7', 'words']

### <center> Stop words</center>
<center> Stop words = noise/common words

In [7]:
nltk.download('stopwords')
stop_words=set(nltk.corpus.stopwords.words("english"))
print(stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Matthew\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
{'mightn', 'shan', "weren't", 'nor', 'my', 'how', 'yours', "won't", 'not', 'now', 'our', 'while', 'few', 'couldn', 'over', 'd', 'by', "hadn't", 'needn', 'if', 'does', 'on', 'ma', 'under', 'herself', 'off', 'her', 'it', 'both', 'only', 'them', 'because', 's', 'are', 'mustn', 'a', 'having', 'or', 'wouldn', 'to', 're', 'before', 'an', "she's", 'down', 'of', "you've", 'further', "wouldn't", 'this', 've', "you'll", 'from', 'and', 'your', 'is', 'he', 't', 'doing', "mightn't", 'shouldn', 'should', 'itself', 'be', 'ours', 'o', 'as', 'his', "haven't", 'any', 'but', "it's", 'hasn', 'at', 'y', "you're", 'more', 'in', 'up', 'am', 'has', 'into', 'myself', 'had', 'all', 'hadn', "that'll", 'some', 'with', 'each', 'such', 'about', 'until', "shan't", 'these', 'against', 'don', "isn't", 'will', 'haven', 'between', 'me', 'did', 'what', 'theirs', 'same',

In [22]:
sentence = "Hello Mr. Smith. How are you doing today?"
sentence_stripped = sentence.translate(str.maketrans('', '', string.punctuation))
tokens = nltk.word_tokenize(sentence_stripped)
filtered_tokens = [x for x in tokens if x not in stop_words]
filtered_tokens

['Hello', 'Mr', 'Smith', 'How', 'today']

### <center> Stemming </center>
<center> Reducing words to their roots

In [24]:
ps = nltk.stem.PorterStemmer()

In [25]:
ps.stem("running")

'run'

### <center> Lemmatization </center>
<center> Reducing words to their base words

In [26]:
lem = nltk.stem.wordnet.WordNetLemmatizer()

In [30]:
lem.lemmatize("flying", "v"), ps.stem("flying")

('fly', 'fli')

In [39]:
lem.lemmatize("better", "a")

'good'

### <center> Part of Speech Tagging 

In [40]:
sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens = nltk.word_tokenize(sent)
nltk.pos_tag(tokens)

[('Albert', 'NNP'),
 ('Einstein', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('Ulm', 'NNP'),
 (',', ','),
 ('Germany', 'NNP'),
 ('in', 'IN'),
 ('1879', 'CD'),
 ('.', '.')]

### <center> TF-IDF </center>
<center> Term Frequency-Inverse Document Frequency

<center><img src="tfidf.png" height=600 width=600>

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [58]:
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = []
    for item in tokens:
        stems.append(nltk.stem.PorterStemmer().stem(item))
    return stems

In [59]:
tf_idf_vec = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

In [60]:
import os 
token_dict = {}
for dirpath, dirs, files in os.walk('Steinbeck'):
    for f in files:
        fname = os.path.join(dirpath, f)
        print(fname)
        with open(fname) as pearl:
            text = pearl.read()
            token_dict[f] = text.lower().translate(str.maketrans('', '', string.punctuation))

Steinbeck\Pearl1.txt
Steinbeck\Pearl2.txt
Steinbeck\Pearl3.txt
Steinbeck\Pearl4.txt
Steinbeck\Pearl5.txt
Steinbeck\Pearl6.txt


In [61]:
vectors = tf_idf_vec.fit_transform(token_dict.values())

  'stop_words.' % sorted(inconsistent))


<6x2307 sparse matrix of type '<class 'numpy.float64'>'
	with 4617 stored elements in Compressed Sparse Row format>

In [65]:
phrase = 'all great and precious things are lonely.'
response = tf_idf_vec.transform([phrase])
print(response)

  (0, 2024)	0.3759573580043784
  (0, 1143)	0.8469428138456193
  (0, 851)	0.3759573580043784


In [71]:
tf_idf_vec.get_feature_names()[1143]

'lone'

### <center> Artist Lyric Classification Example

In [72]:
import pandas as pd
import numpy as np
df = pd.read_csv('lyrics.csv')
df.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


In [None]:
df.dropna(inplace=True)

In [129]:
len(df[df['artist']=='beyonce-knowles'])

249

In [170]:
beyonce = df[df['artist']=='beyonce-knowles'].sample(30).index

In [171]:
barbara = df[df['artist']=='barbra-streisand'].sample(30).index

In [172]:
data = df[df.index.isin(beyonce) | df.index.isin(barbara)]

In [217]:
data.head()

Unnamed: 0,artist,lyrics
2,0,If you search\nFor tenderness\nIt isn't hard t...
8,0,[Verse 1:]\nI read all of the magazines\nwhile...
16,0,You're bad for me I clearly get it\nI don't se...
24,0,"Ay! Ay! Ay!\nOh! Beyonc, Beyonc\nOh! Beyonc, B..."
30,0,You don't care about me at all\nYou treat me l...


In [174]:
data = data.drop(['index','song','year','genre'], axis=1)
data.head()

Unnamed: 0,artist,lyrics
2,beyonce-knowles,If you search\nFor tenderness\nIt isn't hard t...
8,beyonce-knowles,[Verse 1:]\nI read all of the magazines\nwhile...
16,beyonce-knowles,You're bad for me I clearly get it\nI don't se...
24,beyonce-knowles,"Ay! Ay! Ay!\nOh! Beyonc, Beyonc\nOh! Beyonc, B..."
30,beyonce-knowles,You don't care about me at all\nYou treat me l...


In [175]:
data['artist'] = [0 if x=='beyonce-knowles' else 1 for x in data['artist']]
data.head()

Unnamed: 0,artist,lyrics
2,0,If you search\nFor tenderness\nIt isn't hard t...
8,0,[Verse 1:]\nI read all of the magazines\nwhile...
16,0,You're bad for me I clearly get it\nI don't se...
24,0,"Ay! Ay! Ay!\nOh! Beyonc, Beyonc\nOh! Beyonc, B..."
30,0,You don't care about me at all\nYou treat me l...


In [187]:
## Count Vectorizer 
from sklearn.feature_extraction.text import CountVectorizer
matrix = CountVectorizer(max_features=10000)
X = matrix.fit_transform(data['lyrics']).toarray()

In [195]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, data['artist'], test_size=0.2, random_state=123)
clf = RandomForestClassifier(random_state=123).fit(X_train, y_train)
clf.score(X_train, y_train), clf.score(X_test, y_test)

(1.0, 0.9166666666666666)

In [196]:
## TF-IDF Vectorizer
tf_idf_vec = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
vectors = tf_idf_vec.fit_transform(data['lyrics'])

  'stop_words.' % sorted(inconsistent))


In [197]:
X_train, X_test, y_train, y_test = train_test_split(vectors, data['artist'], test_size=0.2, random_state=123)
clf = RandomForestClassifier(random_state=123).fit(X_train, y_train)
clf.score(X_train, y_train), clf.score(X_test, y_test)

(1.0, 0.8333333333333334)

In [236]:
## Bigrams
for bigram in nltk.bigrams(nltk.word_tokenize(data['lyrics'].values[0])):
    print(bigram)

('If', 'you')
('you', 'search')
('search', 'For')
('For', 'tenderness')
('tenderness', 'It')
('It', 'is')
('is', "n't")
("n't", 'hard')
('hard', 'to')
('to', 'find')
('find', 'You')
('You', 'can')
('can', 'have')
('have', 'the')
('the', 'love')
('love', 'You')
('You', 'need')
('need', 'to')
('to', 'live')
('live', 'But')
('But', 'if')
('if', 'you')
('you', 'look')
('look', 'For')
('For', 'truthfulness')
('truthfulness', 'You')
('You', 'might')
('might', 'just')
('just', 'As')
('As', 'well')
('well', 'be')
('be', 'blind')
('blind', 'It')
('It', 'always')
('always', 'seems')
('seems', 'to')
('to', 'be')
('be', 'So')
('So', 'hard')
('hard', 'to')
('to', 'give')
('give', 'Chorus')
('Chorus', ':')
(':', 'Honesty')
('Honesty', 'Is')
('Is', 'such')
('such', 'a')
('a', 'lonely')
('lonely', 'word')
('word', 'Everyone')
('Everyone', 'is')
('is', 'so')
('so', 'untrue')
('untrue', 'Honesty')
('Honesty', 'Is')
('Is', 'hardly')
('hardly', 'ever')
('ever', 'heard')
('heard', 'And')
('And', 'mostly')


In [222]:

matrix = CountVectorizer(max_features=10000, ngram_range=(2,2))
X = matrix.fit_transform(data['lyrics']).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, data['artist'], test_size=0.2, random_state=123)
clf = RandomForestClassifier(random_state=123).fit(X_train, y_train)
clf.score(X_train, y_train), clf.score(X_test, y_test)

(1.0, 0.75)

## <center> Activity

<center> Use NLP and classification to perform a sentiment analysis task - predicting the sentiment of a Twitter user on nuclear power based off the text of their tweet.

In [240]:
df = pd.read_csv('sentiment_nuclear_power.csv',encoding = 'unicode_escape')