<a href="https://colab.research.google.com/github/jolonia/NLP/blob/main/Project_4_Eminem_NLP_jpynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Problem Description

The purpose of Project 4 is to use a prepared dataset of Wikipedia summaries in .csv format and explore them using Natural Language Processing (NLP) and the K-Nearest Neighbor analysis with the Tfidf transform to determine the ten nearest neighbors to a selected person in the summary .csv file.

In addition, a Wikipedia API must be used to access the full Wikipedia content of the Wikipedia pages for the selection person and their ten nearest neighbors.

Comparisons will be made of the nearest neighbors lists from the two sources, as well as of the sentiments of the Wiki summary and the Wikipedia full page for the targeted person.

An audience participation feature at the end will allow the audience to select a person and print the corresponding full text of the Wikipedia page using the Wikipedia API.

### Code Libraries

In [None]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [None]:
#install Wikipedia API
!pip3 install wikipedia-api



In [None]:
from textblob import TextBlob
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
import wikipediaapi

# Part 1

## Read in Data

A .csv dataset was provided with the project for the first part of this project:

In [None]:
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dataset = pd.read_csv('/content/drive/MyDrive/Copy of Project_4.csv')

In [None]:
dataset.shape

(42786, 3)

In [None]:
dataset.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


Create two additional dataframes, one for the text column and one for the names column. These will be used later to train the KNN model.

In [None]:
train_text = dataset['text']
train_text.head()

0    digby morrell born 10 october 1979 is a former...
1    alfred j lewy aka sandy lewy graduated from un...
2    harpdog brown is a singer and harmonica player...
3    franz rottensteiner born in waidmannsfeld lowe...
4    henry krvits born 30 december 1974 in tallinn ...
Name: text, dtype: object

In [None]:
train_names = dataset['name']

##Data Cleaning
There are no missing values in the dataset, and it appears things are pretty clean in terms of textblob processing.  A few small adjustments will be made during the model process as needed.

##Exploratory Data Analysis

### Part 1: K-Nearest-Neighbor from Dataset

In [None]:
#Vectorize training text into X_train_counts
count_vec = CountVectorizer(stop_words='english')
X_train_counts = count_vec.fit_transform(train_text)

In [None]:
# Apply Tfidf transform to create X_train_tfidf (sparse matrix)
tfidf_xfrm =  TfidfTransformer()
X_train_tfidf = tfidf_xfrm.fit_transform(X_train_counts)

In [None]:
# Find nearest neighbors on the transformed training matrix
nearest = NearestNeighbors()
nearest.fit(X_train_tfidf)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

Find nearest neighbors of person specified.

In [None]:
person = 'Eminem'

In [None]:
#@title Default title text
ind = np.where(train_names == person)
ind[0]


array([31657])

In [None]:
#Give it a location of a name, and store res{lts (distance and index row of nearest neighbors)
dist, row = nearest.kneighbors(X_train_tfidf[31657:,
                                            ], n_neighbors=11)

In [None]:
dist

array([[0.        , 1.15405969, 1.23649465, ..., 1.25948347, 1.26088656,
        1.26178261],
       [0.        , 1.31479593, 1.32345847, ..., 1.33678074, 1.33771814,
        1.33911713],
       [0.        , 1.10261898, 1.12691023, ..., 1.20636908, 1.20831809,
        1.21024284],
       ...,
       [0.        , 1.19607473, 1.22069517, ..., 1.27242588, 1.281583  ,
        1.28496619],
       [0.        , 1.25803849, 1.34079872, ..., 1.38446366, 1.38549918,
        1.38589221],
       [0.        , 1.22992624, 1.25191282, ..., 1.28783668, 1.28832753,
        1.29327643]])

In [None]:
#index of the nearest neighbors to the first element  (2450)
row

array([[31657, 24782, 15946, ..., 35738,  6946, 33007],
       [31658, 20956, 21537, ..., 40380, 35460, 27669],
       [31659,  3069,  8768, ...,  1177,  7068, 38434],
       ...,
       [42783, 38992, 31581, ...,  9560, 14192,  6538],
       [42784,  6898, 21406, ..., 31235, 39179, 23495],
       [42785, 35332, 42541, ..., 36050, 36821, 19596]])

In [None]:
# Get the names of the people in the above rows
neighbors = train_names.iloc[row[0]]
neighbors

31657                        Eminem
24782                       50 Cent
15946                       Dr. Dre
17337                         Jay Z
26055    Andrea Bocelli discography
34724                        Lecrae
35801                    Joss Stone
24857                       Rihanna
35738                  Tommy Coster
6946                  Philip Atwell
33007                   Celine Dion
Name: name, dtype: object

In [None]:
#To determine the sentiment of person's bio, use TextBlob
bio = TextBlob(str(train_text.iloc[ind[0]]))

In [None]:
bio

TextBlob("31657    marshall bruce mathers iii born october 17 197...
Name: text, dtype: object")

###Sentiment
Calculating sentiment of selected data summary returns a nearly neutral sentiment that is slightly more objective than subjective.

In [None]:
bio.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)


---

### Part Two: Wikipedia




The purpose of Part Two is to explore using the Wikipedia API directly instead of starting with a cleaned .csv file.

First step is to obtain text for the person's Wikipedia page.

Define a function that will use the Wikipedia API to call up the entire Wikipedia for any given name

In [None]:
def wiki_content(celebrity):
    wikip=wikipediaapi.Wikipedia('en')
    page_ex = wikip.page(celebrity)
    if page_ex.exists(): return page_ex.text
    return None

In [None]:
#Use the function to call up the page for the person to test the function
wiki_content(person)



Using the same list of nearest neighbors derived in Part 1 (*nearest list*), loop the wiki_content function to get the full text of Wikipages for the target and the ten nearest neighbors.
Put the results into array *wiki_text*

In [None]:
wiki_text = []
for name in neighbors:
  wiki_text.append(wiki_content(name))

In [None]:
wiki_text

 'Shawn Corey Carter (born December 4, 1969), known professionally as Jay-Z (stylized as JAY-Z), is an American rapper, songwriter, record executive, businessman, and media proprietor. He is widely regarded as one of the most influential hip-hop artists in history and is also well known for being the former CEO of Def Jam Recordings, cultivating major industry artists such as Rihanna and Rick Ross.Born and raised in New York City, Jay-Z first began his musical career after founding the record label Roc-A-Fella Records in 1995, and subsequently released his debut studio album Reasonable Doubt in 1996. The album was released to widespread critical success, and solidified his standing in the music industry. He went on to release twelve additional albums, including the acclaimed albums The Blueprint (2001), The Black Album (2003), American Gangster (2007), and 4:44 (2017). He also released the full-length collaborative albums Watch the Throne (2011) with Kanye West and Everything Is Love (

Pulling the page directly from Wikipedia results in some formatting that will could alter the results of natural language processing.  These need to be cleaned up before performing any more processing. Put the results into *wiki_text_clean*.

In [None]:
#replace newline chars with spaces before doing any nearest neighbor processing. Strip the ' and "s" from possessives
wiki_text_clean = []
for x in range(len(wiki_text)):
  wiki_text_clean.append(wiki_text[x].replace("\n"," ").replace("\'s",'').replace('\'',''))

In [None]:
wiki_text_clean

 'Shawn Corey Carter (born December 4, 1969), known professionally as Jay-Z (stylized as JAY-Z), is an American rapper, songwriter, record executive, businessman, and media proprietor. He is widely regarded as one of the most influential hip-hop artists in history and is also well known for being the former CEO of Def Jam Recordings, cultivating major industry artists such as Rihanna and Rick Ross.Born and raised in New York City, Jay-Z first began his musical career after founding the record label Roc-A-Fella Records in 1995, and subsequently released his debut studio album Reasonable Doubt in 1996. The album was released to widespread critical success, and solidified his standing in the music industry. He went on to release twelve additional albums, including the acclaimed albums The Blueprint (2001), The Black Album (2003), American Gangster (2007), and 4:44 (2017). He also released the full-length collaborative albums Watch the Throne (2011) with Kanye West and Everything Is Love (

### Sentiment of Full Wikipedia page

Calculate Sentiment of full Wiki page for the person

In [None]:
bio_wiki = TextBlob(wiki_text_clean[0])

In [None]:
bio_wiki.sentiment

Sentiment(polarity=0.04224700050096881, subjectivity=0.41019922023890326)

Use NLP and KNN to analyze the full (cleaned) text of the Wiki bios

In [None]:
#Vectorize training text from full Wiki pages
count_vec2 = CountVectorizer(stop_words='english')
X_train_counts2 = count_vec2.fit_transform(wiki_text_clean)

In [None]:
#Apply Tfidf transform
tfidf_xfrm =  TfidfTransformer()
X_train_tfidf2 = tfidf_xfrm.fit_transform(X_train_counts2)

In [None]:
#Find the nearest neighbors
nearest = NearestNeighbors()
nearest.fit(X_train_tfidf2)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [None]:
#List the nearest neighbors of the target in row 0
dist2, row2 = nearest.kneighbors(X_train_tfidf2[0], n_neighbors=11)

In [None]:
# Get the names of the people in the above rows
neighbors_wiki = neighbors.iloc[row2[0]]
neighbors_wiki

31657                        Eminem
6946                  Philip Atwell
15946                       Dr. Dre
24782                       50 Cent
17337                         Jay Z
35738                  Tommy Coster
35801                    Joss Stone
24857                       Rihanna
33007                   Celine Dion
26055    Andrea Bocelli discography
34724                        Lecrae
Name: name, dtype: object

Compare both lists

In [None]:
my_dict = {'Neighbors_CSV': list(neighbors), 'Neighbors_Wiki': list(neighbors_wiki)}
results_df = pd.DataFrame(my_dict)
results_df

Unnamed: 0,Neighbors_CSV,Neighbors_Wiki
0,Eminem,Eminem
1,50 Cent,Philip Atwell
2,Dr. Dre,Dr. Dre
3,Jay Z,50 Cent
4,Andrea Bocelli discography,Jay Z
5,Lecrae,Tommy Coster
6,Joss Stone,Joss Stone
7,Rihanna,Rihanna
8,Tommy Coster,Celine Dion
9,Philip Atwell,Andrea Bocelli discography


In [None]:
import platform
print(platform.python_version)

<function python_version at 0x7f8c5e0c34d0>
