<a href="https://colab.research.google.com/github/jolonia/NLP/blob/main/wikiNLP_JO_jpynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Problem Description

The purpose of this project is to use a prepared dataset of Wikipedia summaries in .csv format and explore them using Natural Language Processing (NLP) and the K-Nearest Neighbor analysis with the Tfidf transform to determine the ten nearest neighbors to a selected person in the summary .csv file.

In addition, a Wikipedia API must be used to access the full Wikipedia content of the Wikipedia pages for the selection person and their ten nearest neighbors.

Comparisons will be made of the nearest neighbors lists from the two sources, as well as of the sentiments of the Wiki summary and the Wikipedia full page for the targeted person.

An audience participation feature at the end will allow the audience to select a person and print the corresponding full text of the Wikipedia page using the Wikipedia API.



### Code Libraries

In [None]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [None]:
#install Wikipedia API
!pip3 install wikipedia-api

Collecting wikipedia-api
  Downloading Wikipedia_API-0.6.0-py3-none-any.whl (14 kB)
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.6.0


In [None]:
from textblob import TextBlob
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
import wikipediaapi

# Part 1

## Read in Data

A .csv dataset was provided with the project for the first part of this project:

In [None]:
#dataset = pd.read_csv('/content/drive/MyDrive/Copy of Project_4.csv')

In [None]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'
dataset = pd.read_csv(url)

In [None]:
dataset.shape

(42786, 3)

In [None]:
dataset.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


Create two additional dataframes, one for the text column and one for the names column. These will be used later to train the KNN model.

In [None]:
train_text = dataset['text']
train_text.head()

0    digby morrell born 10 october 1979 is a former...
1    alfred j lewy aka sandy lewy graduated from un...
2    harpdog brown is a singer and harmonica player...
3    franz rottensteiner born in waidmannsfeld lowe...
4    henry krvits born 30 december 1974 in tallinn ...
Name: text, dtype: object

In [None]:
train_names = dataset['name']

##Data Cleaning
There are no missing values in the dataset, and it appears things are pretty clean in terms of textblob processing.  A few small adjustments will be made during the model process as needed.

##Exploratory Data Analysis

### Part 1: K-Nearest-Neighbor from Dataset

In [None]:
#Vectorize training text into X_train_counts
count_vec = CountVectorizer(stop_words='english')
X_train_counts = count_vec.fit_transform(train_text)

In [None]:
# Apply Tfidf transform to create X_train_tfidf (sparse matrix)
tfidf_xfrm =  TfidfTransformer()
X_train_tfidf = tfidf_xfrm.fit_transform(X_train_counts)

In [None]:
# Find nearest neighbors on the transformed training matrix
dfnearest = NearestNeighbors()
dfnearest.fit(X_train_tfidf)

Find people closest to person specified.

In [None]:
person = 'Eminem'

In [None]:
ind = np.where(train_names == person)
ind[0][0]

31657

In [None]:
#Give it a location of a name, and store results (distance and index row of nearest neighbors)
#Give it a location of a name, and store results (distance and index row of nearest neighbors)
dist, row = dfnearest.kneighbors(X_train_tfidf[ind[0],:], n_neighbors=11)

In [None]:
dist

array([[0.        , 1.15405969, 1.23649465, 1.24666436, 1.24846074,
        1.24849862, 1.24931436, 1.25640304, 1.25948347, 1.26088656,
        1.26178261]])

In [None]:
#index of the nearest neighbors to the first element  (2450)
row

array([[31657, 24782, 15946, 17337, 26055, 34724, 35801, 24857, 35738,
         6946, 33007]])

In [None]:
# Get the names of the people in the above rows
neighbors = train_names.iloc[row[0]]
neighbors

31657                        Eminem
24782                       50 Cent
15946                       Dr. Dre
17337                         Jay Z
26055    Andrea Bocelli discography
34724                        Lecrae
35801                    Joss Stone
24857                       Rihanna
35738                  Tommy Coster
6946                  Philip Atwell
33007                   Celine Dion
Name: name, dtype: object

In [None]:
#To determine the sentiment of person's bio, use TextBlob
bio = TextBlob(str(train_text.iloc[ind[0]]))

In [None]:
bio

TextBlob("31657    marshall bruce mathers iii born october 17 197...
Name: text, dtype: object")

###Sentiment
Calculating sentiment of selected data summary returns a neutral sentiment that is also neutral in subjsectivity.

In [None]:
bio.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)


---

### Part Two: Wikipedia




The purpose of Part Two is to explore using the Wikipedia API directly instead of starting with a cleaned .csv file.

First step is to obtain text for the person's Wikipedia page.

Define a function that will use the Wikipedia API to call up the entire Wikipedia for any given name

In [None]:
def wiki_content(celebrity):
    wikip=wikipediaapi.Wikipedia(user_agent= 'FOofooo')
    # Set user agent separately
    page_ex = wikip.page(celebrity)
    if page_ex.exists(): return page_ex.text
    return None

In [None]:
#Use the function to call up the page for the person to test the function
wiki_content(person)



Using the same list of nearest neighbors derived in Part 1 (*nearest list*), loop the wiki_content function to get the full text of Wikipages for the target and the ten nearest neighbors.
Put the results into array *wiki_text*

In [None]:
wiki_text = []
for name in neighbors:
  wiki_text.append(wiki_content(name))

In [None]:
wiki_text

 'Shawn Corey Carter (born December 4, 1969), known professionally as Jay-Z, is an American rapper and entrepreneur. Born and raised in New York City, he was named the greatest rapper of all time by Billboard and Vibe in 2023. He served as the president and chief executive officer of Def Jam Recordings from 2004 to 2007 before founding the entertainment company, Roc Nation the following year.\nAs a protégé of fellow New York City-based rapper Jaz-O, Jay-Z began his musical career in the late 1980s; he co-founded the record label Roc-A-Fella Records in 1994 to release his first two studio albums Reasonable Doubt (1996) and In My Lifetime, Vol. 1 (1997), both of which were met with critical acclaim. His following albums, including The Blueprint (2001), The Black Album (2003), American Gangster (2007), and 4:44 (2017) each debuted atop the Billboard 200; Jay-Z holds the record for the most number-one albums (14) of any solo artist on the chart. He has also released the collaborative album

Pulling the page directly from Wikipedia results in some formatting that will could alter the results of natural language processing.  These need to be cleaned up before performing any more processing. Put the results into *wiki_text_clean*.

In [None]:
wiki_text_clean = []
for x in range(len(wiki_text)):
  if wiki_text[x] is not None: # Check if wiki_text[x] is not None before calling replace
    wiki_text_clean.append(wiki_text[x].replace("\n"," ").replace("\'s",'').replace('\'',''))
  else:
    # Handle the case where wiki_text[x] is None, maybe by appending an empty string or a placeholder
    wiki_text_clean.append("") # Appending an empty string for None values

In [None]:
wiki_text_clean

 'Shawn Corey Carter (born December 4, 1969), known professionally as Jay-Z, is an American rapper and entrepreneur. Born and raised in New York City, he was named the greatest rapper of all time by Billboard and Vibe in 2023. He served as the president and chief executive officer of Def Jam Recordings from 2004 to 2007 before founding the entertainment company, Roc Nation the following year. As a protégé of fellow New York City-based rapper Jaz-O, Jay-Z began his musical career in the late 1980s; he co-founded the record label Roc-A-Fella Records in 1994 to release his first two studio albums Reasonable Doubt (1996) and In My Lifetime, Vol. 1 (1997), both of which were met with critical acclaim. His following albums, including The Blueprint (2001), The Black Album (2003), American Gangster (2007), and 4:44 (2017) each debuted atop the Billboard 200; Jay-Z holds the record for the most number-one albums (14) of any solo artist on the chart. He has also released the collaborative albums

### Sentiment of Full Wikipedia page

Calculate Sentiment of full Wiki page for the person

In [None]:
bio_wiki = TextBlob(wiki_text_clean[0])

In [None]:
bio_wiki.sentiment

Sentiment(polarity=0.047621649525338074, subjectivity=0.4157806012293721)

Use NLP and KNN to analyze the full (cleaned) text of the Wiki bios

In [None]:
#Vectorize training text from full Wiki pages
count_vec2 = CountVectorizer(stop_words='english')
X_train_counts2 = count_vec2.fit_transform(wiki_text_clean)

In [None]:
#Apply Tfidf transform
tfidf_xfrm =  TfidfTransformer()
X_train_tfidf2 = tfidf_xfrm.fit_transform(X_train_counts2)

In [None]:
#Find the nearest neighbors
nearest = NearestNeighbors()
nearest.fit(X_train_tfidf2)

In [None]:
#List the nearest neighbors of the target in row 0
dist2, row2 = nearest.kneighbors(X_train_tfidf2[0], n_neighbors=11)

In [None]:
# Get the names of the people in the above rows
neighbors_wiki = neighbors.iloc[row2[0]]
neighbors_wiki

31657                        Eminem
35738                  Tommy Coster
15946                       Dr. Dre
24782                       50 Cent
17337                         Jay Z
35801                    Joss Stone
24857                       Rihanna
33007                   Celine Dion
34724                        Lecrae
26055    Andrea Bocelli discography
6946                  Philip Atwell
Name: name, dtype: object

Compare both lists

In [None]:
my_dict = {'Neighbors_CSV': list(neighbors), 'Neighbors_Wiki': list(neighbors_wiki)}
results_df = pd.DataFrame(my_dict)
results_df

Unnamed: 0,Neighbors_CSV,Neighbors_Wiki
0,Eminem,Eminem
1,50 Cent,Tommy Coster
2,Dr. Dre,Dr. Dre
3,Jay Z,50 Cent
4,Andrea Bocelli discography,Jay Z
5,Lecrae,Joss Stone
6,Joss Stone,Rihanna
7,Rihanna,Celine Dion
8,Tommy Coster,Lecrae
9,Philip Atwell,Andrea Bocelli discography


In [None]:
pip install ipywidgets wikipedia-api


Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.1


In [None]:
from ipywidgets import interact, widgets
from IPython.display import display, HTML
import wikipediaapi


In [77]:
from ipywidgets import interact, widgets
from IPython.display import display, HTML
import wikipediaapi
from textblob import TextBlob  # Import TextBlob for sentiment analysis

# Function to fetch Wikipedia summary
def fetch_wikipedia_summary(name):
    wiki_wiki = wikipediaapi.Wikipedia(user_agent='Producer')
    page = wiki_wiki.page(name)
    if page.exists():
        summary = page.summary[:10000]  # Get the first 1000 characters of the summary
        return summary
    else:
        return "No summary available for this person on Wikipedia."

# Function to perform sentiment analysis
def analyze_sentiment(text):
    blob = TextBlob(text)
    sentiment_score = blob.sentiment.polarity
    return sentiment_score

# Function to display the summary and sentiment analysis
def show_summary_with_sentiment(name):
    summary = fetch_wikipedia_summary(name)
    sentiment_score = analyze_sentiment(summary)

    # Determine sentiment label
    if sentiment_score > 0:
        sentiment_label = "Positive"
    elif sentiment_score < 0:
        sentiment_label = "Negative"
    else:
        sentiment_label = "Neutral"

    # Display summary and sentiment
    display(HTML(f"<p><strong>{name}:</strong> {summary}</p>"))
    display(HTML(f"<p><strong>Sentiment:</strong> {sentiment_label}</p>"))

# Create the interactive widget
interact(show_summary_with_sentiment, name=widgets.Text(value='', description='Name:', placeholder='Type a name...'))


interactive(children=(Text(value='', description='Name:', placeholder='Type a name...'), Output()), _dom_class…