# Natural Language Processing



This project will give you practical experience using Natural Language Processing techniques. This project is in three parts:
- in part 1) you will use a traditional dataset in a CSV file
- in part 2) you will use the Wikipedia API to directly access content
on Wikipedia.
- in part 3) you will make your notebook interactive


### Part 1)



- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is provide the capability to
  - Take one person from the list as input and output the 10 other people who's overview are "closest" to the person in a Natural Language Processing sense
  - Also output the sentiment of the overview of the person



Make an interactive notebook.

In addition to presenting the project slides, at the end of the presentation each student will demonstrate their code using a famous person suggested by the other students that exists in the DBpedia set.


In [1]:
%%capture
# Download corpora
!python -m textblob.download_corpora

In [2]:
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from textblob import TextBlob
import pandas as pd
from io import StringIO
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/joeyrobak/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/joeyrobak/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/joeyrobak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/joeyrobak/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joeyrobak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
path = '/Users/joeyrobak/Downloads/NLP.xlsx'
df=pd.read_excel(path)
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [4]:
text = df['text'] #[0:50]

In [5]:
text = text.str.lower()

In [6]:
text

0        digby morrell born 10 october 1979 is a former...
1        alfred j lewy aka sandy lewy graduated from un...
2        harpdog brown is a singer and harmonica player...
3        franz rottensteiner born in waidmannsfeld lowe...
4        henry krvits born 30 december 1974 in tallinn ...
                               ...                        
42781    motoaki takenouchi born july 8 1967 saitama pr...
42782    alan graham judge born 14 may 1960 is a retire...
42783    eduardo lara lozano born 4 september 1959 in c...
42784    tatiana faberg is an author and faberg scholar...
42785    kenneth thomas born february 24 1938 was chief...
Name: text, Length: 42786, dtype: object

In [7]:
sentiment = text.apply(lambda x: TextBlob(x).sentiment)

In [8]:
sentiment.sort_values()


25962                                  (-0.8, 1.0)
14143    (-0.4321428571428572, 0.7821428571428571)
1290                 (-0.42500000000000004, 0.475)
30126    (-0.3821428571428572, 0.6535714285714286)
34978               (-0.35625, 0.6131944444444446)
                           ...                    
1126      (0.5166666666666667, 0.2972222222222222)
4977      (0.5230909090909092, 0.4163030303030303)
29271     (0.5249999999999999, 0.4933333333333333)
331                      (0.5333333333333333, 0.5)
30155     (0.5593394886363636, 0.4389441287878789)
Name: text, Length: 42786, dtype: object

In [9]:
df['sentiment']=sentiment

In [10]:
df

Unnamed: 0,URI,name,text,sentiment
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"(-0.041666666666666664, 0.17896825396825394)"
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"(0.2186607142857143, 0.5276785714285714)"
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"(0.24754901960784317, 0.3892156862745098)"
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"(0.04795574795574796, 0.3609547859547859)"
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"(0.34924242424242424, 0.5173821548821549)"
...,...,...,...,...
42781,<http://dbpedia.org/resource/Motoaki_Takenouchi>,Motoaki Takenouchi,motoaki takenouchi born july 8 1967 saitama pr...,"(0.057575757575757565, 0.3984848484848485)"
42782,<http://dbpedia.org/resource/Alan_Judge_(footb...,"Alan Judge (footballer, born 1960)",alan graham judge born 14 may 1960 is a retire...,"(0.017857142857142856, 0.2976190476190476)"
42783,<http://dbpedia.org/resource/Eduardo_Lara>,Eduardo Lara,eduardo lara lozano born 4 september 1959 in c...,"(0.11477272727272729, 0.38181818181818183)"
42784,<http://dbpedia.org/resource/Tatiana_Faberg%C3...,Tatiana Faberg%C3%A9,tatiana faberg is an author and faberg scholar...,"(0.1805785123966942, 0.5140495867768595)"


In [11]:
tokenized_data = []

# Loop through each row in the data series
for _ in text:
    # Convert the text to a TextBlob
    blob = TextBlob(_)
    # Tokenize the text by words
    tokens = blob.words
    # Append the tokens to the new list
    tokenized_data.append(tokens)

# Convert the list of tokens back into a pandas Series if needed
tokenized_series = pd.Series(tokenized_data)
print(tokenized_series)

0        [digby, morrell, born, 10, october, 1979, is, ...
1        [alfred, j, lewy, aka, sandy, lewy, graduated,...
2        [harpdog, brown, is, a, singer, and, harmonica...
3        [franz, rottensteiner, born, in, waidmannsfeld...
4        [henry, krvits, born, 30, december, 1974, in, ...
                               ...                        
42781    [motoaki, takenouchi, born, july, 8, 1967, sai...
42782    [alan, graham, judge, born, 14, may, 1960, is,...
42783    [eduardo, lara, lozano, born, 4, september, 19...
42784    [tatiana, faberg, is, an, author, and, faberg,...
42785    [kenneth, thomas, born, february, 24, 1938, wa...
Length: 42786, dtype: object


In [12]:
stop_words = set(stopwords.words('english'))

filtered_data = tokenized_series.apply(lambda tokens: [word for word in tokens if word not in stop_words])

print(filtered_data)

0        [digby, morrell, born, 10, october, 1979, form...
1        [alfred, j, lewy, aka, sandy, lewy, graduated,...
2        [harpdog, brown, singer, harmonica, player, ac...
3        [franz, rottensteiner, born, waidmannsfeld, lo...
4        [henry, krvits, born, 30, december, 1974, tall...
                               ...                        
42781    [motoaki, takenouchi, born, july, 8, 1967, sai...
42782    [alan, graham, judge, born, 14, may, 1960, ret...
42783    [eduardo, lara, lozano, born, 4, september, 19...
42784    [tatiana, faberg, author, faberg, scholar, swi...
42785    [kenneth, thomas, born, february, 24, 1938, ch...
Length: 42786, dtype: object


In [13]:
lemmatizer = WordNetLemmatizer()
lemitized_data = filtered_data.apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])

In [14]:
df['clean_text']=lemitized_data

In [15]:
df

Unnamed: 0,URI,name,text,sentiment,clean_text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"(-0.041666666666666664, 0.17896825396825394)","[digby, morrell, born, 10, october, 1979, form..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"(0.2186607142857143, 0.5276785714285714)","[alfred, j, lewy, aka, sandy, lewy, graduated,..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"(0.24754901960784317, 0.3892156862745098)","[harpdog, brown, singer, harmonica, player, ac..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"(0.04795574795574796, 0.3609547859547859)","[franz, rottensteiner, born, waidmannsfeld, lo..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"(0.34924242424242424, 0.5173821548821549)","[henry, krvits, born, 30, december, 1974, tall..."
...,...,...,...,...,...
42781,<http://dbpedia.org/resource/Motoaki_Takenouchi>,Motoaki Takenouchi,motoaki takenouchi born july 8 1967 saitama pr...,"(0.057575757575757565, 0.3984848484848485)","[motoaki, takenouchi, born, july, 8, 1967, sai..."
42782,<http://dbpedia.org/resource/Alan_Judge_(footb...,"Alan Judge (footballer, born 1960)",alan graham judge born 14 may 1960 is a retire...,"(0.017857142857142856, 0.2976190476190476)","[alan, graham, judge, born, 14, may, 1960, ret..."
42783,<http://dbpedia.org/resource/Eduardo_Lara>,Eduardo Lara,eduardo lara lozano born 4 september 1959 in c...,"(0.11477272727272729, 0.38181818181818183)","[eduardo, lara, lozano, born, 4, september, 19..."
42784,<http://dbpedia.org/resource/Tatiana_Faberg%C3...,Tatiana Faberg%C3%A9,tatiana faberg is an author and faberg scholar...,"(0.1805785123966942, 0.5140495867768595)","[tatiana, faberg, author, faberg, scholar, swi..."


In [16]:
text_data = [' '.join(doc) for doc in lemitized_data]

my_vectorizer = CountVectorizer(stop_words='english')

my_bow_vec = my_vectorizer.fit_transform(text_data)

In [17]:
tf_idf_vec = TfidfTransformer()
tf_idf_fit = tf_idf_vec.fit_transform(my_bow_vec)

In [19]:
tf_idf_fit.shape

(42786, 423599)

In [19]:
nn = NearestNeighbors().fit(tf_idf_fit)

In [20]:
df.loc[df['name'] == 'Christian Bale']

Unnamed: 0,URI,name,text,sentiment,clean_text
39695,<http://dbpedia.org/resource/Christian_Bale>,Christian Bale,christian charles philip bale born 30 january ...,"(0.18916666666666665, 0.2966666666666667)","[christian, charles, philip, bale, born, 30, j..."


In [21]:
sent = tf_idf_fit[39695]
sent.shape

(1, 423599)

In [22]:
distances, indices = nn.kneighbors(
    X = sent,
    n_neighbors=11
)


In [25]:
distance_df = pd.DataFrame({'indices': indices.flatten(), 'distances': distances.flatten()})

In [27]:
distance_df

Unnamed: 0,indices,distances
0,39695,0.0
1,28070,1.201615
2,5875,1.212054
3,4672,1.212396
4,11156,1.212695
5,39923,1.217291
6,34836,1.224546
7,13993,1.233194
8,29556,1.2334
9,38372,1.235514


In [28]:
name_locations = distance_df['indices'][0:11]

In [30]:
target=df.iloc[(list(name_locations))]
target

Unnamed: 0,URI,name,text,sentiment,clean_text
39695,<http://dbpedia.org/resource/Christian_Bale>,Christian Bale,christian charles philip bale born 30 january ...,"(0.18916666666666665, 0.2966666666666667)","[christian, charles, philip, bale, born, 30, j..."
28070,<http://dbpedia.org/resource/Amy_Adams>,Amy Adams,amy lou adams born august 20 1974 is an americ...,"(0.22387387387387386, 0.28175675675675677)","[amy, lou, adam, born, august, 20, 1974, ameri..."
5875,<http://dbpedia.org/resource/Michael_Keaton>,Michael Keaton,michael john douglas born september 5 1951 bet...,"(0.1886679292929293, 0.38437499999999997)","[michael, john, douglas, born, september, 5, 1..."
4672,<http://dbpedia.org/resource/Daniel_Day-Lewis>,Daniel Day-Lewis,sir daniel michael blake daylewis kt born 29 a...,"(0.2114583333333333, 0.3815972222222222)","[sir, daniel, michael, blake, daylewis, kt, bo..."
11156,<http://dbpedia.org/resource/Anne_Hathaway>,Anne Hathaway,anne jacqueline hathaway born november 12 1982...,"(0.25546536796536795, 0.35633116883116883)","[anne, jacqueline, hathaway, born, november, 1..."
39923,<http://dbpedia.org/resource/Jack_Nicholson>,Jack Nicholson,john joseph jack nicholson born april 22 1937 ...,"(0.23333333333333334, 0.4002604166666667)","[john, joseph, jack, nicholson, born, april, 2..."
34836,<http://dbpedia.org/resource/Michael_Caine>,Michael Caine,sir michael caine cbe ken born maurice joseph ...,"(0.20304054054054052, 0.32747747747747746)","[sir, michael, caine, cbe, ken, born, maurice,..."
13993,<http://dbpedia.org/resource/Ralph_Fiennes>,Ralph Fiennes,ralph nathaniel twisletonwykehamfiennes ref fa...,"(0.3046875, 0.3194444444444444)","[ralph, nathaniel, twisletonwykehamfiennes, re..."
29556,<http://dbpedia.org/resource/Felicity_Huffman>,Felicity Huffman,felicity kendall huffman born december 9 1962 ...,"(0.215, 0.40374999999999994)","[felicity, kendall, huffman, born, december, 9..."
38372,<http://dbpedia.org/resource/Liam_Neeson>,Liam Neeson,liam john neeson obe born 7 june 1952 is an ir...,"(0.4015625, 0.40156249999999993)","[liam, john, neeson, obe, born, 7, june, 1952,..."


In [31]:
target['nn_distance_1']=distance_df['distances']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  target['nn_distance_1']=distance_df['distances']


Couple of fun ones:
- 42475 Jim Carrey
- 24428 Brad Pitt
- 39695 Christian Bale

In [32]:
#df.reset_index(inplace=True)
distance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   indices    11 non-null     int64  
 1   distances  11 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 308.0 bytes


In [33]:
fig = px.line(
    
    x = None,
    y = distance_df['distances'])
fig.update_layout(
    title="Nearest Neighbors of Wikipedia Dataset",
    xaxis_title='50 Closest Neighbors to Christian Bale',
    yaxis_title='Distance'
    )
fig.show()

### Part 2)



In [34]:
%%capture
!pip3 install wikipedia-api

In [35]:
import wikipediaapi

In [36]:
def get_wiki_text(person_name):
    wikip = wikipediaapi.Wikipedia(user_agent = 'foobar')
    page = wikip.page(person_name)
    if page.exists():
        return page.text
    else:
        return None

In [37]:
print(get_wiki_text("Christian Bale"))

Christian Charles Philip Bale (born 30 January 1974) is an English actor. Known for his versatility and physical transformations for his roles, he has been a leading man in films of several genres. He has received various accolades, including an Academy Award and two Golden Globe Awards. Forbes magazine ranked him as one of the highest-paid actors in 2014.
Born in Wales to English parents, Bale had his breakthrough role at age 13 in Steven Spielberg's 1987 war film Empire of the Sun. After more than a decade of performing in leading and supporting roles in films, he gained wider recognition for his portrayals of serial killer Patrick Bateman in the black comedy American Psycho (2000) and the title role in the psychological thriller The Machinist (2004). In 2005, he played superhero Batman in Batman Begins and again in The Dark Knight (2008) and The Dark Knight Rises (2012), garnering acclaim for his performance in the trilogy, which is one of the highest-grossing film franchises.
Bale 

In [38]:
target['full_text'] = target['name'].apply(get_wiki_text)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [39]:
target

Unnamed: 0,URI,name,text,sentiment,clean_text,nn_distance_1,full_text
39695,<http://dbpedia.org/resource/Christian_Bale>,Christian Bale,christian charles philip bale born 30 january ...,"(0.18916666666666665, 0.2966666666666667)","[christian, charles, philip, bale, born, 30, j...",,Christian Charles Philip Bale (born 30 January...
28070,<http://dbpedia.org/resource/Amy_Adams>,Amy Adams,amy lou adams born august 20 1974 is an americ...,"(0.22387387387387386, 0.28175675675675677)","[amy, lou, adam, born, august, 20, 1974, ameri...",,"Amy Lou Adams (born August 20, 1974) is an Ame..."
5875,<http://dbpedia.org/resource/Michael_Keaton>,Michael Keaton,michael john douglas born september 5 1951 bet...,"(0.1886679292929293, 0.38437499999999997)","[michael, john, douglas, born, september, 5, 1...",,"Michael John Douglas (born September 5, 1951),..."
4672,<http://dbpedia.org/resource/Daniel_Day-Lewis>,Daniel Day-Lewis,sir daniel michael blake daylewis kt born 29 a...,"(0.2114583333333333, 0.3815972222222222)","[sir, daniel, michael, blake, daylewis, kt, bo...",,Sir Daniel Michael Blake Day-Lewis (born 29 Ap...
11156,<http://dbpedia.org/resource/Anne_Hathaway>,Anne Hathaway,anne jacqueline hathaway born november 12 1982...,"(0.25546536796536795, 0.35633116883116883)","[anne, jacqueline, hathaway, born, november, 1...",,"Anne Jacqueline Hathaway (born November 12, 19..."
39923,<http://dbpedia.org/resource/Jack_Nicholson>,Jack Nicholson,john joseph jack nicholson born april 22 1937 ...,"(0.23333333333333334, 0.4002604166666667)","[john, joseph, jack, nicholson, born, april, 2...",,"John Joseph Nicholson (born April 22, 1937) is..."
34836,<http://dbpedia.org/resource/Michael_Caine>,Michael Caine,sir michael caine cbe ken born maurice joseph ...,"(0.20304054054054052, 0.32747747747747746)","[sir, michael, caine, cbe, ken, born, maurice,...",,Sir Michael Caine (born Maurice Joseph Mickle...
13993,<http://dbpedia.org/resource/Ralph_Fiennes>,Ralph Fiennes,ralph nathaniel twisletonwykehamfiennes ref fa...,"(0.3046875, 0.3194444444444444)","[ralph, nathaniel, twisletonwykehamfiennes, re...",,Ralph Nathaniel Twisleton-Wykeham-Fiennes (; b...
29556,<http://dbpedia.org/resource/Felicity_Huffman>,Felicity Huffman,felicity kendall huffman born december 9 1962 ...,"(0.215, 0.40374999999999994)","[felicity, kendall, huffman, born, december, 9...",,"Felicity Kendall Huffman (born December 9, 196..."
38372,<http://dbpedia.org/resource/Liam_Neeson>,Liam Neeson,liam john neeson obe born 7 june 1952 is an ir...,"(0.4015625, 0.40156249999999993)","[liam, john, neeson, obe, born, 7, june, 1952,...",,William John Neeson (born 7 June 1952) is a N...


In [40]:
full_text_df=target['full_text']

In [41]:
full_text_df = full_text_df.str.lower()

In [42]:
sentiment = full_text_df.apply(lambda x: TextBlob(x).sentiment)

In [43]:
target['full sentiment'] = sentiment



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [44]:
target

Unnamed: 0,URI,name,text,sentiment,clean_text,nn_distance_1,full_text,full sentiment
39695,<http://dbpedia.org/resource/Christian_Bale>,Christian Bale,christian charles philip bale born 30 january ...,"(0.18916666666666665, 0.2966666666666667)","[christian, charles, philip, bale, born, 30, j...",,Christian Charles Philip Bale (born 30 January...,"(0.1154168870752412, 0.31647796993806937)"
28070,<http://dbpedia.org/resource/Amy_Adams>,Amy Adams,amy lou adams born august 20 1974 is an americ...,"(0.22387387387387386, 0.28175675675675677)","[amy, lou, adam, born, august, 20, 1974, ameri...",,"Amy Lou Adams (born August 20, 1974) is an Ame...","(0.1300817160367722, 0.3954665980087334)"
5875,<http://dbpedia.org/resource/Michael_Keaton>,Michael Keaton,michael john douglas born september 5 1951 bet...,"(0.1886679292929293, 0.38437499999999997)","[michael, john, douglas, born, september, 5, 1...",,"Michael John Douglas (born September 5, 1951),...","(0.10917390096534484, 0.3986533430918459)"
4672,<http://dbpedia.org/resource/Daniel_Day-Lewis>,Daniel Day-Lewis,sir daniel michael blake daylewis kt born 29 a...,"(0.2114583333333333, 0.3815972222222222)","[sir, daniel, michael, blake, daylewis, kt, bo...",,Sir Daniel Michael Blake Day-Lewis (born 29 Ap...,"(0.1720494282584446, 0.348486373978177)"
11156,<http://dbpedia.org/resource/Anne_Hathaway>,Anne Hathaway,anne jacqueline hathaway born november 12 1982...,"(0.25546536796536795, 0.35633116883116883)","[anne, jacqueline, hathaway, born, november, 1...",,"Anne Jacqueline Hathaway (born November 12, 19...","(0.1338206836972, 0.38011790154771125)"
39923,<http://dbpedia.org/resource/Jack_Nicholson>,Jack Nicholson,john joseph jack nicholson born april 22 1937 ...,"(0.23333333333333334, 0.4002604166666667)","[john, joseph, jack, nicholson, born, april, 2...",,"John Joseph Nicholson (born April 22, 1937) is...","(0.1425362244995014, 0.40812247493038506)"
34836,<http://dbpedia.org/resource/Michael_Caine>,Michael Caine,sir michael caine cbe ken born maurice joseph ...,"(0.20304054054054052, 0.32747747747747746)","[sir, michael, caine, cbe, ken, born, maurice,...",,Sir Michael Caine (born Maurice Joseph Mickle...,"(0.07516953857774175, 0.36406321272043946)"
13993,<http://dbpedia.org/resource/Ralph_Fiennes>,Ralph Fiennes,ralph nathaniel twisletonwykehamfiennes ref fa...,"(0.3046875, 0.3194444444444444)","[ralph, nathaniel, twisletonwykehamfiennes, re...",,Ralph Nathaniel Twisleton-Wykeham-Fiennes (; b...,"(0.10110291907166907, 0.3501971914081287)"
29556,<http://dbpedia.org/resource/Felicity_Huffman>,Felicity Huffman,felicity kendall huffman born december 9 1962 ...,"(0.215, 0.40374999999999994)","[felicity, kendall, huffman, born, december, 9...",,"Felicity Kendall Huffman (born December 9, 196...","(0.12875308892496398, 0.4129040646853146)"
38372,<http://dbpedia.org/resource/Liam_Neeson>,Liam Neeson,liam john neeson obe born 7 june 1952 is an ir...,"(0.4015625, 0.40156249999999993)","[liam, john, neeson, obe, born, 7, june, 1952,...",,William John Neeson (born 7 June 1952) is a N...,"(0.10508207070707073, 0.348951884435755)"


In [45]:
tokenized_data = []

# Loop through each row in the data series
for _ in full_text_df:
    # Convert the text to a TextBlob
    blob = TextBlob(_)
    # Tokenize the text by words
    tokens = blob.words
    # Append the tokens to the new list
    tokenized_data.append(tokens)

# Convert the list of tokens back into a pandas Series if needed
tokenized_series = pd.Series(tokenized_data)
print(tokenized_series)

0     [christian, charles, philip, bale, born, 30, j...
1     [amy, lou, adams, born, august, 20, 1974, is, ...
2     [michael, john, douglas, born, september, 5, 1...
3     [sir, daniel, michael, blake, day-lewis, born,...
4     [anne, jacqueline, hathaway, born, november, 1...
5     [john, joseph, nicholson, born, april, 22, 193...
6     [sir, michael, caine, born, maurice, joseph, m...
7     [ralph, nathaniel, twisleton-wykeham-fiennes, ...
8     [felicity, kendall, huffman, born, december, 9...
9     [william, john, neeson, born, 7, june, 1952, i...
10    [thomas, jeffrey, hanks, born, july, 9, 1956, ...
dtype: object


In [46]:
stop_words = set(stopwords.words('english'))

filtered_data = tokenized_series.apply(lambda tokens: [word for word in tokens if word not in stop_words])

print(filtered_data)

0     [christian, charles, philip, bale, born, 30, j...
1     [amy, lou, adams, born, august, 20, 1974, amer...
2     [michael, john, douglas, born, september, 5, 1...
3     [sir, daniel, michael, blake, day-lewis, born,...
4     [anne, jacqueline, hathaway, born, november, 1...
5     [john, joseph, nicholson, born, april, 22, 193...
6     [sir, michael, caine, born, maurice, joseph, m...
7     [ralph, nathaniel, twisleton-wykeham-fiennes, ...
8     [felicity, kendall, huffman, born, december, 9...
9     [william, john, neeson, born, 7, june, 1952, n...
10    [thomas, jeffrey, hanks, born, july, 9, 1956, ...
dtype: object


In [47]:
lemmatizer = WordNetLemmatizer()
lemitized_data = filtered_data.apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])

In [48]:
target['full_clean_text']=lemitized_data



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [49]:
text_data = [' '.join(doc) for doc in lemitized_data]

# Initialize CountVectorizer
my_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the text data
my_bow_vec = my_vectorizer.fit_transform(text_data)

# Convert the bag of words matrix to a DataFrame
#my_sent_df = pd.DataFrame(my_bow_vec.toarray(), columns=my_vectorizer.get_feature_names_out())

# Display the DataFrame
#print(my_sent_df)

In [50]:
tf_idf_vec = TfidfTransformer()
tf_idf_fit = tf_idf_vec.fit_transform(my_bow_vec)

In [51]:
nn = NearestNeighbors().fit(tf_idf_fit)

In [52]:
sent = tf_idf_fit[0]
sent.shape

(1, 7402)

In [53]:
distances, indices = nn.kneighbors(
    X = sent,
    n_neighbors=11
)

In [54]:
distance_df2 = pd.DataFrame({'indices': indices.flatten(), 'distances': distances.flatten()})

In [64]:
distance_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   indices    11 non-null     int64  
 1   distances  11 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 308.0 bytes


In [56]:
target['nn_distance_2']=distance_df2['distances']



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [57]:
target

Unnamed: 0,URI,name,text,sentiment,clean_text,nn_distance_1,full_text,full sentiment,full_clean_text,nn_distance_2
39695,<http://dbpedia.org/resource/Christian_Bale>,Christian Bale,christian charles philip bale born 30 january ...,"(0.18916666666666665, 0.2966666666666667)","[christian, charles, philip, bale, born, 30, j...",,Christian Charles Philip Bale (born 30 January...,"(0.1154168870752412, 0.31647796993806937)",,
28070,<http://dbpedia.org/resource/Amy_Adams>,Amy Adams,amy lou adams born august 20 1974 is an americ...,"(0.22387387387387386, 0.28175675675675677)","[amy, lou, adam, born, august, 20, 1974, ameri...",,"Amy Lou Adams (born August 20, 1974) is an Ame...","(0.1300817160367722, 0.3954665980087334)",,
5875,<http://dbpedia.org/resource/Michael_Keaton>,Michael Keaton,michael john douglas born september 5 1951 bet...,"(0.1886679292929293, 0.38437499999999997)","[michael, john, douglas, born, september, 5, 1...",,"Michael John Douglas (born September 5, 1951),...","(0.10917390096534484, 0.3986533430918459)",,
4672,<http://dbpedia.org/resource/Daniel_Day-Lewis>,Daniel Day-Lewis,sir daniel michael blake daylewis kt born 29 a...,"(0.2114583333333333, 0.3815972222222222)","[sir, daniel, michael, blake, daylewis, kt, bo...",,Sir Daniel Michael Blake Day-Lewis (born 29 Ap...,"(0.1720494282584446, 0.348486373978177)",,
11156,<http://dbpedia.org/resource/Anne_Hathaway>,Anne Hathaway,anne jacqueline hathaway born november 12 1982...,"(0.25546536796536795, 0.35633116883116883)","[anne, jacqueline, hathaway, born, november, 1...",,"Anne Jacqueline Hathaway (born November 12, 19...","(0.1338206836972, 0.38011790154771125)",,
39923,<http://dbpedia.org/resource/Jack_Nicholson>,Jack Nicholson,john joseph jack nicholson born april 22 1937 ...,"(0.23333333333333334, 0.4002604166666667)","[john, joseph, jack, nicholson, born, april, 2...",,"John Joseph Nicholson (born April 22, 1937) is...","(0.1425362244995014, 0.40812247493038506)",,
34836,<http://dbpedia.org/resource/Michael_Caine>,Michael Caine,sir michael caine cbe ken born maurice joseph ...,"(0.20304054054054052, 0.32747747747747746)","[sir, michael, caine, cbe, ken, born, maurice,...",,Sir Michael Caine (born Maurice Joseph Mickle...,"(0.07516953857774175, 0.36406321272043946)",,
13993,<http://dbpedia.org/resource/Ralph_Fiennes>,Ralph Fiennes,ralph nathaniel twisletonwykehamfiennes ref fa...,"(0.3046875, 0.3194444444444444)","[ralph, nathaniel, twisletonwykehamfiennes, re...",,Ralph Nathaniel Twisleton-Wykeham-Fiennes (; b...,"(0.10110291907166907, 0.3501971914081287)",,
29556,<http://dbpedia.org/resource/Felicity_Huffman>,Felicity Huffman,felicity kendall huffman born december 9 1962 ...,"(0.215, 0.40374999999999994)","[felicity, kendall, huffman, born, december, 9...",,"Felicity Kendall Huffman (born December 9, 196...","(0.12875308892496398, 0.4129040646853146)",,
38372,<http://dbpedia.org/resource/Liam_Neeson>,Liam Neeson,liam john neeson obe born 7 june 1952 is an ir...,"(0.4015625, 0.40156249999999993)","[liam, john, neeson, obe, born, 7, june, 1952,...",,William John Neeson (born 7 June 1952) is a N...,"(0.10508207070707073, 0.348951884435755)",,


In [62]:
fig = px.line(
    
    x = None,
    y = distance_df2['distances'])
fig.update_layout(
    title="Nearest Neighbors of Wikipedia Dataset Long Text",
    xaxis_title='50 Closest Neighbors to Christian Bale Long Text',
    yaxis_title='Distance'
    )
fig.show()

In [59]:
distance_df

Unnamed: 0,indices,distances
0,39695,0.0
1,28070,1.201615
2,5875,1.212054
3,4672,1.212396
4,11156,1.212695
5,39923,1.217291
6,34836,1.224546
7,13993,1.233194
8,29556,1.2334
9,38372,1.235514


In [60]:
distance_df2

Unnamed: 0,indices,distances
0,0,0.0
1,1,1.218762
2,3,1.289884
3,9,1.291092
4,4,1.294021
5,2,1.297877
6,7,1.315752
7,6,1.31806
8,10,1.326763
9,5,1.333833


In [63]:
target

Unnamed: 0,URI,name,text,sentiment,clean_text,nn_distance_1,full_text,full sentiment,full_clean_text,nn_distance_2
39695,<http://dbpedia.org/resource/Christian_Bale>,Christian Bale,christian charles philip bale born 30 january ...,"(0.18916666666666665, 0.2966666666666667)","[christian, charles, philip, bale, born, 30, j...",,Christian Charles Philip Bale (born 30 January...,"(0.1154168870752412, 0.31647796993806937)",,
28070,<http://dbpedia.org/resource/Amy_Adams>,Amy Adams,amy lou adams born august 20 1974 is an americ...,"(0.22387387387387386, 0.28175675675675677)","[amy, lou, adam, born, august, 20, 1974, ameri...",,"Amy Lou Adams (born August 20, 1974) is an Ame...","(0.1300817160367722, 0.3954665980087334)",,
5875,<http://dbpedia.org/resource/Michael_Keaton>,Michael Keaton,michael john douglas born september 5 1951 bet...,"(0.1886679292929293, 0.38437499999999997)","[michael, john, douglas, born, september, 5, 1...",,"Michael John Douglas (born September 5, 1951),...","(0.10917390096534484, 0.3986533430918459)",,
4672,<http://dbpedia.org/resource/Daniel_Day-Lewis>,Daniel Day-Lewis,sir daniel michael blake daylewis kt born 29 a...,"(0.2114583333333333, 0.3815972222222222)","[sir, daniel, michael, blake, daylewis, kt, bo...",,Sir Daniel Michael Blake Day-Lewis (born 29 Ap...,"(0.1720494282584446, 0.348486373978177)",,
11156,<http://dbpedia.org/resource/Anne_Hathaway>,Anne Hathaway,anne jacqueline hathaway born november 12 1982...,"(0.25546536796536795, 0.35633116883116883)","[anne, jacqueline, hathaway, born, november, 1...",,"Anne Jacqueline Hathaway (born November 12, 19...","(0.1338206836972, 0.38011790154771125)",,
39923,<http://dbpedia.org/resource/Jack_Nicholson>,Jack Nicholson,john joseph jack nicholson born april 22 1937 ...,"(0.23333333333333334, 0.4002604166666667)","[john, joseph, jack, nicholson, born, april, 2...",,"John Joseph Nicholson (born April 22, 1937) is...","(0.1425362244995014, 0.40812247493038506)",,
34836,<http://dbpedia.org/resource/Michael_Caine>,Michael Caine,sir michael caine cbe ken born maurice joseph ...,"(0.20304054054054052, 0.32747747747747746)","[sir, michael, caine, cbe, ken, born, maurice,...",,Sir Michael Caine (born Maurice Joseph Mickle...,"(0.07516953857774175, 0.36406321272043946)",,
13993,<http://dbpedia.org/resource/Ralph_Fiennes>,Ralph Fiennes,ralph nathaniel twisletonwykehamfiennes ref fa...,"(0.3046875, 0.3194444444444444)","[ralph, nathaniel, twisletonwykehamfiennes, re...",,Ralph Nathaniel Twisleton-Wykeham-Fiennes (; b...,"(0.10110291907166907, 0.3501971914081287)",,
29556,<http://dbpedia.org/resource/Felicity_Huffman>,Felicity Huffman,felicity kendall huffman born december 9 1962 ...,"(0.215, 0.40374999999999994)","[felicity, kendall, huffman, born, december, 9...",,"Felicity Kendall Huffman (born December 9, 196...","(0.12875308892496398, 0.4129040646853146)",,
38372,<http://dbpedia.org/resource/Liam_Neeson>,Liam Neeson,liam john neeson obe born 7 june 1952 is an ir...,"(0.4015625, 0.40156249999999993)","[liam, john, neeson, obe, born, 7, june, 1952,...",,William John Neeson (born 7 June 1952) is a N...,"(0.10508207070707073, 0.348951884435755)",,
