<a href="https://colab.research.google.com/github/lisabortiz/Data-Science-Portfolio/blob/main/Project_5/Project_5_NLP_Description_for_Students_v01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing



## Problem Definition
>"What's going on?"

This project will give you practical experience using Natural Language Processing techniques. This project is in three parts:
- in part 1) you will use a dataset in a CSV file
- in part 2) you will use the Wikipedia API to directly access content
on Wikipedia.
- in part 3) you will make your notebook interactive


### Part 1)



- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is to ...
  1. Pick one person from the list ( the target person ) and output 10 other people who's overview are "closest" to the target person in a Natural Language Processing sense
  1. Also output the sentiment of the overview of the target person



### Part 2)



- For the same target person that you chose in Part 1), use the Wikipedia API to access the whole content of the target person's Wikipedia page.
- The goal of Part 2) is to ...
  1. Print out the text of the Wikipedia article for the target person
  1. Determine the sentiment of the text of the Wikipedia page for the target person
  1. Collect the text of the Wikipedia pages from the 10 nearest neighbors from Part 1)
  1. Determine the nearness ranking of these 10 people to your target person based on their entire Wikipedia page
  1. Compare, i.e. plot,  the nearest ranking from Step 1) with the Wikipedia page nearness ranking.  A difference of the rank is one means of comparison.



### Part 3)


Make an interactive notebook where a user can choose or enter a name and the notebook displays the 10 closest individuals.

In addition to presenting the project slides, at the end of the presentation each student will demonstrate their code using a famous person suggested by the other students that exists in the DBpedia set.


## Data Collection/Sources
>"Initial Setup"


In [65]:
import pandas as pd
import sqlite3 as db
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
import kaleido
import seaborn as sns
import sklearn
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

pd.options.display.max_columns = 100

import nltk
# nltk.download('omw-1.4')
nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [66]:
%%capture
# Install textblob
!pip install -U textblob
from textblob import TextBlob

In [67]:
%%capture
!pip install wikipedia-api
import wikipediaapi

In [68]:
agent = 'CNM_DeepDive (lisaballortiz+deepdive@gmail.com)'
wiki_wiki = wikipediaapi.Wikipedia(user_agent=agent, language='en')

### Load the data

In [69]:
!curl -s -O https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv | wc -l

0


In [70]:
# !curl --help all
!head NLP.csv

URI,name,text
<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victoria

In [71]:
!ls -lh NLP.csv

-rw-r--r-- 1 root root 80M Nov 10 16:01 NLP.csv


In [72]:
conn = db.connect(':memory:')
csv_file = 'NLP.csv'  # Replace with your CSV file path
chunk_size = 10000  # Adjust as needed

for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
    if i == 0:
        # Create table with header from the first chunk
        chunk.to_sql('temp_table', conn, if_exists='replace', index=False)
    else:
        # Append subsequent chunks to the table
        chunk.to_sql('temp_table', conn, if_exists='append', index=False)

In [73]:
query = '''
  select *
  from temp_table
  limit 100
'''

temp = pd.read_sql_query(query , conn)
temp

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...
...,...,...,...
95,<http://dbpedia.org/resource/Steve_Castle>,Steve Castle,steve castle born 17 may 1966 in barking is a ...
96,<http://dbpedia.org/resource/Armen_Ra>,Armen Ra,armen ra is an american artist and performer o...
97,<http://dbpedia.org/resource/David_Shaughnessy>,David Shaughnessy,david james shaughnessy also spelled shaughnes...
98,<http://dbpedia.org/resource/John_Reynolds_(Ca...,John Reynolds (Canadian politician),john douglas reynolds pc born january 19 1942 ...


### Part 1

In [74]:
temp

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...
...,...,...,...
95,<http://dbpedia.org/resource/Steve_Castle>,Steve Castle,steve castle born 17 may 1966 in barking is a ...
96,<http://dbpedia.org/resource/Armen_Ra>,Armen Ra,armen ra is an american artist and performer o...
97,<http://dbpedia.org/resource/David_Shaughnessy>,David Shaughnessy,david james shaughnessy also spelled shaughnes...
98,<http://dbpedia.org/resource/John_Reynolds_(Ca...,John Reynolds (Canadian politician),john douglas reynolds pc born january 19 1942 ...


In [75]:
# #Sample data of the the target famous person and read it into a panda dataframe
# query = '''
#   select text
#   from temp_table
#   where name='Digby Morrell'
# '''

# temp = pd.read_sql_query(query , conn)
# temp

In [76]:
temp.iloc[0,2]

'digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vfla

In [77]:
#Target Sentence Sentiment using TextBlob
target = TextBlob(temp.iloc[0,2])
target.sentiment

Sentiment(polarity=-0.041666666666666664, subjectivity=0.17896825396825394)

## Data Cleaning

>  "Clean Up Data"

#### Clean Weird Characters

In [78]:
mask = temp['name'].str.contains(r'[^\w\s]')
temp_clean = temp[~mask]
temp_clean.reset_index(drop=True, inplace=True)

In [79]:
temp_clean['text']

Unnamed: 0,text
0,digby morrell born 10 october 1979 is a former...
1,harpdog brown is a singer and harmonica player...
2,franz rottensteiner born in waidmannsfeld lowe...
3,sam henderson born october 18 1969 is an ameri...
4,aaron lacrate is an american music producer re...
...,...
70,eva felicitas habermann born january 16 1976 i...
71,steve castle born 17 may 1966 in barking is a ...
72,armen ra is an american artist and performer o...
73,david james shaughnessy also spelled shaughnes...


#### Singularize Name

In [80]:
sentence_tb = TextBlob(temp_clean.iloc[0,2]) # Make a textblob so that we can singularize the word
sentence_singular = [ x.singularize() for x in sentence_tb.words ] # Singularize each word in the text
sentence_clean = ' '.join(sentence_singular) # Join it together into a single string
sentence_clean

'digby morrell born 10 october 1979 is a former australian rule footballer who played with the kangaroo and carlton in the australian football league aflfrom western australium morrell played hi early senior football for west perth hi 44game senior career for the falcon spanned 19982000 and he wa the club leading goalkicker in 2000 at the age of 21 morrell wa recruited to the australian football league by the kangaroo football club with it third round selection in the 2001 afl rookie draft a a forward he twice kicked five goal during hi time with the kangaroo the first wa in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell wa traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 game for the blue before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullant carlton vflaffiliate in 2006 an

In [81]:
temp_clean.iloc[0,2]

'digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vfla

#### Singularize All

In [82]:
def singularize_text(sentence):
  sentence_tb = TextBlob(sentence) # Make a textblob so that we can singularize the word
  sentence_singular = ' '.join([ x.singularize() for x in sentence_tb.words ]) # Singularize each word in the text

  return sentence_singular

In [83]:
temp_clean['singular_text'] = temp_clean['text'].map(singularize_text)
# temp_clean



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Exploratory Data Analysis
> "Look Around"

#### Count Vectorizer

In [84]:
vectorizer = CountVectorizer(stop_words='english')
bow_matrix = vectorizer.fit_transform(temp_clean['singular_text'])

In [85]:
bow_matrix.shape

(75, 5073)

#### TF-IDF

In [86]:
tf_idf_matrix = TfidfTransformer()
tf_idf_famous = tf_idf_matrix.fit_transform(bow_matrix)

In [87]:
tf_idf_famous.shape

(75, 5073)

In [88]:
vectorizer.get_feature_names_out()

array(['10', '100', '1000', ..., 'zone', 'zoubeir', 'zwigoff'],
      dtype=object)

##Data Processing
>  "Crunch Numbers"

#### NN

In [89]:
nn = NearestNeighbors().fit(tf_idf_famous)

In [90]:
target = 'Digby Morrell'
target

'Digby Morrell'

In [91]:
mask = (temp_clean['name']==target).to_numpy()

In [92]:
sent0 = tf_idf_famous[mask]
sent0

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 107 stored elements and shape (1, 5073)>

In [93]:
distances, indices = nn.kneighbors(
  X = sent0,
  n_neighbors = 11,
)

In [94]:
distances[distances>0]

array([1.09682007, 1.28056948, 1.28656336, 1.30154545, 1.30903476,
       1.31444259, 1.33158459, 1.33673087, 1.33681614, 1.33941432])

In [95]:
names = temp_clean['name'][indices[0]]
names

Unnamed: 0,name
0,Digby Morrell
74,Dean Greig
50,Alan Roper
68,Shaka Hislop
71,Steve Castle
57,Vladimir Yurchenko
18,Adel Sellimi
40,Corey Woolfolk
21,Vic Stasiuk
27,Bob Reece


### Part 2

In [96]:
target

'Digby Morrell'

In [97]:
target_page = page_py = wiki_wiki.page(target)

In [98]:
len(target_page.text)

2027

In [99]:
# def get_wiki_sentiment(name):
#   target_page = page_py = wiki_wiki.page(name)
#   blob = TextBlob(target_page.text)

#   return blob.sentiment

In [100]:
def get_wiki_text(name):
  target_page = page_py = wiki_wiki.page(name)

  return target_page.text

def get_wiki_sentiment(text):
  blob = TextBlob(text)

  return blob.sentiment

In [101]:
target_blob_wiki = TextBlob(target_page.text)
target_blob_wiki.sentiment

Sentiment(polarity=-0.041035353535353536, subjectivity=0.19291125541125542)

In [102]:
# names = temp_clean['name'][indices[0]].tolist()

In [103]:
names_list = names.tolist()
names_list

['Digby Morrell',
 'Dean Greig',
 'Alan Roper',
 'Shaka Hislop',
 'Steve Castle',
 'Vladimir Yurchenko',
 'Adel Sellimi',
 'Corey Woolfolk',
 'Vic Stasiuk',
 'Bob Reece',
 'Ceiron Thomas']

In [104]:
texts = list(map(get_wiki_text, names_list))
sentiments = list(map(get_wiki_sentiment, texts))

In [105]:
sentiments

[Sentiment(polarity=-0.041035353535353536, subjectivity=0.19291125541125542),
 Sentiment(polarity=0.1532608695652174, subjectivity=0.366304347826087),
 Sentiment(polarity=0.03333333333333333, subjectivity=0.03333333333333333),
 Sentiment(polarity=0.10306851811089102, subjectivity=0.33278242473157743),
 Sentiment(polarity=0.12942989214175654, subjectivity=0.38694144838212624),
 Sentiment(polarity=0.1763157894736842, subjectivity=0.24298245614035086),
 Sentiment(polarity=0.08708597603946441, subjectivity=0.24933051444679355),
 Sentiment(polarity=0.051176470588235295, subjectivity=0.29490196078431374),
 Sentiment(polarity=0.14069059870946662, subjectivity=0.270747177350951),
 Sentiment(polarity=0.033088235294117654, subjectivity=0.3042232277526395),
 Sentiment(polarity=0.00972222222222222, subjectivity=0.3467200854700854)]

In [106]:
singular_texts = list(map(singularize_text, texts))

In [107]:
# list(singular_texts)

In [108]:
wiki_bow_matrix = vectorizer.fit_transform(singular_texts)

In [109]:
tf_idf_wiki = tf_idf_matrix.fit_transform(wiki_bow_matrix)

In [110]:
wiki_nn = NearestNeighbors().fit(tf_idf_wiki)

In [111]:
tgt_idx = names_list.index(target)

In [112]:
tgt_vector_wiki = tf_idf_wiki[tgt_idx]

In [113]:
tf_idf_wiki.shape

(11, 1393)

In [114]:
tgt_vector_wiki.shape

(1, 1393)

In [115]:
wiki_distances, wiki_indices = wiki_nn.kneighbors(
  X = tgt_vector_wiki,
  n_neighbors = 11,
)

In [116]:
names_arr = np.array(names_list)
names_arr[wiki_indices[0]]

array(['Digby Morrell', 'Dean Greig', 'Shaka Hislop', 'Steve Castle',
       'Vladimir Yurchenko', 'Adel Sellimi', 'Vic Stasiuk',
       'Corey Woolfolk', 'Alan Roper', 'Bob Reece', 'Ceiron Thomas'],
      dtype='<U18')

In [117]:
indices[distances>0]

array([74, 50, 68, 71, 57, 18, 40, 21, 27, 17])

In [118]:
wiki_indices[wiki_distances>0]

array([ 1,  3,  4,  5,  6,  8,  7,  2,  9, 10])

In [123]:
wiki_df = pd.DataFrame(index=names_arr[distances[0]>0])
wiki_df['orig_ranks'] = np.arange(1, 11)
wiki_df['wiki_ranks'] = [1, 8, 2, 3, 4, 5, 7, 6, 9, 10]
wiki_df

Unnamed: 0,orig_ranks,wiki_ranks
Dean Greig,1,1
Alan Roper,2,8
Shaka Hislop,3,2
Steve Castle,4,3
Vladimir Yurchenko,5,4
Adel Sellimi,6,5
Corey Woolfolk,7,7
Vic Stasiuk,8,6
Bob Reece,9,9
Ceiron Thomas,10,10


## Data Visualization
> "Lets Plot"

In [120]:
#!plotly_get_chrome


Plotly will install a copy of Google Chrome to be used for generating static images of plots.
Chrome will be installed at: None
Do you want to proceed? [y/n] Traceback (most recent call last):
  File "/usr/local/bin/plotly_get_chrome", line 8, in <module>
    sys.exit(plotly_get_chrome())
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/plotly/io/_kaleido.py", line 855, in plotly_get_chrome
    response = input("Do you want to proceed? [y/n] ")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
^C


In [122]:
import plotly.graph_objects as go
go.Figure(go.Scatter(x=wiki_df['orig_ranks'], y=wiki_df['wiki_ranks']), layout_title_text='Rank Comparison', layout_xaxis_title="Original Rank", layout_yaxis_title="Wikipedia Rank")
#fig = go.Figure(go.Scatter(x=wiki_df['orig_ranks'], y=wiki_df['wiki_ranks']), layout_title_text='Rank Comparison', layout_xaxis_title="Original Rank", layout_yaxis_title="Wikipedia Rank")
#fig.write_image("yourfile.png")

## Results

We were given a data set from DBpedia which included Famous People, source url, and some information about them.  This data set was pretty cleaned up for doing some  analysis.

The original data set was somewhere in the realm of 40,000 rows.  We were fair warned that this would crash our notebook with pandas.

I created a small sample of data, and created a nearest neighbors model calculation of DBpedia data set.

We also pulled API generated content from Wikipedia of similar target.

Using Nearest Neighbors Algorithm, the results were compared between Wikipedia and DBpedia.