# Class 5

### Tfid Vectorizer

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


In [3]:
text_data = np.array(['I love Brazil. Brazil!','Sweden is best','Germany beats both'])

In [4]:
count = CountVectorizer()
count.fit_transform(text_data)

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [25]:
count.get_feature_names_out()

array(['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love',
       'sweden'], dtype=object)

In [26]:
tfidf = TfidfVectorizer()

In [27]:
feature_matrix = tfidf.fit_transform(text_data)
print(feature_matrix.toarray())


[[0.         0.         0.         0.89442719 0.         0.
  0.4472136  0.        ]
 [0.         0.57735027 0.         0.         0.         0.57735027
  0.         0.57735027]
 [0.57735027 0.         0.57735027 0.         0.57735027 0.
  0.         0.        ]]


In [28]:
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

In [29]:
df1 = pd.DataFrame(feature_matrix.toarray(), columns=count.get_feature_names_out())
df1

Unnamed: 0,beats,best,both,brazil,germany,is,love,sweden
0,0.0,0.0,0.0,0.894427,0.0,0.0,0.447214,0.0
1,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.57735
2,0.57735,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0


# Class 5 
## Lecture 4
### Demonstrate Text Preprocessing: Replacing and Correcting Words

#### 4.1 Ilustrate Text conversion to lowercase
##### Text Conversion to LowerCase

- It may be necessary for to convert a text into lowercase. 
- We can use lower() function provided by Python on our text to convert it into lowercase.


In [52]:
myString = "The 5 countries include China, United States,Indonesia,India and Brazil"

In [53]:
str = myString.lower()
print(str)

the 5 countries include china, united states,indonesia,india and brazil


Challenge question : Create tokens from the above string using nltk and convert to upper case just the first letter

In [84]:
def upper_funct(text):
    word_toke = word_tokenize(text)
    x = [result.upper().capitalize() for result in word_toke]
    return x

In [85]:
x = upper_funct(str)
print(x)

['The', '5', 'Countries', 'Include', 'China', ',', 'United', 'States', ',', 'Indonesia', ',', 'India', 'And', 'Brazil']


#### 4.2 Apply Number Removal, Punctuation Removal, and Whitespace Removal
##### Number removal 

You may not want to work with numbers in your analysis. 
Number removal can be done in Python using regular expressions

In [41]:
import re

In [42]:
myString = "Box A has 4 red and 6 white balls, while Box B has 3 red and 5 blue balls"

In [43]:
output = re.sub(r'\d+',"",myString)
output

'Box A has  red and  white balls, while Box B has  red and  blue balls'

##### Punctuation removal
- We may want to remove punctuations from our text for easy processing. 
- Examples of such symbols include #, $, %, *, & (), +, -, ., /, :, ;, <=>, ?, @, [, \, ], ^, _, `, {, |, }, ~, ].


In [89]:
import string

In [103]:
myString = 'You,{$%are amazing students:at@@! at Lambton College ! 123 ;'

In [104]:
test_str = myString.translate(str.maketrans('','',string.punctuation))

In [105]:
print(test_str) 

Youare amazing studentsat at Lambton College  123 


Challenge question: Write a code that uses re.sub and removes numbers and punctuations in a single step

In [116]:
output = re.sub(r'[^a-zA-Z]+'," ",myString)
output

'You are amazing students at at Lambton College '

##### White space removal
- We may want to work with text without leading and ending spaces. 
- You can do away with these from your text by calling the strip() method.


In [119]:
myString = "\t a sample string \t"
myString2 = "  a sample string  "
print(myString)
print(myString2)
print(myString.strip())
print(myString2.strip())


	 a sample string 	
  a sample string  
a sample string
a sample string


#### 4.3 Use parts of Speech Tagging (POS)
##### POS

- The goal of POS is to assign the various parts of a speech to every word of the provided text. 
- This is normally done based on the definition and the context. 
- There are various tools that provide us with POS taggers, including NLTK, TextBlob, etc.
- In this lecture, we will use TextBlob.
- It needs to be installed (pip for windows).


In [120]:
pip install textblob

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
     -------------------------------------- 636.8/636.8 kB 5.0 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.17.1
Note: you may need to restart the kernel to use updated packages.


In [121]:
from textblob import TextBlob
import nltk
nltk.download('tagsets')

[('Codespeedy', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('programming', 'VBG'), ('blog.Blog', 'NN'), ('posts', 'NNS'), ('contains', 'VBZ'), ('articles', 'NNS'), ('and', 'CC'), ('tutorials', 'NNS'), ('on', 'IN'), ('python', 'NN'), ('CSS', 'NNP'), ('and', 'CC'), ('even', 'RB'), ('much', 'RB'), ('more', 'JJR')]


[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [127]:
text = ('Codespeedy is a programming blog.' 'Blog posts contains articles and tutorials on python, CSS and even much more')
tb = TextBlob(text)
print("POS TextBlob:",tb.tags)

POS TextBlob: [('Codespeedy', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('programming', 'VBG'), ('blog.Blog', 'NN'), ('posts', 'NNS'), ('contains', 'VBZ'), ('articles', 'NNS'), ('and', 'CC'), ('tutorials', 'NNS'), ('on', 'IN'), ('python', 'NN'), ('CSS', 'NNP'), ('and', 'CC'), ('even', 'RB'), ('much', 'RB'), ('more', 'JJR')]


Challenge question: Use an alternative technique based on NLTK tools to discover the POS for each word!

In [128]:
word_toke_POS = word_tokenize(text)

In [129]:
print("POS NLTK:", nltk.pos_tag(word_toke_POS))

POS NLTK: [('Codespeedy', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('programming', 'VBG'), ('blog.Blog', 'NN'), ('posts', 'NNS'), ('contains', 'VBZ'), ('articles', 'NNS'), ('and', 'CC'), ('tutorials', 'NNS'), ('on', 'IN'), ('python', 'NN'), (',', ','), ('CSS', 'NNP'), ('and', 'CC'), ('even', 'RB'), ('much', 'RB'), ('more', 'JJR')]


Una de las diferencias es que para la libreria de NLTK debes realizar tokenize para poder usar la POS de esta libreria.

How can we figure out the description if each POS tag?

In [132]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

#### 4.4 Practice Named Entity Recognition
##### Named Entity Recognition

 
Named Entity Recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. 

NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions:

- Which companies were mentioned in the news article?
- Were specified products mentioned in complaints or reviews?
- Does the tweet contain the name of a person? Does the tweet contain this person’s location?


In [133]:
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [135]:
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [163]:
myString = '  Jack Nelson worked for Microsoft and attended a conference  in Italy. I study at Lambton college in Toronto'

In [167]:
from nltk.tokenize import sent_tokenize, word_tokenize

# tokenize the article into sentences: sentences
sentences = sent_tokenize(myString)

# tokenize  each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences]

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences,binary=True)

In [165]:
for sent in chunked_sentences:
    for chunk in sent:
        print(chunk)

(NE Jack/NNP Nelson/NNP)
('worked', 'VBD')
('for', 'IN')
(NE Microsoft/NNP)
('and', 'CC')
('attended', 'VBD')
('a', 'DT')
('conference', 'NN')
('in', 'IN')
(NE Italy/NNP)
('.', '.')
('I', 'PRP')
('study', 'VBP')
('at', 'IN')
(NE Lambton/NNP)
('college', 'NN')
('in', 'IN')
(NE Toronto/NNP)


Challenge question? Only print out name entities?

In [168]:
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == 'NE':
            print(chunk)

(NE Jack/NNP Nelson/NNP)
(NE Microsoft/NNP)
(NE Italy/NNP)
(NE Lambton/NNP)
(NE Toronto/NNP)


#### 4.5 Show Collocation Extraction and Synonyms

##### Discovering Word Collocations

- Collocations are two or more words that tend to appear frequently together, such as United States. Of course, there are many other words that can come after United, such as United Kingdom and United Airlines. As with many aspects of natural language processing, context is very important. And for collocations, context is everything!

- In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means that we'll find common phrases that occur frequently throughout the text. For fun, we'll start with the script for Monty Python and the Holy Grail.

Getting ready
- The script for Monty Python and the Holy Grail is found in the webtext corpus, so be sure that it's unzipped at nltk_data/corpora/webtext/.


Descubrir las colocaciones de palabras en el procesamiento del lenguaje natural (PLN) se refiere al proceso de identificar y analizar combinaciones de palabras o frases que ocurren con frecuencia y tienden a aparecer juntas en un texto. Las colocaciones son básicamente palabras que se encuentran juntas con una probabilidad más alta de lo esperado al azar.

En el contexto del PLN, el descubrimiento de colocaciones es importante porque puede ayudar a comprender mejor las relaciones entre las palabras y mejorar tareas como la traducción automática, el análisis de sentimientos, la extracción de información, entre otros. Al identificar colocaciones relevantes, se pueden construir modelos de lenguaje más precisos y generar resultados más coherentes y comprensibles.

Existen varias técnicas y enfoques para descubrir colocaciones de palabras en el PLN. Algunos métodos comunes incluyen el análisis de frecuencia de palabras adyacentes, el uso de medidas estadísticas como la puntuación de asociación log-likelihood (LL), la puntuación de chi-cuadrado (χ²) o la puntuación de mutua información (MI), así como el uso de modelos de lenguaje basados en n-gramas.

Estas técnicas permiten identificar combinaciones de palabras relevantes, como "máquina de aprendizaje", "análisis de sentimientos", "redes neuronales", entre muchas otras. Al comprender las colocaciones, es posible mejorar la comprensión y generación de texto automatizada, lo que resulta beneficioso en diversas aplicaciones del PLN.


In [169]:
import nltk

In [170]:
nltk.download("webtext")

[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package webtext is already up-to-date!


True

In [171]:
from nltk.corpus import webtext

In [196]:
# First let's check how our corpues look like
words = [w.lower() for w in webtext.words('grail.txt')]
output = ' '.join(words)
print(output)



In [176]:
len(output)

16967

In [205]:
def word_toke (text):
    word_toke = word_tokenize(text)
    return word_toke  

def remo_stopwords (text,language):
    wordlist = stopwords.words(language)
    tokenize_word = word_tokenize(text)
    without_stopwords = []
    for x in tokenize_word:
        if x.lower() not in wordlist:
            without_stopwords.append(x)
    string = ' '.join(without_stopwords)
    return string

def remo_punctuation (text):
    x = re.sub(r'[^\w\s]',' ',text)
    return x


In [210]:
text_clean = word_toke(output)
print (text_clean)



In [214]:
text_clean = remo_stopwords(text_clean,"english")
print (text_clean)



In [215]:
text_clean = remo_punctuation(text_clean)
print (text_clean)

