## Text Processing

### Capturing Text Data
#### Plain Text

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [2]:
import os

# Read in a plain text file
with open(os.path.join("/content/drive/MyDrive/NLP/data ", "hieroglyph.txt"), "r") as f:
  text=f.read()
  print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.


#### Tabular Data

In [3]:
import pandas as pd

# Extract text column from a dataframe
df=pd.read_csv("/content/drive/MyDrive/NLP/data /news.csv")


# Convert text column to lowercase
df['title']=df['title'].str.lower()
df.head()[['publisher','title']]

Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


#### Online Resource

In [4]:
import requests
import json

# Fetch data from a REST API
r=requests.get("https://quotes.rest/qod.json")
res=r.json()
print(json.dumps(res,indent=4))


{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Not every day is going to offer us a chance to save somebody's life, but every day offers us an opportunity to affect one.",
                "length": "122",
                "author": "Mark Bezos",
                "tags": {
                    "0": "inspire",
                    "1": "life",
                    "3": "tod"
                },
                "category": "inspire",
                "language": "en",
                "date": "2022-05-19",
                "permalink": "https://theysaidso.com/quote/mark-bezos-not-every-day-is-going-to-offer-us-a-chance-to-save-somebodys-life-bu",
                "id": "9V2lXZnG7op9S4S8prwXfQeF",
                "background": "https://theysaidso.com/img/qod/qod-inspire.jpg",
                "title": "Inspiring Quote of the day"
            }
        ]
    },
    "baseurl": "https://theysaidso.com",
    "copyright": {
      

In [5]:
# Extract relevant object and field
q=res['contents']['quotes'][0]
print(q["quote"], "\n--", q["author"])

Not every day is going to offer us a chance to save somebody's life, but every day offers us an opportunity to affect one. 
-- Mark Bezos


#### Cleaning


In [6]:
import requests

# Fetch a web page
r = requests.get("https://news.ycombinator.com")
print(r.text)

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?F2tdhQa6ArfaD03Cf9Jc">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href=

In [7]:
import re
# Remove HTML tags using RegEx
pattern=re.compile('<.*?>') # tags look like <...>
print(pattern.sub(" ",r.text)) # replace them with blank

     
         
           
         Hacker News     
                 
                      Hacker News  
               new  |  past  |  comments  |  ask  |  show  |  jobs  |  submit                 
                               login 
                            
                  
     
               
        1.                  Outhorse Your Email   (  visiticeland.com  )       
         221 points  by  eorri    2 hours ago      |  hide  |  40&nbsp;comments                 
        
                 
        2.                  I'm an addict   (  bearblog.dev  )       
         71 points  by  tarunreddy    2 hours ago      |  hide  |  23&nbsp;comments                 
        
                 
        3.                  You Eat a Credit’s Card Worth of Plastic Every Week   (  nautil.us  )       
         54 points  by  dnetesn    1 hour ago      |  hide  |  41&nbsp;comments                 
        
                 
        4.                  Happy Birthday, Libera Chat   

In [8]:
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup=BeautifulSoup(r.text,'html5lib')
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Outhorse Your Email (visiticeland.com)
        221 points by eorri 2 hours ago  | hide | 40 comments              
      
                
      2.      I'm an addict (bearblog.dev)
        71 points by tarunreddy 2 hours ago  | hide | 23 comments              
      
                
      3.      You Eat a Credit’s Card Worth of Plastic Every Week (nautil.us)
        54 points by dnetesn 1 hour ago  | hide | 41 comments              
      
                
      4.      Happy Birthday, Libera Chat (libera.chat)
        188 points by mcwhy 5 hours ago  | hide | 96 comments              
      
                
      5.      Poll: Self Hosting Git Repositories
        19 points by gmemstr 59 minutes ago  | hide | 35 comments

In [9]:
# Find all articles
summaries = soup.find_all("tr", class_="athing")
summaries[1]

<tr class="athing" id="31426683">
      <td align="right" class="title" valign="top"><span class="rank">2.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=31426683&amp;how=up&amp;goto=news" id="up_31426683"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="titlelink" href="https://tarunreddy.bearblog.dev/addict/">I'm an addict</a><span class="sitebit comhead"> (<a href="from?site=bearblog.dev"><span class="sitestr">bearblog.dev</span></a>)</span></td></tr>

In [10]:
# Extract title
summaries[0].find("a", class_="titlelink").get_text().strip()

'Outhorse Your Email'

In [15]:
# Find all articles, extract titles
articles=[]
summaries=soup.find_all("tr", class_="athing")
for summary in summaries:
  title =summary.find("a",class_='titlelink').get_text().strip()
  articles.append((title))

print(len(articles), "Article summaries found. Sample:")
print(articles[0])

30 Article summaries found. Sample:
Outhorse Your Email


#### Normalization
##### Case Normalization

In [16]:
# Sample text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [17]:
# Convert to lowercase
text = text.lower() 
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


##### Punctuation Removal

In [18]:
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


#### Tokenization

In [19]:
# Split text into tokens (words)
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


#### NLTK: Natural Language ToolKit

In [21]:
from nltk.tokenize import word_tokenize