<a href="https://colab.research.google.com/github/hyu623/week4trial/blob/main/Week_4_trial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 - Trial

(Be sure to copy to drive)

Text data is a bit different from numeric data. We can easily find the average of a series of numbers and things like the highest and lowest values in a range to get some ideas on what we are dealing with. We can't really do that with text. We'll focus on some tools that you can use to actually analyze text. We'll start with a library called [TextBlob](https://textblob.readthedocs.io/en/dev/).

In [1]:
#Load up our libraries
from textblob import TextBlob
from google.colab import drive

#these should look familar
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests

#Some extra libraries we'll need for text analysis
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('punkt_tab')


#Connect to Gdrive
drive.mount('/content/gdrive')

print("Libraries and Drive Ready!")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Mounted at /content/gdrive
Libraries and Drive Ready!


# Noun Phrases for the Diary

Now let's generate the noun phrases for January's entries

# Text and Text Files

Our week 2 & week 3 warmup material introduced some ideas about working with files in Google Drive and in our Colab environment. Since we are dealing with text analysis right now we'll take a moment to talk about text files as well.

Sometimes we want to take a string variable and write it to a file so that we can use it a later time.

We'll also make use of automatically grabbing content from the web using the [requests](https://pypi.org/project/requests/) library just like we did in week 1. We are going to grab a book from the [Project Guttenberg](https://www.gutenberg.org/) site as our example.

This is technically an example of screen scrapping. IE. we are programmatically grabbing content from the web using an automated tool. This is the type of thing that AI bots are doing and arguably it is [ruining](https://library.unc.edu/news/library-it-vs-the-ai-bots/) the web.

In [2]:
#We'll be using the H.G. Wells book - The Invisible Man (https://www.gutenberg.org/ebooks/5230)
#but we'll focus on the plain text version
book_text_url = "https://www.gutenberg.org/cache/epub/5230/pg5230.txt"

response = requests.get(book_text_url)

In [3]:
#we now have a string variable (response.text) which holds the whole text of the book
response.text



In [4]:
#File I/O in Python is a whole week of content on its own
#but quickly, the 'w' means we are writing to the file
with (open('invisible_man.txt', 'w')) as f:
    f.write(response.text)

In [5]:
#Magic command to display contents of folder
!ls -l

total 308
drwx------ 6 root root   4096 Feb 17 15:02 gdrive
-rw-r--r-- 1 root root 306429 Feb 17 15:02 invisible_man.txt
drwxr-xr-x 1 root root   4096 Jan 16 14:24 sample_data



# One final activity: Automatic Keyword Generator

Let's put all of what we have learned together to create an automatic keyword generator that identifies Noun Phrases in a book from Guttenberg.

We are going to be looking at the book [The Prince](https://en.wikipedia.org/wiki/The_Prince)


In [6]:
keywords = dict()

# We are using 隨園詩話 - https://www.gutenberg.org/ebooks/52206
book_url = "https://www.gutenberg.org/cache/epub/52206/pg52206.txt"
book_title = "隨園詩話"


print("Downloading book...")
book = requests.get(book_url)

#save a copy of the downloaded book as a text file
with (open(book_title+'.txt', 'w')) as f:
    f.write(book.text)

#Turn text into text blob
book_blob = TextBlob(book.text)


print("Identiying Noun phrases and building frequency dictionary...")

#Go through all noun phrases
for np in book_blob.noun_phrases:
    if np in keywords:
        keywords[np] += 1
    else:
        keywords[np] = 1

noun_phrases = ""
#Sort dictionary and print top 20 entries
print("Most common Nouns...")

for np in sorted(keywords, key=keywords.get, reverse=True)[0:20]:
    noun_phrases += np + ","+str(keywords[np])+"\n"
    print(np, keywords[np])

with(open(book_title+'_keywords.txt','w')) as f:
    f.write(noun_phrases)

Downloading book...
Identiying Noun phrases and building frequency dictionary...
Most common Nouns...
project gutenberg™ 46
electronic works 15
project gutenberg 13
project gutenberg literary archive 12
electronic work 11
project gutenberg™ license 8
u.s. 7
copyright law 5
information 5
phrase “ 4
copyright holder 4
trademark license 3
free distribution 3
paragraph 1.f.3 3
restrictions whatsoever 2
project gutenberg license 2
title 2
author 2
隨園詩話 * * * 2
terms 2


In [None]:
keywords = dict()

# We are using 狄公案 - https://www.gutenberg.org/ebooks/27686
book_url = "https://www.gutenberg.org/cache/epub/27686/pg27686.txt"
book_title = "狄公案"


print("Downloading book...")
book = requests.get(book_url)

#save a copy of the downloaded book as a text file
with (open(book_title+'.txt', 'w')) as f:
    f.write(book.text)

#Turn text into text blob
book_blob = TextBlob(book.text)


print("Identiying Noun phrases and building frequency dictionary...")

#Go through all noun phrases
for np in book_blob.noun_phrases:
    if np in keywords:
        keywords[np] += 1
    else:
        keywords[np] = 1

noun_phrases = ""
#Sort dictionary and print top 20 entries
print("Most common Nouns...")

for np in sorted(keywords, key=keywords.get, reverse=True)[0:20]:
    noun_phrases += np + ","+str(keywords[np])+"\n"
    print(np, keywords[np])

with(open(book_title+'_keywords.txt','w')) as f:
    f.write(noun_phrases)

Downloading book...
Identiying Noun phrases and building frequency dictionary...
Most common Nouns...
project gutenberg™ 46
electronic works 15
project gutenberg 13
project gutenberg literary archive 12
electronic work 11
project gutenberg™ license 8
u.s. 7
copyright law 5
information 5
phrase “ 4
copyright holder 4
trademark license 3
free distribution 3
paragraph 1.f.3 3
restrictions whatsoever 2
project gutenberg license 2
january 2
terms 2
paragraph 1.e.8 2
full terms 2


In [None]:
keywords = dict()

# We are using Der Struwwelpeter - https://www.gutenberg.org/ebooks/24571
book_url = "https://www.gutenberg.org/cache/epub/24571/pg24571.txt"
book_title = "Der Struwwelpeter"


print("Downloading book...")
book = requests.get(book_url)

#save a copy of the downloaded book as a text file
with (open(book_title+'.txt', 'w')) as f:
    f.write(book.text)

#Turn text into text blob
book_blob = TextBlob(book.text)


print("Identiying Noun phrases and building frequency dictionary...")

#Go through all noun phrases
for np in book_blob.noun_phrases:
    if np in keywords:
        keywords[np] += 1
    else:
        keywords[np] = 1

noun_phrases = ""
#Sort dictionary and print top 20 entries
print("Most common Nouns...")

for np in sorted(keywords, key=keywords.get, reverse=True)[0:20]:
    noun_phrases += np + ","+str(keywords[np])+"\n"
    print(np, keywords[np])

with(open(book_title+'_keywords.txt','w')) as f:
    f.write(noun_phrases)

Downloading book...
Identiying Noun phrases and building frequency dictionary...
Most common Nouns...
project gutenberg™ 46
illustration 36
electronic works 15
project gutenberg 13
und 13
suppe 13
er 13
kind 12
ich 12
project gutenberg literary archive 12
electronic work 11
da 9
die geschichte 8
miau 8
mio 8
doch 8
project gutenberg™ license 8
und der 7
der 7
hund 7
