## Aman's ML process:
1. Problem
2. Data
3. Algorithm
4. Discovery

## Auto-Summarizing Text
Find a document and auto-summarize it. It can be a blog/news article/research paper. Use a machine based approach for this. 

* get data
* parse
* analyze

### Reference
* [build a quick summarizer with Python and NLTK](https://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk)
* [Stack Abuse: Text Summarization with NLTK in Python](https://stackabuse.com/text-summarization-with-nltk-in-python/)
* [Toward Data Science have a great example with a summarizer deployed in the command line](https://towardsdatascience.com/write-a-simple-summarizer-in-python-e9ca6138a08e)

* Ideas:
    * [Using Deep Learning to AutoSummarize Email](https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1)
    * [TextRank for Text Summarization](https://nlpforhackers.io/textrank-text-summarization/)

### 1. Get Data
NLTK has some great [notes](https://www.nltk.org/book/ch03.html) on pre-processing

In [145]:
import requests
from bs4 import BeautifulSoup

article_url = 'https://www.websitebuilderexpert.com/building-websites/how-much-should-a-website-cost/'
headers = {
    'User-Agent': 'Mozilla/5.0'
}
html = requests.get(article_url, headers = headers)
soup = BeautifulSoup( html.content.decode('utf-8', 'ignore') , 'html.parser')

In [157]:
stitle = soup.find( 'h1', class_ = 'entry-title').get_text()
parsed_article = soup.find('div', class_ = 'entry')
paras = parsed_article.find_all('p')
for p in paras:
    stext += p.get_text() + ' '

Remove special characters and set all to lower case (?)  
re to add a space after full stop from [stackoverflow](https://stackoverflow.com/questions/42445842/how-to-split-text-into-sentences-when-there-is-no-space-after-full-stop)

In [186]:
import re
raw = stext.replace(u'\xa0',' ')
raw = raw.replace(u'\n',' ')
raw = raw.lower()
raw = re.sub(r'\.(?=[^ \W\d])','. ', raw)

In [185]:
test = 'this is what consumerism is all about.so, let’s put money (as a resource) aside for now,'
re.sub(r'\.(?=[^ \W\d])','. ', test)

'this is what consumerism is all about. so, let’s put money (as a resource) aside for now,'

### 2. Parse
#### Preliminary

In [37]:
import nltk
import string

### Tokenization

In [187]:
from nltk import word_tokenize
from nltk import sent_tokenize

raw_nopunc = re.sub( r'[^\w\s]',' ',raw)
word_tok = word_tokenize(raw_nopunc)
sent_tok = sent_tokenize(raw)

### Stopword Removal

In [188]:
from nltk.corpus import stopwords
my_stopwords = set( stopwords.words('english') + list(string.punctuation))
my_words = [ w for w in word_tok if w not in my_stopwords]

### N-grams

In [47]:
from nltk import ngrams
n = 3
trigrams = ngrams( my_words, n)

### 3. Analyze

#### Weighted Frequency of Occurence

In [189]:
word_freq = {}
for w in my_words:
    if w not in word_freq.keys():
        word_freq[w] = 1
    else:
        word_freq[w] += 1

sorting dictionary by value thanks to this [tip on StackOverflow](https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value)

In [190]:
word_freq_sort = sorted( word_freq.items() , key = lambda kv: kv[1], reverse= True)
for key, value in word_freq_sort[:10]:
    print(f'{key}:{value}')

website:693
cost:237
design:155
builder:128
time:127
wordpress:124
need:113
drag:103
drop:99
building:97


In [191]:
most_freq = word_freq_sort[0][1]
word_freq_score = { k: v/ most_freq for k,v in word_freq_sort}

In [192]:
sent_scored = {}
for sent in sent_tok:
    iscore = 0
    for w in word_tokenize(sent):
        if w in word_freq_score.keys():
            iscore += word_freq_score[w]
    sent_scored[sent] = iscore

In [193]:
sent_sorted = sorted( sent_scored.items(), key = lambda kv: kv[1], reverse = True)
print(f'Auto-summary: Top 10 sentences with scores\n----------------------------')
for k,v in sent_sorted[:10]:
    print(f'{k}: {v}\n')

Auto-summary: Top 10 sentences with scores
----------------------------
summary chart of cost of website using drag & drop website builder: website setup costs automated setup: $0 website builder software learning costs time: few minutes – 1 hour website design costs free templates: $0 website building costs free if you do it yourself (but will cost you time) hourly costs of a designer: $30 – $60/hour average cost of content population: $500 – $2,000 website maintenance costs wix:                 $5.00 | $10.00 | $14.00 | $17.00 | $25.00 squarespace: $12.0 | $18.0 | $26.0 | $40.0 weebly:           $8.0 | $12.0 | $25.0 | $49.0 *monthly fee, based on annual plans* all plans come with dedicated, 24/7 support.: 11.359307359307367

summary chart of hiring a professional to help you build a wordpress website: website setup costs hosting cost: $5 – $250/month hiring pro to setup: $50 – $200 (1-time fee) time: 1 to 6 hours website builder software learning costs paid tutorials: $50/month time 