# HW-2: 

* Classification and regression are some of the most common forms of supervised machine learning 
* In this homework we will explore regression and multi-class classification, as applied to text data
* I would recommend starting the data collection and prep ASAP. Then doing the Neural network training after you have completed the labs.
* Once the data is collected, this HW should be quite easy, since most of the code can be recycled from the labs and textbook. 
* **You can do this assignment in either Keras OR Pytorch. It is your choice.** 

**Instructions** 
* **Develope a text based classification and regression data-set:**
  * Use the wikipedia API to search for articles to generate the data-set
  * Select a set of highly different topics (i.e. labels), for example, 
    * multi-class case: y=(pizza, oak_trees, joe_biden, ... , etc)=(0,1,2, ... , N-1)
    * You don't have to use these, you can use whatever labels you want but use at least 3 labels.  
  * Search for wikipedia pages about these topics and harvest the text from the pages. 
  * Do some basic text cleaning as needed. 
  * Use the NLTK sentence tokenizer to break the text into sentences. Then form chunks of text that are five sentences long as your "inputs".
  * The "label" for these chunks will be the search label used to find the text. 
  * Also "tag" each chunk of text with an associated "compound" sentiment score computed using NLTK sentiment analysis.
  * The data set will not be perfect. 
    * There will be chunks of text that are not related to the topic (i.e. noise). 
    * However that is just something we have to live with.
  * Important: Start small when writing and debugging THEN scale up. The more chunks of text you have the better.
  * Do this in a file called "data_collection.ipynb", have it save its results to a folder "data"
    * Save the text and labels to the same format used by the textbook, that way you can recycle your lab code seamlessly. 
    * Also, include a second file with the associated numeric sentiment scores.
* **Training guide-lines**
  * Do the training in this "HW-2.ipynb" file
  * Use a dense feed forward ANN model
  * Normalize the data as needed
  * Visualize the results at the end where possible
  * Partition data into training, validation, and test
  * Monitor training and validation throughout training by plotting
  * Print training, validation, and test errors at the very end
  * Do basic hyper parameter tuning to try to achieve an optimal fit model
    * i.e. best training/validation loss without over-fitting
    * Explore L1 and L2 regularization (or dropout)
    * Explore different optimizers 
    * Use the loss functions specified in the textbook
    * Explore different options for activation functions, network size/depth, etc
  * **Document what is going on in the code, as needed, with narrative markdown text between cells.**
  * *Submit the version with hyper parameters that provide the optimal fit*
    * i.e. you don't need to show the outputs of your hyper-parameter tuning process
  * *Optional*: write an outer loop to do an automated hyperparameter search.
* **Classification training:**
  * Recycle your code from lab-2.2 to train a multi-class classifier on the data set 
* **Regression training:**
  * Use your lab-2.1 code as a starting point 
  * Train a ANN to predict the sentiment score from the vectorized text
    * Use the same input matrix as the classification problem
    * This is a bit of a silly and circular exercise, since we are using the outputs of the NLTK model to train an ANN model.
    * The only time you would want to do this is if the second model is computationally MUCH MUCH faster than the first.
    * That being said, it is an educational exercise so hopefully it is informative.   
  
**Submission:**
* You need to upload TWO documents to Canvas when you are done
  * (1) A PDF (or HTML) of the completed form of the "HW-2.ipynb" document 
  * (2) A PDF (or HTML) of the completed form of the "data_collection.ipynb" document 
* The final uploaded version should NOT have any code-errors present 
* All outputs must be visible in the uploaded version, including code-cell outputs, images, graphs, etc
* **Total points:** 41.66


In [1]:
import wikipedia

In [2]:
waffle_t = wikipedia.search("waffle", results=10, suggestion=False)
waffle_t

['Waffle',
 'Waffle House',
 'Waffle House Index',
 'Stroopwafel',
 'Belgian waffle',
 'Waffle slab',
 'Chicken and waffles',
 'Waffle iron',
 'Egg waffle',
 'Waffle Crisp']

In [46]:
pizza_t = wikipedia.search("pizza", results=10, suggestion=False)
pizza_t

['Pizza',
 'Pizza Pizza',
 'Pizza Hut',
 'Licorice Pizza',
 "Domino's Pizza",
 'Pizza Margherita',
 'Hawaiian pizza',
 'History of pizza',
 'Chicago-style pizza',
 'Mystic Pizza']

In [30]:
football_t = wikipedia.search("football", results=10, suggestion=False)
football_t

['Football',
 'College football',
 'England national football team',
 '2001 Tennessee Volunteers football team',
 '1999 Tennessee Volunteers football team',
 '1996 Tennessee Volunteers football team',
 'Notre Dame Fighting Irish football',
 '2004 Tennessee Volunteers football team',
 'Nebraska–Omaha Mavericks football',
 '1995 Tennessee Volunteers football team']

In [29]:
waf_page = []
waf_content = []
for i in range(len(waffle_t)):
    waf_p = wikipedia.page(waffle_t[i])
    waf_con = waf_p.content
    waf_page.append(waf_p)
    waf_content.append(waf_con)

In [31]:
foo_page = []
foo_content = []
for i in range(len(football_t)):
    foo_p = wikipedia.page(football_t[i])
    foo_con = foo_p.content
    foo_page.append(foo_p)
    foo_content.append(foo_con)

In [47]:
piz_page = []
piz_content = []
for i in range(len(pizza_t)):
    piz_p = wikipedia.page(pizza_t[i])
    piz_con = piz_p.content
    piz_page.append(piz_p)
    piz_content.append(piz_con)

In [48]:
import re
import spacy
from spacy.language import Language


pipeline = spacy.load('en_core_web_sm')

# http://emailregex.com/
email_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

# replace = [ (pattern-to-replace, replacement),  ...]
replace = [
    (r"<a[^>]*>(.*?)</a>", r"\1"),  # Matches most URLs
    (email_re, "email"),            # Matches emails
    (r"(?<=\d),(?=\d)", ""),        # Remove commas in numbers
    (r"\d+", "number"),              # Map digits to special token <numbr>
    (r"[\t\n\r\*\.\@\,\-\/]", " "), # Punctuation and other junk
    (r"\s+", " ")                   # Stips extra whitespace
]


  from .autonotebook import tqdm as notebook_tqdm


In [53]:
import nltk
from nltk.corpus import wordnet

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download([
    "names",
    "stopwords",
    "state_union",
    "twitter_samples",
    "movie_reviews",
    "averaged_perceptron_tagger",
    "vader_lexicon",
    "punkt",])

In [62]:
waf_text = ''.join(str(e) for e in waf_content)
foo_text = ''.join(str(e) for e in foo_content)
piz_text = ''.join(str(e) for e in piz_content)

In [63]:
waf_sen = nltk.tokenize.sent_tokenize(waf_text)
foo_sen = nltk.tokenize.sent_tokenize(foo_text)
piz_sen = nltk.tokenize.sent_tokenize(piz_text)

In [64]:
waf_text = waf_sen
foo_text = foo_sen
piz_text = piz_sen
for repl in replace:
    waf_sen_clean = [re.sub(repl[0], repl[1], text) for text in waf_text]
    foo_sen_clean = [re.sub(repl[0], repl[1], text) for text in foo_text]
    piz_sen_clean = [re.sub(repl[0], repl[1], text) for text in piz_text]

In [65]:
n = 5 # chunk length
waf_inputs = [waf_sen[i:i+n] for i in range(0, len(waf_sen_clean), n)]
foo_inputs = [foo_sen[i:i+n] for i in range(0, len(foo_sen_clean), n)]
piz_inputs = [piz_sen[i:i+n] for i in range(0, len(piz_sen_clean), n)]

In [67]:
waf_inputs = waf_inputs[:10]
foo_inputs = foo_inputs[:10]
piz_inputs = piz_inputs[:10]
waf_inputs[0]

['A waffle is a dish made from leavened batter or dough that is cooked between two plates that are patterned to give a characteristic size, shape, and surface impression.',
 'There are many variations based on the type of waffle iron and recipe used.',
 'Waffles are eaten throughout the world, particularly in Belgium, which has over a dozen regional varieties.',
 'Waffles may be made fresh or simply heated after having been commercially cooked and frozen.',
 '== Etymology ==\nThe word waffle first appears in the English language in 1725: "Waffles.']

In [70]:
inputs_all = waf_inputs + foo_inputs + piz_inputs
len(inputs_all)

30

In [72]:
import re
import pandas as pd

punc_ul = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

for i in range(len(inputs_all)):
    sen = ''.join(inputs_all[i])
    sen_n = re.sub(r'[^\w\s]','',sen)
    sen_n = re.sub(punc_ul,'',sen_n)
    sen_n = re.sub(r'\w*\d\w*','',sen_n)
    inputs_all[i] = sen_n

df_all = pd.DataFrame({'content':inputs_all})

In [74]:
df_all.head()

Unnamed: 0,content
0,A waffle is a dish made from leavened batter o...
1,Take flower cream It is directly derived from ...
2,The format of the iron itself was almost alway...
3,Toss in some flour and mixThen fill little by ...
4,However this was a waffle gaufre in name only...


In [75]:
labels = ['waffle']*10 + ['football']*10 + ['pizza']*10
df_all['label'] = labels
df_all

Unnamed: 0,content,label
0,A waffle is a dish made from leavened batter o...,waffle
1,Take flower cream It is directly derived from ...,waffle
2,The format of the iron itself was almost alway...,waffle
3,Toss in some flour and mixThen fill little by ...,waffle
4,However this was a waffle gaufre in name only...,waffle
5,By the century paintings by Joachim de Beucke...,waffle
6,Take with that half water and half wine and gi...,waffle
7,Groote Wafelen in its use of leavening was the...,waffle
8,to be no less than yards from one to the othe...,waffle
9,The wealthier families waffles known often as ...,waffle


In [76]:
df_all.to_csv('data.csv')

In [101]:
from nltk.sentiment import SentimentIntensityAnalyzer

#INITIALIZE 

#STRING-1
sia = SentimentIntensityAnalyzer()
scorelist = []
for text in inputs_all:
    score = sia.polarity_scores(text)
    scores = score['compound']
    scorelist.append(scores)
    #print(scores)

In [104]:
df_all['score'] = scorelist
df_all

Unnamed: 0,content,label,score
0,A waffle is a dish made from leavened batter o...,waffle,0.3182
1,Take flower cream It is directly derived from ...,waffle,-0.4201
2,The format of the iron itself was almost alway...,waffle,0.2732
3,Toss in some flour and mixThen fill little by ...,waffle,-0.2584
4,However this was a waffle gaufre in name only...,waffle,0.6808
5,By the century paintings by Joachim de Beucke...,waffle,0.4678
6,Take with that half water and half wine and gi...,waffle,0.5964
7,Groote Wafelen in its use of leavening was the...,waffle,0.6705
8,to be no less than yards from one to the othe...,waffle,0.1901
9,The wealthier families waffles known often as ...,waffle,0.659


In [105]:
df_all.to_csv('data_w_score.csv')