# REFRESHER EXERCISE CHALLENGE

In this week, you will work in your respective group to solve the coding tasks mentioned below. Our goal is to comprehensively rehearse some of the topics you have learnt so far in the course. 

Teams that manage to complete the challenge by **16.00 Wednesday 05.01.2022** (send me your code via e-mail) will receive **0.3 upgrade bonus** for their final grade. The full solutions will be released, along with the codes on some other parts, on Github platform on Thursday 06.01.2022. 

In [None]:
### In case you do not have the required packages, the following lines give you a pointer as 
###to how to install them on different machines.

#!pip install unidecode
#!pip install googletrans
#!pip install gensim
#!pip install spacy
#!pip install wordcloud
#!pip install pyldavis

#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_lg

#import nltk
#nltk.download('stopwords') 
#nltk.download('punkt') 
#nltk.download('wordnet') 
#nltk.download('averaged_perceptron_tagger')
#nltk.download('vader_lexicon')

### Set up and load the data


In [1]:
# Usual imports
import numpy as np
import os
import pandas as pd

# To plot pretty figures
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib as mpl
import matplotlib.pyplot as plt
#%matplotlib notebook
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

import seaborn as sns
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings = lambda *a, **kw: None

# To make this notebook's output identical at every run
np.random.seed(42)

In [2]:
# For this exercise to work, Scikit-Learn ≥0.2 is required
import sklearn

We use as an example the **20 Newsgroups** ([[http://qwone.com/~jason/20Newsgroups/]]) dataset (from `sklearn`), a collection of about 20,000 newsgroup (message forum) documents. 

In [3]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups() # object is a dictionary
data.keys()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
#Check dataset characteristics
print(data['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [5]:
W, y = data.data, data.target
n_samples = y.shape[0]
n_samples

11314

In [6]:
y[:10] # news story categories

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [7]:
doc = W[0]
doc

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [8]:
#Make and check a pandas dataframe
df = pd.DataFrame(W,columns=['text'])
df['topic'] = y
df.head()

Unnamed: 0,text,topic
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


In [9]:
from gensim.utils import simple_preprocess

processed = []
# iterate over rows
for i, text in enumerate(W):
    document = simple_preprocess(text) # get sentences/tokens
    processed.append(document) # add to list
    if i > 100:
        break



In [10]:
processed[0][:10]

['from',
 'lerxst',
 'wam',
 'umd',
 'edu',
 'where',
 'my',
 'thing',
 'subject',
 'what']

### Basic Text Descriptive Statistics

In [11]:
#Remove unicode characters
from unidecode import unidecode # package for removing unicode
uncode_str = 'Visualizations\xa0'
fixed = unidecode(uncode_str) # example usage
print([uncode_str],[fixed]) # print cleaned string (replaced with a space)

['Visualizations\xa0'] ['Visualizations ']


**YOUR TASKS:**

1. Count Words per document. 

2. Build a frequency distribution over words with Counter. 

3. Create a graph with a top 10 most commonly occured words, with and without inclusion of typical English stopwords. (Tip: from nltk.corpus import stopwords) 

4. Create a graph with a top 10 most commonly occured bi-gram AND tri-gram, with and without inclusion of typical English stopwords. (Tip: from nltk.corpus import stopwords) 

5. Use RegEx to extract: 
   (i) hyphenated words 
   (ii) e-mail addresses
   
6. Tag Parts of Speech

7. Create a function to remove punctuations, lowercase words, remove stopwords and stem them (using Snowball stemmer).

8. Apply TF-IDF vectorizer and create wordclouds


### Dictionary/Matching Methods

In [None]:
# Dictionary-Based Sentiment Analysis on the corpus
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
polarity = sid.polarity_scores(doc)
print(polarity)

In [None]:
# sample 20% of the dataset
dfs = df.sample(frac=.2) 

# apply compound sentiment score to data-frame
def get_sentiment(snippet):
    return sid.polarity_scores(snippet)['compound']
dfs['sentiment'] = dfs['text'].apply(get_sentiment)

In [None]:
dfs.sort_values('sentiment',inplace=True)
# print beginning of most positive documents
[x[50:150] for x  in dfs[-5:]['text']]

In [None]:
# print beginning of most negative documents
[x[50:150] for x  in dfs[:5]['text']]

### Topic Models

We again use the **20 Newsgroups** ([[http://qwone.com/~jason/20Newsgroups/]]) dataset (from `sklearn`), a collection of about 20,000 newsgroup (message forum) documents. 


In [None]:
W=data.data

In [None]:
#Preprocessing
from gensim.utils import simple_preprocess

doc_clean = []
# iterate over rows
for i, text in enumerate(W):
    document = simple_preprocess(text) # get sentences/tokens
    document = [word for word in document if word not in stopwords] # remove stopwords
    doc_clean.append(document) # add to list
    if i > 100:
        break
# shuffle the documents
from random import shuffle
shuffle(doc_clean)

# creating the term dictionary
from gensim import corpora
dictionary = corpora.Dictionary(doc_clean)

**YOUR TASKS:**

1. Converting list of documents (corpus) into Document Term Matrix and Tf-idf matrix using dictionary prepared above.

2. Train LDA with 10 topics and print out the topics

(Note: Parameters of LDA: num_topics = specify how many topics you would like to extract from the documents
 alpha = document-topic density (the greater, the article will be assigned to more topics, vice versa)
eta = topic-word density (the greater, each topic will contain more words, vice versa) )

3. Create LDA WordClouds

### Word Embeddings

Word embeddings requires word2vec, which you can work with in gensim package. Remember that word embeddings needs sentences as inputs. Hence, your tasks in this section are as follows:

1. Create a function to obtain sentences from the document (make sure to lowercase, split and stem the raw setences in the document).
2. Stream sentences in random order (using from random import shuffle +.....)
3. Train the model (using from gensim.models import Word2Vec + ....)
4. Once done training, save the trained w2v as .pkl file (e.g:  w2v.save('w2v-vectors.pkl')   )
5. Find the 10 most similar words to "man"