# Foundations of AI & ML
## Session 06
### CaseStudy
### Applying PCA, ISOMAP, LLE, T-SNE on data

### Step 1
We read the entire file into a list of lines, converting everything to lowercase as well as remove trailing and leading whitespace.

In [None]:
wp_text_stage0 = [line.strip().lower() for line in open("War_And_Peace.txt",encoding="utf8")]
print(wp_text_stage0[4000:4010])

### Step 2
We combine them into one gigantic string

In [None]:
wp_text_stage1 = ' '.join(wp_text_stage0)

In [None]:
print(len(wp_text_stage1))
print(wp_text_stage1[40000:40200])

### Step 3
We break down this gigantic string into sentences 

In [None]:
from nltk.tokenize import sent_tokenize
wp_text_stage2 = sent_tokenize(wp_text_stage1)

In [None]:
print(len(wp_text_stage2))
print(wp_text_stage2[5000:5010])

So we have about 26k sentences, in the tome. We now take each sentence and clean it up as below:
 * replace all non-alphanumeric characters by space
 * split each sentence on whitespace
 * in each sentence drop words that are less than 3 letters long and are part of fluff words

### Step 4
We read the entire contents the fluff file into a set. As mentioned earlier a set is much faster for checking membership

In [None]:
fluff = set([line.strip() for line in open("stoplist.txt")])

### Step 5
Replace all non-alphanumeric characters by space

In [None]:
import re
only_alnum = re.compile(r"[^\w]+") ## \w => unicode alphabet
#only_alnum = re.compile(r"[^a-z0-9]") --> This will remove accented characters which are part of many names!

## Replaces one or more occurrence of any characters other unicode alphabets and numbers
def cleanUp(s):
    return re.sub(only_alnum, " ", s).strip()
wp_text_stage3 = [cleanUp(s) for s in wp_text_stage2]
print(wp_text_stage3[4000:4010])

### Step 6
Now we break each sentence into words, and store these words as a list. We traverse this list and drop the unwanted words. 

In [None]:
def choose_words(s):
    return [w for w in s.split() if len(w) > 2 and w not in fluff]

In [None]:
wp_text_stage4 = [choose_words(sentence) for sentence in wp_text_stage3]
print(wp_text_stage4[4000:4010])

In [None]:
print(len(wp_text_stage4))

### Step 7
We convert the words to common stem -- that is we do not want to consider "run", "runs", "running" as separate words

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
print(stemmer.stem("running"), stemmer.stem("run"), stemmer.stem("runs"), stemmer.stem("runner"))
print(stemmer.stem("guns"), stemmer.stem("gun"), stemmer.stem("gunned"), stemmer.stem("gunning"))

In [None]:
def stem_list(wordlist):
    return [stemmer.stem(word) for word in wordlist]
for n in range(4000, 4010):
    print(wp_text_stage4[n], stem_list(wp_text_stage4[n]))

In [None]:
wp_text_stage5 = [stem_list(s) for s in wp_text_stage4]
print(wp_text_stage5[4000:4010])

### Step 8
We now build a word2vec model with this corpus.

In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.models import word2vec
from gensim.models import Phrases
import logging

In [None]:
num_features = 300    # Word vector dimensionality                      
min_word_count = 50   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 6           # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

In [None]:
wp = word2vec.Word2Vec(wp_text_stage5, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

In [None]:
wp.init_sims(replace=True)

In [None]:
wp.corpus_count

In [None]:
len(wp.wv.vocab.keys())

In [None]:
sorted(list(wp.wv.vocab))

### Step 9
Let us save this so that we can continue

In [None]:
wp.wv.save_word2vec_format('wp.bin')

In [None]:
import numpy as np
X = np.array([wp.wv.get_vector(w) for w in wp.wv.vocab])
X

**Excerise 1:** Apply Hierarchical Clustering

In [None]:
###Your code here

**Excerise 2:** Apply PCA on the data

In [None]:
### Your code here

**Excerise 3:** Apply ISOMAP on the data

In [None]:
### Your code here

**Excerise 4:** Apply LLE on the data

In [None]:
### Your code here

**Excerise 5:** Apply T-SNE on the data

In [None]:
### Your code here