**Description**
- introduce you to topic identification,
- Using basic NLP models to identify topics from texts based on term frequencies.
- experiment and compare two simple methods: bag-of-words and Tf-idf with a new library Gensim.

### Building a Counter with bag-of-words
- buildbag-of-words counter using a Wikipedia article. 
- Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! 

**Note** this article text has had very little preprocessing from the raw Wikipedia database entry.


In [26]:
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from gensim.corpora.dictionary import Dictionary

In [6]:
# Open a file: file
file = open('Wikipedia articles/wiki_text_debugging.txt',mode='r')
 
# read all lines at once
article = file.read()
 
# close the file
file.close()


In [7]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 66), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]


### Text preprocessing practice

Apply the techniques to clean up text for better NLP results.
   - to remove stop words and non-alphabetic characters, lemmatize, and 
   - perform a new bag-of-words on your cleaned text.

Start with `lower_tokens` and `Counter` class  .

In [19]:
# Import WordNetLemmatizer
##from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
english_stops = stopwords.words("english")
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


[('debugging', 40), ('system', 25), ('bug', 17), ('software', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]


### Creating and querying a corpus with gensim

- use these data structures to investigate word trends and potential interesting topics in a document set. 
- import a few additional messy articles from Wikipedia, which were 
    - preprocessed by 
        - lowercasing all words, 
        - tokenizing them, and 
        - removing stop words and punctuation. 
    - stored in a list of document tokens. 
- generate the gensim dictionary and corpus.

In [59]:
# Open a file: file
file = open('Wikipedia articles/wiki_text_crash.txt',mode='r')
 
# read all lines at once
article_1 = file.read()
 
# close the file
file.close()

### Tokenization

In [168]:
import glob
import errno
path = 'Wikipedia articles/*.txt'
files = glob.glob(path)
articles_1 = []
for name in files:
    try:
        with open(name, encoding="utf8") as f:
            articles_1.append(f.read())
    except IOError as exc:
        if exc.errno != errno.EISDIR:
            raise

articles = []
for files in articles_1:
    tokens = [w for w in word_tokenize(files.lower()) if w.isalpha()]
    no_stops = [t for t in tokens if t not in stopwords.words('english')]
    articles.append(no_stops)            

In [174]:
articles[2][1:30]

['dmy',
 'file',
 'crashed',
 'crashed',
 'imac',
 'computing',
 'crash',
 'occurs',
 'computer',
 'program',
 'software',
 'application',
 'operating',
 'system',
 'stops',
 'functioning',
 'properly',
 'exit',
 'system',
 'call',
 'program',
 'responsible',
 'may',
 'appear',
 'hang',
 'computing',
 'crash',
 'reporting',
 'service']

In [181]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])


computer
[(1, 1), (13, 1), (15, 1), (18, 1), (26, 1), (29, 1), (37, 1), (38, 4), (47, 2), (48, 7)]


In [185]:
from collections import defaultdict
import itertools

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

debugging 40
system 19
software 16
tools 14
computer 12
computer 597
software 450
cite 322
ref 259
code 235


In [193]:
from gensim.models.tfidfmodel import TfidfModel
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[0:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words along with their weight
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

[(1, 0.012387783211068613), (13, 0.01564619640758989), (15, 0.019634171856606864), (18, 0.012387783211068613), (26, 0.019634171856606864)]
wolf 0.22204869139372047
debugging 0.20565578262121448
fence 0.17763895311497638
debugger 0.13626561532175474
squeeze 0.1332292148362323
