## Text Mining

![miners](img/text-miners.jpeg)

## Activation discussion:

Take 5 minutes and talk with your neighbor about:
 - What makes text mining hard?
 - How is it different than other types of analysis we've done?
 - What are some applications of text mining you've heard of? (or are _interested_ in?)t
 
Be prepared to share out something your neighbor said with the group!

***

Report out!

***

#### How is text mining different? What is text?

- Order the words from **SMALLEST** to **LARGEST** units
 - character
 - corpora
 - sentence
 - word
 - corpus
 - paragraph
 - document

(after it is all organized)

- Any disagreements about the terms used?

### **Goal**: to internalize the steps, challenges, and methodology of text mining

### Objectives
- use tools we already have to vectorize some text
- apply the `nltk` package to use some more robust tools to vectorize text
- use TDIF to classify documents



## Part 1: First Exercise

### Scenario:
Imagine a time before topic tags on articles. We want to save time sorting through news articles from BBC news to only read the sports articles. We are going to try this using the tools we have.

In the `text_examples`  are articles<br>
(don't read them all! You'll ruin the surprise!)

Here's what we are going to do:
- read an article in to python as a long string
- use `.split()` to convert the string to a list
- fix any problems
- create a bar chart of the top 20 most frequent words in your article
- paste that bar chart [here](https://docs.google.com/presentation/d/1pTYQDlpFudghIM1rEDQNKgjubR77SgCUL4YtPjTmfeA/edit?usp=sharing)
- Try to figure out whose articles are in the same topic as yours using the **bar charts only**


In [None]:
file = open("text_examples/A.txt", "r")

In [None]:
article_a = file.read()

In [None]:
article_a[:20]

In [None]:
article_a_words = article_a.split()

In [None]:
article_a_words[:20]

In [None]:
import pandas as pd


series_a = pd.Series(article_a_words)

In [None]:
series_a.value_counts()

In [None]:
%matplotlib inline

series_a.value_counts()[:20].plot.bar()

### What's the problem here?

In [None]:
series_a.apply(lambda x: x.lower())

In [None]:
series_a.apply(lambda x: x.lower()).value_counts()[:20].plot.bar()

### We done?

In [None]:
file = open("remove_words.txt", "r")
remove = file.read()

In [None]:
remove_list = remove.split()

In [None]:
remove_list[:5]

In [None]:
shorter_series_a = series_a[~series_a.apply((lambda x: x.lower())).isin(remove_list)]

In [None]:
shorter_series_a.apply((lambda x: x.lower())).value_counts()[:20].plot.bar()

In [None]:
shorter_series_a.apply((lambda x: x.lower())).value_counts()[:40].plot.bar()

### What other things could we fix?

- using regex, perhaps?
- normalize for number of words in article?

## Do with your own articles if you haven't already!
Find your group!

###  Text Vectorization, aka Bag of Words
#### Steps:

<img style="float: left" src="./img/bag_of_word.jpg" width="200">

![step by step](https://i.gifer.com/VxbJ.gif)

1. make all lower case
2. Remove punctuation, numbers, symbols, etc
3. Remove stop words, perhaps develop custom stop words list
4. Stemming/Lemmatization


How was that process we just did?
But what about tokenization? when's the best time to tokenize?

## NLTK makes it easier!

### _Natural Language Tool Kit_

NLTK is its own python library. And of course, it has its own [documentation](https://www.nltk.org/)

In [None]:
import nltk
import sklearn
from __future__ import print_function

In [None]:
nltk.download() #for when you are bringing in files from gutenburg, etf

In [None]:
from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string
import re
import urllib

In [None]:
#print(tokens[:100])

In [None]:
file = open("text_examples/A.txt", "r")
article_a = file.read()

Load your article here

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
article_a_tokens_raw = nltk.regexp_tokenize(article_a, pattern)
print(article_a_tokens_raw[:50])

In [None]:
article_a_tokens = [i.lower() for i in article_a_tokens_raw]
print(article_a_tokens[:50])

In [None]:
from nltk.corpus import stopwords
stopwords.words("english")

In [None]:
stop_words = set(stopwords.words('english'))
article_a_tokens_stopped = [w for w in article_a_tokens if not w in stop_words]
print(article_a_tokens_stopped[:50])

## Stemming / Lemming

### Stemming - Porter Stemmer 
![porter](https://cdn.homebrewersassociation.org/wp-content/uploads/Baltic_Porter_Feature-600x800.jpg)

In [None]:
from nltk.stem import *
stemmer = PorterStemmer()
example = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

In [None]:
singles = [stemmer.stem(example) for examp in example]
print(' '.join(singles))

### Stemming - Snowball Stemmer
![snowball](https://localtvwiti.files.wordpress.com/2018/08/gettyimages-936380496.jpg?quality=85&strip=all)

In [None]:
print(" ".join(SnowballStemmer.languages))

In [None]:
stemmer = SnowballStemmer("english")
print(stemmer.stem("running"))

### Porter vs Snowball

In [None]:
print(SnowballStemmer("english").stem("generously"))
print(SnowballStemmer("porter").stem("generously"))

### Use Snowball on metamorphesis

In [None]:
article_a_stemmed = [stemmer.stem(word) for word in article_a_tokens_stopped]
print(article_a_stemmed[:50])

### Lemmatization

Uses a corpus of words "WordNet"

`from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()`


Challenge of lemmatization:

`wordnet_lemmatizer.lemmatize(word, pos="v")`

## Here is a short list of additional considerations when cleaning text:

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.

## Try it with your article!!

### Document statistics

Average word length in document

In [None]:
float(sum(map(len, article_a_stemmed))) / len(article_a_stemmed)

Number of words in document

In [None]:
len(article_a_stemmed)

In [None]:
a_freqdist = FreqDist(article_a_stemmed)

In [None]:
a_freqdist.most_common(50)

In [None]:

a_freqdist.plot(30,cumulative=False)


**TASK**: Create word frequency plot for your article

Question:  Should any more stop words be added to the list given your plot results?

In [None]:
a_finder = BigramCollocationFinder.from_words(article_a_stemmed)


## What you've all been waiting for 

![big deal](http://reddebtedstepchild.com/wp-content/uploads/2013/04/Big-deal-gif.gif)


## Frequency distributions

## Creating a data frame that compares the documents

**Puzzle**: how could you adapt the code below to allow you to compare documents and word counts?

In [50]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

   hello  omg  pony  she  there  went  why
0      1    0     0    0      1     0    1
1      1    1     1    0      0     0    0
2      0    1     0    1      1     1    0


In [None]:
""

## Comparing documents

Why is the `CountVectorizer` not enough to adequately compare documents?

### More math!

$
tf_{i,j} = \text{number of occurences of } i \text{ in}  j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
$

### Term Frequency (TF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

### TF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i}
\end{align} $


In [None]:
cleaned_a = ' '.join(article_a_stemmed)
cleaned_b = ' '.join(FILL THIS)# need b article here


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

# EXIT TICKET [HERE](https://forms.gle/WXAyoWjnFb7XdaJd9)

### _PREP FOR NEXT LECTURE_:

Take a dataset of tweets and prepare it for next lecture.

- explore a few tweets in your chosen dataset
- practice cleaning them for text analysis
- tokenizing, lower case, punctuation, stemming
- then use `" ".join()` to make each cleaned tweet back in to a string
- save that cleaned string in a new dataframe

Make a function that will do that cleaning for the whole dataset.



### Dataset Choices

#### What words indicate a text might be biased?

Politically Neutral vs Partisan dataset:

```
import pandas as pd
df = pd.read_csv('https://query.data.world/s/yexnb2tlowvuvq6dcnkex6jnlzzbmc')
```

#### Can you tell from tweets if someone was attending SXSW or Cochella?
**2017 SXSW twitter**

<img src="https://pmcvariety.files.wordpress.com/2019/10/sxsw.jpg?w=1000" alt="sxsw" style ="text-align:center;width:200px;float:none" >

A collection of all tweets that mention #sxsw or @sxsw
```
import pandas as pd
df = pd.read_csv('https://query.data.world/s/gt4rvczsuklcxymodiklrl7uq2vmyx')
```


**2015 Cochella twitter**

<img src="https://consequenceofsound.net/wp-content/uploads/2019/12/Coachella-2020-lineup.png?w=800" alt="cochella" style ="text-align:center;width:200px;float:none" >

```
import pandas as pd
df = pd.read_csv('https://query.data.world/s/buoio54tyk6fg7gh7x3d7qclbmtqb2')
```

#### Trump vs Johnson

**Boris Johnson Tweets** (as of last night)
in the data folder!

**Donal Trump Tweets**
Same location!


## Appendix
#### If you want to see TF, IDF, and TF-IDF from _scratch...

![homemade](https://media2.giphy.com/media/LBZcXdG0eVBdK/giphy.gif?cid=3640f6095c2d7bb2526a424a4d97117c)

In [None]:
wordSet = set(arta_stemmed).union(set(artb_stemmed)) 
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0) 

for word in arta_stemmed: 
    wordDictA[word]+=1
    
for word in artb_stemmed: 
    wordDictB[word]+=1    

def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

tfbowA = computeTF(wordDictA,arta_stemmed)
tfbowB = computeTF(wordDictB,artb_stemmed)

In [None]:
tfbowA

In [None]:
def computeIDF(docList):
    """ compute inverse doc freq for each doc in the docList
    returns: IDF for each doc
    """
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [None]:
idfs = computeIDF([wordDictA, wordDictB])

In [None]:
def computeTFIDF(tfBow, idfs):
    """creates function for computing TFIDF"""
    tfidf = {} # creates empty dictionary
    for word, val in tfBow.items(): #starts a for loop using keys (word) and values from tfBow
        tfidf[word] = val*idfs[word] #for each word in tfBow, the value is multiplied by the idfs for the word. 
                                        #The word and resulting computation are then added to the dictionary tfidf
    return tfidf #returns the dictionary tfidf

In [None]:
tfidfBowA = computeTFIDF(tfbowA, idfs)
tfidfBowB = computeTFIDF(tfbowB, idfs)

In [None]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])