<a href="https://colab.research.google.com/github/priyanka-ingale/unstructured-intelligence/blob/main/MSIS521_S3_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Analysis
This example shows a few text analyis methods: similarity, clustering, topic modeling and sentiment analysis.


## Data
We are providing the data as a list of documents.

In [1]:
d0 = "He is a good guy, he is not bad"
d1 = "feet wolves cooked boys girls ,!<@!"
d2 = "He is not a good guy, he is bad"
d3 = "I drink water in parties"
d4 = "I grab a drink in parties"

c3 = [d3, d4]

## Similarity Measures

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import * # Cosine
from sklearn.metrics import * # Jaccard

### CountVectorizer


`fit()` finds BOW (Bag of Words) and generate vocabulary (indexed alphabetically)

In [3]:
vectirizer5 = CountVectorizer()
vectirizer5.fit(c3)
print(vectirizer5.vocabulary_)

{'drink': 0, 'water': 4, 'in': 2, 'parties': 3, 'grab': 1}


`transform()` converts each word to a vector spanned by the vocabulary


In [4]:
v_c3 = vectirizer5.transform(c3)
print(v_c3.toarray())

[[1 0 1 1 1]
 [1 1 1 1 0]]


**Similarity** - cosine_similarity needs input as lists.
We can calculate Jaccard similarity because the DTM is binary. (We could have forced binary DTM by using vectorizer5 = CountVectorizer(binary=True)

In [6]:
cosine_similarity(v_c3[0],v_c3[1])

array([[0.75]])

In [11]:
jaccard_score(v_c3[0].toarray(),v_c3[1].toarray(), average='micro')

np.float64(0.6)

## Clustering Analysis

We use K-Means and Agglomerative Clustering for clustering analysis.


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer # we will use TF-IDF vectorizer
from sklearn.cluster import KMeans

In [13]:
d5 = "Seattle weather is bad in winter"
d6 = "Seattle Seahawks is a great football team"
d7 = "I love Seahawks"
d8 = "I learned a lot of Data analytics tools"
d9 = "I am a data scientist"
c4 = [d0,d2,d3,d4,d5,d6,d7,d8,d9]

In [16]:
vectorizer8 = TfidfVectorizer(stop_words='english')
X = vectorizer8.fit_transform(c4)

### K-Means

`fit()` to train the model which gives cluster_centers_

In [14]:
k = 4
model1 = KMeans(n_clusters=k, random_state=12, n_init=10)

In [17]:
model1.fit(X)

In [18]:
model1.cluster_centers_

array([[0.        , 0.52374168, 0.        , 0.        , 0.        ,
        0.6023681 , 0.        , 0.        , 0.6023681 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.54218382, 0.        ,
        0.        , 0.32096472, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.54218382, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.32096472, 0.        ,
        0.        ],
       [0.23030531, 0.        , 0.51714818, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.23030531,
        0.23030531, 0.        , 0.        , 0.38198267, 0.        ,
        0.        , 0.        , 0.23030531, 0.        , 0.        ,
        0.        ],
       [0.        , 0.13572908, 0.        , 0.        , 0.15842953,
        0.        , 0.        , 0.15842953, 0.       

`transform()` assigns cluster membership as given by labels_

In [19]:
model1.transform(X)
model1clusters = model1.labels_.tolist()
print(model1clusters)

[0, 0, 1, 1, 3, 3, 3, 2, 2]


#### Characteristics of each cluster

We want to find **5 most significant words for each cluster**.

1. for each cluster, we get the indices of the sorted array in descending order. `argsort()` sorts the array in ascending order and return the indices of sorted array. The index slicing  `[:,::-1]`reads the sorted array from the end and effectively re-arranges it in descending order.
2. `get_feature_names_out()` returns the words corresponding to indices.

In [20]:
order_centroids = model1.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer8.get_feature_names_out()
for c in range(k):
    print(f'Cluster {c}')
    for ind in order_centroids[c, :4]:
        print(f' {terms[ind]}')


Cluster 0
 guy
 good
 bad
 weather
Cluster 1
 parties
 drink
 grab
 water
Cluster 2
 data
 scientist
 learned
 lot
Cluster 3
 seahawks
 seattle
 love
 weather


### Agglomerative Clustering

1. **n_clusters** - number of clusters pre-selected
2. **affinity** - distance measure between documents, others are: manhattan, cosince
3. **linkage** - distance measure between clusters, others are: single, complete, average  

`fit_predict()` fits the hierarchical clustering from features or distance matrix, and returns cluster labels.

In [None]:
## Packages to plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
import plotly.figure_factory as ff

In [None]:
Z = linkage(X.toarray(), method="complete")
fig_complete = ff.create_dendrogram(Z, orientation='bottom')
fig_complete.update_layout(width=1000, height=600, title='Hierarchical Clustering Dendrogram using Complete Linkage')
fig_complete.show()

## LDA Topic Modeling

### Data and document vectors

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

### Latent Dirichlet Allocation

* **n_components** - number of topics
* The model **lda**'s attribute **components_** stores topic word distribution. The array **components_[i, j]** can be viewed as pseudocount that represents the number of times **word j** was assigned to **topic i**.

To display the representative words under each topic, for each topic:
1. sort the indices of words in the descending order of pseudocount
2. take the top 4
2. retrieve the corresponding words and join them with space " " inbetween them

#### Probability of a document containing a topic

### LDA using gensim

#### Prepare the data

list of lists of words from list of documents

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

In [None]:
## Preprocessing the documents for gensim
lemmatizer = nltk.stem.WordNetLemmatizer()
stemmer = nltk.stem.PorterStemmer()

processed_c4 = []
for doc in c4:
  tokens = nltk.word_tokenize(doc.lower())
  tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
  tokens = [stemmer.stem(token) for token in tokens]
  tokens = [token for token in tokens if not token in stopwords.words('english')]
  processed_c4.append(tokens)

print(processed_c4)

#### gensim

In terminal, **pip install gensim**

1. create the **dictionary** which is a list of words from all documents
2. convert document into the bag-of-words format = list of (word_id, word_count) 2-tuples
3. train the model with number of topics **num_topics** and mapping from word IDs to words **id2word**
4. use **`print_topics(num_topics, num_words)`** to display **num_topics** randomly selected topics with a string of **num_words** words ordered by their significance. Both are optional with the default `num_topics=20, num_words=10`. `num_topics = -1` means to show all topics.

In [None]:
!pip install gensim
import gensim

In [None]:
dictionary = gensim.corpora.Dictionary(processed_c4)
bow_c4 = [dictionary.doc2bow(doc) for doc in processed_c4]
print(bow_c4)

In [None]:
lda_model = gensim.models.LdaModel(bow_c4, num_topics=4, id2word=dictionary
                                   ,passes=10
                                   ,iterations=200)

In [None]:
for idx, topic in lda_model.print_topics(-1,4):
  print(f'Topic {idx}: \n Words: {topic}')

## Sentiment Analysis

### Install vader_lexicon

### Data

In [None]:
sentences = ["They are smart, cute, and funny.",  # positive sentence example
    "They are smart, cute, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
    "They are very smart, cute, and funny.",# booster words handled correctly (sentiment intensity adjusted)
    "They are VERY SMART, cute, and FUNNY.",  # emphasis for ALLCAPS handled
    "They are VERY SMART, cute, and FUNNY!!!",  # combination of signals - VADER appropriately adjusts intensity
    "They are VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",  # booster words & punctuation make this close to ceiling for score
    "The book was good.",  # positive sentence
    "The book was kind of good.",  # qualified positive sentence is handled correctly (intensity adjusted)
    "The plot was good, but the characters are uncompelling and the dialog is not great.",  # mixed negation sentence
    "A really bad, horrible book.",  # negative sentence with booster words
    "At least it isn't a horrible book.",  # negated negative sentence with contraction
    ":) and :D",  # emoticons handled
    "",  # an empty string is correctly handled
    "Today sux",  # negative slang handled
    "Today sux!",  # negative slang with punctuation emphasis handled
    "Today SUX!",  # negative slang with capitalization emphasis
    "Today kinda sux! But I'll get by, lol"  # mixed sentiment example with slang and contrastive conjunction "but"
     ]


### Sentiment analysis with existing classifier

1. use `SentimentIntensityAnalyzer()`
2. get polarity scores: compound, neg, neu, pos. The output is a dictionary data type which is joined by a dictionary of sentence using `update()`.
3. convert list of dictionary to `DataFrame` and show.