# Table of Contents
 <p>

In this practice we will perform topic modeling and sentiment analysis on two datasets.

**Activity 1:** Load the kaggle voted dataset and perform following tasks 

* Focus on the "Description" column and preprocess if required
* Fit an LDA model with 10 topics on the "Description" column 
    * While creating TF matrix ignore terms that have a document frequency strictly higher than 100
* Print top-5 words per topic

### Load data file

In [None]:
# load necessary packages
import json
import pandas as pd
import re

In [None]:
filepath = "/dsa/data/DSA-8410/voted-kaggle-dataset.csv"

df = pd.read_csv(filepath, encoding = 'utf-8')
print(df['Description'].head(5))

### Create a TF matrix

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation


In [None]:
print(df.shape)
desc = df['Description'].dropna()
print(desc.shape)

In [None]:
docs = desc.values
countVectorizer = CountVectorizer(stop_words='english', max_df=100 )
termFrequency = countVectorizer.fit_transform(docs)
featureNames = countVectorizer.get_feature_names()

### Fit an LDA model

In [None]:
lda = LatentDirichletAllocation(n_components=10)
lda.fit(termFrequency)    

### Print top 10 words per topic

In [None]:
for idx, topic in enumerate(lda.components_):
    print("Topic ", idx, " ".join(featureNames[i] for i in topic.argsort()[:-10 - 1:-1]))

**Activity 2**: Load the Twitter US Airline Sentiment data and perform the following task

* Preprocess the text column with the removal of all the mentions 
* Identify sentiment (pos, neg, neu) for each of the tweets using the standard rule mentioned in the lab
* Print the classification report

### Load data file

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report

In [None]:
filepath = "/dsa/data/DSA-8410/Twitter-US-Airline-Sentiment/Tweets.csv"
# filepath = "../../../../data/Twitter-US-Airline-Sentiment/Tweets.csv"

df_all= pd.read_csv(filepath)
df_all.head()

### Preprocess the tweets

In [None]:
tweets = [re.sub(r'@(\w+)', ' ', t) for t in df_all['text'].values]

### Identify polarity for each tweet 

In [None]:
analyzer = SentimentIntensityAnalyzer()
tweets_sentiment = [analyzer.polarity_scores(t) for t in tweets]

df = pd.DataFrame(tweets_sentiment)
df['tweet'] = tweets
df.head()

In [None]:
df.describe()

### Perform a rule-base classification

In [None]:
df['sentiment'] = 'NEU'
df.loc[df['compound'] > 0.05, 'sentiment'] = 'POS'
df.loc[df['compound'] < -0.05, 'sentiment'] = 'NEG'

df.head()

In [None]:
import seaborn as sns
sns.set()
sns.boxplot(x="sentiment", y="compound", data=df);

### Report classification metrices

In [None]:
y_true = df_all["airline_sentiment"].map({'neutral': 'NEU', 'positive': 'POS', 'negative': 'NEG'})
y_pred = df['sentiment']
print(classification_report(y_true, y_pred))