# Text Mining and Sentiment Analysis with NLTK and pandas in Python



## Data import

In [None]:
import pandas as pd

# Import some Tweets from Elon Musk 
df = pd.___("https://raw.githubusercontent.com/kirenz/twitter-tweepy/main/tweets.csv")
# Show first 3 rows
df.head(___)

## Data transformation

In [None]:
# Transform strings to lower case (overwrite the column text)
df['___'] = df['___'].astype(str).str.___()

df.head(3)

## Tokenization

- Install [NLTK](https://anaconda.org/anaconda/nltk): 

```bash
conda install -c anaconda nltk
```


- We use NLTK's [RegexpTokenizer](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) to perform [tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) in combination with regular expressions. 

- To learn more about regular expressions ("regexp"), visit the following sites:


- [regular expression basics](https://www.w3schools.com/python/python_regex.asp).
- [interactive regular expressions tool](https://regex101.com/)

- `\w+` matches Unicode word characters with one or more occurrences; 
- this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

In [None]:
from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer('___')

# Create a new column called text_token from text column
# apply the function regexp
df['___']=df['___'].apply(___.tokenize)

df.head(3)


## Stopwords

- Stop words are words in a stop list which are dropped before analysing natural language data since they don't contain valuable information (like "will", "and", "or", "has", ...).

In [None]:
import nltk

nltk.download('stopwords')

In [None]:
import nltk
from nltk.corpus import stopwords

# Make a list of english stopwords
stopwords = nltk.corpus.stopwords.words("english")

# Extend the list with your own custom stopwords
# add the word https
my_stopwords = ['___']

stopwords.extend(my_stopwords)

- We use a [lambda function](https://www.w3schools.com/python/python_lambda.asp) to remove the stopwords:

In [None]:
# Remove stopwords from text_token
df['___'] = df['___'].apply(lambda x: [item for item in x if item not in stopwords])

df.head(3)

## Remove infrequent words

- We remove words which occur less then two times. 
- Note that this operation changes the data format of our column `text_token` (notice the missing brackets).

In [None]:
df['text_token'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if len(item)>2]))

df.head(3)

## Lemmatization

- Next, we perfom [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation).

In [None]:
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

wordnet_lem = ___()

# use on text_token
df['___'] = df['___'].apply(wordnet_lem.lemmatize)

## Word cloud

- Install [wordcloud](https://amueller.github.io/word_cloud/):

```bash
conda install -c conda-forge wordcloud
```

- [Word cloud example gallery](https://amueller.github.io/word_cloud/auto_examples/index.html#example-gallery)

In [None]:
all_words = ' '.join([word for word in df['text_token']])

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(width=600, 
                     height=400, 
                     random_state=2, 
                     max_font_size=100).generate(all_words)


plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off');

- Different style:

In [None]:
import numpy as np

x, y = np.ogrid[:300, :300]
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(background_color="white", repeat=True, mask=mask)
wc.generate(all_words)

plt.axis("off")
plt.imshow(wc, interpolation="bilinear");

## Frequency distributions

In [None]:
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Use all_words
words = nltk.word_tokenize(___)

# Use on words
fd = FreqDist(___)

### Most common words

In [None]:
# Show 3 most common words
fd.most_common(___)

In [None]:
fd.tabulate(___)

### Plot common words

In [None]:
# Obtain top 10 words
top_10 = fd.most_common(___)

# Create pandas series to make plotting easier
fdist = pd.Series(dict(top_10))

In [None]:
import seaborn as sns
sns.set_theme(style="ticks")

sns.barplot(y=fdist.index, x=fdist.values, color='blue');

In [None]:
import plotly.express as px

fig = px.bar(y=fdist.index, x=fdist.values)

# sort values
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})

# show plot
fig.show()

### Search specific words

In [None]:
# Show frequency of a specific word
fd["nasa"]

## Sentiment analysis



### VADER lexicon

- NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for "Valence Aware Dictionary and Sentiment Reasoner" (Hutto & Gilbert, 2014).

In [None]:
nltk.download('vader_lexicon')

### Sentiment 

### Sentiment Intensity Analyzer

- Initialize an object of `SentimentIntensityAnalyzer` with name "analyzer":

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

### Polarity scores

- Use the `polarity_scores` method:

In [None]:
# Create column called polarity
df['___'] = df['text_token'].apply(lambda x: analyzer.polarity_scores(x))

df.head(3)

### Transform data

In [None]:
# Change data structure
df = pd.concat(
    [df.drop(['Unnamed: 0', 'id', 'author_id', 'polarity'], axis=1), 
     df['polarity'].apply(pd.Series)], axis=1)

df.head(3)

In [None]:
# Create new variable with sentiment "neutral," "positive" and "negative"
# Call the new column sentiment 
df['___'] = df['compound'].apply(lambda x: '___' if x >0 else '___' if x==0 else '___')

df.head(4)

### Analyze data

In [None]:
# Tweet with highest positive sentiment (use max)
df.loc[df['compound'].idx___()].values

In [None]:
# Tweet with highest negative sentiment (use min)
# ...seems to be a case of wrong classification because of the word "deficit"
df.___[df['compound'].idx___()].values

### Visualize data

In [None]:
# Number of tweets with certain sentiment
sns.countplot(y='___', 
             data=___, 
             palette=['#b2d8d8',"#008080", '#db3d13']
             );

In [None]:
# Lineplot with compound
g = sns.lineplot(x='created_at', y='___', data=df)

g.set(xticklabels=[]) 
g.set(title='Sentiment of Tweets')
g.set(xlabel="Time")
g.set(ylabel="Sentiment")
g.tick_params(bottom=False)

g.axhline(0, ls='--', c = 'grey');

In [None]:
# Boxplot
sns.___(y='compound', 
            x='sentiment',
            palette=['#b2d8d8',"#008080", '#db3d13'], 
            data=df);

Literature:

[Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.](https://ojs.aaai.org/index.php/ICWSM/article/view/14550)