# Textmining with NLTK



## Setup

For this code tutorial, you will need to install [nltk](https://anaconda.org/anaconda/nltk) and [wordcloud](https://anaconda.org/conda-forge/wordcloud).



## Import data

In [1]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/kirenz/twitter-tweepy/main/tweets.csv")
df

Unnamed: 0.1,Unnamed: 0,created_at,id,author_id,text
0,0,2021-12-10T07:20:45.000Z,1469205428227784711,44196397,@Albi_SideArms maybe i will …
1,1,2021-12-10T07:19:05.000Z,1469205011687223298,44196397,@jack https://t.co/ueyR6NAwap
2,2,2021-12-10T06:44:19.000Z,1469196261953884160,44196397,@SawyerMerritt 🤣🤣
3,3,2021-12-10T04:42:00.000Z,1469165476911755264,44196397,@SawyerMerritt Tesla China has done amazing work
4,4,2021-12-10T04:21:25.000Z,1469160298158383109,44196397,@MrBeast 🙏
...,...,...,...,...,...
67,67,2021-12-03T23:35:30.000Z,1466914018615078912,44196397,@joroulette It is an honor to serve NASA and t...
68,68,2021-12-03T19:28:57.000Z,1466851970443010056,44196397,@NASASpaceflight 39A is hallowed spaceflight g...
69,69,2021-12-03T19:22:34.000Z,1466850364012044288,44196397,@EvaFoxU @SawyerMerritt Huge cranes are cool haha
70,70,2021-12-03T19:20:15.000Z,1466849780253003782,44196397,@PPathole @ErcXspace @SpaceX This will look so...


## Lowercase

In [2]:
df['text'] = df['text'].astype(str)
df['text'] = df['text'].str.lower()
df

Unnamed: 0.1,Unnamed: 0,created_at,id,author_id,text
0,0,2021-12-10T07:20:45.000Z,1469205428227784711,44196397,@albi_sidearms maybe i will …
1,1,2021-12-10T07:19:05.000Z,1469205011687223298,44196397,@jack https://t.co/ueyr6nawap
2,2,2021-12-10T06:44:19.000Z,1469196261953884160,44196397,@sawyermerritt 🤣🤣
3,3,2021-12-10T04:42:00.000Z,1469165476911755264,44196397,@sawyermerritt tesla china has done amazing work
4,4,2021-12-10T04:21:25.000Z,1469160298158383109,44196397,@mrbeast 🙏
...,...,...,...,...,...
67,67,2021-12-03T23:35:30.000Z,1466914018615078912,44196397,@joroulette it is an honor to serve nasa and t...
68,68,2021-12-03T19:28:57.000Z,1466851970443010056,44196397,@nasaspaceflight 39a is hallowed spaceflight g...
69,69,2021-12-03T19:22:34.000Z,1466850364012044288,44196397,@evafoxu @sawyermerritt huge cranes are cool haha
70,70,2021-12-03T19:20:15.000Z,1466849780253003782,44196397,@ppathole @ercxspace @spacex this will look so...


## Tokenization

- [RegexpTokenizer](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) 
- [regular expression](https://www.w3schools.com/python/python_regex.asp).
- [interactive regular expressions tool](https://regex101.com/)

`\w+` matches Unicode word characters with one or more occurrences; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

In [3]:
from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer('\w+')

df['text_token']=df['text'].apply(regexp.tokenize)
df['text_token']


0                       [albi_sidearms, maybe, i, will]
1                      [jack, https, t, co, ueyr6nawap]
2                                       [sawyermerritt]
3     [sawyermerritt, tesla, china, has, done, amazi...
4                                             [mrbeast]
                            ...                        
67    [joroulette, it, is, an, honor, to, serve, nas...
68    [nasaspaceflight, 39a, is, hallowed, spaceflig...
69    [evafoxu, sawyermerritt, huge, cranes, are, co...
70    [ppathole, ercxspace, spacex, this, will, look...
71            [evafoxu, sawyermerritt, i, love, norway]
Name: text_token, Length: 72, dtype: object

## Stopwords

If you use this module the first time, you need to install stopwords:

```python
import nltk

nltk.download(‘stopwords’)
```

In [None]:
from nltk.corpus import stopwords

# make a list of german stopwords
stopwords = nltk.corpus.stopwords.words("english")

# extend the list with your own custom stopwords
my_stopwords = ['https']
stopwords.extend(my_stopwords)

In [None]:
#remove stopwords
df['text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])
df['text_token'] 

## Remove infrequent words

We remove all words that have a length <=2. In general, small words (length <=2 ) aren’t useful for sentiment analysis because they have no meaning. These most probably are noise in our analysis.

In [None]:
df['text_token'] = df['text_token'].apply(lambda x: ' '.join([w for w in x if len(w)>2]))

In [None]:
df['text_token']

## Lemmatization

In [None]:
# nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

wordnet_lem = WordNetLemmatizer()

df['text_token'] = df['text_token'].apply(wordnet_lem.lemmatize)

In [None]:
df['text_token']

## Word cloud

[Word cloud example gallery](https://amueller.github.io/word_cloud/auto_examples/index.html#example-gallery)

In [None]:
all_words = ' '.join([word for word in df['text_token']])
all_words

In [None]:
from wordcloud import WordCloud

wordcloud = WordCloud(width=600, 
                     height=400, 
                     random_state=2, 
                     max_font_size=100).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Different style:

In [None]:
import numpy as np

x, y = np.ogrid[:300, :300]
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(background_color="white", repeat=True, mask=mask)
wc.generate(all_words)

plt.axis("off")
plt.imshow(wc, interpolation="bilinear")
plt.show()

## Frequency distributions

In [None]:
from nltk.probability import FreqDist

words = nltk.tokenize.word_tokenize(all_words)
fd = FreqDist(words)

### Most common words

In [None]:
fd.most_common(3)

In [None]:
fd.tabulate(3)

In [None]:
# Obtain top 10 words
top_10 = fd.most_common(10)

In [None]:
import seaborn as sns
sns.set_theme(style="ticks")

# Make pandas series for easier plotting
fdist = pd.Series(dict(top_10))

## Seaborn plotting using Pandas attributes + xtick rotation for ease of viewing
sns.barplot(y=fdist.index, x=fdist.values);

### Search words

In [None]:
fd["nasa"]

## Sentiment

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
