**Important notes:**

- Use your **HdM ID** (e.g. the xy123 in ***xy123***@hdm-stuttgart.de) as **NAME**


- Don't change the name of the file and don't delete any cells.


- Make sure you fill in any place that says  <font color='green'> \# YOUR CODE HERE </font> or "YOUR ANSWER HERE".


- The function `NotImplementedError()` prevents you from hand in tasks with empty cells. Simply delete the function if you start working on a cell with this entry.


- Before you turn this problem in (i.e., after you completed all tasks), make sure everything runs as expected: Restart the kernel and run all cells. If you use:
  - *Visual Studio Code*: select "Restart" and then "Run All" 
  - *Colab*: in the menubar, select `Runtime` and click on `Restart and run all`
  - *Jupyter Notebook*: in the menubar, select `Kernel` and click on `Restart & Run All`


Good luck!

In [None]:
NAME = ""

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Text Mining with NLTK

## Python setup

We need the following modules in this notebook:

- nltk
- wordcloud
- pandas
- altair

In [None]:
# we suppress some unimportant warnings
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

### Data import

In [None]:
import pandas as pd

# Import some Tweets
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/tweets-cnn.csv")

# drop some columns
df.drop(columns=["author_id", "edit_history_tweet_ids", "id"], inplace=True)

df.head(3)

### Data corrections

In [None]:
df['text'] = df['text'].astype(str).str.lower()

df.head(3)

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'])

df.info()

## Text mining data preparation

### Tokenization


- We use NLTK's [RegexpTokenizer](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) to perform [tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) in combination with regular expressions. 

- To learn more about regular expressions ("regexp"), visit the following sites:


- [regular expression basics](https://www.w3schools.com/python/python_regex.asp).
- [interactive regular expressions tool](https://regex101.com/)


- `\w+` matches Unicode word characters with one or more occurrences; 
- this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

In [None]:
from nltk.tokenize import RegexpTokenizer

Hint

---

```python

regexp = RegexpTokenizer('___') # use regular expression to match (multiple) word characters and numbers

df['text_token']=df['___'].apply(___.tokenize) # insert the data column and the regular expression pattern

```

---


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
df.head()

In [None]:
# Check your code
assert df.iloc[0, 2] == ['the',
 'body',
 'of',
 'missing',
 'princeton',
 'university',
 'student',
 'misrach',
 'ewunetie',
 'has',
 'been',
 'found',
 'https',
 't',
 'co',
 '66wv0od5ut']


*Compare the entries of `text` with `text_token`. Do you notice any differences?*

### Stopwords

- Stop words are words in a stop list which are dropped before analysing natural language data since they don't contain valuable information (like "will", "and", "or", "has", ...).

In [None]:
import nltk

# download the stopwords package
nltk.download('stopwords')

In [None]:
import nltk
from nltk.corpus import stopwords

In [None]:
# Make a list of english stopwords
stopwords = nltk.corpus.stopwords.words("english")

In [None]:
# make your own custom stopwords
my_stopwords = ['https', 'co']

In [None]:
# Extend the stopword list with your own custom stopwords
stopwords.extend(my_stopwords)

- Next, we use a [lambda function](https://www.w3schools.com/python/python_lambda.asp) (anonymous function) to remove the stopwords:

Hint: 

We want to get rid of all stopwords in `text_token` and create a new column called `text_token_s` (for "text token without stopwords"). 

Therefore, we use the following code:


---

```python
df['text_token_s'] = df['text_token'].___(___ x: [__ for __ in x if __ not in ___])
```


---

You need to complete the code with the follwing information:


- `.apply` applies a function along the rows of the DataFrame.


- `lambda x:` is an anonymous funtion (we dont have to give it a name)


- use `i` as iterator to iterate through every row and only keep words if they are not in `stopwords`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Check your code
assert df.iloc[1,3] == ['uk',
 'prime',
 'minister',
 'liz',
 'truss',
 'quits',
 'disastrous',
 'six',
 'weeks',
 'office',
 'putting',
 'course',
 'britain',
 'shortest',
 'serving',
 'leader',
 '0o0xqscrxi']

In [None]:
df.head(3)

*Compare the entries of `text_token_s` with `text_token`. Do you notice any differences?*

### Transform data and remove infrequent words

In the next step, we will:

- transform the text tokens to a simple string (i.e. from cell value [a , b , c] to 'a b c') because the following steps (like lemmatization) can't handle tokens


- remove words which occur less then two times (because such infrequent words usually don't have much value for our analysis)


- save the result in a new column called `text_si` (`s` stands for stopword and `i` for infrequent words)



Hint:


---

```python
___ = df['___'].___(lambda x: ' '.join([__ for __ in __ if len(__)>__]))
```


---



- name the new column `text_si`


- use the column `text_token_s`


- use `.apply` to apply a lambda function to every row of the dataframe


- The lambda function should: 

  - combine (use `join()`) all word tokens (use `i` as an iterator) from a row in a single string (use a white space `' '`
 as seperator between the tokens)
  - only keep tokens which occur more than 2 times

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# check you code
assert df.iloc[1, 4] == 'prime minister liz truss quits disastrous six weeks office putting course britain shortest serving leader 0o0xqscrxi'

In [None]:
df.head(3)

*Note that this operation changes the format of your cell entries (notice the missing brackets). Do you notice further differences?*


### Lemmatization

- Next, we perform [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) (lemmatization is the process of converting a word to its base form).

In [None]:
# we need to download some packages
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
# create an object called wordnet_lem of the WordNetLemmatizer() function.
wordnet_lem = WordNetLemmatizer()

In [None]:
# create a new column called text_sil (l for lemmatization) and apply the function .lemmatize
df['text_sil'] = df['text_si'].apply(wordnet_lem.lemmatize)

In [None]:
# we check wether there are any differences in the two columns
check_difference = (df['text_sil'] == df['text_si'])

# sum all True and False values
check_difference.value_counts()

*We can observe that on our data, the lemmatization function did not change an of the words (we have only `True` values, which means that every row in `df['text_sil'] == df['text_si']`).*

In [None]:
df.to_csv("sentiment-cnn.csv", index=None)

## Data visualization

### Word cloud

We use a word cloud to visualize our data ([word cloud example gallery](https://amueller.github.io/word_cloud/auto_examples/index.html#example-gallery))

In [None]:
# combine all words in one object called all_words
all_words = ' '.join([i for i in df['text_sil']])

In [None]:
all_words

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(width=600, 
                     height=400, 
                     random_state=2, 
                     max_font_size=100).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show;

- Different style:

In [None]:
import numpy as np

x, y = np.ogrid[:300, :300]
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(background_color="white", repeat=True, mask=mask)
wc.generate(all_words)

plt.axis("off")
plt.imshow(wc, interpolation="bilinear")
plt.show;

### Frequency distributions

In [None]:
# download the package
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

In [None]:
# tokenize the words
words_tokens = nltk.word_tokenize(all_words)

In [None]:
# use the function FreqDist and save the result as fd
fd = FreqDist(words_tokens)

In [None]:
words

In [None]:
fd

### Most common words

Find the 3 most common words by using the function `most_common(n=foo)` (foo is a placeholder).

Use the object `fd` to obtain the result

Save the result as `top_3`



In [None]:
# find the 3 most common words
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Check your code
assert top_3 == [('trump', 5), ('president', 5), ('russian', 4)]

In [None]:
# show the 3 most common words as table
fd.tabulate(3)

### Plot common words

In [None]:
# Obtain top 10 words
top_10 = fd.most_common(10)

top_10

In [None]:
# make a pandas datframe from the dictionary
df_dist = pd.DataFrame({"value": dict(top_10)})

df_dist

In [None]:
# reset index to transform index to column
df_dist.reset_index(inplace=True)

df_dist

In [None]:
import altair as alt

alt.Chart(df_dist).mark_bar().encode(
    x=alt.X("value"),
    y=alt.Y("index", sort="-x")
)

### Search specific words

In [None]:
# Show frequency of a specific word
fd["trump"]