## Quick note about Jupyter cells

When you are editing a cell in Jupyter notebook, you need to re-run the cell by pressing `<Shift> + <Enter>`. This will allow changes you made to be available to other cells.

Use `<Enter>` to make new lines inside a cell you are editing.

### Code cells
Re-running will execute any statements you have written. To edit an existing code cell, click on it.

### Markdown cells
Re-running will render the markdown text. To edit an existing markdown cell, double-click on it.


### Common Jupyter operations

**Inserting and removing cells**

Use the "plus sign" icon to insert a cell below the currently selected cell
Use "Insert" -> "Insert Cell Above" from the menu to insert above

**Clear the output of all cells**

Use "Kernel" -> "Restart" from the menu to restart the kernel
click on "clear all outputs & restart" to have all the output cleared

**Show function signature**

Start typing function and hit `<Shift> + <Tab>`

# Preprocessing

## import necessary libraries

Import the following packages: `pandas as pd`, `csv`, `nltk` and `matplotlib.pyplot as plt`

In [1]:
import pandas as pd
import csv
import nltk 
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt 

## load data

As we have done the step of collecting some sample data for you already, you only have to load the data into a pandas dataframe using the method `pd.read_csv()`. Typing a variable name into a jupyter cell and running it, shows you the current content.

In [2]:
tweets = tweets = pd.read_csv('data\\tweets\\tweets.tsv', sep='\t', header=None, names=["id", "sentiment","md5","related","text"])

In [3]:
tweets

Unnamed: 0,id,sentiment,md5,related,text
0,385497381925847040,positive,16d4ea17feedebbc87d8ddf3f57c172b,[],Not Available
1,364483696570945536,neutral,a6416685aa01bb28315ec81baa50639d,[],Not Available
2,373606425379225600,positive,e9f93f030ab466d8aa624d1bfb33d31b,[],Not Available
3,367189542857482240,neutral,c020aa23ff1f8ff985ce489b2b678674,[],Tainted Talents (Ateliertagebuch.) » Wir sind ...
4,368327046574776321,neutral,0096b66e311fffcca65c23d2a310083b,[],Aber wenigstens kommt #Supernatural heute mal ...
5,390690148188712960,positive,66ffcee70a34442d5e3df0b39e359d11,[],Not Available
6,368309870673793024,neutral,575fd73efa41e07e2b0c360a721d19d7,[],DARLEHEN - Angebot für Schufa-freie Darlehen: ...
7,362896018389475328,neutral,8b824c765a7642a980b9e14c02830126,[],ANRUF ERWÜNSCHT: Hardcore Teeny Vicky Carrera:...
8,367912309303148545,neutral,a07f6bc77b0cfb06a75c7cb1a88d752b,[],Na? Wo sind Frankens heimliche Talente? - Die ...
9,364291928692490241,positive,40d5d6cccb61d3d230c730e60ca4dbf1,[],... Glück breitet sich aus ...


## inspect and clean data

The dataset contains several entries of messages that have been deleted by the user after posting them, use pandas' `str.contains()` or any other method like `loc` to remove all rows that represent a message not accessibly anymore ("Not Available").

In [None]:
tweets = ...

In [None]:
tweets

The columns we are most interested in are "text" and "sentiment". Use pandas' `groupby()` method in combination with `count()` to get a first notion about the distribution of our labels. 



In [4]:
tweets.groupby('sentiment').count()

Unnamed: 0_level_0,id,md5,related,text
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
negative,1617,1617,1617,1617
neutral,5883,5883,5883,5883
positive,2439,2439,2439,2439


Drop the unnecessary columns to retain a dataframe with the two columns "text" and "sentiment".





In [None]:
tweets =

Use the pandas function `str.replace()` to get rid of [twitter handles](https://www.urbandictionary.com/define.php?term=twitter%20handle) (hint: use regular expressions with `r'my_regex'` as the first argument `in str.replace()`).

In [5]:
tweets['text'] = ...

In [None]:
tweets['text']

**Advanced**
- Get rid of links. 
- Inspect the rest of the columns and keep some if they might contain information relevant to our prediction at a later stage.

In [None]:
tweets['text'] = ...

In [None]:
tweets['text']

## create single string to count word frequencies

In order to visualize word frequencies, we will concatenate all messages to create one long string containing all words present in these messages. 

Use `str.cat()` with `sep=' '` on the column 'text' to create one string containing all messages.

In [None]:
merged_tweets = ...

Pass merged_tweets to the `nltk` method `word_tokenize()` to create a list of tokens.

In [None]:
merged_tweets_tokens = ...

We will use matplotlib to plot token frequencies. Set the matplotlib figure size to `(15, 8)` in order to create a larger plot area. 
Pass `merged_tweets_tokens` to `nltk.FreqDist()`, save the result as `fd` and call `fd.plot()` with `(50,cumulative=False)`.

In [None]:
# plot token frequencies
plt.figure(figsize=(.., ..))  

fd = nltk. ...

## save data

Use the pandas `to_csv()` method to save your dataframe as a .csv file. Name it "training_data_tweets.csv", set `encoding='utf-8'`, use `quoting=csv.QUOTE_ALL`, `header=False` and `index=False`.

In [None]:
# save the dataframe in a format you can easily import in the following notebook
tweets...