## Quick note about Jupyter cells

When you are editing a cell in Jupyter notebook, you need to re-run the cell by pressing `<Shift> + <Enter>`. This will allow changes you made to be available to other cells.

Use `<Enter>` to make new lines inside a cell you are editing.

### Code cells
Re-running will execute any statements you have written. To edit an existing code cell, click on it.

### Markdown cells
Re-running will render the markdown text. To edit an existing markdown cell, double-click on it.


### Common Jupyter operations

**Inserting and removing cells**

Use the "plus sign" icon to insert a cell below the currently selected cell
Use "Insert" -> "Insert Cell Above" from the menu to insert above

**Clear the output of all cells**

Use "Kernel" -> "Restart" from the menu to restart the kernel
click on "clear all outputs & restart" to have all the output cleared

**Show function signature**

Start typing function and hit `<Shift> + <Tab>`

# Preprocessing

## import necessary libraries

Import the following packages: `pandas as pd`, `csv`, `nltk` and `matplotlib.pyplot as plt`

In [None]:
import pandas as pd
import csv
import nltk 
import matplotlib.pyplot as plt 

## load data

As we have done the step of collecting some sample data for you already, you only have to load the data into a pandas dataframe using the method `pd.read_csv()`. Typing a variable name into a jupyter cell and running it, shows you the current content.

In [None]:
tweets = tweets = pd.read_csv('data\\tweets\\tweets.tsv', sep='\t', header=None, names=["id", "sentiment","md5","related","text"])

In [None]:
tweets

## inspect and clean data

The dataset contains several entries of messages that have been deleted by the user after posting them, use pandas' `str.contains()` or any other method like `loc` to remove all rows that represent a message not accessibly anymore ("Not Available").

In [None]:
#TODO:
tweets = ...

In [None]:
tweets

The columns we are most interested in are "text" and "sentiment". Use pandas' `groupby()` method in combination with `count()` to get a first notion about the distribution of our labels. 



In [1]:
tweets.groupby('sentiment').count()

NameError: name 'tweets' is not defined

Drop the unnecessary columns to retain a dataframe with the two columns "text" and "sentiment".





In [None]:
#TODO:
tweets = ...

Use the pandas function `str.replace()` to get rid of [twitter handles](https://www.urbandictionary.com/define.php?term=twitter%20handle) (hint: use regular expressions with `r'my_regex'` as the first argument `in str.replace()`). Example to remove all numbers: `data['column'].str.replace(r'[0-9]+')`

infos on how to build regular expressions: https://www.rexegg.com/regex-quickstart.html#quantifiers

test your regular expressions: https://regex101.com/

In [None]:
#TODO:
tweets['text'] = ...

In [None]:
tweets['text']

**Advanced**
- Get rid of links. 
- Inspect the rest of the columns and keep some if they might contain information relevant to our prediction at a later stage.

In [None]:
#TODO:
tweets['text'] = ...

In [None]:
tweets['text']

## create single string to count word frequencies 

In order to visualize word frequencies, we will concatenate all messages to create one long string containing all words present in these messages. 


In [None]:
from nltk.tokenize import word_tokenize

merged_tweets  = tweets.text.str.cat(sep=' ')
merged_tweets_tokens = word_tokenize(merged_tweets)

### Plot the word frequencies

In [None]:
# plot token frequencies
plt.figure(figsize=(15, 8))  

fd = nltk.FreqDist(merged_tweets_tokens)
fd.plot(50,cumulative=False)

##### If you still have symbols or words that seem unnecessary for the classification ->  jump back up and replace them

## save data

Use the pandas `to_csv()` method to save your dataframe as a .csv file. Name it "training_data_tweets.csv", set `encoding='utf-8'`, use `quoting=csv.QUOTE_ALL`, `header=False` and `index=False`.

In [None]:
# save the dataframe in a format you can easily import in the following notebook
tweets.to_csv("training_data_tweets.csv", encoding='utf-8', quoting=csv.QUOTE_ALL,header=False, index=False)