# Introduction to Deep Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

---

# Text Classification problem

Here we want to solve a famous text classification problem.
We have the Sentiment 140 Twitter dataset (available [here](https://www.tensorflow.org/datasets/catalog/sentiment140) or in the tensorflow dataset library).

The main objectives are:
1. Show a *brief* preliminary analysis of the data (classes are balanced, useful informations, feature selection, etc)
2. Show some visualisation.
3. Answer questions (later)
4. Train a model with a test accuracy over the $80\%$.
5. *Optional* Deploy the model on a webpage through Tensorflow.js

**Bonus**: make me learn something I did not know 🙂.

#### Important note
Any choice has to be properly explained and justified.

<details>
    <summary><b>HINT</b></summary> 
    
    Make use of open-source implementations of similar problems you can easily find online!
</details>

## The dataset
<details>
    <summary><b>Click to Expand</b></summary> 
    
We will use [twitter_sentiment dataset](https://www.tensorflow.org/datasets/catalog/sentiment140).

### What is Sentiment140?

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter.

### How does this work?
You can read about our approach in our technical report: [Twitter Sentiment Classification](http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) using Distant Supervision. There are also additional features that are not described in this paper.

### Who created this?
Sentiment140 was created by Alec Go, Richa Bhayani, and Lei Huang, who were Computer Science graduate students at Stanford University.

    
**Note**: you can directly download the dataset from [tensorflow datasets](https://www.tensorflow.org/datasets/catalog/sentiment140).
</details>
I suggest you to operate your preprocessing steps and then convert to a tensorflow dataset, which is the robust, and ready-to-parallel computing format you want to use.

In [7]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/training.1600000.processed.noemoticon.csv.zip -P data
!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip

--2021-05-21 18:14:03--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/training.1600000.processed.noemoticon.csv.zip
Risoluzione di nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Connessione a nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 85088192 (81M) [application/zip]
Salvataggio in: «data/training.1600000.processed.noemoticon.csv.zip»


2021-05-21 18:14:26 (3,58 MB/s) - «data/training.1600000.processed.noemoticon.csv.zip» salvato [85088192/85088192]

Archive:  data/training.1600000.processed.noemoticon.csv.zip
  inflating: data/training.1600000.processed.noemoticon.csv  


## Hardware suggestion

I strongly advice to work in colab, or any other environment with a GPU available in order to minimise training time and being able to run multiple model training. 
Recall that experimenting is crucial.

To check whether your instance has a GPU activated you can run the following code
```python
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:    
    raise SystemError('GPU device not found')
```

If you do not have the GPU enabled, just go to:

`Edit -> Notebook Settings -> Hardware accelerator -> Set to GPU`



### Questions to answer

1. Is the dataset balanced?
2. What kind of preprocessing you think is necessary?
3. Can you use some sort of transfer learning? Which one?
4. How many items contains the word "*bush*"?
5. How many items containing the word "*pussy*" are classified as "positive"?
6. How many items are classified as "neutral" and do not contain the words "phone", "computer", "President" and "suck"?

## General assignements

* Write your code following [PEP8 style guide](https://www.python.org/dev/peps/pep-0008/).
* Docstrings has to be written in [Google Style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).
* It is strongly adviced to write your modules to collect functions and import them in the notebook (this will make the following point almost effortless). To import custom modules in colab [look at this example](https://colab.research.google.com/drive/1uvHuizCBqFgvbCwEhK7FvU8JW0AfxgJw#scrollTo=psH0aLrvoh78).
* Once you are sure the notebook runs smoothly, write a python script to be executed from a command line interpreter to train your model:

```bash
python3 -m train --conf config.yml
```

The `config.yml` file has to contain configuration instructions on model architecture (kind of layers, number of layers, number of units, activations, etc.), on training (number of epochs, size of batches, if apply early stopping, optimiser, etc.) and on script metadata (where to get data, where to save output model).

* Finally (optionally), you can serve your model on a webpage thanks to tensorflow.js.

<div style="margin: 0 auto; text-align: center">
    <a href="https://colab.research.google.com/github/oscar-defelice/DeepLearning-lectures/blob/master/src/FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div