# Boring stuff: setting everything up

*Warning: run this section only once*

Connect to your Google Drive so that your work does not get lost when you end your session

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Change working directory to your Google Drive

In [None]:
%cd /content/drive/MyDrive/

/content/drive/MyDrive


Create the main directory for the laboratory inside your Google Drive

In [None]:
!mkdir NLP_MASTER

Remove unwanted directories (if it is your first run these directories do not exist and the following two commands have no effect)

In [None]:
!rm -rf /content/drive/MyDrive/NLP_MASTER/finance

In [None]:
!rm -rf /content/drive/MyDrive/NLP_MASTER/spacy-projects

Now let's install all the dependencies for the laboratory

In [None]:
!pip install -U pip setuptools wheel

[0m

In [None]:
#!pip install -U spacy-nightly --pre

Collecting spacy-nightly
  Downloading spacy-nightly-3.0.0rc5.tar.gz (7.0 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.9/7.0 MB[0m [31m27.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m5.3/7.0 MB[0m [31m76.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.0/7.0 MB[0m [31m87.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hcanceled
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
!pip install -U spacy transformers

[0mCollecting spacy
  Using cached spacy-3.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Using cached spacy-3.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[0mInstalling collected packages: spacy
Successfully installed spacy-3.7.5
[0m

Now that everything is set up, change working directory to the newly created directory NLP_MASTER in your Google Drive

In [None]:
%cd /content/drive/MyDrive/NLP_MASTER/

/content/drive/MyDrive/NLP_MASTER


Clone the official projects from the Spacy Repo, you are going to start from [this one](https://github.com/explosion/projects/tree/v3/tutorials/textcat_goemotions) and adapt it to the sentiment classification of financial news headlines

In [None]:
!git clone https://github.com/explosion/projects.git spacy-projects

Cloning into 'spacy-projects'...
remote: Enumerating objects: 4663, done.[K
remote: Counting objects: 100% (902/902), done.[K
remote: Compressing objects: 100% (413/413), done.[K
remote: Total 4663 (delta 590), reused 660 (delta 482), pack-reused 3761[K
Receiving objects: 100% (4663/4663), 19.05 MiB | 10.39 MiB/s, done.
Resolving deltas: 100% (2779/2779), done.
Updating files: 100% (594/594), done.


Let's now create a subdirectory "finance" inside NLP_MASTER, where we are going to copy the textcat_goemotions tutorial we just cloned with git with the command above

In [None]:
!mkdir finance

mkdir: cannot create directory ‘finance’: File exists


In [None]:
!cp -r spacy-projects/tutorials/textcat_goemotions/* finance/

In [None]:
%cd /content/drive/MyDrive/NLP_MASTER/finance/

/content/drive/MyDrive/NLP_MASTER/finance


Spacy command line in action: now that we moved in the root directory of the project we tell Spacy to download everything the project needs in order to be run

In [None]:
!spacy project assets

[38;5;4mℹ Fetching 4 asset(s)[0m
[38;5;2m✔ Downloaded asset
/content/drive/MyDrive/NLP_MASTER/finance/assets/categories.txt[0m
[38;5;2m✔ Downloaded asset
/content/drive/MyDrive/NLP_MASTER/finance/assets/train.tsv[0m
[38;5;2m✔ Downloaded asset
/content/drive/MyDrive/NLP_MASTER/finance/assets/dev.tsv[0m
[38;5;2m✔ Downloaded asset
/content/drive/MyDrive/NLP_MASTER/finance/assets/test.tsv[0m


# Sentiment analysis: Reddit Posts Dataset

*Example records [TEXT_CONTENT, EMOTION_ID, TEXT_ID]:*

You can take a look at the dataset [here](https://drive.google.com/file/d/118kEBuOXikDJhlAvDVmAVxNBymtQ5MKb/view?usp=sharing)

*   My favourite food is anything I didn't have to cook myself.	27	eebbqej
*   Thank you friend	15	eeqd04y
*   It's crazy how far Photoshop has come. Underwater bridges?!! NEVER!!!	7,13	efanc6t


Check out **assets/categories.txt** to explore the labels for this dataset. *The first row corresponds to the emotion_id 0, the second row to the emotion_id 1 and so on.*

---



***Edit project.yml and change gpu_id from -1 to 0 in order to take advantage of the Colab GPU***

Let Spacy **preprocess Reddit Posts Dataset** (assets/train.tsv, assets/dev.tsv, assets/test.tsv and assets/categories.txt) and format it as it internally needs.

In [None]:
!spacy project run preprocess

[1m
Running command: /usr/bin/python3 scripts/convert_corpus.py


Now that the dataset has been processed, **let's train the model** on the Reddit posts!

In [None]:
!spacy project run train

[1m
Running command: /usr/bin/python3 -m spacy train ./configs/cnn.cfg -o training/cnn --gpu-id -1
[38;5;2m✔ Created output directory: training/cnn[0m
[38;5;4mℹ Saving to output directory: training/cnn[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.27       50.46    0.50
  0     200          7.96       57.03    0.57
  0     400          7.04       60.02    0.60
  0     600          6.61       62.25    0.62
  0     800          6.45       63.64    0.64
  0    1000          6.34       66.80    0.67
  0    1200          6.19       69.88    0.70
  1    1400          6.02       72.13    0.72
  1    1600          5.67       73.92    0.74
  1    1800          5.74       75.33    0.75
  1    2000          5.80  

Automatic SpaCy evaluation of the model you just trained

In [None]:
!spacy project run evaluate

[1m
Running command: /usr/bin/python3 -m spacy evaluate ./training/cnn/model-best ./corpus/test.spacy --output ./metrics/cnn.json
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   83.39 
SPEED                 18184 

[1m

                     P       R       F
admiration       68.35   59.13   63.40
amusement        77.78   82.20   79.93
anger            58.33   24.75   34.75
annoyance        48.44    9.69   16.15
approval         58.33   15.95   25.06
caring           62.96   12.59   20.99
confusion        49.06   16.99   25.24
curiosity        51.40   32.39   39.74
desire           53.12   20.48   29.57
disappointment   50.00    0.66    1.31
disapproval      43.28   10.86   17.37
disgust          51.22   17.07   25.61
embarrassment     0.00    0.00    0.00
excitement       75.00   11.65   20.17
fear             85.71   38.46   53.10
gratitude        93.53   90.34   91.91
grief             0

Let's test the model on some examples, **feel free to change them to whatever you want**!

In [None]:
import spacy
nlp = spacy.load("./training/cnn/model-best")

texts = [
    "It was really bad to watch you leave, hopefully you'll be back soon",
    "Oh yes, I can relate to that. Still, you'd better think about it twice.",
]

for doc in nlp.pipe(texts):
    # Do something with the doc here
    print(doc.cats)

{'admiration': 0.0034495063591748476, 'amusement': 0.0034103163052350283, 'anger': 0.00824139267206192, 'annoyance': 0.0036229901015758514, 'approval': 0.006888858042657375, 'caring': 0.1147070825099945, 'confusion': 0.0007254867232404649, 'curiosity': 0.006565956398844719, 'desire': 0.027285708114504814, 'disappointment': 0.15754802525043488, 'disapproval': 0.007828064262866974, 'disgust': 0.0574275404214859, 'embarrassment': 0.005095605738461018, 'excitement': 0.01632966287434101, 'fear': 0.007915233261883259, 'gratitude': 0.0042739431373775005, 'grief': 0.0015688115963712335, 'joy': 0.0008439110824838281, 'love': 0.0013803112087771297, 'nervousness': 0.002749372273683548, 'optimism': 0.9889201521873474, 'pride': 0.0013569535221904516, 'realization': 0.00494409492239356, 'relief': 0.0011274284915998578, 'remorse': 0.008032864890992641, 'sadness': 0.12642258405685425, 'surprise': 0.001355200307443738, 'neutral': 0.004896792117506266}
{'admiration': 0.042892564088106155, 'amusement': 0

#Data Preparation: from the Reddit Post Dataset to the Financial News Dataset
**TODO: Upload Financial News Dataset file FinancialPhraseBank_AllAgree.txt to the assets folder, you can find the dataset [here](https://drive.google.com/file/d/1WXM2t8sh-myIEUZt37zIXC2McNrCyS2l/view?usp=sharing)**\
Financial news dataset example records [TEXT_CONTENT, SENTIMENT_LABEL]:


*   According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral
*   Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .@positive
*   Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .@negative



---

Now you have to **format the Financial News Dataset like the Reddit Posts Dataset**, in order to retrain the sentiment classifier on the new financial dataset.

Remember to split the dataset into train (70%), validation (10%) and test (20%), **saving the respective TSV files (train.tsv, dev.tsv, test.tsv) in the asset folder** .

