# A quick intro to working with ```spaCy```

In [2]:
#You have to install spacy every time you use it
# Do it in the terminal: write pip install spacy
import spacy

When ```spaCy``` is loaded, we then need to initialize a model.

NB: Models first have to be downloaded from the command line. An overview of avaiable models from ```spaCy``` can be found [here](https://spacy.io/usage/models):

```spacy download en_core_web_sm``` - WRITE THIS IN THE TERMINAL

In [7]:
nlp = spacy.load("en_core_web_sm")

#Good idea to call it nlp, because it describes that this is your nlp tool

#The en_core.... is an NLP model 

We first create a ```spaCy``` pipeline which is going to be used for all of our analysis. Essentially we feed our examples of language down the pipeline, and get annotated texts out the end.

In [5]:
sentence = "My name is Ross and I come from Scotland"

The final object that comes out of the end is known as a ```spaCy``` Doc which is essentiall a list of tokens. 

However, rather than just being a list of strings, each of the tokens in this list have their own *attributes*, which can be accessed using the dot notation.

In [6]:
doc = nlp(sentence)


#so now we're using the model to process the sentence above. 
#Called doc, because this code is known as a spacy.doc object


# A doc is just a fancy list of tokens. 

In [8]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)

    #for every token in the doc (the sentence from before), you want the text, the position, dependency (I think), and the morphology

My 		 PRON 		 poss 		 Number=Sing|Person=1|Poss=Yes|PronType=Prs
name 		 NOUN 		 nsubj 		 Number=Sing
is 		 AUX 		 ROOT 		 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
Ross 		 PROPN 		 attr 		 Number=Sing
and 		 CCONJ 		 cc 		 ConjType=Cmp
I 		 PRON 		 nsubj 		 Case=Nom|Number=Sing|Person=1|PronType=Prs
come 		 VERB 		 conj 		 Tense=Pres|VerbForm=Fin
from 		 ADP 		 prep 		 
Scotland 		 PROPN 		 pobj 		 Number=Sing


We can also visualise certain aspects of the linguistic structure of the sentence, such as the dependency relations between individual words:

In [9]:
spacy.displacy.serve(doc, style="dep")

#dependency parse - based on dependency grammar




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Experimenting

- Experiment with different language models available from ```spaCy``` for another language you know - Danish, Dutch, Chinese, Portuguese, whatever.
    - How does ```spaCy``` perform? 
    - Are all the same features available for all languages?

In [10]:
nlp_span = spacy.load("es_core_news_sm")

In [16]:
sentence = "me llamo Diaz y tengo veinticinco años. Yo soy feliz"

doc = nlp_span(sentence)

In [17]:
for token in doc:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_,"\t\t", token.morph)


    #AUX = modalverbum 

me 		 PRON 		 obj 		 Case=Acc|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs|Reflex=Yes
llamo 		 VERB 		 ROOT 		 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
Diaz 		 PROPN 		 nsubj 		 
y 		 CCONJ 		 cc 		 
tengo 		 VERB 		 conj 		 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
veinticinco 		 NUM 		 nummod 		 NumType=Card|Number=Plur
años 		 NOUN 		 obj 		 Gender=Masc|Number=Plur
. 		 PUNCT 		 punct 		 PunctType=Peri
Yo 		 PRON 		 nsubj 		 Case=Nom|Number=Sing|Person=1|PronType=Prs
soy 		 AUX 		 cop 		 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
feliz 		 ADJ 		 ROOT 		 Number=Sing


In [18]:
spacy.displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Task

- In the shared data drive, there is a folder called ```News_Category_Dataset_v2.json```. This is taken from [this Kaggle exercise](https://www.kaggle.com/datasets/rmisra/news-category-dataset) and comprises some 200k news headlines from [HuffPost](https://www.huffpost.com/). The data is a *json lines* format, with one JSON object per row. You can load this data into ```pandas``` in the following way:

```python
data = pd.read_json(filepath, lines=True)
```

- Select a couple of sub-categories of news data and use ```spaCy``` to find the **relative frequency per 10k words*** of each of the following word classes - NOUN, VERB, ADJECTIVE, ADVERB
    - Save the results as a CSV file (again using ```pandas```)
    - Are there any differences in the distributions?

In [28]:
import pandas as pd

data = pd.read_json("../../../115274/news_data/News_Category_Dataset_v2.json", lines=True)

#print(data)

data.category.unique() #SÅDAN finder du unikke værdier i en kolonne. 

array(['CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS',
       'WEIRD NEWS', 'BLACK VOICES', 'WOMEN', 'COMEDY', 'QUEER VOICES',
       'SPORTS', 'BUSINESS', 'TRAVEL', 'MEDIA', 'TECH', 'RELIGION',
       'SCIENCE', 'LATINO VOICES', 'EDUCATION', 'COLLEGE', 'PARENTS',
       'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE', 'HEALTHY LIVING',
       'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST', 'FIFTY', 'ARTS',
       'WELLNESS', 'PARENTING', 'HOME & LIVING', 'STYLE & BEAUTY',
       'DIVORCE', 'WEDDINGS', 'FOOD & DRINK', 'MONEY', 'ENVIRONMENT',
       'CULTURE & ARTS'], dtype=object)

In [26]:
data_subset = data[data["category"] == "ENTERTAINMENT"]

print(data_subset)

             category                                           headline  \
1       ENTERTAINMENT  Will Smith Joins Diplo And Nicky Jam For The 2...   
2       ENTERTAINMENT    Hugh Grant Marries For The First Time At Age 57   
3       ENTERTAINMENT  Jim Carrey Blasts 'Castrato' Adam Schiff And D...   
4       ENTERTAINMENT  Julianna Margulies Uses Donald Trump Poop Bags...   
5       ENTERTAINMENT  Morgan Freeman 'Devastated' That Sexual Harass...   
...               ...                                                ...   
200775  ENTERTAINMENT    Bow Wow Has Tax Liens From 2006, 2008, And 2010   
200776  ENTERTAINMENT  World Preview Of Madonna's 'Give Me All Your L...   
200777  ENTERTAINMENT  'Terminator 3' Star Nick Stahl Arrested For No...   
200838  ENTERTAINMENT  Sundance, Ice-T, and Shades of the American Ra...   
200839  ENTERTAINMENT  'Girl With the Dragon Tattoo' India Release Ca...   

                                                  authors  \
1                         