# Getting to know the DANTEStocks Data

In [1]:
data = open(r"C:\Users\ianba\OneDrive\Desktop\shenanigans\usp\ic\poetisa\data\DANTEStocks (15dez2022).conllu", "r", encoding="utf-8")

### CONLL-U File Framework Description

Lines preceded by a hashtag are comment lines providing metadata on the corpus. In the case of the DANTEStocks corpus the comment lines are used to indicate the filename, new paragraphs, each sentece ID and the full tweet before the syntactic and semantic annotation. Following the comment lines we have word lines that each contain a token and its annotations. For example for the first phrase we have 14 word lines for the words in the first tweet. The columns in each line are separated by tabulators.
The columns have the following meanings, as indicated in https://universaldependencies.org/format.html.

1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
2. FORM: Word form or punctuation symbol.
3. LEMMA: Lemma or stem of word form.
4. UPOS: Universal part-of-speech tag.
5. XPOS: Language-specific part-of-speech tag; underscore if not available.
6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
7. HEAD: Head of the current word, which is either a value of ID or zero (0).
8. DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
10. MISC: Any other annotation.

In [2]:
annotations = data.read()

In [3]:
data.close()

In [4]:
print(annotations[:4000])

# newdoc id = DANTE_29Nov22.conllu
# newpar
# sent_id = dante_01_441014389496619009l
# text = #VALE5 - Análise #Ichimoku - pregão de sexta-feira, 28 de fevereiro. http://t.co/kEreB1xgU4
1	#VALE5	#VALE5	X	_	_	_	_	_	_
2	-	-	PUNCT	_	_	_	_	_	_
3	Análise	Análise	PROPN	_	_	_	_	_	_
4	#Ichimoku	#Ichimoku	PROPN	_	_	_	_	_	_
5	-	-	PUNCT	_	_	_	_	_	_
6	pregão	pregão	NOUN	_	Gender=Masc|Number=Sing	_	_	_	_
7	de	de	ADP	_	_	_	_	_	_
8	sexta-feira	sexta-feira	NOUN	_	Gender=Fem|Number=Sing	_	_	_	_
9	,	,	PUNCT	_	_	_	_	_	_
10	28	28	NUM	_	NumType=Card	_	_	_	_
11	de	de	ADP	_	_	_	_	_	_
12	fevereiro	fevereiro	NOUN	_	Gender=Masc|Number=Sing	_	_	_	_
13	.	.	PUNCT	_	_	_	_	_	_
14	http://t.co/kEreB1xgU4	http://t.co/kEreB1xgU4	SYM	_	_	_	_	_	SpacesAfter=\n

# sent_id = dante_01_441020223408578560l
# text = #PETR4 - Análise #Ichimoku - pregão de sexta-feira, 28 de fevereiro. http://t.co/oAHK5pB3e0
1	#PETR4	#PETR4	X	_	_	_	_	_	_
2	-	-	PUNCT	_	_	_	_	_	_
3	Análise	Análise	PROPN	_	_	_	_	_	_
4	#Ichimoku	#Ichimoku	PROPN	_	_	_	

I would like to better visualize and manipulate this data, so I will try to convert it to a pandas dataframe. 

In [5]:
! pip install conllu



In [6]:
import conllu as c

In [7]:
# parses the text data in conllu format into a list of token list
# objects, where token lists are a datatype specific to the conllu
# format
sentences = c.parse(annotations)

In [8]:
type(sentences)

conllu.models.SentenceList

In [9]:
type(sentences[0])

conllu.models.TokenList

In [10]:
# We are able to extract the comments from the DANTEStocks corpus 
# using the metadata atribute. Furthermore, as it returns a dictionary
# like object we are able to use the keys to access specific comments.
# In this corpus the keys are as follows:

# 1. newdoc id: Corpus ID
# 2. newpar: new paragraph
# 3. sent_id: sentence ID
# 4. text: full tweet
sentences[0].metadata

{'newdoc id': 'DANTE_29Nov22.conllu',
 'newpar': None,
 'sent_id': 'dante_01_441014389496619009l',
 'text': '#VALE5 - Análise #Ichimoku - pregão de sexta-feira, 28 de fevereiro. http://t.co/kEreB1xgU4'}

In [11]:
# Extracting line 1
sentences[0][0]

{'id': 1,
 'form': '#VALE5',
 'lemma': '#VALE5',
 'upos': 'X',
 'xpos': None,
 'feats': None,
 'head': None,
 'deprel': '_',
 'deps': None,
 'misc': None}

In [12]:
# getting a specific key from line 1
sentences[0][0]['lemma']

'#VALE5'

In [13]:
# lets create a list of all the full tweets
tweets = []
for i in range(len(sentences)):
    tweet = sentences[i].metadata['text']
    tweets.append(tweet)

In [14]:
tweets[:5]

['#VALE5 - Análise #Ichimoku - pregão de sexta-feira, 28 de fevereiro. http://t.co/kEreB1xgU4',
 '#PETR4 - Análise #Ichimoku - pregão de sexta-feira, 28 de fevereiro. http://t.co/oAHK5pB3e0',
 'as nuvens do ichimoku: PETR4 - pregão de sexta-feira, 28 de fevereiro http://t.co/fZ3wAPf7An',
 'Em a #PETR4 fizemos em a sexta passada uma Sub Onda 5 de fundo e em o dia 05/02/14uma Onda 3 de fundo alvos 15,44 16,40 http://t.co/5DPBQCulWr',
 '@PaiRico @frfontanella @eddu56 @TiagoBDS Acabei de fazer análise da sua #petr4 . Abraços !']

In [15]:
len(tweets)

4048

### Creating a Pandas Dataframe
We're going to try and create a pandas dataframe with our data

In [16]:
# creating a list of columns
columns = list(sentences[0][0].keys())
columns

['id',
 'form',
 'lemma',
 'upos',
 'xpos',
 'feats',
 'head',
 'deprel',
 'deps',
 'misc']

In [17]:
# creating a dictionary where the keys are the columns
# and the values are lists of the values for each column
df = {k:[] for k in columns}

In [18]:
# populating the dictionary
for t in range(len(sentences)):
    for tok in sentences[t]:
        for k, v in tok.items():
            df[k].append(v)

In [19]:
# getting the number of tokens
numtoks = 0
for i in range(len(sentences)):
    numtoks += len(sentences[i])

In [20]:
numtoks

84397

In [21]:
for col in df.keys():
    print(len(df[col]))

84397
84397
84397
84397
84397
84397
84397
84397
84397
84382


As we can see for some reason some tokens don't have the misc column.

In [22]:
miscs = None
errors = []
for t in range(len(sentences)):
    for tok in sentences[t]:
        try:
            miscs = tok['misc']
        except:
            errors.append([t, tok['form']])

In [23]:
errors

[[366, '#falencia'],
 [366, '#incompetencia'],
 [366, '#PETR4'],
 [366, '#divida'],
 [702, 'que'],
 [950, 'niquel'],
 [1110, 'sem'],
 [1264, 'o'],
 [1676, 'VEM'],
 [1719, 'ações'],
 [1834, 'Teria'],
 [2266, 'ações'],
 [2316, 'o'],
 [3490, 'o'],
 [3833, 'o']]

In [24]:
error_words = [i[1] for i in errors]
error_idxs = [i[0] for i in errors]

for t in error_idxs:
    for tok in sentences[t]:
        if tok['form'] in error_words:
            tok['misc'] = None

In [25]:
df = {k:[] for k in columns}
for t in range(len(sentences)):
    for tok in sentences[t]:
        for k, v in tok.items():
            df[k].append(v)

In [26]:
for col in df.keys():
    print(len(df[col]))

84397
84397
84397
84397
84397
84397
84397
84397
84397
84397


In [27]:
import pandas as pd
import numpy as np

In [28]:
dante_df = pd.DataFrame(df)

In [29]:
dante_df.shape

(84397, 10)

In [30]:
dante_df.head()

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc
0,1,#VALE5,#VALE5,X,,,,_,,
1,2,-,-,PUNCT,,,,_,,
2,3,Análise,Análise,PROPN,,,,_,,
3,4,#Ichimoku,#Ichimoku,PROPN,,,,_,,
4,5,-,-,PUNCT,,,,_,,


In [31]:
# we want to add a column to the dataframe that indicates what tweet the token
# comes from

tweet_ids = []
for i in range(len(sentences)):
    for _ in range(len(sentences[i])):
        tweet_ids.append(i)

In [32]:
len(tweet_ids)

84397

In [33]:
dante_df['tweet_id'] = tweet_ids

In [34]:
dante_df.head()

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc,tweet_id
0,1,#VALE5,#VALE5,X,,,,_,,,0
1,2,-,-,PUNCT,,,,_,,,0
2,3,Análise,Análise,PROPN,,,,_,,,0
3,4,#Ichimoku,#Ichimoku,PROPN,,,,_,,,0
4,5,-,-,PUNCT,,,,_,,,0


### EDA

1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
2. FORM: Word form or punctuation symbol.
3. LEMMA: Lemma or stem of word form.
4. UPOS: Universal part-of-speech tag.
5. XPOS: Language-specific part-of-speech tag; underscore if not available.
6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
7. HEAD: Head of the current word, which is either a value of ID or zero (0).
8. DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
10. MISC: Any other annotation.

In [35]:
dante_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84397 entries, 0 to 84396
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        84397 non-null  object 
 1   form      84397 non-null  object 
 2   lemma     84397 non-null  object 
 3   upos      84397 non-null  object 
 4   xpos      3 non-null      object 
 5   feats     36582 non-null  object 
 6   head      28 non-null     float64
 7   deprel    84397 non-null  object 
 8   deps      16 non-null     object 
 9   misc      6835 non-null   object 
 10  tweet_id  84397 non-null  int64  
dtypes: float64(1), int64(1), object(9)
memory usage: 7.1+ MB


In [40]:
dante_df['id'].unique()

array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, (3, '-', 4), 15, 16,
       (5, '-', 6), (16, '-', 17), 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, (9, '-', 10), (8, '-', 9), 29, 30, (7, '-', 8), 31, 32, 33,
       34, 35, (21, '-', 22), (15, '-', 16), (18, '-', 19), (2, '-', 3),
       (12, '-', 13), (19, '-', 20), (22, '-', 23), (6, '-', 7),
       (25, '-', 26), (4, '-', 5), (11, '-', 12), (24, '-', 25),
       (14, '-', 15), (23, '-', 24), (28, '-', 29), (20, '-', 21),
       (10, '-', 11), (13, '-', 14), 36, 37, 38, 39, 40, 41, 42,
       (17, '-', 18), (1, '-', 2), (27, '-', 28), (30, '-', 31),
       (26, '-', 27), (32, '-', 33), (29, '-', 30), 43, 44, 45, 46, 47,
       48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
       65, 66, 67, 68, 69, 70, 71, (31, '-', 32), (33, '-', 34),
       (36, '-', 37), (13, '-', 16), (12, '-', 15), (9, '-', 11)],
      dtype=object)

In [36]:
# Which are the universal part of speach tags
dante_df['upos'].unique()

array(['X', 'PUNCT', 'PROPN', 'NOUN', 'ADP', 'NUM', 'SYM', 'DET', '_',
       'VERB', 'ADJ', 'CCONJ', 'ADV', 'AUX', 'SCONJ', 'PRON', 'INTJ',
       'PART'], dtype=object)

In [37]:
dante_df.to_csv(r"C:\Users\ianba\OneDrive\Desktop\shenanigans\usp\ic\poetisa\data\data_intermed\dante_df.csv")

#### Universal Part-of-Speech Tags

Source: https://universaldependencies.org/u/pos/

- ADJ: adjective. "Adjectives are words that typically modify nouns and specify their properties or attributes". In UD some words that could be considered adjectives are labled as numbers or determiners.

- ADP: adposition. "Adposition is a cover term for prepositions and postpositions. Adpositions belong to a closed set of items that occur before (preposition) or after (postposition) a complement composed of a noun phrase, noun, pronoun, or clause that functions as a noun phrase, and that form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause". 

- ADV: adverb. "Adverbs are words that typically modify verbs for such categories as time, place, direction or manner. They may also modify adjectives and other adverbs". 

- AUX: auxiliary. "An auxiliary is a function word that accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb, such as person, number, tense, mood, aspect, voice or evidentiality".

- CCONJ: coordinating conjunction. "A coordinating conjunction is a word that links words or larger constituents without syntactically subordinating one to the other and expresses a semantic relationship between them."

- DET: determiner. "Determiners are words that modify nouns or noun phrases and express the reference of the noun phrase in context. That is, a determiner may indicate whether the noun is referring to a definite or indefinite element of a class, to a closer or more distant element, to an element belonging to a specified person or thing, to a particular number or quantity, etc."

- INTJ: interjection. "An interjection is a word that is used most often as an exclamation or part of an exclamation. It typically expresses an emotional reaction, is not syntactically related to other accompanying expressions, and may include a combination of sounds not otherwise found in the language."

- NOUN: noun. "Nouns are a part of speech typically denoting a person, place, thing, animal or idea."

- NUM: numeral. 

- PART: particle

- PRON: pronoun. "Pronouns are words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context."

- PROPN: proper noun. "A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object."

- PUNCT: punctuation

- SCONJ: subordinating conjunction. "A subordinating conjunction is a conjunction that links constructions by making one of them a constituent of the other. The subordinating conjunction typically marks the incorporated constituent which has the status of a (subordinate) clause."

- SYM: symbol

- VERB: verb. "Note that the VERB tag covers main verbs (content verbs) but it does not cover auxiliary verbs and verbal copulas (in the narrow sense), for which there is the AUX tag."

- X: other. "The tag X is used for words that for some reason cannot be assigned a real part-of-speech category."

In [36]:
# Lets check how many times each of the unique POS tags appear in 
# the dataset

dante_df['upos'].value_counts()

PUNCT    13056
NOUN     11936
PROPN    11440
ADP       8735
DET       6855
VERB      6585
NUM       5033
SYM       4509
_         3347
ADJ       2934
ADV       2764
X         2102
CCONJ     1694
AUX       1316
PRON      1214
SCONJ      732
INTJ       142
PART         3
Name: upos, dtype: int64

As expected, there are more nouns and proper nouns than other sintactic functions in the corpus.

In [37]:
# Lets examine some of the X POS tags
dante_df.query("upos == 'X'").sample(10)

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc,tweet_id
67480,11,RT,RT,X,,,,_,,,3235
19271,2,#PETR4,#PETR4,X,,,,_,,,940
20104,27,Heeheh,Heeheh,X,,,,_,,{'SpacesAfter': '\n'},981
40074,3,#EuApoioCPIdaPetrobras,#EuApoioCPIdaPetrobras,X,,,,_,,,1949
33210,1,#LLXL3,#LLXL3,X,,,,_,,,1623
13211,18,rsrsrrs,rsrsrrs,X,,,,_,,{'SpacesAfter': '\n'},633
32719,1,$AEDU3,$AEDU3,X,,,,_,,,1600
65221,1,$VALE5,$VALE5,X,,,,_,,,3127
73027,25,#infomoney,#infomoney,X,,,,_,,,3496
67004,1,RT,RT,X,,,,_,,,3216


Mostly hashtags, cashtags and retweets.

In [38]:
# Lets check what the _ POS tag is 
dante_df.loc[dante_df.loc[:, 'upos'] == '_', :]

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc,tweet_id
30,"(3, -, 4)",do,_,_,,,,_,,,2
49,"(5, -, 6)",na,_,_,,,,_,,,3
61,"(16, -, 17)",no,_,_,,,,_,,,3
83,"(9, -, 10)",da,_,_,,,,_,,,4
98,"(8, -, 9)",na,_,_,,,,_,,,5
...,...,...,...,...,...,...,...,...,...,...,...
84265,"(21, -, 22)",no,_,_,,,,_,,,4040
84275,"(6, -, 7)",à,_,_,,,,_,,,4041
84311,"(8, -, 9)",do,_,_,,,,_,,,4043
84322,"(6, -, 7)",dos,_,_,,,,_,,,4044


Don't quite understand what these are.

In [39]:
# Lets check what deprels we have
dante_df['deprel'].unique()

array(['_', 'goeswith'], dtype=object)

In [40]:
dante_df['deprel'].value_counts()

_           84369
goeswith       28
Name: deprel, dtype: int64

Mostly missing values. 

Consulting Bryans masters project we can see that this is expected:

"Para os tweets, tem-se, por enquanto, apenas diretrizes para a etiquetação morfossintática(DI-FELIPPO et al., 2022)."

Let's check what the 'goeswith' deprel is.

In [41]:
dante_df.query("deprel == 'goeswith'").sample(5)

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc,tweet_id
57595,3,diretor,_,X,,,2.0,goeswith,,,2768
67433,17,lenga,_,X,,,16.0,goeswith,,,3232
37799,18,side,_,X,,,17.0,goeswith,,{'SpacesAfter': '\n'},1849
2494,14,sal,_,X,,,13.0,goeswith,,,134
37819,18,side,_,X,,,17.0,goeswith,,"{'SpacesAfter': 'No', 'CorrectSpaceAfter': 'Yes'}",1850


They're for compound words.

Let's check the feats.

In [42]:
dante_df.loc[(dante_df.loc[:, 'feats'].notnull()), 'feats'].sample(5)

13588    {'Mood': 'Imp', 'Number': 'Plur', 'Person': '3...
34265    {'Definite': 'Def', 'Gender': 'Masc', 'Number'...
17867    {'Mood': 'Ind', 'Number': 'Sing', 'Person': '3...
36422                                  {'NumType': 'Card'}
58890    {'Mood': 'Sub', 'Number': 'Sing', 'Person': '3...
Name: feats, dtype: object

Information on gender, number, if it is definite or indefinite etc.

#### Miscs
We want to check if the misc column is relevant. If it isn't, then we can drop it.

In [43]:
misc_vals = dante_df.loc[~pd.isna(dante_df.loc[:, 'misc']), 'misc']

In [44]:
type(misc_vals)

pandas.core.series.Series

In [45]:
misc_vals.index[0]

13

In [46]:
misc_vals.shape

(6835,)

In [47]:
l = {}
for i in range(misc_vals.shape[0]):
    key, value = list(misc_vals.values[i].items())[0]
    if not l.get(key):
        l[key] = [[value, misc_vals.index[i]]]
    else:
        l[key].append([value, misc_vals.index[i]])

In [48]:
np.unique(list(l.keys()))

array(['4:reparandum', 'CorrectForm', 'CorrectSpaceAfter', 'FulForm',
       'FullFom', 'FullFor', 'FullForm', 'FullForma', 'FullForn',
       'FullFrom', 'Fullform', 'SpaceAfter', 'SpacesAfter', 'Trunc'],
      dtype='<U17')

Apparently there are many keys that are just misspelled versions of the same category, such as SpacesAfter and SpaceAfter.

In [49]:
l['FullForm'][:5]

[['sexta-feira', 52],
 ['mínima', 101],
 ['sexta-feira', 103],
 ['para', 192],
 ['suporte', 215]]

In [50]:
dante_df.iloc[192, :]

id                            32
form                         pra
lemma                       para
upos                         ADP
xpos                        None
feats            {'Abbr': 'Yes'}
head                         NaN
deprel                         _
deps                        None
misc        {'FullForm': 'para'}
tweet_id                       8
Name: 192, dtype: object

The FullForm Misc category doesn't seem to be very important since the lemmas usually capture the full form of the token.

In [51]:
l['SpacesAfter'][:5]

[['\\n', 13], ['\\n', 27], ['\\n', 44], ['\\n', 74], ['\\n', 90]]

In [52]:
dante_df.iloc[44]

id                              16
form        http://t.co/fZ3wAPf7An
lemma       http://t.co/fZ3wAPf7An
upos                           SYM
xpos                          None
feats                         None
head                           NaN
deprel                           _
deps                          None
misc         {'SpacesAfter': '\n'}
tweet_id                         2
Name: 44, dtype: object

The SpacesAfter category also doesnt seem very usefull since all values seem to be the same and it only indicates that there was a space after the token. The tokens in this case are usually hyperlinks.

In [53]:
l['CorrectSpaceAfter'][:5]

[['Yes', 65], ['Yes', 824], ['Yes', 1527], ['Yes', 1759], ['Yes', 1841]]

In [54]:
dante_df.iloc[1759]

id                                                         1
form                                                05/03/14
lemma                                               05/03/14
upos                                                     NUM
xpos                                                    None
feats                                    {'NumType': 'Card'}
head                                                     NaN
deprel                                                     _
deps                                                    None
misc        {'CorrectSpaceAfter': 'Yes', 'SpaceAfter': 'No'}
tweet_id                                                 100
Name: 1759, dtype: object

Also does not seem very relevant.

In [55]:
l['CorrectForm'][:5]

[['Subonda', 55], ['subonda', 95], ['fez', 197], ['Papéis', 236], ['Não', 320]]

In [56]:
dante_df.iloc[55]

id                                                         10
form                                                      Sub
lemma                                                 subonda
upos                                                     NOUN
xpos                                                     None
feats       {'Gender': 'Fem', 'Number': 'Sing', 'Typo': 'Y...
head                                                      NaN
deprel                                                      _
deps                                                     None
misc                               {'CorrectForm': 'Subonda'}
tweet_id                                                    3
Name: 55, dtype: object

In [57]:
tweets[3]

'Em a #PETR4 fizemos em a sexta passada uma Sub Onda 5 de fundo e em o dia 05/02/14uma Onda 3 de fundo alvos 15,44 16,40 http://t.co/5DPBQCulWr'

The CorrectForm category seems more relevant than the others, but the lemma seems to be enough to understand the token, even when the form in the tweet is very poorly written.

In [58]:
l['Trunc'][:5]

[['Yes', 2660], ['Yes', 4007], ['Yes', 10366], ['Yes', 13827], ['Yes', 17334]]

In [59]:
dante_df.iloc[10366]

id                        29
form                     que
lemma                    que
upos                       X
xpos                    None
feats                   None
head                     NaN
deprel                     _
deps                    None
misc        {'Trunc': 'Yes'}
tweet_id                 500
Name: 10366, dtype: object

In [60]:
tweets[500]

'CTIP3 encerrou com uma pequena perda (-0,76%). Abriu Estácio (ESTC3) e subiu forte, e BR Malls (BRML3), que... http://t.co/K7mbQ4GTJ8'

This category indicates truncation, in the sense that the tweets was likely truncated due to the size limitation distinctive of this format.

It is unlikely that the Misc column will be useful so we will drop it.

In [158]:
dante_df.drop('misc', axis=1, inplace=True)

Number of unique tokens in the corpus.

In [62]:
dante_df.loc[:, 'lemma'].nunique()

9330

In [68]:
# Number of tokens not counting punctuation, symbols and X
subset = dante_df.query('upos != "PUNCT" and upos != "X" and upos != "SYM"')
subset.loc[:, 'lemma'].nunique()

7503

Getting average number of tokens per tweet.

In [64]:
len(tweets)

4048

In [70]:
subset.shape[0] / len(tweets)

15.990612648221344

In [71]:
dante_df.shape[0] / len(tweets)

20.849061264822133

## Getting Possible Predicative Nouns

The predicative nouns are usually accompanied by a preposition. As a first attempt to try to find all predicative nouns we will try to find all tweets that have nouns followed and prepositions. We will then form touples with the noun, preposition pairs, as well as the word that preceded the noun and the word that comes after the preprosition. Proper nouns cannot be predicative names, so we will ignore those. We also want to create a column in the data frame with the possible predicative nouns and a list containing the tweets and the words in the tweets that can be predicative nouns. 

In [162]:
def find_noun_pred(df):

    # lists we will create
    noun_pred_windows = []
    noun_pred_locs = []
    noun_pred_col = []

    for i in range(df.shape[0]-1):

        # restricting ourselves to the same tweet
        if (df['tweet_id'][i]) != (df['tweet_id'][i+1]):
            noun_pred_col.append(0)
            continue

        # finding situations where a noun is followed by a preposition
        if (df['upos'][i] == 'NOUN') and (df['upos'][i+1] == 'ADP'):

            # creating the window of words we want
            window = [i+j for j in range(-1, 3, 1) if (i+j >= 0) and (i+j <= df.shape[0])]
            w = [df['lemma'][i] for i in window]
            noun_pred_windows.append(w)
            noun_pred_locs.append([df['tweet_id'][i], df['id'][i]])
            noun_pred_col.append(1)
        else:
            noun_pred_col.append(0)

    noun_pred_col.append(0)

    return noun_pred_windows, noun_pred_locs, noun_pred_col

In [163]:
windows, locs, col = find_noun_pred(dante_df)

In [164]:
len(col)

84397

In [165]:
dante_df['poss_npred'] = col
dante_df.head()

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,tweet_id,poss_npred
0,1,#VALE5,#VALE5,X,,,,_,,0,0
1,2,-,-,PUNCT,,,,_,,0,0
2,3,Análise,Análise,PROPN,,,,_,,0,0
3,4,#Ichimoku,#Ichimoku,PROPN,,,,_,,0,0
4,5,-,-,PUNCT,,,,_,,0,0


In [167]:
windows[:5]

[['-', 'pregão', 'de', 'sexta-feira'],
 ['-', 'pregão', 'de', 'sexta-feira'],
 ['-', 'pregão', 'de', 'sexta-feira'],
 ['o', 'mínima', 'de', 'sexta-feira'],
 ['o', 'momento', 'de', 'entrar']]

In [203]:
tweets_with_npred = []
for i in range(len(locs)):
    tweet = dante_df.query('tweet_id == {idx}'.format(idx = i))
    tweet.loc[tweet.loc[:, 'id'] == locs[i][1], 'form'] = tweet.loc[tweet.loc[:, 'id'] == locs[i][1], 'form'] + '(NPRED)' 
    tweet_str = " ".join(tweet['form'])
    tweets_with_npred.append(tweet_str)

In [206]:
len(tweets_with_npred)

2444

In [207]:
dante_df.query('poss_npred == 1')

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,tweet_id,poss_npred
5,6,pregão,pregão,NOUN,,"{'Gender': 'Masc', 'Number': 'Sing'}",,_,,0,1
19,6,pregão,pregão,NOUN,,"{'Gender': 'Masc', 'Number': 'Sing'}",,_,,1,1
37,9,pregão,pregão,NOUN,,"{'Gender': 'Masc', 'Number': 'Sing'}",,_,,2,1
101,10,min,mínima,NOUN,,"{'Gender': 'Fem', 'Number': 'Sing', 'Abbr': 'Y...",,_,,5,1
127,8,momento,momento,NOUN,,"{'Gender': 'Masc', 'Number': 'Sing'}",,_,,6,1
...,...,...,...,...,...,...,...,...,...,...,...
84300,11,Formulario,formulário,NOUN,,"{'Gender': 'Masc', 'Number': 'Sing', 'Typo': '...",,_,,4042,1
84341,8,Prod.,produção,NOUN,,"{'Gender': 'Fem', 'Number': 'Sing', 'Abbr': 'Y...",,_,,4045,1
84343,10,Petroleo,petróleo,NOUN,,"{'Gender': 'Masc', 'Number': 'Sing', 'Typo': '...",,_,,4045,1
84352,19,Recorde,recorde,NOUN,,{'Number': 'Sing'},,_,,4045,1


In [208]:
poss_npred_df = pd.DataFrame({'tweets':tweets_with_npred,
                              'windows': windows,
                              'locations': [loc[1] for loc in locs]})

In [209]:
poss_npred_df.head()

Unnamed: 0,tweets,windows,locations
0,#VALE5 - Análise #Ichimoku - pregão(NPRED) de ...,"[-, pregão, de, sexta-feira]",6
1,#PETR4 - Análise #Ichimoku - pregão(NPRED) de ...,"[-, pregão, de, sexta-feira]",6
2,as nuvens do de o ichimoku : PETR4 - pregão(NP...,"[-, pregão, de, sexta-feira]",9
3,Em a #PETR4 fizemos na em a sexta passada uma ...,"[o, mínima, de, sexta-feira]",10
4,@PaiRico @frfontanella @eddu56 @TiagoBDS Acabe...,"[o, momento, de, entrar]",8


In [212]:
poss_npred_df.to_csv(r"C:\Users\ianba\OneDrive\Desktop\shenanigans\usp\ic\poetisa\data\data_intermed\poss_npred.csv")