# Evaluating Topic Distances of Authors in Twitter-Conversations

From the project plan this adresses the following tasks:

4. Map named entities to word vectors using fasttext and store them in db
   - [ ] map entities to word vectors and calculate the hit rate (how many are contained in the fasttext list)
   - [ ] (optionally) enhance the fasttext list with the missing vocabulary
   - [ ] if the hit rate is too low, use transfer learning to train a model that includes 90% the NERs
   - [ ] store the NER for the authors in a separate table in the database with foreign key reference to the authors

In [6]:
import sqlite3
import pandas as pd

# Create your connection.
from twitter.nlp_util import process_tweet

cnx = sqlite3.connect('db.sqlite3')

df = pd.read_sql_query(
    "SELECT id, conversation_id, created_at, text, author_id,in_reply_to_user_id FROM delab_timeline WHERE lang='en'",
    cnx)
df.head(3)

Unnamed: 0,id,conversation_id,created_at,text,author_id,in_reply_to_user_id
0,1,1435703745304612870,2021-09-08 20:37:01,RT @OregonOEM: 🚩🌩🔥 Red Flag and #FireWeather w...,14838508,
1,2,1435664738353098756,2021-09-08 18:02:01,It's #NationalPreparednessMonth. Help ensure t...,14838508,
2,3,1435663883595706370,2021-09-08 17:58:37,RT @OregonOEM: Oregon is still recovering from...,14838508,


In [7]:
df_reduced = df[["author_id", "text", "id"]]
#df_reduced = df_reduced.groupby('author_id')
# df_reduced.count()

df_reshaped = df_reduced.pivot(index="id", columns="author_id", values="text")
mask = 400 > df_reshaped.nunique()
mask = mask[mask == True]
df_reshaped.drop(columns=mask.index, inplace=True)
df_reshaped.nunique()  # the number of tweets of the authors that have more then 400 tweets

author_id
16558158               478
18616003               946
26998226               469
382814447              446
1005470991668084736    445
1106611172462219265    491
1162371171805011968    447
1239172010363826183    427
1292908140975943681    414
1402252385427222528    414
1403930956428460035    441
dtype: int64

The following takes the pandas dataframe and converts it to a dictionary with the author ids as keys and the twitter
corpora as values.

In [8]:
#df_reshaped.shape
author_corpora_cleaned = {}
author_corpora = df_reshaped.to_dict(orient="series")
for author_id, tweets in author_corpora.items():
    author_corpora_cleaned[author_id] = tweets.dropna()

example_corpus = author_corpora_cleaned[next(iter(author_corpora))]
example_corpus

id
8247                                   @ncreen_same ouch!
8248    That last tweet got some responses from spambo...
8249    'virtually' virtually means: real\n\nSo it sho...
8250                               Sounds great on paper!
8251    Any .com.au registrar recommendations? So far ...
                              ...                        
8742    RT @paydirtapp: Check out our Free Invoice Cre...
8743    @taitems @mmilo yeah nice site, suggestion: ht...
8744    RT @MichaelFHansen: Zendesk eyes Southeast Asi...
8745    While LinkedIn has been changing drastically o...
8746    Another LinkedIn email fail… now they want me ...
Name: 16558158, Length: 478, dtype: object

The first step is to evaluate how many out-of-vocabulary words we have in the authors' tweets.

```python
import fasttext.util
# fasttext.load_model('cc.en.300.bin') # comment this in instead of the next line, if you are not Julian
ft = fasttext.load_model('/home/julian/nltk_data/fasttext/cc.de.300.bin')

author_words_uncleaned = []
n_words_in_voc = 0
for author, a_tweets in author_corpora_cleaned.items():
    for a_tweet in a_tweets:
        for word in a_tweet.split(" "):
            author_words_uncleaned.append(word)

for word in author_words_uncleaned:
    if word in ft.words:
        n_words_in_voc += 1

n_words = len(author_words_uncleaned)
print("{}% of uncleaned words are in the embedding vocabulary".format((n_words_in_voc/n_words))*100)
```

Actually, for memory reasons I had to run this outside of Jupyter notebook

100%|██████████| 99225/99225 [11:27<00:00, 144.33it/s]
The accuracy, that uncleaned words are in the embedding vocabulary is 0.7420710506424792. This allows us
to use the tweets as input directly.
