Step 2: First look into the data #3

trannel · 2021-05-24T10:51:55Z

First we should take a look into the data we have by analysing keywords and using tf-idf.

Determine the top 20 words (unigrams) per conference
Determine the top 20 bigrams per conference
Do the same with same with tf-idf
Implement scattertext for comparisions
Compare our results with those of NLP Scholar

trannel · 2021-06-27T18:11:41Z

I checked what is wrong with the tf-idf function and I just overlooked that it is generating a matrix, but by using .idf_ I already got the weigthing for each feature/token out of it. I changed the function, so you could more easily access the matrix.

We can get the highest tf-idf scores using this, which gives us the following: Highest tf-idf scores in selection: [('+0.4', 1, 6.938854596835685), ('+0.6', 1, 6.938854596835685), ('+0.7', 1, 6.938854596835685), ('+1', 1, 6.938854596835685), ('-25.3', 1, 6.938854596835685), ('-50.5', 1, 6.938854596835685), followed by some links. As you can see, removing numbers isn't so trivial, as I can only give sklearn a corpus of words and putting every number in every way in it is not feasible. Though numbers do not appear in visualization done in scattertext or pyLDAvis anymore.

The issue with the counting during the demo was caused by a missing default value, so the CLI overwrote the other default value.

truas · 2021-06-29T07:37:01Z

Thanks for the update Lennart.
I wonder if the .idf_ is just the inverse document part of the equation. In any case, if we can access the entire matrix it should be fine.

About the number issue. Are you removing the numerical characters before running the tf-idf? I believe this would be easier, as we treat the input before using it in any processing, right before/after the stopword removal. I'm still wondering if we should use Tfidf vectorizer instead of transformer. The former is usually used when the input is the raw documents, and the latter if you already have a count matrix. Also, in the first, several tasks can be automated with a parameter flag (e.g. stopword removal, n-grams, max features, min, regex, etc).

trannel · 2021-06-29T13:55:37Z

The CountVectorizer I use has the parameters you mentioned and creates the matrix the Tf-idf Transformer needs. I can check if the results would be the same.

The parameters are also the issue for the stopwords, as I can only pass a list of stopwords, which sklearn will remove. I think I can smuggle it into the tokenization, so numbers will also be removed. Then we would also do the stopwords removal ourselfs, because we have to check numbers with a function and can't pass a list of all numbers to remove. Maybe I also missed something and you can also pass a function.

truas · 2021-07-05T06:52:45Z

Sorry for the delay Lennart.
Yes, don't overthink this. Just a regex to get rid of punctuation/numbers is enough. Essentially the stopword removal is nothing more than a simple comprehension that checks is a given word in listed or not.

trannel · 2021-07-06T13:36:32Z

I removed some numbers by casting them to a float and checking if it works for now. Tokens like +1.23/-7.23 is not removed though, but there are also quite a few other tokens that contain just punctuation that we might have to look at anyway later on.

trannel added the doing label Jun 12, 2021

trannel removed the doing label Jul 6, 2021

jpwahle added this to the Milestone 1 - Data Aquisition & Feature Development milestone Jul 30, 2021

jpwahle closed this as completed Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 2: First look into the data #3

Step 2: First look into the data #3

trannel commented May 24, 2021 •

edited

Loading

trannel commented Jun 27, 2021 •

edited

Loading

truas commented Jun 29, 2021

trannel commented Jun 29, 2021

truas commented Jul 5, 2021

trannel commented Jul 6, 2021 •

edited

Loading

Step 2: First look into the data #3

Step 2: First look into the data #3

Comments

trannel commented May 24, 2021 • edited Loading

trannel commented Jun 27, 2021 • edited Loading

truas commented Jun 29, 2021

trannel commented Jun 29, 2021

truas commented Jul 5, 2021

trannel commented Jul 6, 2021 • edited Loading

trannel commented May 24, 2021 •

edited

Loading

trannel commented Jun 27, 2021 •

edited

Loading

trannel commented Jul 6, 2021 •

edited

Loading