Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Modelling and Visualization #163

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
fa342a9
added MultiIndex DF support
mk2510 Aug 18, 2020
59a9f8c
beginning with tests
henrifroese Aug 19, 2020
19c52de
implemented correct sparse support
mk2510 Aug 19, 2020
66e566c
Merge branch 'master_upstream' into change_representation_to_multicolumn
mk2510 Aug 21, 2020
41f55a8
added back list() and rm .tolist()
mk2510 Aug 21, 2020
217611a
rm .tolist() and added list()
mk2510 Aug 21, 2020
6a3b56d
Adopted the test to the new dataframes
mk2510 Aug 21, 2020
b8ff561
wrong format
mk2510 Aug 21, 2020
e3af2f9
Address most review comments.
henrifroese Aug 21, 2020
77ad80e
Add more unittests for representation
henrifroese Aug 21, 2020
bee2157
Initial commit to add topic modelling
henrifroese Aug 24, 2020
dece7b5
add pyLDAvis to dependencies
henrifroese Aug 24, 2020
6387ce9
add return_figure option
henrifroese Aug 24, 2020
01c0818
allow display in Console and Jupyter Notebooks
henrifroese Aug 24, 2020
9cd113c
tsvd
mk2510 Aug 24, 2020
187d8f5
Change display at end of function
henrifroese Aug 24, 2020
d9d032c
Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 24, 2020
85089b1
change display
henrifroese Aug 24, 2020
77d815a
change display for notebook again
henrifroese Aug 24, 2020
242383a
added lda
mk2510 Aug 24, 2020
7edac3a
Merge remote-tracking branch 'origin/topic_modelling' into topic_mode…
mk2510 Aug 24, 2020
9e09e7a
Add tests
henrifroese Aug 24, 2020
fc54ff2
Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 24, 2020
46289f2
Format; change name; remove new type Signature
henrifroese Aug 24, 2020
bcfa78d
updatewd test
mk2510 Aug 24, 2020
eb2d31b
add docstring
henrifroese Aug 24, 2020
1fd16eb
Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 24, 2020
3a39346
Implement matrix multiplication changes; fix metadata error
henrifroese Aug 24, 2020
4ad7ee8
added test for lda and tSVD
mk2510 Aug 25, 2020
65504bb
Implement top_words_per_document, top_words_per_topic, topics_from_to…
henrifroese Aug 25, 2020
76e7689
added tests for topic functions
mk2510 Aug 25, 2020
3dfd528
Add docstrings and function comments to new topic modelling functions…
henrifroese Aug 25, 2020
6222448
Merge remote-tracking branch 'origin/topic_modelling' into topic_mode…
henrifroese Aug 25, 2020
79fc37e
fixed index and test
mk2510 Aug 25, 2020
2db883d
- Fix display options
henrifroese Aug 25, 2020
d1c582b
Fix plot visualization
henrifroese Aug 25, 2020
72a7736
remove return_figure parameter
henrifroese Aug 25, 2020
ebc4171
Fix errors and bugs.
henrifroese Aug 25, 2020
9d85c14
remove test-docstring at the end
henrifroese Aug 25, 2020
64edbaf
Start implementing discussed changes
henrifroese Aug 29, 2020
69af26b
Finish implementing the suggested changes.
henrifroese Aug 30, 2020
b9eaf1f
incorporate suggested changes from review
henrifroese Sep 12, 2020
6c30a5e
Fix pyLDAvis PCoA issue.
henrifroese Sep 18, 2020
d12ba7e
Add comment to docstring.
henrifroese Sep 18, 2020
a571925
import _helper in __init__ to overwrite pyLDAvis change
henrifroese Sep 18, 2020
a75aebe
enable auto-display for jupyter notebooks
henrifroese Sep 18, 2020
f8a09c4
Merge branch 'master_upstream' into topic_modelling
mk2510 Sep 22, 2020
cfc78d9
fixed vector series, as pca returns an array
mk2510 Sep 22, 2020
4c5aa0b
fixed the last merged issues
mk2510 Sep 22, 2020
dc42ed1
fix formatting
mk2510 Sep 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH
install:
- pip3 install --upgrade pip # all three OSes agree about 'pip3'
- pip3 install black
- pip3 install black==19.10b0
- pip3 install ".[dev]" .
# 'python' points to Python 2.7 on macOS but points to Python 3.8 on Linux and Windows
# 'python3' is a 'command not found' error on Windows but 'py' works on Windows only
Expand Down
3 changes: 2 additions & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,11 @@ install_requires =
unidecode>=1.1.1
gensim>=3.6.0
matplotlib>=3.1.0
pyLDAvis>=2.1.2
# TODO pick the correct version.
[options.extras_require]
dev =
black>=19.10b0
black==19.10b0
pytest>=4.0.0
Sphinx>=3.0.3
sphinx-markdown-builder>=0.5.4
Expand Down
51 changes: 34 additions & 17 deletions tests/test_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@
s_tokenized_lists = pd.Series([["Test", "Test2"], ["Test3"]], index=[5, 6])
s_numeric = pd.Series([5.0], index=[5])
s_numeric_lists = pd.Series([[5.0, 5.0], [6.0, 6.0]], index=[5, 6])
df_document_term = pd.DataFrame(
[[0.125, 0.0, 0.0, 0.125, 0.250], [0.0, 0.25, 0.125, 0.0, 0.125]],
index=[5, 6],
columns=pd.MultiIndex.from_product([["test"], ["!", ".", "?", "TEST", "Test"]]),
dtype="Sparse",
)


# Define all test cases. Every test case is a list
# of [name of test case, function to test, tuple of valid input for the function].
Expand Down Expand Up @@ -56,27 +63,27 @@
]

test_cases_representation = [
[
"count",
lambda x: representation.flatten(representation.count(x)),
(s_tokenized_lists,),
],
[
"term_frequency",
lambda x: representation.flatten(representation.term_frequency(x)),
(s_tokenized_lists,),
],
[
"tfidf",
lambda x: representation.flatten(representation.tfidf(x)),
(s_tokenized_lists,),
],
["count", representation.count, (s_tokenized_lists,),],
["term_frequency", representation.term_frequency, (s_tokenized_lists,),],
["tfidf", representation.tfidf, (s_tokenized_lists,),],
["pca", representation.pca, (s_numeric_lists, 0)],
["nmf", representation.nmf, (s_numeric_lists,)],
["tsne", representation.tsne, (s_numeric_lists,)],
["truncatedSVD", representation.tsne, (s_numeric_lists, 1)],
["lda", representation.tsne, (s_numeric_lists, 1)],
["kmeans", representation.kmeans, (s_numeric_lists, 1)],
["dbscan", representation.dbscan, (s_numeric_lists,)],
["meanshift", representation.meanshift, (s_numeric_lists,)],
[
"topics_from_topic_model",
representation.topics_from_topic_model,
(s_numeric_lists,),
],
[
"top_words_per_document",
representation.relevant_words_per_document,
(df_document_term,),
],
]

test_cases_visualization = []
Expand Down Expand Up @@ -106,12 +113,22 @@ class AbstractIndexTest(PandasTestCase):
def test_correct_index(self, name, test_function, valid_input):
s = valid_input[0]
result_s = test_function(*valid_input)
t_same_index = pd.Series(s.values, s.index)

if isinstance(s, pd.Series):
t_same_index = pd.Series(s.values, s.index)
else:
t_same_index = pd.DataFrame(s.values, s.index)

self.assertTrue(result_s.index.equals(t_same_index.index))

@parameterized.expand(test_cases)
def test_incorrect_index(self, name, test_function, valid_input):
s = valid_input[0]
result_s = test_function(*valid_input)
t_different_index = pd.Series(s.values, index=None)

if isinstance(s, pd.Series):
t_different_index = pd.Series(s.values, index=None)
else:
t_different_index = pd.DataFrame(s.values, index=None)

self.assertFalse(result_s.index.equals(t_different_index.index))
Loading