## What is topic analysis?

Topic analysis is a machine learning technique that uses key words & phrases in a collection of texts to identify common themes / categories.

There are many methods for topic analysis, the most popular being LDA and NMF. There's been some research indicating that NMF may have stronger results, but both can be very useful and better in different situations. datto currently uses NMF, but will likely add in LDA options in future.

## Topic analysis in datto

To run a basic topic analysis in datto, you need to first create a Pandas dataframe with one column that has one row for each piece of text to classify.

You then need to specify the following parameters:
* `X` -> The dataframe of text data
* `num_examples` -> datto returns a few examples of texts within each category; choose how many you'd like.
* `text_column_name` -> The name of the column in your dataframe containing the text

The following parameters are optional:
* `chosen_num_topics` -> How many topics to create
* `chosen_stopwords` -> Any additional stopwords to use (i.e. words to ignore)
* `min_df` -> The minimum number/percentage of texts a word has to appear in to be counted (default is 3)
* `max_df` -> The maximum number/percentage of texts a word can appear in before it is ignored (default is 0.1)
* `min_ngrams` -> The smallest number of words together to be considered (default is 1, i.e. a single word)
* `max_ngrams` -> The most number of words together to be considered (default is 3, i.e. up to 3 word phrases)

## Automatically choosing the number of topics

There are a variety of methods to choose how many topics you would like the model to create. If you leave the `chosen_num_topics` parameter blank, the method will do some testing to determine the "optimal" number of topics for you. Automatically choosing the number of topics is often a great place to start, but these testing methods are far from perfect. 

The method for automated choosing of topic numbers utilizes two measurements - Jaccard similarity and coherence. 

**Jaccard similarity** is a way of measuring the similarity of two texts by making unique sets of the words in each topic, getting the number of words in common, and dividing by the total number of unique words. It is used here to get the similarities between each combination of topics, and to verify that these scores are low, i.e. the topics are distinct. 

**Coherence scores** evaluate a topic by the degree that promiment words in that topic consistently appear together throughout all the texts. 

This code tests various numbers of topics (5 - 75 at intervals of 5), and chooses the number that both maximizes coherence and minimizes Jaccard similiarity.

## Manually choosing the number of topics

Nothing substitutes simply looking through the identified topics to get a feel for whether there are duplicate topics or topics that need to be expanded. Your business case may even require a certain number / range of topics. There are also graphing techniques to visually inspect topics, such as [UMAP](https://umap-learn.readthedocs.io/en/latest/clustering.html). Adding graphing techniques is another method that may be added to datto in the future.

## Choosing stopwords

A large part of getting good results from text analysis depends on choosing effective stop words. Stop words are words that are considered to not have meaning in the current analysis, and are thus ignored while making topics.

Several Python packages have built in lists of stopwords, so in order to get the best of all the worlds, datto combines stopwords from several packages to use as the default stopwords (`nltk` + `spacy` + `sklearn`). By default, datto also excludes punctuation, single letters, pronouns, some common terms (`w/`, `and/or`, `i.e.`, `e.g.`) and some common punctuation combinations (e.g. `---`, `..`).

You can add any number of additional stopwords by including them as a set in the `chosen_stopwords` parameter.

To choose stopwords, there's no substitute for looking through topics generated, and identifying which words aren't meaningful to your specific analysis. For example, if you're analyzing messages from a certain company, the name of that company isn't really meaningful. If you're looking through messages, finding a topic consisting of only days of the week is likely unhelpful, so you can exclude each day of the week name. 

You will likely need to iterate through this analysis many times and continue to identify unhelpful words until you start to find meaningful topics emerge.

## Output

This datto method returns 3 objects - a dataframe with one row per topic, your original dataframe with one row per text with added topics, and the text analysis model.

The first dataframe contains the number of texts with that topic as the most relevant, the key words and phrases used to create that topic, and several columns of example texts that fall into that category.

The second dataframe is your complete original dataframe, just with one added column indicating which topic that text fits most closely. Note that the actual model output scores each text compared to how closely it fits each topic. So the simplified chosen topic is chosen by taking those scores and returning the topic with the max score.

The model is the trained text analysis model used to create the topics. You can call the model with `model.predict()` to classify new text data into the topics chosen.

## Example usage

In [25]:
import pandas as pd
import datto as dt

from sklearn.datasets import fetch_20newsgroups

In [22]:
# Testing using a built in sklearn dataset
training_data = fetch_20newsgroups(subset='train')
df = pd.DataFrame(training_data.data, columns=['text'])

In [24]:
df.head()

Unnamed: 0,text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


In [27]:
df.shape

(11314, 1)

In [26]:
mr = dt.ModelResults()

In [31]:
# Lets do our first run using just the defaults and a sample so it runs in a relatively decent timeframe
# Note: this still takes a few minutes (testing different topic numbers makes this run longer)
concated_topics, 
original_with_keywords, 
model = mr.most_similar_texts(df.sample(500), 
                              text_column_name='text', 
                              num_examples=15);

Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['-pron-', 'I', 'far', 'need', 'regard', 'shall', 'use', 'win'] not in stop_words.
Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
Maximum number of iterations 200 reached. Increase it to improve convergence.
np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.

Topics created with top words & example texts:
    topic_num  num_in_category  \
0           0                2   
1           1               21   
2           2               11   
3           3               20   
4           4                6   
5           5               19   
6           6                3   
7           7               11   
8           8                5   
9           9                5   
10         10                8   
11         11                8   
12         12                4   
13         13                6   
14         14                6   
15         15               12   
16         16                8   
17         17               18   
18         18                5   
19         19               13   
20         20                2   
21         21                8   
22         22                8   
23         23               10   
24         24                9   
25         25               10   
26         26                5   
2

In [34]:
print('Number of topics chosen:')
print(concated_topics.shape[0])

Number of topics chosen:
50


In [35]:
# Let's inspect the topics
# Note that the topic number is included to simplify identifying topics; it's labeled and then sorted by most common
concated_topics

Unnamed: 0,topic_num,num_in_category,top_words_and_phrases,example0,example1,example2,example3,example4,example5,example6,example7,example8,example9,example10,example11,example12,example13,example14
1,1,21,"[god, atheist, evil, religion, eternal, author...",From: jsledd@ssdc.sas.upenn.edu (James Sledd)\...,From: dlphknob@camelot.bradley.edu (Jemaleddin...,From: joslin@pogo.isp.pitt.edu (David Joslin)\...,From: phs431d@vaxc.cc.monash.edu.au\nSubject: ...,From: coffey@cptc2.neep.wisc.edu (Robert L. Co...,From: mussack@austin.ibm.com (Christopher Muss...,From: seanna@bnr.ca (Seanna (S.M.) Watson)\nSu...,From: REXLEX@fnal.fnal.gov\nSubject: Re: Athie...,From: news@cbnewsk.att.com\nSubject: Re: An ag...,From: B8HA <B8HA@MUSICB.MCGILL.CA>\nSubject: R...,From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...,From: jaeger@buphy.bu.edu (Gregg Jaeger)\nSubj...,From: sandvik@newton.apple.com (Kent Sandvik)\...,From: bobbe@vice.ICO.TEK.COM (Robert Beauchain...,From: bil@okcforum.osrhe.edu (Bill Conner)\nSu...
3,3,20,"[drive, hard drive, hard, disk, floppy, floppy...",From: Aovai@qube.OCUnix.On.Ca (Aovai)\nSubject...,From: wgs1@Isis.MsState.Edu (Walter G. Seefeld...,From: eacj@theory.TC.Cornell.EDU (Julian Vries...,From: dashley@wyvern.wyvern.com (Doug Ashley)\...,From: bcasavan@cougar.ecn.uoknor.edu (Brent Ca...,From: corwin@igc.apc.org (Corwin Nichols)\nSub...,From: jdrout@scott.skidmore.edu (JTD is lost)\...,From: robert.desonia@hal9k.ann-arbor.mi.us (Ro...,From: d88-jwa@hemul.nada.kth.se (Jon Wtte)\nSu...,From: bagels@gotham.East.Sun.COM (Alex Beigelm...,From: bruce@liv.ac.uk (Bruce Stephens)\nSubjec...,From: joshc@csa.bu.edu (Josh Carroll)\nSubject...,Subject: Put ex. syquest in Centris 610?\nFrom...,From: darrylo@srgenprp.sr.hp.com (Darryl Okaha...,From: choe@dirac.phys.washington.edu\nSubject:...
41,41,20,"[sale, cd, tape, steven, shipping, sell, rob, ...",From: rob@mother.bates.edu (Rob Spellman)\nSub...,From: smedley@ecst.csuchico.edu (Steven Medley...,From: wgs1@Isis.MsState.Edu (Walter G. Seefeld...,From: walshs@cs.uwp.edu (Steven Walsh)\nSubjec...,From: sbrogii@copernicus.Tymnet.COM (Scott Bro...,"From: jac2y@Virginia.EDU (""Jonathan A. Cook <j...",From: hinds@cmgm.stanford.edu (Alexander Hinds...,From: Wil.Chin@launchpad.unc.edu (Wilson Chin)...,From: pchang@ic.sunysb.edu (Pong Chang)\nSubje...,From: hsieh1@carson.u.washington.edu (Darrell ...,From: easwarakv@woods.ulowell.edu\nSubject: CD...,From: mycal@NetAcsys.com (Mycal)\nSubject: Nee...,From: mkaschke@oasys.dt.navy.mil (Martin Kasch...,From: dwarner@journalism.indiana.edu (David J....,From: steveg@bach.udel.edu (Steven N Gaudino)\...
46,46,20,"[piece, north carolina, carolina, address, nor...",From: jiml@strauss.FtCollinsCO.NCR.COM (Jim L)...,From: bryan@philips.oz.au (Bryan Ryan)\nOrgani...,From: cmgrawbu@eos.ncsu.edu (CHRISTOPHER M GRA...,From: jrwaters@eos.ncsu.edu (JACK ROGERS WATER...,From: jmcocker@eos.ncsu.edu (Mitch)\nSubject: ...,"From: jac2y@Virginia.EDU (""Jonathan A. Cook <j...",From: rbemben@timewarp.prime.com (Rich Bemben)...,,,,,,,,
5,5,19,"[key, chip, encryption, clipper, escrow, law e...",From: amolitor@nmsu.edu (Andrew Molitor)\nSubj...,From: smythw@vccnw03.its.rpi.edu (William Smyt...,From: ameline@vnet.IBM.COM (Ian Ameline)\nSubj...,From: amanda@intercon.com (Amanda Walker)\nSub...,From: silly@ugcs.caltech.edu (Brad Threatt)\nS...,From: Michael_LaBella@vos.stratus.com\nSubject...,From: ho@cs.arizona.edu (Hilarie Orman)\nSubje...,From: felixg@coop.com (Felix Gallo)\nSubject: ...,From: denning@guvax.acc.georgetown.edu\nSubjec...,From: C445585@mizzou1.missouri.edu (John Kelse...,From: stan@tacobel.UUCP (stan)\nSubject: Re: t...,From: tommc@hpcvusj.cv.hp.com (Tom McFarland)\...,From: pat@rwing.UUCP (Pat Myrto)\nSubject: New...,From: olson@umbc.edu (Bryan Olson; CMSC)\nSubj...,From: ankleand@mtl.mit.edu (Andrew Karanicolas...
17,17,18,"[driver, mode, card, vesa, video, memory, vga,...",From: d91-fad@tekn.hj.se (DANIEL FALK)\nSubjec...,From: debrown@hubcap.clemson.edu (David E. Bro...,From: theroo@med.unc.edu (Bron D. Skinner Ph.D...,From: jmc@engr.engr.uark.edu (J. M. Carmack)\n...,From: ICH344@DJUKFA11.BITNET\nSubject: Wanted:...,From: tiang@midway.ecn.uoknor.edu (Tiang)\nSub...,From: dsou@btma57.nohost.nodomain\nSubject: Sp...,From: jmgree01@starbase.spd.louisville.edu (Ju...,From: j_meyer@informatik.uni-kl.de (Joerg Meye...,From: rob@rjck.UUCP (Robert J.C. Kyanko)\nSubj...,From: jjd1@cbnewsg.cb.att.com (james.j.dutton)...,From: lingeke2@mentor.cc.purdue.edu (Ken Linge...,From: phil@howtek.MV.COM (Phil Hunt)\nSubject:...,From: loschen@binah.cc.brandeis.edu\nSubject: ...,From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...
33,33,15,"[graphic, program, terminal, benchmark, cpu, u...",From: D.L.P.Li1@lut.ac.uk (DLP Li) \nSubject: ...,From: amann@iam.unibe.ch (Stephan Amann)\nSubj...,From: lioness@maple.circa.ufl.edu\nSubject: Re...,From: aw@camcon.co.uk (Alain Waha)\nSubject: R...,From: afielden@cbnewsb.cb.att.com (andrew.j.fi...,From: timd@fenian.dell.com (Tim Deagan)\nSubje...,From: kohut1@urz.unibas.ch\nSubject: Help ! Mi...,From: egerter@gaul.csd.uwo.ca (Barry Egerter)\...,From: afielden@cbnewsb.cb.att.com (andrew.j.fi...,From: oleg@sdd.comsat.com (Oleg Roytburd)\nSub...,From: ruocco@ghost.dsi.unimi.it (sergio ruocco...,From: petro@server.uwindsor.ca (PETRO DAVID )...,From: tdawson@engin.umich.edu (Chris Herringsh...,From: jmgree01@starbase.spd.louisville.edu (Ju...,From: robert.desonia@hal9k.ann-arbor.mi.us (Ro...
43,43,15,"[25, 250, ray, network, line 25, 35, 11, chara...",From: brentw@netcom.com (Brent C. Williams)\nS...,From: timr@sco.COM (Tim Ruckle)\nSubject: Who ...,From: stovall@ficus.cs.ucla.edu (Steven Stoval...,From: delilah@next18pg2.wam.umd.edu (Romeo DeV...,From: JJMARVIN@pucc.princeton.edu\nSubject: Re...,From: andrew@idacom.hp.com (Andrew Scott)\nSub...,From: wstomv@wsinpa04.win.tue.nl (Tom Verhoeff...,From: ray@ole.cdac.com (Ray Berry)\nSubject: R...,From: wesommer@mit.edu (Bill Sommerfeld)\nSubj...,From: calzone@athena.mit.edu\nSubject: Re: Eum...,From: dkmiller@unixg.ubc.ca (Derek K. Miller)\...,From: L.H.Wood@lut.ac.uk\nSubject: An 8051 sim...,From: bks2@cbnewsi.cb.att.com (bryan.k.strouse...,From: joerg@sax.sax.de (Joerg Wunsch)\nSubject...,From: dashley@wyvern.wyvern.com (Doug Ashley)\...
45,45,14,"[player, play, lopez, catcher, defensive, soli...",From: gballent@hudson.UVic.CA (Greg Ballentine...,From: mjones@fenway.aix.kingston.ibm.com (Mike...,"From: ldo@waikato.ac.nz (Lawrence D'Oliveiro, ...",From: sheehan@aludra.usc.edu (Joseph Sheehan)\...,"From: ""Dennis G Parslow"" <p00421@psilink.com>\...",From: steph@cs.uiuc.edu (Dale Stephenson)\nSub...,From: genetic+@pitt.edu (David M. Tate)\nSubje...,From: genetic+@pitt.edu (David M. Tate)\nSubje...,From: gballent@hudson.UVic.CA (Greg Ballentine...,From: niepornt@phoenix.Princeton.EDU (David Ma...,From: pkortela@snakemail.hut.fi (Petteri Korte...,From: st902415@pip.cc.brandeis.edu (Adam Levin...,From: icop@csa.bu.edu (Antonio Pera)\nSubject:...,From: woods@ncar.ucar.edu (Greg Woods)\nSubjec...,From: steveh@thor.isc-br.com (Steve Hendricks)...
31,31,14,"[gun, crime, firearm, rate, nra, family, defen...",From: PA146008@utkvm1.utk.edu (David Veal)\nSu...,From: dtmedin@catbyte.b30.ingr.com (Dave Medin...,From: jbrown@batman.bmd.trw.com\nSubject: Re: ...,From: n9020351@henson.cc.wwu.edu (James Dougla...,From: VEAL@utkvm1.utk.edu (David Veal)\nSubjec...,From: lvc@cbnews.cb.att.com (Larry Cipriani)\n...,From: 0005111312@mcimail.com (Peter Nesbitt)\n...,From: mikey@ccwf.cc.utexas.edu (Strider)\nSubj...,From: matt@galaxy.nsc.com (Matt Freivald x8043...,From: meyers@leonardo.rtp.dg.com (Bill Meyers)...,From: jblanken@ccat.sas.upenn.edu (James R. Bl...,"From: tms@cs.umd.edu (Tom Swiss (not Swift, no...",From: timr@sco.COM (Tim Ruckle)\nSubject: Who ...,From: arc@cco.caltech.edu (Aaron Ray Clements)...,From: djohnson@cs.ucsd.edu (Darin Johnson)\nSu...


In [40]:
# Here we can see each text, the topic with the highest scoring, 
# and a reiteration of what words & phrases make up that topic
original_with_keywords.head(15)

Unnamed: 0,text,topic_num,top_words_and_phrases
0,From: sera@zuma.UUCP (Serdar Argic)\nSubject: ...,30,"[armenian, argic, armenia, serdar argic, serda..."
1,From: et@teal.csn.org (Eric H. Taylor)\nSubjec...,44,"[polygon, font, clip, algorithm, curve, postsc..."
2,From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venk...,39,"[ed, green, fast, pull, left, grateful dead, g..."
3,From: d91-fad@tekn.hj.se (DANIEL FALK)\nSubjec...,17,"[driver, mode, card, vesa, video, memory, vga,..."
4,From: Aovai@qube.OCUnix.On.Ca (Aovai)\nSubject...,3,"[drive, hard drive, hard, disk, floppy, floppy..."
5,From: rob@mother.bates.edu (Rob Spellman)\nSub...,41,"[sale, cd, tape, steven, shipping, sell, rob, ..."
6,From: brentw@netcom.com (Brent C. Williams)\nS...,43,"[25, 250, ray, network, line 25, 35, 11, chara..."
7,From: frankkim@CATFISH.LCS.MIT.EDU (Frank Kim)...,42,"[monitor, 17, buy, opinion, monitor I, monitor..."
8,From: roby@chopin.udel.edu (Scott W Roby)\nSub...,24,"[batf, communication service, organization net..."
9,From: david@c-cat.UUCP (Dave)\nSubject: Re: ID...,37,"[scsi, ide, scsi-2, controller, scsi-1, bus, c..."


In [41]:
# The trained text analysis model
model

NMF(n_components=50, random_state=42)

In [38]:
# Let's look closer at the words & phrases used
[x for x in concated_topics['top_words_and_phrases'].values]

[['god',
  'atheist',
  'evil',
  'religion',
  'eternal',
  'authority',
  'christian',
  'bible',
  'true',
  'religious',
  'life',
  'christ',
  'existence',
  'islam',
  'atheism'],
 ['drive',
  'hard drive',
  'hard',
  'disk',
  'floppy',
  'floppy drive',
  'cartridge',
  'meg',
  'switch',
  'external',
  'scsi',
  'quantum',
  'syqu',
  'internal',
  'motherboard'],
 ['sale',
  'cd',
  'tape',
  'steven',
  'shipping',
  'sell',
  'rob',
  'inch',
  'iv',
  'offer',
  'good offer',
  'include',
  'disk',
  'college',
  'sony'],
 ['piece',
  'north carolina',
  'carolina',
  'address',
  'north carolina state',
  'organization north carolina',
  'university project eos',
  'eos',
  'state university project',
  'project eos',
  'carolina state',
  'carolina state university',
  'university project',
  'organization north',
  'company'],
 ['key',
  'chip',
  'encryption',
  'clipper',
  'escrow',
  'law enforcement',
  'clipper chip',
  'enforcement',
  'government',
  'agency'

There's lots to unpack here as we look through the initial topics identified. Some things that don't appear meaningful are: numbers, emails, `na`, line numbers, version numbers, dates, phone numbers, and smiley faces (just to name a few). Let's try adding those in as stopwords and rerunning the analysis.