# Topic modeling with Gensim

# Introduction

__Topic Modeling__ is a statistical modeling used to extract the _hidden topics_ or _structures_ from large corpora. Why are the topics _hidden_? Because they are not clearly visible or clearly stated by the author(s) of the corpus. Topic modeling is a unsupervised tecnique. 

There are some algorithms that can be used for topic modeling: LDA, PLDA, LSI. 
Here I used __LDA__ (Latent Dirichlet allocation), that is a very popular algorithm and has an implementation in __Gensim__ Python library. 

How does LDA work?
- __input__: corpus with N documents and K topics. 
- __training__: 
    - it goes through each document and randomly assigns each word in the document to one of K topics. This random assignment gives random topic representations of __all documents__ and __word distributions of all the topics__.
    - For each __document d__, compute __P( topic t | document d )__ := proportion of words in document d that are assigned to topic t
    - For each __topic t__, __P( word w | topic t )__ := proportion of assignments to topic t that come from word w (across all documents)
    - For each __word w__, reassign __topic t’__, where it chooses topic t’ with probability __P( topic t’ | word w ) = P( topic t’ | document d ) * P( word w | topic t’ )__
- output: prediction on the probability that topic t’ generated word w and the final values of P( word w | topic t ) and P( topic t | document d).

The general structure of my project follows these passages: 
- _text processing_
- _lda training_
- _evaluation_ and _visualization_ of outcomes

The performance of the topic modeling technique relies on*: 
- the _corpus_ analyzed
- its _processing_
- the _algorithm_ 
- its _parameters_

I started from two texts: 
- __Infinite Jest__(1996): an encyclopedic novel by David Foster Wallace.
- __Brown corpus__(1960): a corpus of 500 samples of English-language text, compiled by Henry Kučera and W. Nelson Francis form works published in the USA in 1961.

More precisely,

A. __Text processing__
- I cleaned them trough _tokenization_, _lemmatization_ and _filtering_. 
- I divided the texts into _documents_.


B. __LDA training__
- I prepared _dictionaries_ and _bag of words per document_ needed as input by the algorithm.
- I trained the algorithm, printing the outputs.


C. __Visualization and Evaluation of outputs and model__
- I tried to visualize the results with pyLDAvis and to evaluate them computing the _coherence_ of the model.


As I stated above*, the performance of LDA is influenced by the input (the corpus and how it is processed) and by the parameters of the algorithm. For these reasons, I tried apply some modifications at the level of: 
- _text cleaning_: 
 * I __pos-tagged__ the texts,
 * I create two new corpora composed only by __common nouns__, and by __common nouns and adjectives__,
 * I filtered words according to their __length__.
- _parameters tuning_: I tried to see how the coherence and the perplexity of the models are related to: 
 * the __number of topics__,
 * the __number of passes__.
 
I selected the modifications that gave the best results (in terms of coherence of results for every corpus), and I applied them.
Then, I tried to evaluate the new models and to visualize their outputs, in order to see whether the improvements could be visible or not. 

# A. Basic Steps

In [3]:
# here I import libraries, packages, corpora and models that I will need later
import re # to clean and process the corpora with regex
import nltk # to tokenize, lemmatize and pos-tag corpora, to use english stopwords and brown corpus
import gensim # to train and use LDA model and to compute coherence scores
import pyLDAvis # to visualize results
import pyLDAvis.gensim
import pandas as pd # to build tables
#import no_w 
#from no_w import stopw 
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords, brown
from gensim import corpora, models
from gensim.models import CoherenceModel, LdaModel
wnl = nltk.WordNetLemmatizer()

In [18]:
# here I prepare two list of stopwords
# stopwords is a list of words from nltk corpus that are very frequent and that don't have a "real meaning".

#stopwords1 list comes form http://www.matthewjockers.net/macroanalysisbook/expanded-stopwords-list/
# this list contains more frequent and meaningless words + common proper nouns from english literature.
stopw = '0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, aaron, abbey, abbie, abdul, abe, across, abel, abigail, about, above, abraham, abram, abst, accordance, according, act, actually, ada, adah, adalberto, adaline, adam, adan, added, among, addie, adela, adelaida, adelaide, adele, adelia, adelina, adeline, adell, adella, adelle, adena, adina, adj, adolfo, adolph, adopted, adria, adrian, adriana, adriane, adrianna, adrien, adrienne, after, afterwards, afton, again, against, agatha, agnes, agnus, agueda, agustina, ahmad, ahmed, ai, aida, besides, aide, aiko, aileen, ailene, aimee, aja, akilah, al, alaina, alaine, alan, alana, alane, alanna, alayna, alba, albert, alberta, albertha, albertina, albertine, alberto, albina, alda, alden, aldo, alease, alec, alecia, e, aleen, aleisha, eg, alejandra, alejandrina, alejandro, alena, elsewhere, alene, alesha, aleshia, alesia, alessandra, aleta, aletha, everywhere, alethea, alethia, alex, alexander, alexandria, alexia, alexis, alfonso, alfonzo, alfred, alfreda, alfredia, alfredo, ali, alia, alica, alice, alicia, alida, alina, alisa, alise, had, alisha, alishia, alisia, alison, alissa, alita, alix, aliza, all, alla, allan, alleen, allegra, allen, allena, allene, allie, alline, allison, allyn, allyson, alma, hop, almeda, almeta, almost, alona, alone, along, alonso, alonzo, alpha, alphonse, alphonso, already, also, alta, altagracia, altha, althea, although, alton, alva, alvaro, alvera, alverta, alvin, alvina, always, alyce, alycia, alysa, alyse, alysha, alysia, alyson, alyssa, am, amado, amal, amalia, amanda, amber, amberly, ambrose, amee, amelia, long, america, m, ami, amie, amiee, amina, may, amira, ammie, amongst, amos, might, amparo, amy, an, ana, anabel, anamaria, anastacia, anastasia, and, andera, anderson, andra, andre, na, andrea, andreas, andres, andrew, andria, andy, anette, angel, angela, angele, angelena, angeles, angelia, angelic, angelica, angelina, angeline, angelique, angelita, angella, angelo, angelyn, angie, angila, angla, angle, anglea, anh, anibal, anika, anisha, anissa, anita, anitra, anjanette, anjelica, ann, anna, annabel, annabell, annabelle, overall, annalee, annalisa, annamae, annamaria, annamarie, anne, anneliese, annelle, annemarie, put, annetta, annice, annie, rather, annika, annis, annmarie, announce, another, answered, anthony, antione, antionette, antoine, anton, antone, antonetta, antonette, antonia, antonietta, antonina, antonio, antony, antwan, any, anya, anyhow, anyone, anything, anywhere, apolonia, april, apryl, ara, should, araceli, aracelis, aracely, arcelia, archie, ardath, ardelia, ardell, ardella, ardelle, arden, ardis, ardith, are, aren, arent, aretha, argelia, argentina, ariana, arianna, arianne, arica, arie, ariel, arielle, arla, arlean, arleen, arlen, there, arlena, arlene, arletha, arletta, arlette, arlie, arlinda, arline, arlyne, armand, armanda, they, armandina, armando, armida, arminda, arnetta, arnette, arnita, arnold, arnoldo, arnulfo, around, arron, three, art, arthur, artie, arturo, arvilla, as, asa, asha, ashanti, ashely, ashlea, toward, ashlee, ashleigh, ashley, ashli, ashlie, ashly, ashlyn, asia, ask, unlikely, asked, asley, assunta, astrid, ups, at, athena, aubrey, audie, audra, audrea, audrey, audria, audrie, audry, august, augusta, augustina, augustine, augustus, aundrea, aura, aurea, aurelia, aurelio, aurora, aurore, austin, auth, autumn, ava, available, avelina, avery, avis, avril, awilda, ayako, ayana, ayanna, ayesha, azalee, azucena, azzie, b, babara, babette, back, bailey, bambi, bao, barabara, barb, barbara, barbera, barbie, barbra, bari, barney, barrett, barrie, bart, basil, would, basilia, be, bea, beata, beatrice, beatris, beatriz, beau, beaulah, bebe, became, because, becki, becky, become, becomes, becoming, bee, abby, been, before, beforehand, begin, beginning, behind, being, belen, belia, belinda, belkis, bell, bella, belle, below, ben, benedict, benita, benito, benjamin, bennett, benny, benton, berenice, berna, bernadette, bernadine, bernard, bernarda, bernardina, bernardine, bernardo, adrianne, berneice, adriene, bernice, bernie, berniece, bernita, bert, agripina, berta, agustin, bertha, bertie, bertram, beryl, beside, bess, bessie, best, beth, bethanie, aisha, bethann, akiko, bethany, bethel, betsey, bette, better, bettie, bettina, betty, bettyann, bettye, between, beula, beulah, bev, beverlee, beverley, beverly, beyond, bianca, bibi, bill, billie, billy, billye, aleida, birdie, birgit, blaine, blair, blake, blanca, blanch, blanche, blondell, blossom, blythe, bo, bobbi, bobbie, bobbye, alexa, bok, alexandra, bong, bonita, bonnie, bonny, booker, boris, both, boyce, boyd, brad, bradford, bradley, bradly, brady, brain, branda, aline, brande, brandee, branden, brandi, brandon, brandy, brant, breana, breann, breanna, breanne, bree, brenda, brendan, brendon, brenna, brent, brenton, bret, brett, brian, briana, brianne, brice, bridget, bridgett, bridgette, brigette, brigid, brigida, brigitte, brinda, britany, britney, britni, britt, britta, brittaney, brittani, brittanie, britteny, brittni, brittny, brock, broderick, bronwyn, brook, brooke, brooks, bruce, amada, bruna, brunilda, bruno, bryan, bryanna, bryant, bryce, brynn, bryon, bud, buddy, buena, buffy, buford, bula, bunny, burl, burma, burt, burton, buster, but, analisa, by, byron, c, ca, caitlin, caitlyn, calandra, caleb, calista, andree, callie, calvin, camelia, camellia, cameron, camie, camila, camilla, camille, cammy, can, candace, candance, angelika, candelaria, candi, candice, candida, candis, candy, candyce, cannot, cant, caprice, caption, cara, caren, carey, cari, anisa, caridad, carie, carina, carisa, anja, carissa, carita, carl, carla, carlee, carleen, carlena, carlene, carletta, carley, carli, carline, carlita, carlo, carlos, carlota, annett, carlotta, annette, carlton, carly, carlyn, carma, annita, carman, carmel, carmela, carmelia, carmelina, antoinette, carmelita, carmella, carmelo, carmen, carmina, carmine, carmon, carol, carola, carolann, carole, carolee, carolin, caroline, caroll, carolyn, carolyne, carolynn, caron, caroyln, carri, carrie, carrol, carry, carson, cary, caryl, carylon, caryn, casandra, casey, casie, ariane, casimira, cassandra, cassaundra, cassey, cassidy, cassie, cassondra, cassy, catalina, catarina, caterina, catharine, catherin, catherina, catherine, cathern, catheryn, cathey, cathi, cathie, cathleen, cathrine, cathryn, cathy, catina, catrice, catrina, cayla, cecil, cecila, cecile, aron, cecilia, cecille, cecily, cedric, cedrick, celena, celesta, celeste, celestina, celestine, celia, celina, celinda, celine, celsa, ceola, cesar, chad, ashton, chadwick, chae, chan, chana, asuncion, chance, chanda, chandra, chanel, chanell, chanelle, chang, chantal, chantay, chante, chantel, chantell, chantelle, chapter, chara, charis, charise, charissa, charisse, charita, charity, charla, charlena, charlene, charles, charlesetta, charlette, charley, charlie, charline, charlott, charlotte, charlsie, charlyn, charmaine, charolette, chase, chasidy, chasity, chastity, chau, chauncey, chaya, barbar, chelsea, chelsey, chelsie, cher, chere, cheree, cherelle, cheri, barry, cherie, barton, cherilyn, cherish, cherlyn, cherri, cherrie, cherry, cherryl, chery, cheryl, cheryle, cheryll, beckie, chester, chet, cheyenne, chi, chia, chieko, chin, china, ching, belva, chiquita, chloe, chong, chris, chrissy, christa, bennie, christal, christeen, christel, christen, christene, christi, christia, christian, christiana, christiane, christie, christin, bernetta, christina, christine, christinia, christoper, berry, christopher, christy, chrystal, chu, chuck, chun, chung, cicely, ciera, cierra, cinda, cinderella, cindi, cindy, betsy, cinthia, cira, clair, clara, clare, clarence, claretha, claretta, claribel, clarice, clarine, claris, clarisa, clarissa, clarita, billi, clark, classie, claud, claude, claudette, claudia, claudie, claudine, claudio, clay, clayton, clemencia, clement, clemente, clementina, bob, clementine, clemmie, bobby, cleo, bobette, cleopatra, cleora, cleotilde, cleta, cletus, cleveland, cliff, clifford, clifton, clint, clinton, clora, clorinda, clotilde, clyde, co, codi, cody, colby, coleen, brandie, coleman, colene, coletta, colette, colin, colleen, collen, collene, collette, collin, columbus, come, concepcion, conception, concetta, concha, conchita, connie, brianna, conrad, constance, consuela, consuelo, contessa, cora, coral, coralee, coralie, corazon, cordelia, cordell, cordia, cordie, coreen, corene, coretta, corey, brittany, corie, brittney, corina, corine, corinna, corinne, corliss, cornelia, cornelius, cornell, corrie, corrin, corrina, corrine, corrinne, cortez, cortney, could, couldnt, courtney, buck, coy, craig, creola, cried, cris, criselda, bulah, crissy, crista, cristal, cristen, cristi, cristie, cristin, cristina, cristine, cristobal, cristopher, cristy, cruz, crysta, crystal, crystle, cuc, cami, curt, curtis, cyndi, cyndy, cammie, cynthia, cyril, cyrstal, cyrus, cythia, d, dacia, candie, dagmar, candra, dagny, dahlia, daina, daine, daisey, daisy, dakota, dale, dalene, carin, dalia, dalila, dallas, dalton, damaris, damian, damien, damion, damon, dan, dana, danae, dane, carlie, danelle, danette, dani, danial, danica, daniel, daniela, daniele, daniell, daniella, danielle, danika, danille, danilo, danita, dann, danna, dannette, dannie, dannielle, danny, danuta, danyel, danyell, danyelle, daphine, dara, darby, carolina, darcel, darcey, darci, darcie, darcy, darell, daren, daria, darin, dario, carroll, darius, darla, carter, darleen, darlena, darlene, darline, darnell, daron, darrel, darrell, darren, darrick, darrin, cassi, darron, darryl, darwin, daryl, date, dave, david, davida, davina, davis, dawn, dawna, dawne, dayle, dayna, daysi, deadra, dean, deana, deandra, deandre, deandrea, deane, deangelo, cecelia, deann, deanna, deanne, deb, debbi, debbie, debbra, debby, debera, debi, debora, deborah, debra, debrah, debroah, dede, dedra, dee, deeann, deeanna, deedee, deena, deetta, deidra, deidre, deirdre, deja, del, delana, delbert, delcie, delena, delfina, delia, delicia, delila, delilah, delinda, delisa, dell, della, delma, delmar, delmer, delmy, delois, charleen, deloise, delora, deloras, delores, deloris, delorse, delpha, delphia, delphine, delsie, delta, demarcus, charmain, demetra, demetria, chas, demetrice, demetrius, dena, chassidy, denae, deneen, denese, denice, denis, denise, denisha, denita, denna, dennis, dennise, denny, denver, denyse, cherise, deon, cherly, deonna, derek, derick, derrick, deshawn, desirae, desire, desiree, despina, dessie, destiny, detra, devin, devon, devona, devora, devorah, dewayne, dewey, dewitt, dexter, dia, diamond, dian, diana, diane, diann, dianna, christena, dianne, dick, did, didnt, diedra, diedre, diego, dierdre, digna, dillon, dimple, dina, dinah, dino, dinorah, dion, dione, dionna, dionne, ciara, dirk, divina, dixie, do, dodie, does, cindie, doesnt, dollie, dolly, dolores, claire, doloris, domenic, domenica, dominga, domingo, dominic, dominica, clarinda, dominick, dominique, dominque, domitila, domonique, don, dona, donald, donella, donetta, donette, dong, donita, donn, donna, donnell, clelia, donnetta, donnette, donnie, donny, donovan, dont, donte, donya, dora, dorathy, dorcas, doreatha, doreen, dorene, doretha, dorethea, doretta, dori, doria, dorian, dorie, dorinda, dorine, doris, dorla, cole, dorotha, dorothea, dorothy, dorris, dorsey, dortha, dorthea, dorthey, dorthy, dot, dotty, colton, doug, douglas, douglass, dovie, down, doyle, dreama, drew, drucilla, duane, dudley, dulcie, dung, during, dusti, dustin, dusty, dwana, dwayne, dwight, dylan, each, earl, earle, earlean, cori, earleen, earlene, earlie, earline, earnest, earnestine, eartha, easter, eboni, ebonie, ebony, echo, ed, edda, eddie, cory, eddy, edelmira, eden, edgardo, edie, edith, edmond, edmund, edmundo, edna, edra, edris, eduardo, edward, edwardo, edwin, edyth, edythe, effie, efrain, efren, ehtel, eight, eighty, eilene, either, ela, eladia, elaina, elaine, elana, elane, elanor, elayne, elba, elda, elden, eldon, eldora, eldridge, eleanor, eleanora, eleanore, elease, elena, elene, eleni, elenor, elenora, eleonor, eleonora, eleonore, elfreda, elfrieda, elfriede, eli, elia, eliana, elias, dania, elicia, elida, elidia, elijah, elina, elinor, elinore, elisa, elisabeth, elise, eliseo, elisha, elissa, eliz, eliza, elizabeth, elizbeth, elizebeth, dante, elke, ella, ellamae, ellan, ellen, daphne, ellena, elli, ellie, elliot, elliott, ellis, ellsworth, elly, ellyn, elma, elmer, elmira, elmo, elna, elnora, elodia, elois, eloisa, eloise, elouise, eloy, elroy, elsa, else, elsie, elsy, elton, elva, elvera, elvia, elvie, elvina, elvira, elvis, elwanda, elwood, elyse, elza, ema, emanuel, emelda, emelia, emelina, emeline, emely, emerald, emerita, emerson, emery, emiko, emil, emile, emilee, emilia, emilie, emilio, emily, emma, emmaline, emmanuel, emmie, emmitt, emmy, emogene, emory, ena, enda, enedina, deedra, enid, enoch, enola, enough, enrique, enriqueta, epifania, delaine, era, erasmo, eric, erica, erich, erick, ericka, erik, erika, erin, erinn, erlene, erlinda, erline, erma, ermelinda, erminia, erna, ernest, ernestina, ernestine, ernesto, ernie, errol, ervin, erwin, eryn, esmeralda, esperanza, essie, esteban, estefana, estela, estell, estella, estelle, ester, esther, estrella, etc, etha, ethan, denisse, ethel, ethelene, ethelyn, ethyl, etsuko, etta, ettie, eufemia, eugena, eugene, eugenia, eugenie, eugenio, eula, eulah, eulalia, desmond, eun, euna, eunice, eura, eusebia, eusebio, eustolia, evalyn, evan, evangelina, evangeline, eve, evelia, evelin, evelina, eveline, evelyn, evelyne, evelynn, even, ever, everett, everette, every, everyone, everything, evette, evia, evie, evita, evon, evonne, ewa, except, exie, ezekiel, ezequiel, ezra, f, fabian, fabiola, fae, fairy, faith, fallon, fannie, fanny, far, farah, farrah, fatima, fatimah, faustina, faustino, fausto, fawn, fay, faye, fe, felecia, felica, felice, felicia, felicidad, felicita, felicitas, felipa, felipe, felisa, felisha, felix, felton, ferdinand, fermin, fermina, fern, fernanda, fernande, fernando, ferne, few, fidel, fidela, fidelia, fifty, filiberto, filomena, fiona, first, five, flavia, fleta, fletcher, flo, flor, flora, florance, florence, florencia, florencio, florene, dottie, florentina, florentino, floretta, floria, florida, florinda, florine, drema, florrie, flossie, drusilla, floy, floyd, dulce, fonda, duncan, for, forest, former, formerly, dwain, forrest, forty, foster, dyan, found, four, fran, france, francene, frances, francesca, francesco, franchesca, francie, francina, francine, francis, francisca, francisco, francoise, frank, eda, frankie, franklyn, fransisca, fred, freda, edgar, fredda, freddie, edison, freddy, frederic, frederica, frederick, fredericka, fredia, fredric, fredrick, fredricka, freeda, freeman, edwina, freida, frida, frieda, fritz, from, fumiko, eileen, further, g, gabriel, gabriela, gabriele, gabriella, gabrielle, gail, gala, gale, elbert, galen, galina, garfield, garland, garnet, garnett, garret, garrett, garry, gary, gaston, gavin, gay, gaye, elenore, gayla, gayle, gaylene, gaylord, gaynell, gearldine, gema, gemma, gena, gene, genesis, geneva, genevie, genevieve, elin, genevive, genia, genie, genna, gennie, genny, genoveva, geoffrey, georgann, george, georgeann, elizabet, georgeanna, georgene, georgetta, georgette, georgia, georgiana, georgiann, georgianne, georgina, georgine, gerald, geraldine, geraldo, geralyn, gerard, gerardo, gerda, geri, german, gerri, gerry, gertie, gertrude, gertrudis, get, ghislaine, gia, gianna, gidget, gigi, gilbert, gilberte, gilberto, gilda, gillian, gilma, gina, ginette, ginger, elvin, ginny, gino, giovanna, giovanni, gisela, gisele, giselle, gita, giuseppe, giuseppina, gladis, glady, gladys, glayds, glen, glenda, glendora, glenn, glenna, glennie, glennis, glinda, gloria, glory, glynda, glynis, go, golda, golden, emmett, goldie, gonzalo, good, gordon, got, grace, gracia, gracie, eneida, graciela, grady, graham, grant, granville, grayce, grazyna, great, gregg, gregoria, gregorio, gregory, greta, gretchen, gretta, gricelda, grisel, griselda, guadalupe, gudrun, guillermina, guillermo, gus, gussie, gustavo, guy, gwen, gwenda, gwendolyn, gwenn, gwyn, gwyneth, h, ha, hae, hai, esta, hailey, hal, haley, halina, halley, hallie, han, hang, hanh, hank, hanna, hannah, hannelore, hans, harlan, harland, harley, harmony, harold, harriet, harriett, harriette, harris, harrison, harry, harvey, has, hasnt, hassan, hassie, hattie, have, havent, haydee, eva, hayden, hayley, haywood, hazel, he, heath, heather, hector, hed, hedwig, hedy, heide, heidi, heidy, heike, helaine, helen, helena, helene, helga, hellen, hence, henrietta, henriette, henry, her, herb, herbert, here, hereafter, hereby, herein, heres, hereupon, herlinda, herma, herman, hermelinda, hermina, hermine, faviola, herminia, hers, herschel, herself, federico, hershel, hertha, hes, hester, hettie, hid, hien, hilaria, hilario, hilary, hilda, hildegard, hildred, hillary, hilma, hilton, him, himself, hipolito, hiroko, his, hoa, hobert, holley, holli, hollie, holly, home, homer, honey, hong, hope, horace, horacio, hortencia, hortense, hortensia, hosea, houston, how, howard, however, hoyt, hubert, huey, hugh, hui, hulda, humberto, hundred, hung, hunter, huong, hwa, hye, hyman, hyo, hyon, i, ian, id, ida, idalia, idell, idella, ie, iesha, if, ignacio, franklin, ike, ilana, ileana, ileen, ilene, iliana, ill, illa, ilona, ilse, iluminada, im, ima, imelda, imogene, in, inc, include, includes, indeed, index, india, indira, inell, ines, inez, information, inga, inge, ingeborg, inger, ingrid, inocencia, instead, internet, into, iola, iona, garth, ione, iraida, irena, irene, iris, irish, irma, irmgard, irvin, irving, gaynelle, is, isa, isaac, isabel, genaro, isabell, isabella, isadora, isaiah, isaias, isaura, isela, isiah, isidra, isidro, isis, ismael, isnt, isobel, israel, isreal, issac, it, its, itself, ivan, ivana, ive, georgianna, ivelisse, georgie, ivey, ivonne, ivory, ivy, izetta, izola, j, ja, jacalyn, jacelyn, germaine, jacinda, jacinta, jacinto, gertha, jackeline, gertrud, jackelyn, jacki, gertude, jackie, jacklyn, jackqueline, jackson, jaclyn, gil, jacob, jacqualine, jacque, jacquelin, jacqueline, jacquelyne, jacquelynn, jacques, jacquetta, jacqui, jacquie, jacquiline, jacquline, jacqulyn, jada, jadwiga, jae, jaime, jaimee, jaimie, jake, jaleesa, jalisa, jama, jamaal, jamal, jamar, jame, jamee, jamel, james, jamey, jami, jamie, jamika, jamila, jamison, jammie, jan, jana, janae, janay, jane, janean, janee, janeen, graig, janel, janell, janella, janelle, greg, janene, janessa, janet, janeth, janett, janetta, janette, janey, jani, janice, grover, janie, janina, janine, janis, janise, janita, jann, janna, jannet, jannette, jannie, january, janyce, jaqueline, jaquelyn, jared, jarod, jarred, jarrett, jarrod, jarvis, jasmin, jason, jasper, hana, jaunita, javier, jay, jaye, jayme, jaymie, jayna, jayne, jayson, jazmin, jazmine, jc, jean, jeana, jeane, jeanelle, jeanene, jeanett, jeanetta, jeanette, jeanice, jeanie, jeanine, jeanmarie, jeanna, jeanne, jeannetta, jeannette, jeannie, jeannine, jed, jeff, hee, jefferey, jefferson, jeffery, jeffie, jeffrey, jeffry, jen, jena, jenae, jene, jenee, jenell, jenelle, jenette, jeneva, heriberto, jeni, jenice, jenifer, jeniffer, hermila, jenine, jenise, jenna, jennefer, jennell, herta, jennette, jenni, jennie, hiedi, jennifer, jenniffer, jennine, jerald, jeraldine, hilde, jeramy, hildegarde, jere, jeremiah, jeri, jerica, jerilyn, hiram, jermaine, hisako, jerold, jerome, jeromy, jerrell, jerri, hollis, jerrica, jerrie, jerrod, jerrold, jerry, jesica, jess, jesse, jessenia, jessi, jessia, jessica, jessika, jestine, hsiu, jesus, hue, jesusa, jesusita, hugo, jetta, jettie, jewel, jewell, ji, jill, jillian, hyacinth, jim, jimmie, jin, jina, hyun, jinny, jo, joan, joana, joane, joann, ignacia, joanna, joanne, ila, joannie, ilda, joaquin, joaquina, jocelyn, jodee, jodi, jodie, jody, joeann, joel, joella, joelle, joellen, ina, joesph, joetta, joette, joey, johana, johanna, johanne, john, johna, johnathan, johnathon, johnetta, johnie, johnna, ira, johnnie, johnny, johnsie, irina, johnson, joi, joie, jolanda, joleen, jolene, irwin, jolie, joline, jolyn, jolynn, jon, isabelle, jona, jonah, jonas, jonathon, jone, jonell, jonelle, jong, joni, jonie, jonna, jonnie, jordan, jordon, iva, jorge, jose, josef, ivette, josefa, josefina, josefine, joselyn, joseph, josephina, josephine, josette, josh, joshua, josiah, josie, jack, joslyn, jospeh, josphine, josue, jovan, joy, joya, joyce, joycelyn, joye, juan, juana, juanita, jacquelyn, jude, judi, judie, judith, judson, judy, julee, julene, jules, juli, jade, julia, julian, juliana, juliane, juliann, julianna, julianne, julie, julieann, julienne, juliet, julieta, julietta, juliette, julio, julissa, julius, june, jung, junie, junior, junita, junko, just, justa, justina, justine, jutta, k, ka, kacey, kaci, kacie, kacy, kai, kaila, kaitlin, kaitlyn, kala, kaleigh, kaley, kali, kallie, kalyn, kam, kamala, janiece, kami, kamilah, kandace, kandice, kandis, kandra, kandy, kanesha, kanisha, kara, karan, kareem, kareen, karen, karena, karey, kari, karie, karima, karin, karina, jasmine, karine, karisa, karissa, karl, karla, karleen, karlene, karly, karlyn, karma, karmen, karol, karole, karoline, karolyn, karon, karren, karri, karrie, karry, kary, karyl, karyn, kasandra, kasha, kasi, kassandra, kassie, kate, katelin, katelyn, katelynn, katerine, kathaleen, katharina, katharine, katharyn, kathe, katheleen, katherin, katherine, kathern, kathey, kathi, kathie, kathleen, kathlene, kathline, kathlyn, kathrin, kathrine, kathryn, kathryne, kathy, kathyrn, kati, katia, katie, katlyn, katrice, katrina, kattie, katy, kay, jenny, kayce, kaycee, kaye, kayla, kaylee, jeremy, kayleen, kayleigh, kaylene, jerlene, kazuko, kecia, keeley, keely, keena, keesha, keiko, keila, keira, keisha, keith, jesenia, keitha, keli, kelle, kelley, kelli, kellie, kelly, jessie, kellye, kelsey, kelsi, kelsie, kelvin, kemberly, ken, kena, kenda, kendal, kendall, kendra, kendrick, keneth, jimmy, kenia, kenisha, kenna, kenneth, kenny, kent, kenton, joanie, kenyatta, kenyetta, kera, keren, keri, kermit, kerri, kerrie, kerry, kerstin, kesha, joe, keshia, keturah, keva, keven, kevin, khadijah, khalilah, kia, kiana, kiara, kiera, kiersten, kiesha, kieth, kim, kimber, kimberlee, johnette, kimberley, kimberlie, kimberly, kimbery, kimbra, kimi, kimiko, kina, kindra, king, kira, kirby, kirk, kirsten, kirstie, kirstin, kisha, kit, kittie, jonathan, kitty, kiyoko, kizzie, kizzy, klara, know, korey, kori, kortney, kory, kourtney, kraig, kris, krishna, krissy, krista, kristal, kristan, kristeen, kristel, kristen, kristi, kristian, kristie, kristin, kristina, kristine, kristofer, kristy, kristyn, krysta, jovita, krystal, krysten, krystin, krystina, krystle, krystyna, kum, kurt, kurtis, kyla, kyle, kylee, kylie, kym, jule, kymberly, kyoko, kyong, kyra, kyung, l, lacey, lachelle, laci, lacie, lacresha, lacy, ladawn, ladonna, lady, lael, lahoma, lai, laine, lajuana, lakeesha, lakeisha, lakendra, lakenya, lakesha, lakeshia, lakia, lakiesha, justin, lakisha, lakita, lala, lamonica, lamont, lan, lana, lance, landon, lane, lanell, lanelle, lanette, lani, lanie, lanita, lannie, lanny, lanora, laquanda, laquita, lara, larae, kandi, laraine, laree, larhonda, larisa, larissa, larita, laronda, larraine, larry, larue, lasandra, lashanda, lashandra, lashaun, lashaunda, lashawn, lashawna, lashay, lashell, lashon, lashonda, lashunda, last, latanya, latasha, latashia, later, latesha, latia, laticia, latina, latisha, latonia, latonya, latoria, latosha, latoya, latoyia, latrice, latricia, latrina, latrisha, kasey, latter, latterly, kasie, launa, laura, lauralee, lauran, laure, laureen, laurel, lauren, laurena, laurence, laurene, lauretta, laurette, lauri, katherina, laurice, laurie, katheryn, laurinda, laurine, lauryn, lavada, lavelle, lavenia, lavera, lavern, laverna, laverne, laveta, lavette, lavinia, lavon, lavona, lavonda, katina, lavone, lavonia, lavonna, lawana, lawanda, lawanna, lawerence, lawrence, layla, layne, lazaro, le, lea, leah, lean, leana, leandra, leandro, leann, keenan, leanna, leanne, leanora, least, leatha, leatrice, lecia, leda, leeann, kellee, leeanna, leeanne, leena, leesa, left, leia, leida, leif, leigh, leigha, leighann, leila, leilani, leisa, leisha, lekisha, lela, lelah, leland, lelia, lemuel, len, kennith, lena, lenard, lenita, kenya, lenna, lennie, lenny, lenora, lenore, leo, leola, leoma, leon, leona, leonard, leonarda, leonardo, leone, leonel, leonia, leonida, leonie, leonila, leonor, leonora, leonore, leontine, leopoldo, leora, kiley, leota, lera, kimberely, leroy, les, kimberli, lesa, lesha, lesia, leslee, lesley, lesli, leslie, less, lessie, kip, lester, let, leta, letha, leticia, letisha, letitia, lets, lettie, letty, levi, lewis, lezlie, li, lia, liana, liane, lianne, libbie, libby, liberty, librada, lida, lidia, lien, lieselotte, ligia, like, likely, lila, lili, lilia, lilian, liliana, lilla, kristle, lilli, kristopher, lillia, lilliam, lillian, lilliana, lillie, lilly, lily, lin, lina, lincoln, linda, lindsay, lindsey, lindsy, lindy, line, linette, ling, linh, links, linn, linnea, linnie, lino, linsey, linwood, lionel, lisa, lisabeth, lisandra, lisbeth, lise, lisette, lisha, laila, lissa, lissette, lita, livia, liz, liza, lizabeth, lizbeth, lizeth, lizette, lizzette, lizzie, ll, lamar, lloyd, loan, logan, loida, lois, lola, lolita, loma, lon, lona, lang, londa, loni, lonna, lonnie, lonny, lora, loraine, loralee, lore, lorean, loree, loreen, lorelei, loren, lorena, lorene, lorenza, lorenzo, loreta, loretta, lorette, lori, loria, loriann, lorie, lorilee, lashawnda, lorinda, loris, lorita, lorna, lorraine, lasonya, lorretta, latarsha, lorri, lorriane, lorrie, lorrine, lory, lottie, lou, louanne, louella, louetta, louie, louis, louisa, louise, loura, lourdes, lourie, love, lovella, lovetta, lovie, lowell, loyce, loyd, ltd, lu, luana, luann, luanna, luanne, luba, lucas, luci, lucia, luciano, lucie, lucien, lucienne, lucila, lucilla, lucille, lucina, lucio, lucius, lucrecia, lavina, lucretia, lucy, ludie, ludivina, luella, luetta, luigi, lavonne, luis, luisa, luise, luke, lula, lulu, luna, lupe, lupita, lura, lurlene, lurline, luther, luvenia, luz, lyda, lydia, lyla, lyle, lyman, lyn, lynda, lyndia, lee, lyndon, lyndsay, lyndsey, lynell, lynelle, lynette, lynn, lynna, lynne, lynnette, lynsey, lynwood, mabel, mabelle, mable, mac, machelle, macie, mack, mackenzie, macy, madalene, madaline, madalyn, maddie, made, madelaine, madeleine, madelene, madeline, madelyn, madge, madie, madison, madlyn, madonna, mae, maegan, mafalda, magali, magaly, magan, magaret, magda, magdalen, magdalena, magdalene, magen, maggie, magnolia, mahalia, mai, maia, maida, maile, maira, maire, maisha, maisie, major, majorie, make, makeda, makes, malcolm, malcom, malena, malia, malik, malika, malinda, malisa, lexie, malissa, malka, mallie, mallory, malorie, mamie, mammie, man, mana, manda, mandi, mandie, mandy, manie, manual, manuel, manuela, many, maple, mara, maragaret, maragret, maranda, marc, marcel, marcela, marcelina, marceline, marcelino, marcell, marcella, marcelle, marcellus, marcelo, marcene, marchelle, marcia, marcie, marcos, marcus, marcy, maren, marg, margareta, margarett, margaretta, margarette, margarita, margarite, margarito, margart, margene, margeret, margert, margery, marget, margherita, margie, margit, margo, margorie, margret, margrett, marguerita, marguerite, margurite, margy, marhta, mari, maria, loise, mariah, mariam, marian, mariana, marianela, mariann, marianna, marianne, mariano, maribel, maribeth, marica, maricela, maricruz, marie, mariel, mariela, mariella, marietta, mariette, mariko, marilee, marilou, marilu, marilyn, marilynn, marin, marina, marinda, marine, mario, marion, lorina, maris, lorine, marisa, marisela, marisha, marisol, marissa, marita, maritza, marivel, marjory, mark, marketta, markita, louann, markus, marla, marlana, marleen, marlen, marlena, marlene, marlin, marline, marlo, louvenia, marlon, marlyn, marlys, marna, marni, marnie, marquerite, marquetta, marquis, marquita, marquitta, marry, marsha, marshall, marth, martha, luciana, marti, martin, martina, martine, marty, lucile, marva, marvel, marvella, lucinda, marvin, marvis, marx, mary, marya, maryalice, maryam, lue, maryann, maryanna, maryanne, marybelle, marybeth, maryellen, maryetta, maryjane, maryjo, maryland, marylee, marylin, maryln, marylou, marylouise, marylyn, marylynn, maryrose, masako, mason, matha, mathew, mathilda, mathilde, matilda, matilde, matt, matthew, mattie, maud, maude, lynetta, maudie, maura, maureen, maurice, mauricio, maurine, maurita, ma, mauro, mavis, max, maxie, maxima, maximina, maximo, maxine, maxwell, maya, maybe, maybell, maybelle, maye, mayme, maynard, mayola, mayra, mazie, mckenzie, mckinley, me, meagan, meaghan, meantime, meanwhile, mechelle, meda, mee, meg, megan, meggan, meghan, meghann, mei, mel, melaine, melani, melania, melanie, melany, melba, melda, melia, melida, melina, melinda, melisa, melissa, melissia, melita, mellie, mellisa, mellissa, melodee, melodi, melodie, melody, melonie, melony, melva, malvina, melvin, melvina, mendy, mercedes, mercedez, mercy, meredith, meri, merideth, meridith, merilyn, merissa, merle, mao, merlene, merlin, merlyn, merna, merri, merrie, merrilee, merrill, marcelene, merry, mertie, mervin, meryl, meta, mi, mia, mica, micaela, micah, marci, michael, michaela, marco, michaele, michal, michale, mardell, micheal, michel, margaret, michele, margarete, michelina, micheline, michell, michelle, michiko, mickey, micki, marge, mickie, miesha, migdalia, mignon, miguelina, mika, mikaela, mike, mikel, miki, margot, mikki, milagro, milagros, milan, milda, mildred, miles, milford, milissa, millard, millie, million, milly, milo, milton, mimi, min, mina, minda, mindi, mindy, minerva, ming, minh, minna, minnie, minta, marielle, miquel, mira, miranda, mireille, mireya, miriam, mirian, mirna, mirta, mirtha, misha, miss, missy, misti, mistie, misty, mitch, mitchel, mitchell, mitsue, mitsuko, mittie, marjorie, mitzi, mitzie, miyoko, modesto, mohamed, mohammad, mohammed, moira, mollie, molly, moment, mona, monet, monica, monika, monique, monnie, monroe, monserrate, monte, monty, moon, mora, more, moreover, morgan, moriah, morris, marta, morton, mose, moses, moshe, most, mostly, mozell, mozella, mozelle, mr, mrs, much, mui, muoi, murray, must, my, myesha, myles, myong, myra, myrl, myrle, myrna, myron, myrta, myrtie, myrtle, myself, myung, n, nada, nadene, nadia, naida, nakesha, nakia, nakisha, nakita, nam, namely, nan, nana, nancee, nancey, nanci, nancie, nancy, nanette, nannette, nannie, naoma, naomi, napoleon, narcisa, natacha, natalia, natalie, natalya, natasha, natashia, nathalie, nathanael, nathanial, natisha, natividad, natosha, neal, near, necole, ned, neda, nedra, neely, neida, neil, neither, nelda, nelia, nelida, nell, nella, nelle, nellie, nelly, nelson, nena, nenita, neoma, neomi, nereida, nerissa, nery, nestor, neta, nettie, neva, nevada, never, nevertheless, neville, new, newton, next, nga, ngan, ngoc, nguyet, nia, nichelle, nichol, nicholas, nichole, melynda, nicholle, nick, nicki, nickie, nickole, nicky, nicol, nicola, nicolas, nicolasa, nicole, nicolette, nicolle, nida, nidia, niesha, nieves, nigel, niki, nikia, nikita, nikki, nikole, nila, nilda, nilsa, nina, nine, ninety, micha, ninfa, nisha, nita, no, noah, noble, nobody, nobuko, noe, noel, noelia, noella, noelle, noemi, nola, nolan, noma, nona, none, miguel, nonetheless, noone, nor, nora, norah, norbert, norberto, mila, noreen, noriko, norine, norma, norman, normand, not, nothing, nova, millicent, novella, now, nowhere, nu, nubia, numbers, nydia, nyla, o, obdulia, ocie, octavia, octavio, oda, odelia, odell, odessa, odette, odilia, odis, mirella, of, ofelia, off, often, oh, ola, olen, olene, oleta, olevia, olga, olimpia, olin, olinda, oliva, olive, oliver, olivia, ollie, olympia, oma, modesta, omar, omega, omer, omitted, on, moises, ona, once, one, oneida, ones, onie, onita, only, onto, opal, ophelia, or, oralee, oralia, ord, oren, oretha, orlando, orpha, orval, orville, oscar, ossie, osvaldo, oswaldo, otelia, muriel, otha, other, others, otherwise, otilia, otis, myriam, otto, ouida, our, ours, ourselves, myrtice, out, myrtis, over, owen, own, ozell, ozella, ozie, nadine, p, pa, pablo, page, pages, paige, palma, palmer, palmira, pamala, pamela, pamelia, pamella, pamila, pamula, pansy, paola, paris, parker, part, parthenia, particia, pasquale, pasty, pat, patience, nathan, patria, patrica, nathaniel, patrice, patricia, patrina, patsy, patti, pattie, patty, paula, paulene, paulette, pauline, paulita, paz, pearl, pearle, pearlene, pearlie, pearline, pearly, pedro, peg, peggie, peggy, pei, penelope, penney, penni, pennie, penny, per, percy, perhaps, perla, perry, pete, peter, petra, petrina, petronila, phebe, phil, philip, phillip, phillis, philomena, phung, phuong, nickolas, phylicia, phylis, phyliss, pia, piedad, pierre, ping, pinkie, piper, pok, polly, porfirio, porsche, porsha, porter, portia, pp, precious, preston, pricilla, prince, princess, priscila, priscilla, proud, providencia, prudence, q, qiana, queen, queenie, quentin, quiana, quincy, quinn, nohemi, quintin, quinton, quyen, r, rachael, rachal, racheal, rachel, rachele, norene, rachell, rachelle, racquel, rae, raeann, norris, raelene, rafael, rafaela, raguel, raina, raisa, raleigh, ralph, ramiro, ramon, ramona, ramonita, ran, rana, ranae, randa, randal, randall, randee, ok, randi, randolph, ranee, raphael, raquel, rashad, rasheeda, rashida, raul, raven, ray, raye, rayford, raymon, raymond, raymonde, raymundo, rayna, re, rea, reagan, reatha, reba, rebbeca, rebbecca, ora, rebeca, rebecca, rebecka, rebekah, recent, recently, reed, reena, ref, refs, refugia, refugio, regan, regena, regenia, reggie, regina, reginald, regine, reginia, reid, reiko, reina, reinaldo, reita, related, rema, remedios, remona, pam, rena, renaldo, renata, renate, renato, renay, pandora, renda, renea, renetta, renita, replied, research, ressie, reta, retha, retta, reuben, reva, rex, reyes, patrick, reyna, reynaldo, rhea, rheba, rhett, paul, rhiannon, rhoda, pauletta, rhona, paulina, rhonda, ria, ricardo, rich, richard, richelle, richie, rick, rickey, ricki, rickie, ricky, rico, rigoberto, rikki, riley, rima, rina, risa, rita, rivka, robbi, robbie, robby, robbyn, robena, robert, roberta, roberto, robin, robt, robyn, rocco, phoebe, rochel, rochell, rochelle, rocio, rocky, phyllis, rod, roderick, rodger, pilar, rodney, rodolfo, rodrick, rodrigo, rogelio, roger, roland, rolanda, rolande, rolando, rolf, rolland, roma, romaine, roman, romana, romelia, romeo, romona, pura, ron, rona, ronald, roni, ronna, ronni, ronnie, ronny, roosevelt, rory, rosa, rosalba, rosalee, rosalia, rosalie, rosalind, rosalinda, rosaline, rosalva, rosalyn, rosamaria, rosamond, rosana, rosann, rosanna, rosanne, rosaria, rosario, rosaura, roscoe, rose, roseann, roseanna, roseanne, roselee, roselia, roseline, rosella, randell, roselle, roselyn, randy, rosemarie, rosemary, rosena, rosenda, rosendo, rosetta, rosette, rosia, rosie, rosina, rosio, raylene, roslyn, ross, rossana, rossie, rosy, rowena, roxana, reanna, roxane, roxann, roxanna, roxanne, roxie, roy, royal, royce, reda, rozanne, rozella, ruben, rubi, rubie, rubin, rubye, rudolf, rudolph, rudy, rueben, rufina, rufus, run, rupert, russel, russell, rusty, ruth, rutha, ruthann, renae, ruthanne, ruthe, ruthie, ryan, ryann, s, rene, sabina, renee, sabine, sabra, renna, sabrina, sacha, sachiko, sade, sadie, sadye, sage, rey, said, sal, reynalda, salena, salina, salley, sallie, sally, salome, salvador, salvatore, sam, ricarda, samantha, samara, samatha, same, samella, samira, sammie, sammy, samual, samuel, sana, sanda, sandee, sandi, sandie, sandra, sandy, sanford, riva, sang, rob, sanjuana, sanjuanita, robbin, sanora, santa, santana, santiago, santo, santos, sara, sarah, sarai, saran, sari, sarina, sarita, sasha, saturnina, sau, saul, saundra, savanna, say, scarlet, scarlett, scot, scott, scottie, scotty, sean, search, season, sebastian, sebrina, sec, section, see, seem, seema, seemed, seeming, seems, ronda, selena, selene, selina, selma, sena, senaida, september, serafina, serena, sergio, serina, serita, rosalina, server, seth, seven, seventy, several, seymour, sha, shad, shae, shaina, shakia, shakita, shala, shalanda, shall, shalon, shalonda, shameka, shamika, shan, shana, shanae, shanda, shandi, shandra, shane, shaneka, shanel, shanell, shanelle, shanice, shanika, shaniqua, shanita, shannan, shannon, rosita, shanon, shanta, shantae, shantay, shante, shantel, shantell, shantelle, shanti, shaquana, shaquita, shara, roxy, sharan, sharda, sharee, sharell, sharen, shari, sharice, sharie, sharika, ruby, sharilyn, sharla, sharleen, sharlene, sharmaine, sharolyn, sharon, sharri, russ, sharyl, sharyn, shasta, shaun, shauna, shaunda, shaunna, shaunta, shaunte, shavon, shavonda, shavonne, shawana, shawanda, shawanna, shawn, shawnda, shawnee, shawnna, shawnta, shay, shayla, shayna, shayne, she, shea, sheba, shed, sheena, sheila, sheilah, shela, shelba, shelby, sheldon, shelia, shell, shella, shelley, shelli, shellie, shelly, shemeka, shemika, shena, shenika, shenita, shenna, shera, sheree, sherell, sheri, sherice, sheridan, sherie, santina, sherika, sherill, sherilyn, sherise, sherita, sherlene, sherley, sherly, sherlyn, sherman, sheron, sherrell, sherri, sherrie, sherril, savannah, sherrill, sherron, sherry, sherryl, sherwood, shery, sheryl, sheryll, shes, shiela, shila, shiloh, shin, shira, shirely, shirl, shirlee, shirleen, shirlene, shirley, shirly, shizue, shizuko, shon, shona, setsuko, shonda, shondra, shonna, shonta, shoshana, shu, shakira, shyla, sibyl, sid, sidney, sierra, signe, sigrid, silas, silva, silvana, silvia, sima, simon, simona, simone, simonne, sina, since, shani, sindy, siobhan, sirena, siu, shanna, six, sixty, skye, slyvia, so, socorro, sofia, soila, sol, solange, soledad, solomon, some, somehow, someone, somer, something, sometime, sometimes, somewhere, sommer, son, sona, sharita, sondra, song, sonia, sonja, sonny, sonya, sharonda, soo, sharron, sook, soon, sophia, sophie, soraya, sparkle, spencer, spring, stacee, stacey, staci, stacia, stacie, stacy, stanford, shawna, stanley, stanton, star, starla, starr, stefan, stefani, stefania, stefanie, stefany, steffanie, stella, stepanie, stephan, stephane, stephani, stephania, stephanie, stephen, stephenie, stephine, stephnie, sterling, shelton, steve, steven, stevie, stewart, still, stop, stormy, stuart, suanne, such, sudie, sue, sueann, suellen, suk, sulema, sumiko, summer, sun, sunday, sung, sunni, sunny, sunshine, susan, susana, susann, susanna, susannah, susanne, susie, susy, suzan, suzann, suzanna, suzanne, suzette, suzi, suzy, svetlana, sybil, syble, sydney, sylvester, sylvia, sylvie, synthia, syreeta, t, ta, tabetha, tabitha, tad, tai, taina, taisha, tajuana, takako, taking, takisha, talia, talisha, talitha, tama, tamar, tamara, tamatha, tambra, tameika, tameka, tamekia, tamela, tamera, tamesha, tami, tamica, tamie, tamiko, tamisha, sixta, tammara, tammera, tammi, tammie, tammy, tana, tandra, tandy, taneka, tanesha, tangela, tania, tanika, tanja, tanna, tanner, tara, tarah, taren, tarra, tarsha, taryn, tasha, tashia, tashina, tasia, tatiana, tatum, tatyana, taunya, tawana, tawanda, tawanna, tawna, tawnya, stan, taylor, tayna, ted, teddy, tegan, tell, stasia, telma, temeka, temika, tempie, temple, ten, tena, tenesha, stephaine, tenisha, tennie, tennille, teodora, teodoro, stephany, teofila, tequila, tera, tereasa, terence, teresa, teresia, teresita, teressa, teri, terica, su, terina, terisa, terra, terrance, terrell, terrence, terresa, terri, terrie, terrilyn, terry, tesha, tessa, tessie, thad, thaddeus, thalia, than, thanh, thao, that, thatll, thats, thatve, the, theda, their, thelma, them, suzie, themselves, then, thence, theo, theodora, theodore, theola, thereafter, thereby, thered, therefore, tabatha, therein, therell, therere, theres, theresa, therese, theressa, thereupon, thereve, theron, thersa, these, tam, theyd, tamala, theyll, theyre, theyve, thi, thing, think, thirty, this, thomas, thomasena, thomasina, thomasine, thora, tamika, those, though, thought, thousand, thresa, through, throughout, tamra, thru, thu, thurman, thus, thuy, tia, tiana, tianna, tanisha, tiara, tien, tiera, tanya, tierra, tiesha, tifany, tari, tiffaney, tiffani, tiffanie, tiffany, tiffiny, tijuana, til, tilda, till, tillie, tim, timika, timmy, timothy, tina, tawny, tinisha, tiny, tip, tisa, tish, teena, tisha, teisha, titus, to, tobi, tobias, tobie, toby, toccara, tod, together, toi, told, tom, tomas, tomasa, tomeka, tomi, tomika, tomiko, terese, tommie, tommy, tommye, tomoko, tona, tonda, tonette, toney, tonia, tonie, tonisha, tonita, tonja, tony, tonya, too, tora, tess, tori, torie, torri, torrie, tory, tosha, toshia, thea, toshiko, tova, towanda, towards, toya, tracee, traci, tracie, theresia, tracy, tran, trang, travis, treasa, treena, trena, trent, trenton, tresa, tressa, tressie, treva, trevor, trey, tricia, trillion, trina, trinh, trinidad, trinity, trisha, trista, tristan, troy, trudi, trudie, trudy, trula, truman, try, tu, tuan, tula, tuyet, twana, twanda, twanna, twenty, twila, two, twyla, ty, tyesha, tyisha, tyler, tynisha, todd, tyra, tyree, tyron, tyrone, tyson, u, ulrike, ulysses, un, una, under, unless, unlike, until, unto, up, toni, upon, ursula, us, used, usha, using, ute, v, vada, val, valarie, valda, valencia, valene, valentin, valentina, valentine, valeri, valeria, valerie, tracey, valery, valorie, valrie, van, vanda, vanesa, vanessa, vanetta, vania, vanita, vanna, vannesa, vannessa, vashti, vasiliki, vaughn, ve, veda, velda, velia, vella, velma, trish, velva, velvet, vena, venessa, venetta, venice, venita, vennie, venus, veola, vera, verda, verdell, verdie, verena, vergie, verla, verlene, verlie, verline, vern, verna, vernell, vernetta, vernia, tyrell, vernice, vernie, vernita, ula, vernon, verona, veronica, veronika, veronique, versie, vertie, very, vesta, veta, vi, via, vicenta, vicente, vickey, vickie, vicky, victor, victoria, vida, vallie, viki, vikki, vilma, vance, vina, vince, vincent, vincenza, vincenzo, vinita, vinnie, viola, violet, violeta, violette, virgen, virgie, virgil, virgilio, virgina, virginia, vita, vito, viva, vivan, vivian, viviana, vivien, vivienne, vol, vols, von, voncile, vonda, vonnie, vs, w, wade, wai, waldo, walker, wallace, wally, walter, walton, waltraud, wan, wanda, waneta, wanetta, wanita, ward, warner, warren, was, wasnt, wava, way, waylon, wayne, we, wed, vicki, wei, weldon, well, wen, victorina, wendell, wendi, wendie, wendolyn, wendy, wenona, went, were, werent, werner, wes, wesley, weston, weve, what, whatever, whatll, whats, whatve, when, whence, whenever, where, whereafter, whereas, whereby, wherein, wheres, whereupon, wherever, whether, which, while, whim, whither, whitley, whitney, who, whod, whoever, whole, wholl, whom, whomever, whos, whose, why, wilber, wilbert, wilbur, wilburn, wilda, wiley, wilford, wilfred, wilfredo, wilhelmina, wilhemina, will, willa, willard, willena, willene, willetta, willette, willia, william, williams, willian, willie, williemae, willis, willodean, willow, willy, wilma, wilmer, wilson, wilton, windy, winford, winfred, winifred, winnie, winnifred, winona, winston, winter, with, within, without, wm, wonda, wont, woodrow, words, wouldnt, wyatt, wynell, wynona, x, xavier, xenia, xiao, xiomara, xochitl, xuan, y, yadira, yaeko, yael, yahaira, yajaira, yan, yang, yanira, yasmin, yasmine, yasuko, yee, yelena, yen, yer, yes, yesenia, yessenia, yet, yetta, yevette, yi, ying, yoko, yolanda, yolande, yolando, yolonda, yon, yong, yoshie, yoshiko, you, youd, youlanda, youll, young, your, youre, yours, yourself, yourselves, youve, yu, yuette, yuk, yuki, yukiko, yuko, yulanda, yun, yung, yuonne, yuri, yuriko, yvette, yvone, yvonne, z, zachariah, zachary, zachery, zack, zackary, zada, zaida, zana, zandra, zane, zelda, zella, zelma, zena, zenaida, zenia, zenobia, zetta, zina, zita, zoe, zofia, zoila, zola, zona, zonia, zora, zoraida, zula, zulema, zulma'
stopwords1 = stopw.split(',')

print(stopwords[:15])
print(70*'——')
print(stopwords1[:15])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours']
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
['0', ' 1', ' 2', ' 3', ' 4', ' 5', ' 6', ' 7', ' 8', ' 9', ' a', ' aaron', ' abbey', ' abbie', ' abdul']


In [19]:
# here I filter the warnings, ignoring them. 
import warnings
warnings.filterwarnings("ignore")

## 1. Pre-processing

### _Infinite Jest_

In [23]:
ij_corpus_txt = open('/Volumes/greta/greta/uni/cimec/fatti/Progetti/topic_modeling/CSTA_txt/infinite_jest .txt', 'r').read()

In [24]:
type(ij_corpus_txt)

str

In [25]:
# Functions that I need to: 1.split the corpus (1 chapter = 1 document), 2. remove newlines, 3.remove the head 
def better_corpus(stringa):
    return stringa.split('<ch><') #<ch>< marks the beginning of a new chapter.
def newline_compressor(stringa):
    return re.sub(r"\\\n+", " ", stringa) 
def head_remove(stringa):
    return re.sub("^.* Wallace ", " ", stringa)

In [26]:
def clean_string(txt):
    corpus = newline_compressor(txt)
    corpus = head_remove(corpus)
    corpus = better_corpus(corpus)
    return corpus

def clean_corpus(lista):
    tokenized_corpus = [word_tokenize(w.lower()) for w in lista]
    lemma_filtered_corpus = [[wnl.lemmatize(i) for i in t if i not in stopwords] for t in tokenized_corpus]
    lemma_filtered_corpus = [[wnl.lemmatize(i) for i in t if i not in stopwords1] for t in lemma_filtered_corpus]
    lemma_filtered_corpus = [[wnl.lemmatize(i) for i in t if i.isalpha()] for t in lemma_filtered_corpus]
    return lemma_filtered_corpus

In [27]:
clean_ij_string = clean_string(ij_corpus_txt)

In [31]:
print(clean_ij_string[1][:100])

01> YEAR OF GLAD I am seated in an office, surrounded by heads and bodies. My posture is consciously


In [36]:
ij_tokenized_documents = clean_corpus(clean_ij_string)
print(ij_tokenized_documents[1][:10])

['year', 'glad', 'seated', 'office', 'surrounded', 'head', 'body', 'posture', 'consciously', 'congruent']


In [35]:
n_ij_documents = len(ij_tokenized_documents)
print(n_ij_documents)

66


### _Brown Corpus_

In [37]:
clean_brown_documents = []
for fileid in brown.fileids(): #brown corpus is already divided in files 
    document = ' '.join(brown.words(fileid)) # from a list to a string
    clean_brown_documents.append(document) # here I created a list of texts, just as for the text above, but with another strategy. 

In [38]:
brown.categories()[:10] 
# brown corpus is already devided in categories, 
#so I printed them here to set them as the 'golden standard' fo my future model.

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery']

In [39]:
brown_tokenized_documents = clean_corpus(clean_brown_documents)

In [41]:
print(brown_tokenized_documents[1][:10])

['austin', 'texas', 'committee', 'approval', 'gov', 'price', 'daniel', 'abandoned', 'property', 'act']


In [44]:
n_brown_documents = len(brown_tokenized_documents)
print(n_brown_documents)

500


## 2. Training the algorithm


### _Dictionaries and Corpora_

In [45]:
#Dictionaries
id2word1 = corpora.Dictionary(ij_tokenized_documents)
id2word2 = corpora.Dictionary(brown_tokenized_documents)
# This function crates a dictionary composed by every unique token in the tokenized texts and its identificator. 

#Corpora
texts1 = ij_tokenized_documents
texts2 = brown_tokenized_documents

corpus1 = [id2word1.doc2bow(text) for text in texts1]
corpus2 = [id2word2.doc2bow(text) for text in texts2]
# the corpus is composed by tuples (word_id, word_frequency)

In [50]:
print(id2word1)
print(20*'—————')
#print(id2word1.token2id.items())

Dictionary(25575 unique tokens: ['courier', 'david', 'foster', 'infinite', 'jest']...)
————————————————————————————————————————————————————————————————————————————————————————————————————


In [52]:
print(corpus1[1][:10])
print(20*'—————')
print(corpus1[2][:10])

[(6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 1), (15, 6)]
————————————————————————————————————————————————————————————————————————————————————————————————————
[(6, 1), (10, 1), (29, 2), (31, 1), (42, 1), (43, 1), (53, 2), (56, 5), (62, 2), (63, 5)]


In [53]:
# The tuples represent (word_id, word_frequency). We can figure out the "meaning" of each id in this way: 
def read_touple(dic, corpus):
    readable_tuples = [[(dic[id], freq) for id, freq in doc] for doc in corpus]
    return readable_tuples

In [59]:
read_touple(id2word1, corpus1)[:1]

[[('courier', 1),
  ('david', 1),
  ('foster', 1),
  ('infinite', 1),
  ('jest', 1),
  ('wallace', 1)]]

In [65]:
read_touple(id2word2, corpus2)[1][:10]

[('act', 3),
 ('age', 2),
 ('aid', 2),
 ('also', 4),
 ('amendment', 2),
 ('appointment', 1),
 ('approved', 2),
 ('assistance', 1),
 ('association', 1),
 ('attorney', 1)]

### _Training_

In [66]:
# I decided to set the number of topics = 10, just beacuse in every tutorial I used to learn this topic used as a convention the number 10.
# However, this parameter will be changed below**.
# Here I ignored every parameter that is not obligatory.
lda_model1 = gensim.models.ldamodel.LdaModel(corpus=corpus1, id2word=id2word1, num_topics=10) #Infinite Jest
lda_model2 = gensim.models.ldamodel.LdaModel(corpus=corpus2, id2word=id2word2, num_topics=10) #Brown Corpus

### Parameters: 
- corpus: bag of words per document;
- num_topics: the number of requested latent topics to be extracted from the training corpus.
- id2word: map from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.


These are the previous models' parameters. Below I printed the some other default parameters (hidden in previous lines):

In [67]:
print('Default paramteters, Model 1')
print('- chunksize, the number of documents to be used in each training chunk:', lda_model1.chunksize)
print('- passes, number of passes through the corpus during training:',lda_model1.passes)
print('- decay, a number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined:',lda_model1.decay)
print('- numterms, number of terms:', lda_model1.num_terms)
print('\nDefault paramteters, Model 2')
print('- chunksize:', lda_model2.chunksize)
print('- passes:',lda_model2.passes)
print('- decay:',lda_model2.decay)
print('- numterms:', lda_model2.num_terms)

Default paramteters, Model 1
- chunksize, the number of documents to be used in each training chunk: 2000
- passes, number of passes through the corpus during training: 1
- decay, a number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined: 0.5
- numterms, number of terms: 25575

Default paramteters, Model 2
- chunksize: 2000
- passes: 1
- decay: 0.5
- numterms: 35251


In [73]:
for i in lda_model1.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

0 
 ['0.010*"like" ', ' 0.006*"gately" ', ' 0.005*"one" ', ' 0.004*"little" ', ' 0.004*"way" ']
1 
 ['0.007*"like" ', ' 0.007*"one" ', ' 0.006*"say" ', ' 0.005*"get" ', ' 0.005*"gately" ']
2 
 ['0.012*"like" ', ' 0.006*"gately" ', ' 0.004*"get" ', ' 0.004*"one" ', ' 0.004*"time" ']
3 
 ['0.010*"like" ', ' 0.007*"one" ', ' 0.006*"gately" ', ' 0.005*"time" ', ' 0.005*"get" ']
4 
 ['0.009*"like" ', ' 0.006*"one" ', ' 0.006*"back" ', ' 0.004*"way" ', ' 0.004*"say" ']
5 
 ['0.008*"like" ', ' 0.007*"gately" ', ' 0.007*"one" ', ' 0.005*"even" ', ' 0.005*"back" ']
6 
 ['0.007*"one" ', ' 0.007*"like" ', ' 0.006*"way" ', ' 0.005*"even" ', ' 0.004*"say" ']
7 
 ['0.009*"like" ', ' 0.007*"one" ', ' 0.006*"even" ', ' 0.004*"way" ', ' 0.004*"say" ']
8 
 ['0.010*"like" ', ' 0.006*"one" ', ' 0.005*"hal" ', ' 0.005*"even" ', ' 0.005*"gately" ']
9 
 ['0.009*"like" ', ' 0.006*"gately" ', ' 0.005*"one" ', ' 0.005*"back" ', ' 0.005*"little" ']


In [74]:
for i in lda_model2.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

0 
 ['0.005*"one" ', ' 0.005*"would" ', ' 0.004*"said" ', ' 0.003*"time" ', ' 0.003*"year" ']
1 
 ['0.006*"one" ', ' 0.006*"would" ', ' 0.005*"said" ', ' 0.004*"state" ', ' 0.003*"new" ']
2 
 ['0.007*"one" ', ' 0.004*"new" ', ' 0.004*"said" ', ' 0.003*"would" ', ' 0.003*"year" ']
3 
 ['0.007*"one" ', ' 0.007*"would" ', ' 0.004*"af" ', ' 0.004*"man" ', ' 0.003*"time" ']
4 
 ['0.007*"one" ', ' 0.005*"would" ', ' 0.004*"said" ', ' 0.003*"time" ', ' 0.003*"may" ']
5 
 ['0.007*"would" ', ' 0.006*"one" ', ' 0.003*"said" ', ' 0.003*"time" ', ' 0.003*"year" ']
6 
 ['0.007*"one" ', ' 0.006*"would" ', ' 0.004*"time" ', ' 0.003*"like" ', ' 0.003*"new" ']
7 
 ['0.006*"one" ', ' 0.004*"could" ', ' 0.004*"time" ', ' 0.004*"would" ', ' 0.003*"state" ']
8 
 ['0.007*"one" ', ' 0.005*"would" ', ' 0.004*"new" ', ' 0.004*"state" ', ' 0.004*"year" ']
9 
 ['0.006*"one" ', ' 0.004*"could" ', ' 0.004*"would" ', ' 0.004*"said" ', ' 0.004*"time" ']


### ___Some questions___
- How can we read these _results_? Are they _good results_?
- Can these results represent the outcomes of two _good models_?
- How can we _evaluate_ the models?


## 3. Interpretation of Outcomes 

### _Visualization_

__PyLDAvis__ is a tool that can help humans to read outcomes of topic models. 

It creates - starting from the _model_, a _corpus_ and a _dictionary_ - a graphical representation of the "topics" in the corpus.

In [32]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model1, corpus1, id2word1)
vis

In [33]:
vis = pyLDAvis.gensim.prepare(lda_model2, corpus2, id2word2)
vis

Outcomes do not seem very clear (nor _printed_, nether represented in a _graph_). 

It is not trivial to understand the goodness of a model looking at its outcomes. It is not entirely clear how to evaluate topic models, there is an open debate. However, it is possible to compute the __coherence__ of its results, that can help to draw some conclusions and to fix problems.

_Gensim_ provides the tools to do it.

### _Evaluation_

In [75]:
coherence_model_lda = CoherenceModel(model=lda_model1, texts=texts1, dictionary=id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 1 (IJ) \nCoherence Score: ', coherence_lda)#higher the better.
coherence_model_lda = CoherenceModel(model=lda_model2, texts=texts2, dictionary=id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('\nModel 2 (BC) \nCoherence Score:', coherence_lda)

Model 1 (IJ) 
Coherence Score:  0.2817601901839984

Model 2 (BC) 
Coherence Score: 0.25465790400765


## Toy Example

Can the coherence improve?

Maybe, in order to understand how it is possible to have better results, it would be useful to start from a _very small corpus_ (what I will call "toy corpus", a corpus composed by only _10 sentences_(=documents). This passage is meant to understend if the coherence is related to the size of the corpora, and also to summarize the necessary steps (in order to add other ones). 

In this case, I already know what "topics" I expect as outputs. Only by skimming the corpus it is possible to notice that the main topics are related to the concepts of 'knowledge', 'justification', 'belief' and 'epistemology'.

### *Processing*

In [77]:
toy_corpus = open('/Volumes/greta/greta/uni/cimec/fatti/Progetti/topic_modeling/CSTA_txt/Epistemology1.rtf', 'r').read()

In [79]:
toy_corpus_non = re.sub(r"\n+", " ", toy_corpus) 
toy_corpus_noh = re.sub("^.* Defined narrowly, " , " ", toy_corpus_non)
toy_corpus_docs = re.split("[;?!.\}]+", toy_corpus_noh)

print(toy_corpus_docs)

[' epistemology is the study of knowledge and justified belief', ' As the study of knowledge, epistemology is concerned with the following questions: What are the necessary and sufficient conditions of knowledge', ' What are its sources', ' What is its structure, and what are its limits', ' As the study of justified belief, epistemology aims to answer questions such as: How we are to understand the concept of justification', ' What makes justified beliefs justified', " Is justification internal or external to one's own mind", ' Understood more broadly, epistemology is about issues having to do with the creation and dissemination of knowledge in particular areas of inquiry', ' This article will provide a systematic overview of the problems that the questions above raise and focus in some depth on issues relating to the structure and the limits of knowledge and justification', '']


In [80]:
toy_tokenized_documents = clean_corpus(toy_corpus_docs)
n_toy_documents = len(toy_tokenized_documents)
print(n_toy_documents)

10


In [81]:
toy_tokenized_documents[0]

['epistemology', 'study', 'knowledge', 'justified', 'belief']

### *Dictionaries and Corpora*

In [84]:
id2word0 = corpora.Dictionary(toy_tokenized_documents)

texts0 = toy_tokenized_documents
corpus0 = [id2word0.doc2bow(text) for text in texts0]

#read_touple(id2word0,corpus0)[1]

### *Training the model*

In [85]:
lda_model0 = gensim.models.ldamodel.LdaModel(corpus=corpus0, id2word=id2word0, num_topics=3)
# I changed the number of topics because 10 topics seems too much for a single paragraph of 10 sentences. 

for i in lda_model0.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

0 
 ['0.063*"knowledge" ', ' 0.060*"epistemology" ', ' 0.050*"broadly" ', ' 0.050*"area" ', ' 0.050*"issue" ']
1 
 ['0.065*"knowledge" ', ' 0.060*"justification" ', ' 0.054*"study" ', ' 0.053*"epistemology" ', ' 0.051*"question" ']
2 
 ['0.098*"justified" ', ' 0.068*"belief" ', ' 0.045*"make" ', ' 0.043*"knowledge" ', ' 0.039*"source" ']


### *Evaluation*

In [86]:
coherence_model_lda = CoherenceModel(model=lda_model0, texts=toy_tokenized_documents, dictionary=id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 0 (TM) \nCoherence Score: ', coherence_lda)

Model 0 (TM) 
Coherence Score:  0.40119204444144435


### *Visualization*

In [91]:
vis0 = pyLDAvis.gensim.prepare(lda_model0, corpus0, id2word0)
#vis0

The coherence of results increased.

The performance of the models can impove with some modifications:
- related to the __text processing__;
- related to the __parameters__ of the model.

For example, Martin and Johnson (2015) showed how models improve their performance if feeded with corpora composed only by _nouns_. 
For this reason I tagged the corpora and I created new corpora composed only by _nouns_ (I tagged the Brown Corpus with the same tagger of the others. I know that it exists a version that is already tagged, but I tried to keep consistency through the steps.)

PS:I also thought that it could have been useful to delete _quantifiers_ (very frequent but not-so-related with the general meaning of the texts).

# B. Text Processing Modifications

## B.1 POS Tagging

In [92]:
def only_noun(texts): 
    some_quantifiers = ['anything', 'anybody', 'something', 'somebody', 'everything', 'thing']
    tagged_corpus = [nltk.pos_tag(i) for i in texts]
    only_noun_corpus = [[t[0] for t in i if t[1] == 'NN' and t[0] not in some_quantifiers] for i in tagged_corpus]
    return only_noun_corpus
    

In [93]:
tag_text_0 = only_noun(texts0)
tag_text_1 = only_noun(texts1)
tag_text_2 = only_noun(texts2)

In [94]:
#Dictionary
new_id2word0 = corpora.Dictionary(tag_text_0)
new_id2word1 = corpora.Dictionary(tag_text_1)
new_id2word2 = corpora.Dictionary(tag_text_2)

new_corpus0 = [new_id2word0.doc2bow(text) for text in tag_text_0]
new_corpus1 = [new_id2word1.doc2bow(text) for text in tag_text_1]
new_corpus2 = [new_id2word2.doc2bow(text) for text in tag_text_2]

In [95]:
new_lda_model0 = gensim.models.ldamodel.LdaModel(corpus=new_corpus0, id2word=new_id2word0, num_topics=3)
new_lda_model1 = gensim.models.ldamodel.LdaModel(corpus=new_corpus1, id2word=new_id2word1, num_topics=10)
new_lda_model2 = gensim.models.ldamodel.LdaModel(corpus=new_corpus2, id2word=new_id2word2, num_topics=10)

for i in new_lda_model0.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

for i in new_lda_model1.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

for i in new_lda_model2.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

In [99]:
coherence_model_lda = CoherenceModel(model=new_lda_model0, texts=tag_text_0, dictionary=new_id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 0, postagged \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=new_lda_model1, texts=tag_text_1, dictionary=new_id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 1, postagged \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=new_lda_model2, texts=tag_text_2, dictionary=new_id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 2, postagged \nCoherence Score: ', coherence_lda)

Model 0, postagged 
Coherence Score:  0.41191453493827535
Model 1, postagged 
Coherence Score:  0.253990184869637
Model 2, postagged 
Coherence Score:  0.2911192252788263


## B1.1 Noun and Adjectives
Also the adjectives can be meaningfull to detect the topics, the themes, or - at least - the writing style of a corpus. I added this part in a second phase, just to see if it could be a usefull filter. 

In [100]:
def only_noun_and_adjectives(texts): 
    some_quantifiers = ['anything', 'anybody', 'something', 'somebody', 'everything', 'thing']
    tagged_corpus = [nltk.pos_tag(i) for i in texts]
    only_noun_corpus = [[t[0] for t in i if t[1] == 'NN'or t[1] == 'ADJ' and t[0] not in some_quantifiers] for i in tagged_corpus]
    return only_noun_corpus
    

In [101]:
tag_text_na_0 = only_noun_and_adjectives(texts0)
tag_text_na_1 = only_noun_and_adjectives(texts1)
tag_text_na_2 = only_noun_and_adjectives(texts2)

In [102]:
#Dictionary
new_na_id2word0 = corpora.Dictionary(tag_text_na_0)
new_na_id2word1 = corpora.Dictionary(tag_text_na_1)
new_na_id2word2 = corpora.Dictionary(tag_text_na_2)

new_na_corpus0 = [new_na_id2word0.doc2bow(text) for text in tag_text_na_0]
new_na_corpus1 = [new_na_id2word1.doc2bow(text) for text in tag_text_na_1]
new_na_corpus2 = [new_na_id2word2.doc2bow(text) for text in tag_text_na_2]

In [103]:
new_na_lda_model0 = gensim.models.ldamodel.LdaModel(corpus=new_na_corpus0, id2word=new_na_id2word0, num_topics=3)
new_na_lda_model1 = gensim.models.ldamodel.LdaModel(corpus=new_na_corpus1, id2word=new_na_id2word1, num_topics=10)
new_na_lda_model2 = gensim.models.ldamodel.LdaModel(corpus=new_na_corpus2, id2word=new_na_id2word2, num_topics=10)

In [104]:
coherence_model_lda = CoherenceModel(model=new_na_lda_model0, texts=tag_text_na_0, dictionary=new_na_id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 0, na_postagged \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=new_na_lda_model1, texts=tag_text_na_1, dictionary=new_na_id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 1, na_postagged \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=new_na_lda_model2, texts=tag_text_na_2, dictionary=new_na_id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 2, na_postagged \nCoherence Score: ', coherence_lda)

Model 0, na_postagged 
Coherence Score:  0.4634461091136269
Model 1, na_postagged 
Coherence Score:  0.2547740898843837
Model 2, na_postagged 
Coherence Score:  0.28669414661110604


I can do another modification at the level of _selection of words_. Some tutorials (e.g. https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21) filter the words in corpora also by their _lenght_. As we can see, the key words printed above are redundant and most of them are composed by 4 letters. Maybe 4 letter words are just too common in every english text to count them as meaningfull. 

Here I try to remove the words that have less than 4 letters to see if the models improve teir performance. 

## B.2 Long Words

In [105]:
def long(texts):
    only_long_nouns_corpus = [[t for t in i if len(t)>4] for i in texts]
    return only_long_nouns_corpus

In [106]:
ltexts0 = long(texts0)
ltexts1 = long(texts1)
ltexts2 = long(texts2)

In [107]:
#Dictionary
lnew_id2word0 = corpora.Dictionary(ltexts0)
lnew_id2word1 = corpora.Dictionary(ltexts1)
lnew_id2word2 = corpora.Dictionary(ltexts2)


lnew_corpus0 = [lnew_id2word0.doc2bow(text) for text in ltexts0]
lnew_corpus1 = [lnew_id2word1.doc2bow(text) for text in ltexts1]
lnew_corpus2 = [lnew_id2word2.doc2bow(text) for text in ltexts2]

In [108]:
lnew_lda_model0 = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus0, id2word=lnew_id2word0, num_topics=3)
lnew_lda_model1 = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus1, id2word=lnew_id2word1, num_topics=10)
lnew_lda_model2 = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus2, id2word=lnew_id2word2, num_topics=10)

for i in lnew_lda_model0.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])
  
for i in lnew_lda_model1.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])
    
for i in lnew_lda_model2.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

In [109]:
coherence_model_lda = CoherenceModel(model=lnew_lda_model0, texts=ltexts0, dictionary=lnew_id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 0, long postagged \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=lnew_lda_model1, texts=ltexts1, dictionary=lnew_id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 1, long postagged \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=lnew_lda_model2, texts=ltexts2, dictionary=lnew_id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('Model 2, long postagged \nCoherence Score: ', coherence_lda)

Model 0, long postagged 
Coherence Score:  0.427693409669931
Model 1, long postagged 
Coherence Score:  0.26572696306962096
Model 2, long postagged 
Coherence Score:  0.2912060273129933


I select the models with the best text processing (relying on coherence scores), and I will work at the level of paramters. The two parameters I am going to work with are: 
- the number of __topics__
- the number of __passes__

Let's see if these modifications have impact on the models. 

# C. Parameters 

## C.1 Number of topics

The coherence of a model can be influenced by the number of its topics (https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/). How?

In [111]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    # the limit will be the max number of topics
    # the start will be the minimum number of topics
    # the step is the step between a number of topics and one other e.g. step = 1 --> 1,2,3,4 step=2 1,3,5,7
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model=LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary)
        coherence_values.append(coherencemodel.get_coherence())
    return (dict(zip(model_list, coherence_values)))

def sort(dictionary): 
    # here I sorted the dictionaries from the highest to the lowest coherence score.
    sorted_dictionary = sorted(dictionary.items(), key=lambda x: x[1], reverse = True)
    return sorted_dictionary

# to see it clearly: 
def visualize_it_better(dictionary):
    for i in dictionary:
        print('The coherence of', i[0],'is', i[1])

### _Toy corpus_

In [112]:
cv_model0 = compute_coherence_values(dictionary=new_id2word0, corpus=new_corpus0, texts=tag_text_0, limit=11, step=1)

In [113]:
sorted_cv_model0 = sort(cv_model0)

In [114]:
visualize_it_better(sorted_cv_model0)

The coherence of LdaModel(num_terms=25, num_topics=7, decay=0.5, chunksize=2000) is 0.4821422170997605
The coherence of LdaModel(num_terms=25, num_topics=4, decay=0.5, chunksize=2000) is 0.4785680529365949
The coherence of LdaModel(num_terms=25, num_topics=10, decay=0.5, chunksize=2000) is 0.47686589395260964
The coherence of LdaModel(num_terms=25, num_topics=6, decay=0.5, chunksize=2000) is 0.4763325842863737
The coherence of LdaModel(num_terms=25, num_topics=9, decay=0.5, chunksize=2000) is 0.47401853911181685
The coherence of LdaModel(num_terms=25, num_topics=8, decay=0.5, chunksize=2000) is 0.4723596084307099
The coherence of LdaModel(num_terms=25, num_topics=3, decay=0.5, chunksize=2000) is 0.4616819275880187
The coherence of LdaModel(num_terms=25, num_topics=5, decay=0.5, chunksize=2000) is 0.46150315393232955
The coherence of LdaModel(num_terms=25, num_topics=2, decay=0.5, chunksize=2000) is 0.4540625209956816


### _Infinite Jest_


In [115]:
cv_model1 = compute_coherence_values(dictionary=lnew_id2word1, corpus=lnew_corpus1, texts=ltexts1, limit=11, step=1)

In [116]:
sorted_cv_model1 = sort(cv_model1)

In [117]:
visualize_it_better(sorted_cv_model1)

The coherence of LdaModel(num_terms=23055, num_topics=9, decay=0.5, chunksize=2000) is 0.2657049108228897
The coherence of LdaModel(num_terms=23055, num_topics=3, decay=0.5, chunksize=2000) is 0.264938994722154
The coherence of LdaModel(num_terms=23055, num_topics=4, decay=0.5, chunksize=2000) is 0.26425723737919304
The coherence of LdaModel(num_terms=23055, num_topics=2, decay=0.5, chunksize=2000) is 0.2640875778593762
The coherence of LdaModel(num_terms=23055, num_topics=6, decay=0.5, chunksize=2000) is 0.26369223554649146
The coherence of LdaModel(num_terms=23055, num_topics=8, decay=0.5, chunksize=2000) is 0.26246572087260384
The coherence of LdaModel(num_terms=23055, num_topics=10, decay=0.5, chunksize=2000) is 0.2617803514183503
The coherence of LdaModel(num_terms=23055, num_topics=5, decay=0.5, chunksize=2000) is 0.26052032156367755
The coherence of LdaModel(num_terms=23055, num_topics=7, decay=0.5, chunksize=2000) is 0.25733268279598587


### _Brown corpus_

In [118]:
cv_model2 = compute_coherence_values(dictionary=new_id2word2, corpus=new_corpus2, texts=tag_text_2, limit=11, step=1)

In [119]:
sorted_cv_model2 = sort(cv_model2)

In [120]:
visualize_it_better(sorted_cv_model2)

The coherence of LdaModel(num_terms=19620, num_topics=10, decay=0.5, chunksize=2000) is 0.30789230126742767
The coherence of LdaModel(num_terms=19620, num_topics=8, decay=0.5, chunksize=2000) is 0.2933278836018408
The coherence of LdaModel(num_terms=19620, num_topics=9, decay=0.5, chunksize=2000) is 0.28695464031321083
The coherence of LdaModel(num_terms=19620, num_topics=7, decay=0.5, chunksize=2000) is 0.28326879027931035
The coherence of LdaModel(num_terms=19620, num_topics=5, decay=0.5, chunksize=2000) is 0.2826938930622197
The coherence of LdaModel(num_terms=19620, num_topics=2, decay=0.5, chunksize=2000) is 0.2822749623404128
The coherence of LdaModel(num_terms=19620, num_topics=4, decay=0.5, chunksize=2000) is 0.2806026908604852
The coherence of LdaModel(num_terms=19620, num_topics=6, decay=0.5, chunksize=2000) is 0.2783951203413922
The coherence of LdaModel(num_terms=19620, num_topics=3, decay=0.5, chunksize=2000) is 0.2756099524352494


If this function works, the best number of parameters (in terms of coherence) is: 
- 7 for Toy Corpus
- 9 for Infinte Jest
- 10 for Brown corpus

## C.2 Number of Passes

In [121]:
np_lda_model0p1 = gensim.models.ldamodel.LdaModel(corpus=new_corpus0, id2word=new_id2word0, num_topics=7, passes=1)
np_lda_model1p1 = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus1, id2word=lnew_id2word1, num_topics=9, passes=1)
np_lda_model2p1 = gensim.models.ldamodel.LdaModel(corpus=new_corpus2, id2word=new_id2word2, num_topics=10, passes=1)

In [122]:
np_lda_model0p50 = gensim.models.ldamodel.LdaModel(corpus=new_corpus0, id2word=new_id2word0, num_topics=7, passes=50)
np_lda_model1p50 = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus1, id2word=lnew_id2word1, num_topics=9, passes=50)
np_lda_model2p50 = gensim.models.ldamodel.LdaModel(corpus=new_corpus2, id2word=new_id2word2, num_topics=10, passes=50)

In [123]:
np_lda_model0p100 = gensim.models.ldamodel.LdaModel(corpus=new_corpus0, id2word=new_id2word0, num_topics=7, passes=100)
np_lda_model1p100 = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus1, id2word=lnew_id2word1, num_topics=9, passes=100)
np_lda_model2p100 = gensim.models.ldamodel.LdaModel(corpus=new_corpus2, id2word=new_id2word2, num_topics=10, passes=100)

In [125]:
print('Model 0')
coherence_model_lda = CoherenceModel(model=np_lda_model0p1, texts=tag_text_0, dictionary=new_id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('1 pass \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=np_lda_model0p50, texts=tag_text_0, dictionary=new_id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('50 passes \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=np_lda_model0p100, texts=tag_text_0, dictionary=new_id2word0)
coherence_lda = coherence_model_lda.get_coherence()
print('100 passes \nCoherence Score: ', coherence_lda)

Model 0
1 pass 
Coherence Score:  0.4612575187527838
50 passes 
Coherence Score:  0.4560294352311454
100 passes 
Coherence Score:  0.4299202597262787


In [124]:
print('Model 1')
coherence_model_lda = CoherenceModel(model=np_lda_model0p1, texts=ltexts1, dictionary=lnew_id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('1 pass \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=np_lda_model1p50, texts=ltexts1, dictionary=lnew_id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('50 passes \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=np_lda_model1p100, texts=ltexts1, dictionary=lnew_id2word1)
coherence_lda = coherence_model_lda.get_coherence()
print('100 passes \nCoherence Score: ', coherence_lda)

Model 1
1 pass 
Coherence Score:  0.5344849354113694
50 passes 
Coherence Score:  0.33779412026657957
100 passes 
Coherence Score:  0.29305179255425373


In [125]:
print('Model 2')
coherence_model_lda = CoherenceModel(model=np_lda_model2p1, texts=tag_text_2, dictionary=new_id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('1 pass \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=np_lda_model2p50, texts=tag_text_2, dictionary=new_id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('50 passes \nCoherence Score: ', coherence_lda)
coherence_model_lda = CoherenceModel(model=np_lda_model2p100, texts=tag_text_2, dictionary=new_id2word2)
coherence_lda = coherence_model_lda.get_coherence()
print('100 passes \nCoherence Score: ', coherence_lda)

Model 2
1 pass 
Coherence Score:  0.29208736107269284
50 passes 
Coherence Score:  0.40285875663119564
100 passes 
Coherence Score:  0.4178955871455436


### __Is it possible to visualize the improvements?__


In [127]:
Toy_final_model = gensim.models.ldamodel.LdaModel(corpus=new_corpus0, id2word=new_id2word0, num_topics=8, passes=1)
IJ_final_model = gensim.models.ldamodel.LdaModel(corpus=lnew_corpus1, id2word=lnew_id2word1, num_topics=7, passes=1)
BC_final_model = gensim.models.ldamodel.LdaModel(corpus=new_corpus2, id2word=new_id2word2, num_topics=10, passes=50)

In [129]:
for i in Toy_final_model.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

0 
 ['0.040*"belief" ', ' 0.040*"source" ', ' 0.040*"justification" ', ' 0.040*"knowledge" ', ' 0.040*"epistemology" ']
1 
 ['0.132*"limit" ', ' 0.132*"structure" ', ' 0.070*"justification" ', ' 0.070*"knowledge" ', ' 0.070*"question" ']
2 
 ['0.040*"source" ', ' 0.040*"belief" ', ' 0.040*"justification" ', ' 0.040*"structure" ', ' 0.040*"knowledge" ']
3 
 ['0.272*"belief" ', ' 0.030*"source" ', ' 0.030*"justification" ', ' 0.030*"structure" ', ' 0.030*"knowledge" ']
4 
 ['0.219*"mind" ', ' 0.219*"justification" ', ' 0.024*"belief" ', ' 0.024*"source" ', ' 0.024*"knowledge" ']
5 
 ['0.111*"knowledge" ', ' 0.111*"issue" ', ' 0.111*"understood" ', ' 0.111*"creation" ', ' 0.111*"inquiry" ']
6 
 ['0.155*"study" ', ' 0.155*"knowledge" ', ' 0.155*"epistemology" ', ' 0.106*"question" ', ' 0.056*"justification" ']
7 
 ['0.272*"source" ', ' 0.030*"belief" ', ' 0.030*"justification" ', ' 0.030*"epistemology" ', ' 0.030*"limit" ']


Recall: the topic, listed by intuition above are:
- __knowledge__, 
- __belief__, 
- __justification__, 
- __epistemology__. 

They correspond with the topic given by the model.

In [131]:
for i in IJ_final_model.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

0 
 ['0.011*"gately" ', ' 0.005*"right" ', ' 0.005*"thing" ', ' 0.004*"around" ', ' 0.004*"little" ']
1 
 ['0.006*"gately" ', ' 0.006*"little" ', ' 0.005*"thing" ', ' 0.005*"could" ', ' 0.004*"pemulis" ']
2 
 ['0.008*"gately" ', ' 0.005*"could" ', ' 0.005*"right" ', ' 0.005*"little" ', ' 0.004*"would" ']
3 
 ['0.006*"could" ', ' 0.006*"gately" ', ' 0.004*"little" ', ' 0.004*"right" ', ' 0.004*"thing" ']
4 
 ['0.005*"gately" ', ' 0.005*"around" ', ' 0.005*"could" ', ' 0.004*"little" ', ' 0.004*"right" ']
5 
 ['0.008*"gately" ', ' 0.005*"little" ', ' 0.005*"would" ', ' 0.004*"thing" ', ' 0.004*"could" ']
6 
 ['0.005*"little" ', ' 0.005*"thing" ', ' 0.004*"gately" ', ' 0.004*"around" ', ' 0.004*"could" ']


The topics given by wikipedia (https://en.wikipedia.org/wiki/Infinite_Jest) for Infinite Jest are: 
- __rehab__,
- __family__,
- __tennis__,
- __politics__

They are not clearly represents here. Some reason can be:
- the complexity of the text (of its vocabulary, its style. It is full of neologisms, acronims, names, that have been filtered here);
- the fact that topic models are not meant to work on a single book;
- the fact that I decided to assign every chapter a document. Maybe another assignation could be better.
- maybe the coherence of a model is not so related to how intuitive (for us, humans) the results could be.


In [133]:
for i in BC_final_model.print_topics():
    print(i[0], '\n', i[1].split('+')[:5])

0 
 ['0.009*"state" ', ' 0.008*"cost" ', ' 0.007*"af" ', ' 0.007*"point" ', ' 0.007*"number" ']
1 
 ['0.012*"year" ', ' 0.007*"time" ', ' 0.006*"day" ', ' 0.006*"child" ', ' 0.005*"class" ']
2 
 ['0.006*"year" ', ' 0.005*"time" ', ' 0.005*"life" ', ' 0.004*"man" ', ' 0.004*"wife" ']
3 
 ['0.018*"state" ', ' 0.015*"year" ', ' 0.009*"school" ', ' 0.009*"program" ', ' 0.008*"government" ']
4 
 ['0.010*"af" ', ' 0.009*"time" ', ' 0.006*"house" ', ' 0.006*"man" ', ' 0.005*"way" ']
5 
 ['0.008*"world" ', ' 0.007*"work" ', ' 0.007*"experience" ', ' 0.007*"number" ', ' 0.007*"society" ']
6 
 ['0.011*"church" ', ' 0.010*"state" ', ' 0.008*"war" ', ' 0.007*"time" ', ' 0.006*"president" ']
7 
 ['0.013*"man" ', ' 0.013*"time" ', ' 0.011*"hand" ', ' 0.007*"car" ', ' 0.007*"head" ']
8 
 ['0.016*"af" ', ' 0.010*"line" ', ' 0.009*"cell" ', ' 0.008*"temperature" ', ' 0.007*"surface" ']
9 
 ['0.010*"time" ', ' 0.008*"man" ', ' 0.008*"year" ', ' 0.007*"day" ', ' 0.007*"life" ']


Here the topic are given by the _categories_ contained in the brown corpus: 
- __adventure__,
- __belles lettres__,
- __editorial__,
- __fiction__,
- __government__,
- __hobbies__,
- __humor__,
- __learned__,
- __lore__,
- __myster__,
- __news__,
- __religion__,
- __reviews__,
- __romance__,
- __science fiction__.

Here there is a partial overlap between topics and categories.

# Conclusion

Topic modeling is a useful tool to cluster topics and to show the hidden features of a text. 
Here, I tried to create models starting from three different corpora:
1. a __toy corpus__: very small but clear. I already knew the topics I wanted as output. They have been reported by the model as well. 
2. a larger corpus, __Infinite Jest__: a book that contains different wiriting styles, which alternates between aulic and street langauge, and that is full of neologisms and acronyms.
3. __Brown Corpus__: the largest corpus, composed by 500 tagged documents (1 milion words).

The best performance, at least in terms of coherence, is given by the model based on the second corpus, but its visual representation is not intuitive for a human reader.

My aim was not to find the best cleaning process and the best parameters for topic models _in general_. I wanted to find a way improve LDA models starting from _these specific corpora_.

I tried to modify: 
- the process of _word selection_,
- _number of topics_ and _passes parameters_,
in order to reach a better rapresentation of the topics in these three texts. 

However, as I stated before, it is not entirely clear how to obtain the best performances from these models and how to evaluate them. There is a huge and compelling open debate on this issue. For sure, coherence is not sufficent to give a good evaluation of the model, but it is intuitive and easy to compute and interpret. 

### Directions for future research
It would be intresting, in the future, to:
- find wich kind of corpora are optimal for topic modeling (book, news, scientific papers, tweets?);
- think of a topic model that does not need the number of topics to be given a priori;
- find an unambiguous evaluation method that is more related to the human interpretation of the results.

